Ok, great. Now we need to get a vfs handler implementation for your
jzipfile - vfsjzip.
David M. Lloyd wrote:
When I say "ZipFile" I should really say "zip
file". I of course
would heavily advocate using my "jzipfile" library that I wrote last
week which is exactly what you describe: a pure Java implementation
without the native and locking bullshit. :-)
That said however, even jzipfile requires a java.io.File and shuns the
scanning approach. Here's why.
The first reason is simple. To stream a zip file to find a single
entry, you have to read the entire zip file's data up to that entry
rather than just jumping to the exact offset of the data you're
interested in and pulling down the data. If you scan the zip file
completely even once, you might as well have just copied it and read
the directory instead.
The second reason is more involved. A zip file has two parts: the
local file section and the directory. The local file section looks
something like this:
[local file 0]
signature word 0x04034b50
compression = deflate
local CRC*
local compressed size*
local uncompressed size*
name = foo/bar/foobar.class
[..possibly compressed file data goes here..]
real CRC**
real compressed size**
real uncompressed size**
[local file 1]
signature word 0x04034b50
compression = deflate
local CRC*
local compressed size*
local uncompressed size*
name = foo/bar/foobaz.class
[..possibly compressed file data goes here..]
real CRC**
real compressed size**
real uncompressed size**
[...]
The asterisk * indicates that this field is a valid header field but
it is almost *always* zero, with the actual information specified
*after* the compressed data (**). (There's a few extra things in
there that I don't mention, including some odd stuff that the JAR tool
does to distinguish a real JAR file from a file named .jar that was
created by some other zip-like tool).
This means that if you're scanning from the start of a zip file (like
ZipInputStream), you usually have no way to know how long each file is
other than just reading the bytes with e.g. InflaterInputStream and
hope that the EOF matches up correctly (which it doesn't always do).
But there is one worse case. If a nested zip file is "stored" rather
than "deflated", there may be *no* way to distinguish between records
in the nested zip from the parent zip, so you may reach the end of the
stored data of a nested zip entry and the streamer would get confused
and think that it hit the end of the parent entry. Very messy.
Now the directory is at the *end* of the zip file and looks like this:
[directory entry 0]
signature word 0x02014b50
compression = deflate
timestamp = xxxx
real CRC
real compressed size
real uncompressed size
offset of local header = 0
name = foo/bar/foobar.class
comment = blah blah
[directory entry 1]
signature word 0x02014b50
compression = deflate
timestamp = xxxx
real CRC
real compressed size
real uncompressed size
offset of local header = 12345
name = foo/bar/foobaz.class
comment = blah blah
[...]
[end of directory]
signature word 0x06054b50
offset of first directory entry = 24680
entry count = N
comment = ""
[EOF]
As you can see the directory entries give us hard values we can use
for the lengths (both compressed and uncompressed) which makes it far
more resilient against errors. Also, the directory is all packed
together, meaning that we can read just a small portion of the zip
file and thus fully index it (and, if we copied the zip to a "safe"
location, we can cache the directory and close the zip file, thus
avoiding one aspect of the infamous locking issues on Windows). So
ideally, we always use the directory to locate zip entries.
The question is, how do you find the directory? You can't seek from
the beginning of the file because of the problems already outlined
with that approach. The only reasonably accurate way to get the
directory is to read the *end* of the file first, which generally
means that InputStream is not an option to get this data. The only
real option is to use a random-access file method (FileChannel or
RandomAccessFile) to get at this data. This means that if you have an
InputStream (which is what you can generally get out of VFS), you need
to create a temp file and slurp the data into it before you can index it.
I realize that this is a full iteration of the zip file, *however* it
means that once you do this, you have instant access to any zip entry
from that point on, *and* the lifetime of the zip file is no longer
tied to its parent (be it a directory or another zip file), which
fixes a wide variety of issues, including the requirement that
resources don't start disappearing before undeploy() is called (in the
case where the FS-backed deploy/ mechanism is used), more Windows
locking issues, etc. I also believe that if we *always* copy and
*never* scan the zip file, we'll have a net performance gain in the
end (assuming we only ever copy and index each zip exactly one time).
And of course things are a lot less likely to blow up surprisingly if
you have a copy of the zip file.
- DML
On 06/19/2009 11:25 AM, Scott Stark wrote:
> I don't view the ZipFile api as flexible enough as it requires a
> java.io.File and nested jars. In addition all jdk implementations
> have a native component that has tended to introduce locking and
> memory faults. That is why I believe we need a pure java jar/zip file
> implementation that addresses these limitations.