[jboss-dev] Scanning jars for package capabilities
David M. Lloyd
david.lloyd at redhat.com
Fri Jun 19 13:12:27 EDT 2009
When I say "ZipFile" I should really say "zip file". I of course would
heavily advocate using my "jzipfile" library that I wrote last week which
is exactly what you describe: a pure Java implementation without the native
and locking bullshit. :-)
That said however, even jzipfile requires a java.io.File and shuns the
scanning approach. Here's why.
The first reason is simple. To stream a zip file to find a single entry,
you have to read the entire zip file's data up to that entry rather than
just jumping to the exact offset of the data you're interested in and
pulling down the data. If you scan the zip file completely even once, you
might as well have just copied it and read the directory instead.
The second reason is more involved. A zip file has two parts: the local
file section and the directory. The local file section looks something
like this:
[local file 0]
signature word 0x04034b50
compression = deflate
local CRC*
local compressed size*
local uncompressed size*
name = foo/bar/foobar.class
[..possibly compressed file data goes here..]
real CRC**
real compressed size**
real uncompressed size**
[local file 1]
signature word 0x04034b50
compression = deflate
local CRC*
local compressed size*
local uncompressed size*
name = foo/bar/foobaz.class
[..possibly compressed file data goes here..]
real CRC**
real compressed size**
real uncompressed size**
[...]
The asterisk * indicates that this field is a valid header field but it is
almost *always* zero, with the actual information specified *after* the
compressed data (**). (There's a few extra things in there that I don't
mention, including some odd stuff that the JAR tool does to distinguish a
real JAR file from a file named .jar that was created by some other
zip-like tool).
This means that if you're scanning from the start of a zip file (like
ZipInputStream), you usually have no way to know how long each file is
other than just reading the bytes with e.g. InflaterInputStream and hope
that the EOF matches up correctly (which it doesn't always do). But there
is one worse case. If a nested zip file is "stored" rather than
"deflated", there may be *no* way to distinguish between records in the
nested zip from the parent zip, so you may reach the end of the stored data
of a nested zip entry and the streamer would get confused and think that it
hit the end of the parent entry. Very messy.
Now the directory is at the *end* of the zip file and looks like this:
[directory entry 0]
signature word 0x02014b50
compression = deflate
timestamp = xxxx
real CRC
real compressed size
real uncompressed size
offset of local header = 0
name = foo/bar/foobar.class
comment = blah blah
[directory entry 1]
signature word 0x02014b50
compression = deflate
timestamp = xxxx
real CRC
real compressed size
real uncompressed size
offset of local header = 12345
name = foo/bar/foobaz.class
comment = blah blah
[...]
[end of directory]
signature word 0x06054b50
offset of first directory entry = 24680
entry count = N
comment = ""
[EOF]
As you can see the directory entries give us hard values we can use for the
lengths (both compressed and uncompressed) which makes it far more
resilient against errors. Also, the directory is all packed together,
meaning that we can read just a small portion of the zip file and thus
fully index it (and, if we copied the zip to a "safe" location, we can
cache the directory and close the zip file, thus avoiding one aspect of the
infamous locking issues on Windows). So ideally, we always use the
directory to locate zip entries.
The question is, how do you find the directory? You can't seek from the
beginning of the file because of the problems already outlined with that
approach. The only reasonably accurate way to get the directory is to read
the *end* of the file first, which generally means that InputStream is not
an option to get this data. The only real option is to use a random-access
file method (FileChannel or RandomAccessFile) to get at this data. This
means that if you have an InputStream (which is what you can generally get
out of VFS), you need to create a temp file and slurp the data into it
before you can index it.
I realize that this is a full iteration of the zip file, *however* it means
that once you do this, you have instant access to any zip entry from that
point on, *and* the lifetime of the zip file is no longer tied to its
parent (be it a directory or another zip file), which fixes a wide variety
of issues, including the requirement that resources don't start
disappearing before undeploy() is called (in the case where the FS-backed
deploy/ mechanism is used), more Windows locking issues, etc. I also
believe that if we *always* copy and *never* scan the zip file, we'll have
a net performance gain in the end (assuming we only ever copy and index
each zip exactly one time). And of course things are a lot less likely to
blow up surprisingly if you have a copy of the zip file.
- DML
On 06/19/2009 11:25 AM, Scott Stark wrote:
> I don't view the ZipFile api as flexible enough as it requires a
> java.io.File and nested jars. In addition all jdk implementations have a
> native component that has tended to introduce locking and memory faults.
> That is why I believe we need a pure java jar/zip file implementation
> that addresses these limitations.
>
> David M. Lloyd wrote:
>> But folks, consider this. By doing this scanning by default, this
>> means there's yet another hoop the user must jump through in order to
>> use the most performant configuration (in this case, bypassing the
>> scanning).
>>
>> All that said though, I think this whole issue would become moot if we
>> eliminate using the JAR streaming API from VFS. If we instead used
>> the ZipFile API exclusively, "traversing the JAR" would amount to
>> scanning the keys of a HashMap, never mind all the other possible
>> issues with streaming zipfiles that I've raised elsewhere.
>>
>> - DML
>
> _______________________________________________
> jboss-development mailing list
> jboss-development at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/jboss-development
More information about the jboss-development
mailing list