[jboss-dev] Scanning jars for package capabilities

Fri Jun 19 13:12:27 EDT 2009

When I say "ZipFile" I should really say "zip file".  I of course would 
heavily advocate using my "jzipfile" library that I wrote last week which 
is exactly what you describe: a pure Java implementation without the native 
and locking bullshit. :-)

That said however, even jzipfile requires a java.io.File and shuns the 
scanning approach.  Here's why.

The first reason is simple.  To stream a zip file to find a single entry, 
you have to read the entire zip file's data up to that entry rather than 
just jumping to the exact offset of the data you're interested in and 
pulling down the data.  If you scan the zip file completely even once, you 
might as well have just copied it and read the directory instead.

The second reason is more involved.  A zip file has two parts: the local 
file section and the directory.  The local file section looks something 
like this:

[local file 0]
    signature word 0x04034b50
    compression = deflate
    local CRC*
    local compressed size*
    local uncompressed size*
    name = foo/bar/foobar.class
    [..possibly compressed file data goes here..]
    real CRC**
    real compressed size**
    real uncompressed size**
[local file 1]
    signature word 0x04034b50
    compression = deflate
    local CRC*
    local compressed size*
    local uncompressed size*
    name = foo/bar/foobaz.class
    [..possibly compressed file data goes here..]
    real CRC**
    real compressed size**
    real uncompressed size**
[...]

The asterisk * indicates that this field is a valid header field but it is 
almost *always* zero, with the actual information specified *after* the 
compressed data (**).  (There's a few extra things in there that I don't 
mention, including some odd stuff that the JAR tool does to distinguish a 
real JAR file from a file named .jar that was created by some other 
zip-like tool).

This means that if you're scanning from the start of a zip file (like 
ZipInputStream), you usually have no way to know how long each file is 
other than just reading the bytes with e.g. InflaterInputStream and hope 
that the EOF matches up correctly (which it doesn't always do).  But there 
is one worse case.  If a nested zip file is "stored" rather than 
"deflated", there may be *no* way to distinguish between records in the 
nested zip from the parent zip, so you may reach the end of the stored data 
of a nested zip entry and the streamer would get confused and think that it 
hit the end of the parent entry.  Very messy.

Now the directory is at the *end* of the zip file and looks like this:
[directory entry 0]
    signature word 0x02014b50
    compression = deflate
    timestamp = xxxx
    real CRC
    real compressed size
    real uncompressed size
    offset of local header = 0
    name = foo/bar/foobar.class
    comment = blah blah
[directory entry 1]
    signature word 0x02014b50
    compression = deflate
    timestamp = xxxx
    real CRC
    real compressed size
    real uncompressed size
    offset of local header = 12345
    name = foo/bar/foobaz.class
    comment = blah blah
[...]
[end of directory]
    signature word 0x06054b50
    offset of first directory entry = 24680
    entry count = N
    comment = ""
[EOF]

As you can see the directory entries give us hard values we can use for the 
lengths (both compressed and uncompressed) which makes it far more 
resilient against errors.  Also, the directory is all packed together, 
meaning that we can read just a small portion of the zip file and thus 
fully index it (and, if we copied the zip to a "safe" location, we can 
cache the directory and close the zip file, thus avoiding one aspect of the 
infamous locking issues on Windows).  So ideally, we always use the 
directory to locate zip entries.

The question is, how do you find the directory?  You can't seek from the 
beginning of the file because of the problems already outlined with that 
approach.  The only reasonably accurate way to get the directory is to read 
the *end* of the file first, which generally means that InputStream is not 
an option to get this data.  The only real option is to use a random-access 
file method (FileChannel or RandomAccessFile) to get at this data.  This 
means that if you have an InputStream (which is what you can generally get 
out of VFS), you need to create a temp file and slurp the data into it 
before you can index it.

I realize that this is a full iteration of the zip file, *however* it means 
that once you do this, you have instant access to any zip entry from that 
point on, *and* the lifetime of the zip file is no longer tied to its 
parent (be it a directory or another zip file), which fixes a wide variety 
of issues, including the requirement that resources don't start 
disappearing before undeploy() is called (in the case where the FS-backed 
deploy/ mechanism is used), more Windows locking issues, etc.  I also 
believe that if we *always* copy and *never* scan the zip file, we'll have 
a net performance gain in the end (assuming we only ever copy and index 
each zip exactly one time).  And of course things are a lot less likely to 
blow up surprisingly if you have a copy of the zip file.

- DML

On 06/19/2009 11:25 AM, Scott Stark wrote:
> I don't view the ZipFile api as flexible enough as it requires a 
> java.io.File and nested jars. In addition all jdk implementations have a 
> native component that has tended to introduce locking and memory faults. 
> That is why I believe we need a pure java jar/zip file implementation 
> that addresses these limitations.
> 
> David M. Lloyd wrote:
>> But folks, consider this.  By doing this scanning by default, this 
>> means there's yet another hoop the user must jump through in order to 
>> use the most performant configuration (in this case, bypassing the 
>> scanning).
>>
>> All that said though, I think this whole issue would become moot if we 
>> eliminate using the JAR streaming API from VFS.  If we instead used 
>> the ZipFile API exclusively, "traversing the JAR" would amount to 
>> scanning the keys of a HashMap, never mind all the other possible 
>> issues with streaming zipfiles that I've raised elsewhere.
>>
>> - DML
> 
> _______________________________________________
> jboss-development mailing list
> jboss-development at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/jboss-development