[jboss-dev] Scanning jars for package capabilities

Fri Jun 19 14:17:19 EDT 2009

Ok, great. Now we need to get a vfs handler implementation for your 
jzipfile - vfsjzip.

David M. Lloyd wrote:
> When I say "ZipFile" I should really say "zip file".  I of course 
> would heavily advocate using my "jzipfile" library that I wrote last 
> week which is exactly what you describe: a pure Java implementation 
> without the native and locking bullshit. :-)
>
> That said however, even jzipfile requires a java.io.File and shuns the 
> scanning approach.  Here's why.
>
> The first reason is simple.  To stream a zip file to find a single 
> entry, you have to read the entire zip file's data up to that entry 
> rather than just jumping to the exact offset of the data you're 
> interested in and pulling down the data.  If you scan the zip file 
> completely even once, you might as well have just copied it and read 
> the directory instead.
>
> The second reason is more involved.  A zip file has two parts: the 
> local file section and the directory.  The local file section looks 
> something like this:
>
> [local file 0]
>    signature word 0x04034b50
>    compression = deflate
>    local CRC*
>    local compressed size*
>    local uncompressed size*
>    name = foo/bar/foobar.class
>    [..possibly compressed file data goes here..]
>    real CRC**
>    real compressed size**
>    real uncompressed size**
> [local file 1]
>    signature word 0x04034b50
>    compression = deflate
>    local CRC*
>    local compressed size*
>    local uncompressed size*
>    name = foo/bar/foobaz.class
>    [..possibly compressed file data goes here..]
>    real CRC**
>    real compressed size**
>    real uncompressed size**
> [...]
>
> The asterisk * indicates that this field is a valid header field but 
> it is almost *always* zero, with the actual information specified 
> *after* the compressed data (**).  (There's a few extra things in 
> there that I don't mention, including some odd stuff that the JAR tool 
> does to distinguish a real JAR file from a file named .jar that was 
> created by some other zip-like tool).
>
> This means that if you're scanning from the start of a zip file (like 
> ZipInputStream), you usually have no way to know how long each file is 
> other than just reading the bytes with e.g. InflaterInputStream and 
> hope that the EOF matches up correctly (which it doesn't always do).  
> But there is one worse case.  If a nested zip file is "stored" rather 
> than "deflated", there may be *no* way to distinguish between records 
> in the nested zip from the parent zip, so you may reach the end of the 
> stored data of a nested zip entry and the streamer would get confused 
> and think that it hit the end of the parent entry.  Very messy.
>
> Now the directory is at the *end* of the zip file and looks like this:
> [directory entry 0]
>    signature word 0x02014b50
>    compression = deflate
>    timestamp = xxxx
>    real CRC
>    real compressed size
>    real uncompressed size
>    offset of local header = 0
>    name = foo/bar/foobar.class
>    comment = blah blah
> [directory entry 1]
>    signature word 0x02014b50
>    compression = deflate
>    timestamp = xxxx
>    real CRC
>    real compressed size
>    real uncompressed size
>    offset of local header = 12345
>    name = foo/bar/foobaz.class
>    comment = blah blah
> [...]
> [end of directory]
>    signature word 0x06054b50
>    offset of first directory entry = 24680
>    entry count = N
>    comment = ""
> [EOF]
>
> As you can see the directory entries give us hard values we can use 
> for the lengths (both compressed and uncompressed) which makes it far 
> more resilient against errors.  Also, the directory is all packed 
> together, meaning that we can read just a small portion of the zip 
> file and thus fully index it (and, if we copied the zip to a "safe" 
> location, we can cache the directory and close the zip file, thus 
> avoiding one aspect of the infamous locking issues on Windows).  So 
> ideally, we always use the directory to locate zip entries.
>
> The question is, how do you find the directory?  You can't seek from 
> the beginning of the file because of the problems already outlined 
> with that approach.  The only reasonably accurate way to get the 
> directory is to read the *end* of the file first, which generally 
> means that InputStream is not an option to get this data.  The only 
> real option is to use a random-access file method (FileChannel or 
> RandomAccessFile) to get at this data.  This means that if you have an 
> InputStream (which is what you can generally get out of VFS), you need 
> to create a temp file and slurp the data into it before you can index it.
>
> I realize that this is a full iteration of the zip file, *however* it 
> means that once you do this, you have instant access to any zip entry 
> from that point on, *and* the lifetime of the zip file is no longer 
> tied to its parent (be it a directory or another zip file), which 
> fixes a wide variety of issues, including the requirement that 
> resources don't start disappearing before undeploy() is called (in the 
> case where the FS-backed deploy/ mechanism is used), more Windows 
> locking issues, etc.  I also believe that if we *always* copy and 
> *never* scan the zip file, we'll have a net performance gain in the 
> end (assuming we only ever copy and index each zip exactly one time).  
> And of course things are a lot less likely to blow up surprisingly if 
> you have a copy of the zip file.
>
> - DML
>
> On 06/19/2009 11:25 AM, Scott Stark wrote:
>> I don't view the ZipFile api as flexible enough as it requires a 
>> java.io.File and nested jars. In addition all jdk implementations 
>> have a native component that has tended to introduce locking and 
>> memory faults. That is why I believe we need a pure java jar/zip file 
>> implementation that addresses these limitations.