Re: [jboss-dev] Scanning jars for package capabilities

Friday, 19 June 2009

Ok, great. Now we need to get a vfs handler implementation for your 
jzipfile - vfsjzip.

David M. Lloyd wrote:
...
 When I say "ZipFile" I should really say "zip
file".  I of course 
 would heavily advocate using my "jzipfile" library that I wrote last 
 week which is exactly what you describe: a pure Java implementation 
 without the native and locking bullshit. :-)

 That said however, even jzipfile requires a java.io.File and shuns the 
 scanning approach.  Here's why.

 The first reason is simple.  To stream a zip file to find a single 
 entry, you have to read the entire zip file's data up to that entry 
 rather than just jumping to the exact offset of the data you're 
 interested in and pulling down the data.  If you scan the zip file 
 completely even once, you might as well have just copied it and read 
 the directory instead.

 The second reason is more involved.  A zip file has two parts: the 
 local file section and the directory.  The local file section looks 
 something like this:

 [local file 0]
    signature word 0x04034b50
    compression = deflate
    local CRC*
    local compressed size*
    local uncompressed size*
    name = foo/bar/foobar.class
    [..possibly compressed file data goes here..]
    real CRC**
    real compressed size**
    real uncompressed size**
 [local file 1]
    signature word 0x04034b50
    compression = deflate
    local CRC*
    local compressed size*
    local uncompressed size*
    name = foo/bar/foobaz.class
    [..possibly compressed file data goes here..]
    real CRC**
    real compressed size**
    real uncompressed size**
 [...]

 The asterisk * indicates that this field is a valid header field but 
 it is almost *always* zero, with the actual information specified 
 *after* the compressed data (**).  (There's a few extra things in 
 there that I don't mention, including some odd stuff that the JAR tool 
 does to distinguish a real JAR file from a file named .jar that was 
 created by some other zip-like tool).

 This means that if you're scanning from the start of a zip file (like 
 ZipInputStream), you usually have no way to know how long each file is 
 other than just reading the bytes with e.g. InflaterInputStream and 
 hope that the EOF matches up correctly (which it doesn't always do).  
 But there is one worse case.  If a nested zip file is "stored" rather 
 than "deflated", there may be *no* way to distinguish between records 
 in the nested zip from the parent zip, so you may reach the end of the 
 stored data of a nested zip entry and the streamer would get confused 
 and think that it hit the end of the parent entry.  Very messy.

 Now the directory is at the *end* of the zip file and looks like this:
 [directory entry 0]
    signature word 0x02014b50
    compression = deflate
    timestamp = xxxx
    real CRC
    real compressed size
    real uncompressed size
    offset of local header = 0
    name = foo/bar/foobar.class
    comment = blah blah
 [directory entry 1]
    signature word 0x02014b50
    compression = deflate
    timestamp = xxxx
    real CRC
    real compressed size
    real uncompressed size
    offset of local header = 12345
    name = foo/bar/foobaz.class
    comment = blah blah
 [...]
 [end of directory]
    signature word 0x06054b50
    offset of first directory entry = 24680
    entry count = N
    comment = ""
 [EOF]

 As you can see the directory entries give us hard values we can use 
 for the lengths (both compressed and uncompressed) which makes it far 
 more resilient against errors.  Also, the directory is all packed 
 together, meaning that we can read just a small portion of the zip 
 file and thus fully index it (and, if we copied the zip to a "safe" 
 location, we can cache the directory and close the zip file, thus 
 avoiding one aspect of the infamous locking issues on Windows).  So 
 ideally, we always use the directory to locate zip entries.

 The question is, how do you find the directory?  You can't seek from 
 the beginning of the file because of the problems already outlined 
 with that approach.  The only reasonably accurate way to get the 
 directory is to read the *end* of the file first, which generally 
 means that InputStream is not an option to get this data.  The only 
 real option is to use a random-access file method (FileChannel or 
 RandomAccessFile) to get at this data.  This means that if you have an 
 InputStream (which is what you can generally get out of VFS), you need 
 to create a temp file and slurp the data into it before you can index it.

 I realize that this is a full iteration of the zip file, *however* it 
 means that once you do this, you have instant access to any zip entry 
 from that point on, *and* the lifetime of the zip file is no longer 
 tied to its parent (be it a directory or another zip file), which 
 fixes a wide variety of issues, including the requirement that 
 resources don't start disappearing before undeploy() is called (in the 
 case where the FS-backed deploy/ mechanism is used), more Windows 
 locking issues, etc.  I also believe that if we *always* copy and 
 *never* scan the zip file, we'll have a net performance gain in the 
 end (assuming we only ever copy and index each zip exactly one time).  
 And of course things are a lot less likely to blow up surprisingly if 
 you have a copy of the zip file.

 - DML

 On 06/19/2009 11:25 AM, Scott Stark wrote:
> I don't view the ZipFile api as flexible enough as it requires a 
> java.io.File and nested jars. In addition all jdk implementations 
> have a native component that has tended to introduce locking and 
> memory faults. That is why I believe we need a pure java jar/zip file 
> implementation that addresses these limitations. 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [jboss-dev] Scanning jars for package capabilities