[jboss-dev] Scanning jars for package capabilities

Fri Jun 19 15:14:04 EDT 2009

Already done - or at least, we started on it.  Ales made a zip-layer 
abstraction, and we got a jzipfile impl working, but since the existing zip 
layer requires both streaming and file support, I had to write a fake 
streaming layer which copies the zip file and emulates streaming, which of 
course performs horribly since we open streams again and again (causing a 
copy each time).

What I'd like to do is rip out the entire existing zip and jar edifice and 
fully replace it, rather than pile yet more crap in (I'm more of a 
raze-and-rebuild-it-simpler guy than an add-another-abstraction guy).  But 
I guess providing a "pure" implementation might be a good start either way.

Ales, what is your timeline for splitting up the VFS project?  I want to do 
this work ASAP, so if you're going to do this work, would you be able to 
either (a) do it now (in the next week or so) or (b) wait until I'm done? 
Either way I'll be doing this work in a branch, so it won't interfere with 
any (non-restructuring) work you do (and also, I fail there's no mess to 
clean up :-).

- DML

On 06/19/2009 01:17 PM, Scott Stark wrote:
> Ok, great. Now we need to get a vfs handler implementation for your 
> jzipfile - vfsjzip.
> 
> David M. Lloyd wrote:
>> When I say "ZipFile" I should really say "zip file".  I of course 
>> would heavily advocate using my "jzipfile" library that I wrote last 
>> week which is exactly what you describe: a pure Java implementation 
>> without the native and locking bullshit. :-)
>>
>> That said however, even jzipfile requires a java.io.File and shuns the 
>> scanning approach.  Here's why.
>>
>> The first reason is simple.  To stream a zip file to find a single 
>> entry, you have to read the entire zip file's data up to that entry 
>> rather than just jumping to the exact offset of the data you're 
>> interested in and pulling down the data.  If you scan the zip file 
>> completely even once, you might as well have just copied it and read 
>> the directory instead.
>>
>> The second reason is more involved.  A zip file has two parts: the 
>> local file section and the directory.  The local file section looks 
>> something like this:
>>
>> [local file 0]
>>    signature word 0x04034b50
>>    compression = deflate
>>    local CRC*
>>    local compressed size*
>>    local uncompressed size*
>>    name = foo/bar/foobar.class
>>    [..possibly compressed file data goes here..]
>>    real CRC**
>>    real compressed size**
>>    real uncompressed size**
>> [local file 1]
>>    signature word 0x04034b50
>>    compression = deflate
>>    local CRC*
>>    local compressed size*
>>    local uncompressed size*
>>    name = foo/bar/foobaz.class
>>    [..possibly compressed file data goes here..]
>>    real CRC**
>>    real compressed size**
>>    real uncompressed size**
>> [...]
>>
>> The asterisk * indicates that this field is a valid header field but 
>> it is almost *always* zero, with the actual information specified 
>> *after* the compressed data (**).  (There's a few extra things in 
>> there that I don't mention, including some odd stuff that the JAR tool 
>> does to distinguish a real JAR file from a file named .jar that was 
>> created by some other zip-like tool).
>>
>> This means that if you're scanning from the start of a zip file (like 
>> ZipInputStream), you usually have no way to know how long each file is 
>> other than just reading the bytes with e.g. InflaterInputStream and 
>> hope that the EOF matches up correctly (which it doesn't always do).  
>> But there is one worse case.  If a nested zip file is "stored" rather 
>> than "deflated", there may be *no* way to distinguish between records 
>> in the nested zip from the parent zip, so you may reach the end of the 
>> stored data of a nested zip entry and the streamer would get confused 
>> and think that it hit the end of the parent entry.  Very messy.
>>
>> Now the directory is at the *end* of the zip file and looks like this:
>> [directory entry 0]
>>    signature word 0x02014b50
>>    compression = deflate
>>    timestamp = xxxx
>>    real CRC
>>    real compressed size
>>    real uncompressed size
>>    offset of local header = 0
>>    name = foo/bar/foobar.class
>>    comment = blah blah
>> [directory entry 1]
>>    signature word 0x02014b50
>>    compression = deflate
>>    timestamp = xxxx
>>    real CRC
>>    real compressed size
>>    real uncompressed size
>>    offset of local header = 12345
>>    name = foo/bar/foobaz.class
>>    comment = blah blah
>> [...]
>> [end of directory]
>>    signature word 0x06054b50
>>    offset of first directory entry = 24680
>>    entry count = N
>>    comment = ""
>> [EOF]
>>
>> As you can see the directory entries give us hard values we can use 
>> for the lengths (both compressed and uncompressed) which makes it far 
>> more resilient against errors.  Also, the directory is all packed 
>> together, meaning that we can read just a small portion of the zip 
>> file and thus fully index it (and, if we copied the zip to a "safe" 
>> location, we can cache the directory and close the zip file, thus 
>> avoiding one aspect of the infamous locking issues on Windows).  So 
>> ideally, we always use the directory to locate zip entries.
>>
>> The question is, how do you find the directory?  You can't seek from 
>> the beginning of the file because of the problems already outlined 
>> with that approach.  The only reasonably accurate way to get the 
>> directory is to read the *end* of the file first, which generally 
>> means that InputStream is not an option to get this data.  The only 
>> real option is to use a random-access file method (FileChannel or 
>> RandomAccessFile) to get at this data.  This means that if you have an 
>> InputStream (which is what you can generally get out of VFS), you need 
>> to create a temp file and slurp the data into it before you can index it.
>>
>> I realize that this is a full iteration of the zip file, *however* it 
>> means that once you do this, you have instant access to any zip entry 
>> from that point on, *and* the lifetime of the zip file is no longer 
>> tied to its parent (be it a directory or another zip file), which 
>> fixes a wide variety of issues, including the requirement that 
>> resources don't start disappearing before undeploy() is called (in the 
>> case where the FS-backed deploy/ mechanism is used), more Windows 
>> locking issues, etc.  I also believe that if we *always* copy and 
>> *never* scan the zip file, we'll have a net performance gain in the 
>> end (assuming we only ever copy and index each zip exactly one time).  
>> And of course things are a lot less likely to blow up surprisingly if 
>> you have a copy of the zip file.
>>
>> - DML
>>
>> On 06/19/2009 11:25 AM, Scott Stark wrote:
>>> I don't view the ZipFile api as flexible enough as it requires a 
>>> java.io.File and nested jars. In addition all jdk implementations 
>>> have a native component that has tended to introduce locking and 
>>> memory faults. That is why I believe we need a pure java jar/zip file 
>>> implementation that addresses these limitations.
> 
> _______________________________________________
> jboss-development mailing list
> jboss-development at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/jboss-development