[jboss-dev] Scanning jars for package capabilities
Scott Stark
sstark at redhat.com
Fri Jun 19 14:17:19 EDT 2009
Ok, great. Now we need to get a vfs handler implementation for your
jzipfile - vfsjzip.
David M. Lloyd wrote:
> When I say "ZipFile" I should really say "zip file". I of course
> would heavily advocate using my "jzipfile" library that I wrote last
> week which is exactly what you describe: a pure Java implementation
> without the native and locking bullshit. :-)
>
> That said however, even jzipfile requires a java.io.File and shuns the
> scanning approach. Here's why.
>
> The first reason is simple. To stream a zip file to find a single
> entry, you have to read the entire zip file's data up to that entry
> rather than just jumping to the exact offset of the data you're
> interested in and pulling down the data. If you scan the zip file
> completely even once, you might as well have just copied it and read
> the directory instead.
>
> The second reason is more involved. A zip file has two parts: the
> local file section and the directory. The local file section looks
> something like this:
>
> [local file 0]
> signature word 0x04034b50
> compression = deflate
> local CRC*
> local compressed size*
> local uncompressed size*
> name = foo/bar/foobar.class
> [..possibly compressed file data goes here..]
> real CRC**
> real compressed size**
> real uncompressed size**
> [local file 1]
> signature word 0x04034b50
> compression = deflate
> local CRC*
> local compressed size*
> local uncompressed size*
> name = foo/bar/foobaz.class
> [..possibly compressed file data goes here..]
> real CRC**
> real compressed size**
> real uncompressed size**
> [...]
>
> The asterisk * indicates that this field is a valid header field but
> it is almost *always* zero, with the actual information specified
> *after* the compressed data (**). (There's a few extra things in
> there that I don't mention, including some odd stuff that the JAR tool
> does to distinguish a real JAR file from a file named .jar that was
> created by some other zip-like tool).
>
> This means that if you're scanning from the start of a zip file (like
> ZipInputStream), you usually have no way to know how long each file is
> other than just reading the bytes with e.g. InflaterInputStream and
> hope that the EOF matches up correctly (which it doesn't always do).
> But there is one worse case. If a nested zip file is "stored" rather
> than "deflated", there may be *no* way to distinguish between records
> in the nested zip from the parent zip, so you may reach the end of the
> stored data of a nested zip entry and the streamer would get confused
> and think that it hit the end of the parent entry. Very messy.
>
> Now the directory is at the *end* of the zip file and looks like this:
> [directory entry 0]
> signature word 0x02014b50
> compression = deflate
> timestamp = xxxx
> real CRC
> real compressed size
> real uncompressed size
> offset of local header = 0
> name = foo/bar/foobar.class
> comment = blah blah
> [directory entry 1]
> signature word 0x02014b50
> compression = deflate
> timestamp = xxxx
> real CRC
> real compressed size
> real uncompressed size
> offset of local header = 12345
> name = foo/bar/foobaz.class
> comment = blah blah
> [...]
> [end of directory]
> signature word 0x06054b50
> offset of first directory entry = 24680
> entry count = N
> comment = ""
> [EOF]
>
> As you can see the directory entries give us hard values we can use
> for the lengths (both compressed and uncompressed) which makes it far
> more resilient against errors. Also, the directory is all packed
> together, meaning that we can read just a small portion of the zip
> file and thus fully index it (and, if we copied the zip to a "safe"
> location, we can cache the directory and close the zip file, thus
> avoiding one aspect of the infamous locking issues on Windows). So
> ideally, we always use the directory to locate zip entries.
>
> The question is, how do you find the directory? You can't seek from
> the beginning of the file because of the problems already outlined
> with that approach. The only reasonably accurate way to get the
> directory is to read the *end* of the file first, which generally
> means that InputStream is not an option to get this data. The only
> real option is to use a random-access file method (FileChannel or
> RandomAccessFile) to get at this data. This means that if you have an
> InputStream (which is what you can generally get out of VFS), you need
> to create a temp file and slurp the data into it before you can index it.
>
> I realize that this is a full iteration of the zip file, *however* it
> means that once you do this, you have instant access to any zip entry
> from that point on, *and* the lifetime of the zip file is no longer
> tied to its parent (be it a directory or another zip file), which
> fixes a wide variety of issues, including the requirement that
> resources don't start disappearing before undeploy() is called (in the
> case where the FS-backed deploy/ mechanism is used), more Windows
> locking issues, etc. I also believe that if we *always* copy and
> *never* scan the zip file, we'll have a net performance gain in the
> end (assuming we only ever copy and index each zip exactly one time).
> And of course things are a lot less likely to blow up surprisingly if
> you have a copy of the zip file.
>
> - DML
>
> On 06/19/2009 11:25 AM, Scott Stark wrote:
>> I don't view the ZipFile api as flexible enough as it requires a
>> java.io.File and nested jars. In addition all jdk implementations
>> have a native component that has tended to introduce locking and
>> memory faults. That is why I believe we need a pure java jar/zip file
>> implementation that addresses these limitations.
More information about the jboss-development
mailing list