[jboss-dev] Further profling: Where should I focus?

Tue Jan 5 18:38:35 EST 2010

On 01/05/2010 04:39 PM, Bill Burke wrote:
> David M. Lloyd wrote:
>> That's fine BUT a general-purpose caching system will just fuck things up
>> again.
>
> With 3.5 years of development you're telling me that our testsuite isn't
> good enough to pick up errors introduced by a "general-purpose caching
> system" or other optimizations?

Yup.  The test suite has historically had a very difficult time locating 
problems which have cropped up in the VFS2 implementation.  I guarantee 
that applying a general solution without understanding each use case will 
result in crappy code and time-bomb style errors.  You can't produce 
bug-free code just by writing regression tests.

>> If you know the info won't change, then don't keep asking for it.
>
> Easier said than done.
>
> Take for instance VirtulFile.visit().  Any visitor that is run calls
> isDirectory().  For each isDirectory() there is at least one
> ConcurrentHashMap lookup to find the mount, then to delegate to the
> FileSystem interface which itself does hash lookups.

Can't be avoided.  The topology of mounts may change at any time, which 
means that one call to isDirectory() might give a different answer than the 
next for the same file.  How can this possibly be cached when things can be 
mounted and unmounted at will, by threads other than your own?  A 
VirtualFile is just a symbolic name after all - it does not and can not 
represent any kind of link into the physical filesystem itself since such a 
link cannot exist with anything resembling consistent semantics.

It might be worthwhile to explore using some type of COW HashMap setup 
instead of CHM, since there are relatively few mounts compared to lookups.

In any case, O(1) + O(1) = O(1) so algorithmically it's sound.  It's just a 
question of optimization at this stage, if there are any which can be done 
without breaking the API contract.

> Then, visit() continues with calling getChildren() if the file is a
> directory.  getChildren() recreates a totally brand new VirtualFile
> list, by, yes, doing N more hash lookups for each child entry.

You're talking about the creation of a list of objects which have three 
reference fields and one integer field.  It's cheap.  I already benchmarked 
the piss out of it, and I'm sure it's not the source of any significant 
performance issue unless you're doing something crazy.

Caching VirtualFile instances, on the other hand, adds a lot of complexity 
and it doesn't give you anything.  Again, creating VirtualFiles is cheap, 
or at least it was last time I measured.

>> There are many different ways in which VFS is used - mounting a library,
>> for example, can be done with a plain zip mount (in other words, everything
>> is in RAM), whereas mounting a hot deployment must first copy the JAR
>> before mounting to avoid locking issues on windows and the "disappearing
>> classes" problem which occurs when you delete the JAR out from under the
>> classloader before the classloader is done with it.
>>
>> Mounting an exploded hot deployment must be done with a more complex copy
>> operation so that changes to dynamic content can be propagated using
>> different rules based on the content (e.g. HTML files should be copied
>> immediately, but the semantics of changing class files is less clear).
>> What can we cache here?
>
> You understand that just to register a jar library with a deployer
> archive, sar, etc, you're running VFS and creating these structures?
>
> **************
> Jar contents don't need these special hot deployment or mounting cases
> as they are static and always will be.
> ***************

If you're doing isDirectory in a mounted JAR, you're only ever hitting the 
cached value in RAM.  You only incur the filesystem cost if you are looking 
at the real filesystem.  JAR contents DO NOT HAVE the special hot 
deployment cases.

Again to break it down:

- JAR Library - can be mounted in place.  Contents will never change, 
nested modules will never be mounted.
- JAR Hot Deployment - must be copied, then the copy can be mounted in 
place.  The copy is to allow the file to be deleted to signify undeployment 
(Windows), and to avoid undeployment failures due to disappearing classes 
(Seam exposed this problem last year sometime, but it applies to all JARred 
deployments).  Nested modules can be mounted "inside" a JAR deployment (but 
only on demand - if such a module is part of the classpath for example). 
The nested modules need not be copied.
- Exploded Hot Deployment - the simplest solution is to basically use the 
files in place.  Since they can change at any time, caching cannot be done 
reliably.

The only optimization I can see here is that it may be possible to make 
certain assumptions - like for JAR libraries/class path entries - that 
there are never going to be any overlying submounts, so that lookup can be 
bypassed.  The lookup within the filesystem cannot be avoided however.

> Caching bugs can be mitigated and supported simply by asking the file
> system if it is ok to cache indefinitely.

The file system already does this.  Caching can only be done on a 
filesystem level because the topology of the VFS can change at any time.

> Or even better, by having the
> FileSystem serve up/create/instantiate VirtualFile instances.  Then you
> can have special cases per FileSystem type that are fully optimized.

This won't help - a VirtualFile is just a symbolic name.  If you're 
thinking of another *kind* of thing which hooks into a fully resolved 
filesystem node, it's iffy because the semantics cannot be consistent.  For 
real filesystems, it's essentially a File, which provides no more 
guarantees over the real filesystem than a VirtualFile does over the VFS 
(in other words, the topology can change out from under you at any time). 
For mounted filesystems, it might be something different depending on what 
the filesystem internals can do.

>> The "ownership" semantics of static files are not clearly defined either,
>> and can depend on what is being mounted and why.  I suspect you'll find
>> that there are very few (if any) cases where the filesystem is being hit
>> for information such as "isDirectory" or "lastModified" information outside
>> of hot deployment, which is the case where it can't be cached anyway.
>
> "isDirectory" is being hit *constantly* by anything that uses the VFS.
> In my little test with ~5000 jar entries, it is being called 15,000 times.
>
> Sure, the "real" file system is not necessarily being hit, but there are
> a GIZILLION unnecessary hash lookups, object creations, etc each and
> every time a jar or directory is browsed (annotation scanning, package
> names, subdeployment searches).

That's the price you pay for having a virtual filesystem with a changeable 
topology.  All these operations are O(1) and designed to be as fast as 
possible, though no doubt there are further micro-optimizations which are 
possible.

>>> As I stated early, define a parallel non-cache api that invalidates the
>>> cache on diffs for sensitive operations (like hot-deployment).
>>
>> Sure, if you like bugs and broken behavior.  Correctness first, speed
>> second.
>
> VFS has had 3-4 years to get correctness right.  If correctness isn't
> covered by the testsuite, then any significant code change will cause
> "bugs and broken behavior"  especially a complete rewrite of VFS via VFS 3.

That's rather disingenuous.  VFS failed to get correctness right after 3-4 
years, hence it has been rewritten with correctness as the primary goal. 
The test suite doesn't ensure correctness, it just ensures that condition A 
causes result B.  If you don't know what the semantics are supposed to be, 
then you have no way of knowing that A is correct, therefore the test is 
useless.  This was the situation with VFS2.  There was no specification of 
what each method is supposed to do, thus people were not using the VFS in a 
manner consistent with its design, thus strange errors crop up which are 
very, very difficult to replicate in tests.

By setting out a simple set of functionality, with exactly defined 
semantics for each method (you'll notice that VFS3 is meticulously 
documented), the test suite has real meaning because you're spelling out 
what conditions are supported, and then testing each one.  Anything missing 
from the documentation is not supported behavior, and any behavior which 
does not appear in the documentation is a defect.  If the specification is 
wrong, you then have a basis to find all code which USES the specific 
feature and analyze the impact of change on all consuming code, unlike what 
we do today (make a change and cross your fingers that the test suite will 
notice a problem).

If you start adding random stuff like caching which CHANGES the semantics 
away from what's documented, then you're introducing a bug by definition. 
The test suite may not pick up on it, because it tests to ensure that 
condition A causes result B, but it can't account for side effects of 
condition A unless someone thinks to put that test in there in the first place.

>>   This is a complex problem that you cannot paint with a broad
>> brush.  You must look at each use case separately before you can say stuff
>> like "we need caching".
>
> Just to let you know, I did work on VFS and optimizations of VFS end of
> 2006 for a month or two.  It used to do stupid shit like calculating
> path names each and every time you asked for it.

If you want the feature, you pay the price.  The requirements are always 
open for debate, I suppose.

- DML