On 01/05/2010 04:39 PM, Bill Burke wrote:
David M. Lloyd wrote:
> That's fine BUT a general-purpose caching system will just fuck things up
> again.
With 3.5 years of development you're telling me that our testsuite isn't
good enough to pick up errors introduced by a "general-purpose caching
system" or other optimizations?
Yup. The test suite has historically had a very difficult time locating
problems which have cropped up in the VFS2 implementation. I guarantee
that applying a general solution without understanding each use case will
result in crappy code and time-bomb style errors. You can't produce
bug-free code just by writing regression tests.
> If you know the info won't change, then don't keep asking
for it.
Easier said than done.
Take for instance VirtulFile.visit(). Any visitor that is run calls
isDirectory(). For each isDirectory() there is at least one
ConcurrentHashMap lookup to find the mount, then to delegate to the
FileSystem interface which itself does hash lookups.
Can't be avoided. The topology of mounts may change at any time, which
means that one call to isDirectory() might give a different answer than the
next for the same file. How can this possibly be cached when things can be
mounted and unmounted at will, by threads other than your own? A
VirtualFile is just a symbolic name after all - it does not and can not
represent any kind of link into the physical filesystem itself since such a
link cannot exist with anything resembling consistent semantics.
It might be worthwhile to explore using some type of COW HashMap setup
instead of CHM, since there are relatively few mounts compared to lookups.
In any case, O(1) + O(1) = O(1) so algorithmically it's sound. It's just a
question of optimization at this stage, if there are any which can be done
without breaking the API contract.
Then, visit() continues with calling getChildren() if the file is a
directory. getChildren() recreates a totally brand new VirtualFile
list, by, yes, doing N more hash lookups for each child entry.
You're talking about the creation of a list of objects which have three
reference fields and one integer field. It's cheap. I already benchmarked
the piss out of it, and I'm sure it's not the source of any significant
performance issue unless you're doing something crazy.
Caching VirtualFile instances, on the other hand, adds a lot of complexity
and it doesn't give you anything. Again, creating VirtualFiles is cheap,
or at least it was last time I measured.
> There are many different ways in which VFS is used - mounting a
library,
> for example, can be done with a plain zip mount (in other words, everything
> is in RAM), whereas mounting a hot deployment must first copy the JAR
> before mounting to avoid locking issues on windows and the "disappearing
> classes" problem which occurs when you delete the JAR out from under the
> classloader before the classloader is done with it.
>
> Mounting an exploded hot deployment must be done with a more complex copy
> operation so that changes to dynamic content can be propagated using
> different rules based on the content (e.g. HTML files should be copied
> immediately, but the semantics of changing class files is less clear).
> What can we cache here?
You understand that just to register a jar library with a deployer
archive, sar, etc, you're running VFS and creating these structures?
**************
Jar contents don't need these special hot deployment or mounting cases
as they are static and always will be.
***************
If you're doing isDirectory in a mounted JAR, you're only ever hitting the
cached value in RAM. You only incur the filesystem cost if you are looking
at the real filesystem. JAR contents DO NOT HAVE the special hot
deployment cases.
Again to break it down:
- JAR Library - can be mounted in place. Contents will never change,
nested modules will never be mounted.
- JAR Hot Deployment - must be copied, then the copy can be mounted in
place. The copy is to allow the file to be deleted to signify undeployment
(Windows), and to avoid undeployment failures due to disappearing classes
(Seam exposed this problem last year sometime, but it applies to all JARred
deployments). Nested modules can be mounted "inside" a JAR deployment (but
only on demand - if such a module is part of the classpath for example).
The nested modules need not be copied.
- Exploded Hot Deployment - the simplest solution is to basically use the
files in place. Since they can change at any time, caching cannot be done
reliably.
The only optimization I can see here is that it may be possible to make
certain assumptions - like for JAR libraries/class path entries - that
there are never going to be any overlying submounts, so that lookup can be
bypassed. The lookup within the filesystem cannot be avoided however.
Caching bugs can be mitigated and supported simply by asking the
file
system if it is ok to cache indefinitely.
The file system already does this. Caching can only be done on a
filesystem level because the topology of the VFS can change at any time.
Or even better, by having the
FileSystem serve up/create/instantiate VirtualFile instances. Then you
can have special cases per FileSystem type that are fully optimized.
This won't help - a VirtualFile is just a symbolic name. If you're
thinking of another *kind* of thing which hooks into a fully resolved
filesystem node, it's iffy because the semantics cannot be consistent. For
real filesystems, it's essentially a File, which provides no more
guarantees over the real filesystem than a VirtualFile does over the VFS
(in other words, the topology can change out from under you at any time).
For mounted filesystems, it might be something different depending on what
the filesystem internals can do.
> The "ownership" semantics of static files are not
clearly defined either,
> and can depend on what is being mounted and why. I suspect you'll find
> that there are very few (if any) cases where the filesystem is being hit
> for information such as "isDirectory" or "lastModified"
information outside
> of hot deployment, which is the case where it can't be cached anyway.
"isDirectory" is being hit *constantly* by anything that uses the VFS.
In my little test with ~5000 jar entries, it is being called 15,000 times.
Sure, the "real" file system is not necessarily being hit, but there are
a GIZILLION unnecessary hash lookups, object creations, etc each and
every time a jar or directory is browsed (annotation scanning, package
names, subdeployment searches).
That's the price you pay for having a virtual filesystem with a changeable
topology. All these operations are O(1) and designed to be as fast as
possible, though no doubt there are further micro-optimizations which are
possible.
>> As I stated early, define a parallel non-cache api that
invalidates the
>> cache on diffs for sensitive operations (like hot-deployment).
>
> Sure, if you like bugs and broken behavior. Correctness first, speed
> second.
VFS has had 3-4 years to get correctness right. If correctness isn't
covered by the testsuite, then any significant code change will cause
"bugs and broken behavior" especially a complete rewrite of VFS via VFS 3.
That's rather disingenuous. VFS failed to get correctness right after 3-4
years, hence it has been rewritten with correctness as the primary goal.
The test suite doesn't ensure correctness, it just ensures that condition A
causes result B. If you don't know what the semantics are supposed to be,
then you have no way of knowing that A is correct, therefore the test is
useless. This was the situation with VFS2. There was no specification of
what each method is supposed to do, thus people were not using the VFS in a
manner consistent with its design, thus strange errors crop up which are
very, very difficult to replicate in tests.
By setting out a simple set of functionality, with exactly defined
semantics for each method (you'll notice that VFS3 is meticulously
documented), the test suite has real meaning because you're spelling out
what conditions are supported, and then testing each one. Anything missing
from the documentation is not supported behavior, and any behavior which
does not appear in the documentation is a defect. If the specification is
wrong, you then have a basis to find all code which USES the specific
feature and analyze the impact of change on all consuming code, unlike what
we do today (make a change and cross your fingers that the test suite will
notice a problem).
If you start adding random stuff like caching which CHANGES the semantics
away from what's documented, then you're introducing a bug by definition.
The test suite may not pick up on it, because it tests to ensure that
condition A causes result B, but it can't account for side effects of
condition A unless someone thinks to put that test in there in the first place.
> This is a complex problem that you cannot paint with a broad
> brush. You must look at each use case separately before you can say stuff
> like "we need caching".
Just to let you know, I did work on VFS and optimizations of VFS end of
2006 for a month or two. It used to do stupid shit like calculating
path names each and every time you asked for it.
If you want the feature, you pay the price. The requirements are always
open for debate, I suppose.
- DML