[
https://jira.jboss.org/jira/browse/DNA-466?page=com.atlassian.jira.plugin...
]
Randall Hauch commented on DNA-466:
-----------------------------------
I've been working on a redesign of SessionCache that does not rely upon UUIDs from the
connectors. Current status is that although there were a few regressions that can be
fixed pretty easily, I'm at a point where I'm merging more recent commits and
resolving quite a few conflicts. But that shouldn't take too long, and I'll be
able to commit this new design.
Rather than maintain the cache keyed by the UUIDs, the new design maintains a structured
graph of the cached information kept in-sync with the changes to the session, making
traversals and navigation significantly faster. Because this structure is maintained,
getting the path for any loaded node is extremely fast. In fact, all paths are cached
(and actually make use of the ChildPath implementation, where each Path object is a single
segment plus a reference to the parent Path; this is much better use of memory) I'll
go more into the design later, but it's interesting to see some of the same stats for
walking the same graph of 42 nodes in a JCR repository, AFTER the SessionCache redesign.
42 samples: min=00:00:00.000,002; avg=00:00:00.001,046; median=00:00:00.000,002;
stddev=6679752.669664584; max=00:00:00.043,817 -> 42
42 samples: min=00:00:00.000,001; avg=00:00:00.000,003; median=00:00:00.000,002;
stddev=4310.247242620428; max=00:00:00.000,03 -> 0
42 samples: min=00:00:00.000,001; avg=00:00:00.000,002; median=00:00:00.000,002;
stddev=2795.0596195151174; max=00:00:00.000,02 -> 0
Note that almost all of the stats are significantly lower (often several order of
magnitude smaller); the maximum on the initial warmup run is higher. Note that with the
new design, I'm pretty sure the new tests are doing more than the old tests - the new
tests force loading of every node being walked, not just having enough information to
construct the path. The two different designs were just apples and oranges. When I only
include the time required to obtain the paths (what the old test was doing), the numbers
are significantly smaller still:
42 samples: min=00:00:00.000,002; avg=00:00:00.000,006; median=00:00:00.000,002;
stddev=22080.827094338383; max=00:00:00.000,147 -> 42
42 samples: min=00:00:00.000,001; avg=00:00:00.000,002; median=00:00:00.000,002;
stddev=3262.2523276768625; max=00:00:00.000,023 -> 0
42 samples: min=00:00:00.000,001; avg=00:00:00.000,002; median=00:00:00.000,002;
stddev=1807.0158058105023; max=00:00:00.000,013 -> 0
BTW, a small change was made to the printout. In each line, the last number represents
the total number of connections made to the source to do the various reads as the test
walked the graph. In the above example, the first walking produced a separate read for
each node; subsequent walks found everything and required no loading.
The new design also has a setting (in GraphSession) that defines the minimum depth to read
for every node being loaded (theoretically reducing the total number of read operations).
Here is a value of 2 (meaning a node and its children are all read as a subgraph). With a
total depth of 5 (including the root), there's not a whole lot of reduction in the
number of connections (33, down from 42). Again, these stats include fully loading each
node (rather than just accessing the path, which is near instantaneous):
42 samples: min=00:00:00.000,001; avg=00:00:00.000,439; median=00:00:00.000,002;
stddev=2800757.049523132; max=00:00:00.018,373 -> 33
42 samples: min=00:00:00.000,001; avg=00:00:00.000,003; median=00:00:00.000,002;
stddev=3339.1106396682303; max=00:00:00.000,023 -> 0
42 samples: min=00:00:00.000,001; avg=00:00:00.000,002; median=00:00:00.000,002;
stddev=3382.377298657882; max=00:00:00.000,024 -> 0
Here is the same run, but with a depth value of 6, meaning that the entire graph is read
in with one read operation. Yes, that one read took 217 milliseconds, but considering
this test uses an in-memory source, all the connections are fast, while other sources
(e.g., databases, file systems, etc.) will likely take longer.
42 samples: min=00:00:00.000,001; avg=00:00:00.005,188; median=00:00:00.000,002;
stddev=3.3205427276732728E7; max=00:00:00.217,806 -> 1
42 samples: min=00:00:00.000,001; avg=00:00:00.000,002; median=00:00:00.000,002;
stddev=3136.0857790573014; max=00:00:00.000,022 -> 0
42 samples: min=00:00:00.000,001; avg=00:00:00.000,002; median=00:00:00.000,002;
stddev=3423.027925887531; max=00:00:00.000,024 -> 0
It's not clear what depth value will be useful and most efficient. But it is a
parameter that can be optimized later (perhaps with different recommended values for
different kinds of sources). All the better, considering that there was not much
additional code required to handle depth reading values > 1.
JCR requires connectors to expose UUID as identifier property
-------------------------------------------------------------
Key: DNA-466
URL:
https://jira.jboss.org/jira/browse/DNA-466
Project: DNA
Issue Type: Bug
Components: JCR
Affects Versions: 0.5
Reporter: Randall Hauch
Assignee: Randall Hauch
Priority: Blocker
Fix For: 0.6
The JCR implementation currently expects connectors to return a UUID as the identifier
property. This is obviously incorrect, as it goes against several of the connectors
we've already implemented. In particular, the SessionCache is expecting to find the
UUID, and submits requests to the source with only the UUID (not the path).
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira