[
https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p...
]
Roberto Tyley commented on JBCACHE-1473:
----------------------------------------
The actual pathology of the capacity loss race condition is:
* CLIENT THREAD : Client thread enters removeNode() to remove a Node X (which, as it
happens, is also a candidate for eviction - not uncommon if you have a cache much smaller
than the set of data you are trying to store in it) - theRemoveNodeCommand performs the
remove on the cache, and the interceptor chain call-stack starts to rewind...
* EVICTION THREAD : EvictionTimerTask wakes up and
BaseEvictionAlgorithm.processQueues() gets called, which first processes all the Eviction
Events in the EvictionEventQueue (note that no REMOVE_ELEMENT_EVENT(X) has been received
so far)
* CLIENT THREAD : ... finally the EvictionInterceptor does it's post-processing
on the RemoveNodeCommand invocation, and creates a REMOVE_ELEMENT_EVENT(X) on the eviction
event queue. The client thread exits removeNode()
* EVICTION THREAD : The EvictionAlgorithm then calls prune(): The algorithm decides it
needs to evict Node X, so it removes the NodeEntry for X from the EvictionQueue, then
calls EvictCommand.perform() to evict Node X.
* EVICTION THREAD : EvictCommand.perform() does a lookup for the Node with X's Fqn
- the lookup returns null, because the node has removed by the Client Thread. As a result,
the command returns 'false', attempting to indicate that the EvictCommand has done
no work (this is the bug!)
* EVICTION THREAD : The EvictionInterceptor gets this boolean value on the way back up
the interceptor chain, and interprets the 'false' to mean only data has been
removed from the node, and that it still exists in the cache. As a consequence it creates
a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in
future, and adds the event to theEvictionEventQueue on the Region. The EvictionEventQueue
now looks like this: REMOVE_ELEMENT_EVENT(X), then ADD_NODE_EVENT(X). The
EvictionTimerTask goes back to sleep.
* EVICTION THREAD : EvictionTimerTask wakes up again, and
BaseEvictionAlgorithm.processQueues() starts to process the Eviction Events. First the
REMOVE_ELEMENT_EVENT is processed - it's job is to remove the NodeEntry for X from the
EvictionQueue - as it happens, the NodeEntry is not there (not a fatal problem - see
http://fisheye.jboss.org/browse/JBossCache/core/trunk/src/main/java/org/j...)
because it was removed the previous time the Eviction thread ran and decided it wanted to
evict X. Finally ADD_NODE_EVENT(X) is processed, and despite the fact that X is not in the
cache, a NodeEntry is created for it in the EvictionQueue.
* The EvictionQueue now contains a NodeEntry for X, and is of length maxNodes, despite
the fact that X is not in the Cache, which contains only maxNodes-1 nodes. Thus the
effective capacity of the cache has become maxNodes-1.
Cache Regions lose capacity - EvictCommand.perform() causes dead
entries in EvictionQueue
-----------------------------------------------------------------------------------------
Key: JBCACHE-1473
URL:
https://jira.jboss.org/jira/browse/JBCACHE-1473
Project: JBoss Cache
Issue Type: Bug
Security Level: Public(Everyone can see)
Components: Eviction
Affects Versions: 3.0.2.GA
Reporter: Roberto Tyley
Assignee: Manik Surtani
There is a serious bug in EvictCommand.perform() which leads to cache regions declining
in effective capacity over time. The syndrome is:
* Cache fails to fill - node count initially reaches around 90% of configured
maxNodes capacity
* Cache capacity declines over time - in our case down to 70% of capacity after 36
hours.
In order to manifest this issue, you need to be removing as well as adding nodes to the
cache. The bug is exposed by a race condition between the Client Code thread removing a
node and the Eviction Thread processing the EvictionEventQueue, but the race condition is
not the problem (the race condition just explains why you don't see this issue all the
time), the problem is with the semantics of the boolean return value of
EvictCommand.perform().
Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no
longer in the cache:
* the EvictCommand.perform() method returns the boolean 'false', attempting
to indicate: "the EvictCommand has done no work"
* the EvictionInterceptor interprets the return value of 'false' to mean:
"only data has been removed from the node X, and the node still exists in the
cache".
As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure
that X will be revisited by the eviction thread in future, and adds the event to
theEvictionQueue.
The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The
EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache
(the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache,
so the cache can not grow to it's full capacity. These dead NodeEntries build up over
time and can take up a substantial portion of your cache if you're frequently
performing node removals.
The fix for this issue is thankfully simple: As only EvictionInterceptor reads the
boolean value returned by EvictCommand.perform(), the semantics of the return value should
be aligned with those of the EvictionInterceptor, and should simply return whether the
node STILL exists in the cache after the evict has been performed, as that is what the
EvictionInterceptor is really interested in.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira