[JBoss JIRA] Created: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue - jbosscache-issues

[JBoss JIRA] Created: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue

[JBoss JIRA] Created:...

Roberto Tyley (JIRA)

Tuesday, 3 February 2009 Tue, 3 Feb '09

2:48 p.m.

Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public (Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira

Show replies by date

Roberto Tyley (JIRA)

Tuesday, 3 February Tue, 3 Feb

3:02 p.m.

New subject: [JBoss JIRA] Commented: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue

[ https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p... ] Roberto Tyley commented on JBCACHE-1473: ---------------------------------------- The actual pathology of the capacity loss race condition is: * CLIENT THREAD : Client thread enters removeNode() to remove a Node X (which, as it happens, is also a candidate for eviction - not uncommon if you have a cache much smaller than the set of data you are trying to store in it) - theRemoveNodeCommand performs the remove on the cache, and the interceptor chain call-stack starts to rewind... * EVICTION THREAD : EvictionTimerTask wakes up and BaseEvictionAlgorithm.processQueues() gets called, which first processes all the Eviction Events in the EvictionEventQueue (note that no REMOVE_ELEMENT_EVENT(X) has been received so far) * CLIENT THREAD : ... finally the EvictionInterceptor does it's post-processing on the RemoveNodeCommand invocation, and creates a REMOVE_ELEMENT_EVENT(X) on the eviction event queue. The client thread exits removeNode() * EVICTION THREAD : The EvictionAlgorithm then calls prune(): The algorithm decides it needs to evict Node X, so it removes the NodeEntry for X from the EvictionQueue, then calls EvictCommand.perform() to evict Node X. * EVICTION THREAD : EvictCommand.perform() does a lookup for the Node with X's Fqn - the lookup returns null, because the node has removed by the Client Thread. As a result, the command returns 'false', attempting to indicate that the EvictCommand has done no work (this is the bug!) * EVICTION THREAD : The EvictionInterceptor gets this boolean value on the way back up the interceptor chain, and interprets the 'false' to mean only data has been removed from the node, and that it still exists in the cache. As a consequence it creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionEventQueue on the Region. The EvictionEventQueue now looks like this: REMOVE_ELEMENT_EVENT(X), then ADD_NODE_EVENT(X). The EvictionTimerTask goes back to sleep. * EVICTION THREAD : EvictionTimerTask wakes up again, and BaseEvictionAlgorithm.processQueues() starts to process the Eviction Events. First the REMOVE_ELEMENT_EVENT is processed - it's job is to remove the NodeEntry for X from the EvictionQueue - as it happens, the NodeEntry is not there (not a fatal problem - see http://fisheye.jboss.org/browse/JBossCache/core/trunk/src/main/java/org/j...) because it was removed the previous time the Eviction thread ran and decided it wanted to evict X. Finally ADD_NODE_EVENT(X) is processed, and despite the fact that X is not in the cache, a NodeEntry is created for it in the EvictionQueue. * The EvictionQueue now contains a NodeEntry for X, and is of length maxNodes, despite the fact that X is not in the Cache, which contains only maxNodes-1 nodes. Thus the effective capacity of the cache has become maxNodes-1.

...

Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

-- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira

Manik Surtani (JIRA)

Wednesday, 4 February Wed, 4 Feb

5:25 a.m.

New subject: [JBoss JIRA] Commented: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue

[ https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p... ] Manik Surtani commented on JBCACHE-1473: ---------------------------------------- Thanks for this excellent report and analysis. Do you have a unit test for this that you could contribute as well?

...

Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani Fix For: 3.1.0.GA There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

Manik Surtani (JIRA)

5:25 a.m.

New subject: [JBoss JIRA] Updated: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue

[ https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p... ] Manik Surtani updated JBCACHE-1473: ----------------------------------- Fix Version/s: 3.1.0.GA Priority: Critical (was: Major)

...

Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani Priority: Critical Fix For: 3.1.0.GA There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

Roberto Tyley (JIRA)

5:56 a.m.

New subject: [JBoss JIRA] Commented: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue

[ https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p... ] Roberto Tyley commented on JBCACHE-1473: ---------------------------------------- Thanks Manik, we really appreciate your quick response. We do have a standalone test (currently about 150 lines long) which we're just tidying up - we can give you this as a zipped up pom if you like? best regards, Roberto Tyley Software Developer @ The Guardian - guardian.co.uk

...

Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani Priority: Critical Fix For: 3.1.0.GA There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

Manik Surtani (JIRA)

6:17 a.m.

New subject: [JBoss JIRA] Commented: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue

[ https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p... ] Manik Surtani commented on JBCACHE-1473: ---------------------------------------- If you could write this as a TestNG unit test that I could drop in to the JBC trunk test suite, that would be ideal. Have a look at some of the other unit tests in trunk to get an idea of how they are structured.

...

Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani Priority: Critical Fix For: 3.1.0.GA There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

Roberto Tyley (JIRA)

8:15 a.m.

New subject: [JBoss JIRA] Updated: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue

[ https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p... ] Roberto Tyley updated JBCACHE-1473: ----------------------------------- Attachment: JBCACHE-1473.TEST.patch JBCACHE-1473.FIX.patch Unit Test : checking that the correct value is returned from EvictCommand.perform() when evicting a node that no longer exists in the cache Fix : update EvictCommand so that it returns the correct value, and use a clearer variable name in EvictionInterceptor

...

Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani Priority: Critical Fix For: 3.1.0.GA Attachments: JBCACHE-1473.FIX.patch, JBCACHE-1473.TEST.patch There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

Roberto Tyley (JIRA)

8:25 a.m.

New subject: [JBoss JIRA] Updated: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue

[ https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p... ] Roberto Tyley updated JBCACHE-1473: ----------------------------------- Attachment: JBCACHE-1473.INTEGRATION-TEST.zip Also attaching an integration test that adds and removes nodes for 2 seconds into a cache configured with a 1 second Eviction Thread wakeup interval, then monitors to see if the Eviction Queue and the Cache size stay in sync. You should be able to run it in Maven by just typing: mvn package The output looks like this: Adding and removing nodes from Cache for 2 SECONDS Target cache size: 5 nodes Waiting for cache to evict excess nodes. numberOfNodesInCache=1690 Cache is at target size of 5 nodes node count : 5 (cache) equals 5 (eviction queue) parent children : [n#9749, n#9750, n#9747, n#9748] maxAgeQueue : mother n#9747 n#9748 n#9749 n#9750 ttlQueue : n#9747 n#9748 n#9749 n#9750 mother node count : 1 (cache) DIFFERS 5 (eviction queue) parent children : [] maxAgeQueue : mother n#6609 n#6614 n#6615 n#6616 ttlQueue : n#6609 n#6614 n#6615 n#6616 mother *** THIS IS A FAIL ***

...

Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani Priority: Critical Fix For: 3.1.0.GA Attachments: JBCACHE-1473.FIX.patch, JBCACHE-1473.INTEGRATION-TEST.zip, JBCACHE-1473.TEST.patch There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

Manik Surtani (JIRA)

10:30 a.m.

New subject: [JBoss JIRA] Resolved: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue

[ https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p... ] Manik Surtani resolved JBCACHE-1473. ------------------------------------ Resolution: Done Thanks for the patch and test; it all looks good and I have applied it. It will be in 3.1.0.BETA due out soon, and I have just published 3.1.0-SNAPSHOT with this fix.

...

Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani Priority: Critical Fix For: 3.1.0.GA Attachments: JBCACHE-1473.FIX.patch, JBCACHE-1473.INTEGRATION-TEST.zip, JBCACHE-1473.TEST.patch There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

Roberto Tyley (JIRA)

Thursday, 5 February Thu, 5 Feb

4:12 a.m.

New subject: [JBoss JIRA] Commented: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue

[ https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p... ] Roberto Tyley commented on JBCACHE-1473: ---------------------------------------- Cheers Manik - we soak tested the fix last night (on a patched build of JBoss Cache 2.2.1, which is the version we're currently developing against) and the problem looks resolved - the cache stays splendidly full. We also tested the snapshot of 3.10 in our integration test, and can confirm that it passes with flying colours - obviously, we'll be upgrading to the new release as soon as it comes out! Thanks once again for your quick response to this issue, it's much appreciated. best regards, Roberto Tyley Software Developer @ The Guardian - guardian.co.uk

...

Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani Priority: Critical Fix For: 3.1.0.GA Attachments: JBCACHE-1473.FIX.patch, JBCACHE-1473.INTEGRATION-TEST.zip, JBCACHE-1473.TEST.patch There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

Manik Surtani (JIRA)

5 a.m.

New subject: [JBoss JIRA] Updated: (JBCACHE-1473) Memory leak: Cache Regions lose capacity, evictCommand.perform() causes dead entries in EvictionQueue

[ https://jira.jboss.org/jira/browse/JBCACHE-1473?page=com.atlassian.jira.p... ] Manik Surtani updated JBCACHE-1473: ----------------------------------- Summary: Memory leak: Cache Regions lose capacity, evictCommand.perform() causes dead entries in EvictionQueue (was: Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue)

...

Memory leak: Cache Regions lose capacity, evictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani Priority: Critical Fix For: 3.1.0.GA Attachments: JBCACHE-1473.FIX.patch, JBCACHE-1473.INTEGRATION-TEST.zip, JBCACHE-1473.TEST.patch There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

Manik Surtani (JIRA)

Friday, 13 February Fri, 13 Feb

8:48 a.m.

New subject: [JBoss JIRA] Updated: (JBCACHE-1473) Memory leak: Cache Regions lose capacity, evictCommand.perform() causes dead entries in EvictionQueue

...

Memory leak: Cache Regions lose capacity, evictCommand.perform() causes dead entries in EvictionQueue ----------------------------------------------------------------------------------------------------- Key: JBCACHE-1473 URL: https://jira.jboss.org/jira/browse/JBCACHE-1473 Project: JBoss Cache Issue Type: Bug Security Level: Public(Everyone can see) Components: Eviction Affects Versions: 3.0.2.GA Reporter: Roberto Tyley Assignee: Manik Surtani Priority: Critical Fix For: 3.0.3.CR1, 3.0.3.GA Attachments: JBCACHE-1473.FIX.patch, JBCACHE-1473.INTEGRATION-TEST.zip, JBCACHE-1473.TEST.patch There is a serious bug in EvictCommand.perform() which leads to cache regions declining in effective capacity over time. The syndrome is: * Cache fails to fill - node count initially reaches around 90% of configured maxNodes capacity * Cache capacity declines over time - in our case down to 70% of capacity after 36 hours. In order to manifest this issue, you need to be removing as well as adding nodes to the cache. The bug is exposed by a race condition between the Client Code thread removing a node and the Eviction Thread processing the EvictionEventQueue, but the race condition is not the problem (the race condition just explains why you don't see this issue all the time), the problem is with the semantics of the boolean return value of EvictCommand.perform(). Failure Case: EvictCommand.perform() tries to evict a node X from the cache, when X is no longer in the cache: * the EvictCommand.perform() method returns the boolean 'false', attempting to indicate: "the EvictCommand has done no work" * the EvictionInterceptor interprets the return value of 'false' to mean: "only data has been removed from the node X, and the node still exists in the cache". As a consequence the EvictionInterceptor creates a new ADD_NODE_EVENT for X, to ensure that X will be revisited by the eviction thread in future, and adds the event to theEvictionQueue. The EvictionQueue now contains a NodeEntry for a Node which is not in the cache. The EvictionAlgorithm will evict nodes based on how many nodes it *thinks* are in the cache (the length of its EvictionQueue), rather than how many nodes are /actually/ in the cache, so the cache can not grow to it's full capacity. These dead NodeEntries build up over time and can take up a substantial portion of your cache if you're frequently performing node removals. The fix for this issue is thankfully simple: As only EvictionInterceptor reads the boolean value returned by EvictCommand.perform(), the semantics of the return value should be aligned with those of the EvictionInterceptor, and should simply return whether the node STILL exists in the cache after the evict has been performed, as that is what the EvictionInterceptor is really interested in.

5815

days inactive

5825

days old

jbosscache-issues@lists.jboss.org

Manage subscription

11 comments

2 participants

tags (0)

participants (2)

Manik Surtani (JIRA)
Roberto Tyley (JIRA)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[JBoss JIRA] Created: (JBCACHE-1473) Cache Regions lose capacity - EvictCommand.perform() causes dead entries in EvictionQueue