More fun with buddy replication. :-)
Saw an error on 1 of our failover tests where,
1) Node D had left the group, so lots of gravitation was going on.
2) Various nodes were sending DataGravitationCleanupCommands to the
cluster for /BUDDY_BACKUP/D_DEAD/1/JSESSION/st_localhost/xxx. Result is
all nodes in the cluster are trying to remove various
/BUDDY_BACKUP/D_DEAD/1/JSESSION/st_localhost/xxx nodes. On node A those
nodes don't exist, so PessimisticLockInterceptor.handleRemoveCommand is
adding them and removing them.
3) Concurrent with #2, a GravitateDataCommand for
/JSESSION/st_localhost/123 comes in to node A. Session 123 was never
stored on node A, so this should result in a cache miss. But what
happened once was:
[JBoss] 16:46:52,961 TRACE
[org.jboss.cache.marshall.CommandAwareRpcDispatcher]
(Incoming-13,10.34.32.153:14736) Problems invoking command.
[JBoss] org.jboss.cache.NodeNotValidException: Node
/_BUDDY_BACKUP_/10.34.32.156_48822:DEAD/1/JSESSION/st_localhost/UvzutZkoESBMRSnjv0eTRA__
is not valid. Perhaps it has been moved or removed.
[JBoss] at
org.jboss.cache.invocation.NodeInvocationDelegate.assertValid(NodeInvocationDelegate.java:527)
[JBoss] at
org.jboss.cache.invocation.NodeInvocationDelegate.getChildrenNames(NodeInvocationDelegate.java:292)
[JBoss] at
org.jboss.cache.commands.read.GravitateDataCommand.perform(GravitateDataCommand.java:176)
...
It seems the command is seeing a non-existent node. Yep; looking at the
logs it's clear the above GravitateDataCommand was executed concurrently
with another DataGravitationCleanupCommand for the same session. (I need
to investigate why that happened.)
Below is a possible patch to work around the issue. This points to a
more general locking problem though -- should these "phantom nodes"
created for removal be visible to other threads? Shouldn't there be a WL
on them from the moment they are created until after they are removed?
Hehe, answered my own question by writing it. The node is created by
PessimisticNodeBasedLockManager and then locked. There's a gap in
between where another thread could get a ref to it.
Anyway, the patch:
### Eclipse Workspace Patch 1.0
#P jbosscache-core
Index: src/main/java/org/jboss/cache/commands/read/GravitateDataCommand.java
===================================================================
---
src/main/java/org/jboss/cache/commands/read/GravitateDataCommand.java
(revision 8163)
+++
src/main/java/org/jboss/cache/commands/read/GravitateDataCommand.java
(working copy)
@@ -29,6 +29,7 @@
import org.jboss.cache.InternalNode;
import org.jboss.cache.InvocationContext;
import org.jboss.cache.Node;
+import org.jboss.cache.NodeNotValidException;
import org.jboss.cache.NodeSPI;
import org.jboss.cache.buddyreplication.BuddyFqnTransformer;
import org.jboss.cache.buddyreplication.BuddyManager;
@@ -171,9 +172,18 @@
else
{
// make sure we LOAD data for this node!!
- actualNode.getData();
- // and children!
- actualNode.getChildrenNames();
+ try
+ {
+ actualNode.getData();
+ // and children!
+ actualNode.getChildrenNames();
+ }
+ catch (NodeNotValidException e)
+ {
+ if (trace)
+ log.trace("Found node " + actualNode.getFqn() + " but
it is not valid. Returning 'no data found'", e);
+ return GravitateResult.noDataFound();
+ }
}
if (backupNodeFqn == null && searchSubtrees)
--
Brian Stansberry
Lead, AS Clustering
JBoss by Red Hat