[jbosscache-dev] GravitateDataCommand finds invalid "phantom" nodes

Thu Aug 6 13:03:07 EDT 2009

More fun with buddy replication. :-)

Saw an error on 1 of our failover tests where,

1) Node D had left the group, so lots of gravitation was going on.
2) Various nodes were sending DataGravitationCleanupCommands to the 
cluster for /BUDDY_BACKUP/D_DEAD/1/JSESSION/st_localhost/xxx. Result is 
all nodes in the cluster are trying to remove various 
/BUDDY_BACKUP/D_DEAD/1/JSESSION/st_localhost/xxx nodes. On node A those 
nodes don't exist, so PessimisticLockInterceptor.handleRemoveCommand is 
adding them and removing them.
3) Concurrent with #2, a GravitateDataCommand for 
/JSESSION/st_localhost/123 comes in to node A. Session 123 was never 
stored on node A, so this should result in a cache miss. But what 
happened once was:

[JBoss] 16:46:52,961 TRACE 
[org.jboss.cache.marshall.CommandAwareRpcDispatcher] 
(Incoming-13,10.34.32.153:14736) Problems invoking command.
[JBoss] org.jboss.cache.NodeNotValidException: Node 
/_BUDDY_BACKUP_/10.34.32.156_48822:DEAD/1/JSESSION/st_localhost/UvzutZkoESBMRSnjv0eTRA__ 
is not valid.  Perhaps it has been moved or removed.
[JBoss] 	at 
org.jboss.cache.invocation.NodeInvocationDelegate.assertValid(NodeInvocationDelegate.java:527)
[JBoss] 	at 
org.jboss.cache.invocation.NodeInvocationDelegate.getChildrenNames(NodeInvocationDelegate.java:292)
[JBoss] 	at 
org.jboss.cache.commands.read.GravitateDataCommand.perform(GravitateDataCommand.java:176)
...

It seems the command is seeing a non-existent node. Yep; looking at the 
logs it's clear the above GravitateDataCommand was executed concurrently 
with another DataGravitationCleanupCommand for the same session. (I need 
to investigate why that happened.)

Below is a possible patch to work around the issue.  This points to a 
more general locking problem though -- should these "phantom nodes" 
created for removal be visible to other threads? Shouldn't there be a WL 
on them from the moment they are created until after they are removed?

Hehe, answered my own question by writing it. The node is created by 
PessimisticNodeBasedLockManager and then locked. There's a gap in 
between where another thread could get a ref to it.

Anyway, the patch:

### Eclipse Workspace Patch 1.0
#P jbosscache-core
Index: src/main/java/org/jboss/cache/commands/read/GravitateDataCommand.java
===================================================================

--- 
src/main/java/org/jboss/cache/commands/read/GravitateDataCommand.java 
(revision 8163)
+++ 
src/main/java/org/jboss/cache/commands/read/GravitateDataCommand.java 
(working copy)
@@ -29,6 +29,7 @@
  import org.jboss.cache.InternalNode;
  import org.jboss.cache.InvocationContext;
  import org.jboss.cache.Node;
+import org.jboss.cache.NodeNotValidException;
  import org.jboss.cache.NodeSPI;
  import org.jboss.cache.buddyreplication.BuddyFqnTransformer;
  import org.jboss.cache.buddyreplication.BuddyManager;
@@ -171,9 +172,18 @@
           else
           {
              // make sure we LOAD data for this node!!
-            actualNode.getData();
-            // and children!
-            actualNode.getChildrenNames();
+            try
+            {
+               actualNode.getData();
+               // and children!
+               actualNode.getChildrenNames();
+            }
+            catch (NodeNotValidException e)
+            {
+               if (trace)
+                  log.trace("Found node " + actualNode.getFqn() + " but 
it is not valid. Returning 'no data found'", e);
+               return GravitateResult.noDataFound();
+            }
           }

           if (backupNodeFqn == null && searchSubtrees)


-- 
Brian Stansberry
Lead, AS Clustering
JBoss by Red Hat