[jbosscache-dev] Cache unable to write to cluster

Tue Nov 11 17:33:08 EST 2008

We just found an intermittent failure in the EJB3 testsuite[1] that's 
more a JBC or JGroups issue. This is with JBC 3.0.0.CR4 and JG 2.6.6. 
I'm speculating it relates to FLUSH work Vladimir's been doing[2][3].

Issue is an inability to replicate a put:

Caused by: org.jboss.cache.lock.TimeoutException: State retrieval timed 
out waiting for flush unblock.
at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:455)
at ....
org.jboss.cache.invocation.CacheInvocationDelegate.put(CacheInvocationDelegate.java:560)
at 
org.jboss.ha.cachemanager.CacheManagerManagedCache.put(CacheManagerManagedCache.java:285)
at 
org.jboss.ejb3.cache.tree.StatefulTreeCache.putInCache(StatefulTreeCache.java:511)
at 
org.jboss.ejb3.cache.tree.StatefulTreeCache.create(StatefulTreeCache.java:123)
... 70 more

Looking at RPCManagerImpl.java:455 we have:

if (channel.flushSupported() && 
!flushBlockGate.await(configuration.getStateRetrievalTimeout(), 
TimeUnit.MILLISECONDS))
{
    throw new TimeoutException("State retrieval timed out waiting for 
flush unblock.");
}

Basically, failing on flushBlockGate.await().  Looking at use of 
flushBlockGate, the gate is closed in block() and opened in unblock(). 
*Assuming* no bug in ReclosableLatch, seems like block() is getting 
called here with no subsequent call to unblock(). (Unfortunately, logs 
related to this failure are gone, so I can't prove that.)

Questions:

1) Vladimir, could the JGRP-855 issue result in block() getting called 
with no subsequent call to unblock(), either on the flush coordinator or 
on one of the other nodes?  If yes, your JGRP-855 fix will probably fix 
this as well.

2) Looking at RPCManagerImpl.start(), it does a connect+state transfer 
in a try/catch where any failure should result in a CacheException being 
thrown from start().  That CacheException should have prevented 
deployment of the ejb; i.e. the call shown in the stack trace above 
shouldn't have happened. Only way I see it could have happened is if the 
node that threw above exception wasn't the flush coordinator; i.e. its 
cache started fine, but a problem on another node led to its block() 
being called with no matching unblock().  That's a big issue too, as it 
means a failure in one node can take down the entire cluster by leaving 
everyone's flushBlockGate closed.

[1] https://jira.jboss.org/jira/browse/EJBTHREE-1580
[2] https://jira.jboss.org/jira/browse/JGRP-855
[3] http://www.jboss.com/index.html?module=bb&op=viewtopic&t=145138

-- 
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry at redhat.com