We just found an intermittent failure in the EJB3 testsuite[1] that's
more a JBC or JGroups issue. This is with JBC 3.0.0.CR4 and JG 2.6.6.
I'm speculating it relates to FLUSH work Vladimir's been doing[2][3].
Issue is an inability to replicate a put:
Caused by: org.jboss.cache.lock.TimeoutException: State retrieval timed
out waiting for flush unblock.
at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:455)
at ....
org.jboss.cache.invocation.CacheInvocationDelegate.put(CacheInvocationDelegate.java:560)
at
org.jboss.ha.cachemanager.CacheManagerManagedCache.put(CacheManagerManagedCache.java:285)
at
org.jboss.ejb3.cache.tree.StatefulTreeCache.putInCache(StatefulTreeCache.java:511)
at
org.jboss.ejb3.cache.tree.StatefulTreeCache.create(StatefulTreeCache.java:123)
... 70 more
Looking at RPCManagerImpl.java:455 we have:
if (channel.flushSupported() &&
!flushBlockGate.await(configuration.getStateRetrievalTimeout(),
TimeUnit.MILLISECONDS))
{
throw new TimeoutException("State retrieval timed out waiting for
flush unblock.");
}
Basically, failing on flushBlockGate.await(). Looking at use of
flushBlockGate, the gate is closed in block() and opened in unblock().
*Assuming* no bug in ReclosableLatch, seems like block() is getting
called here with no subsequent call to unblock(). (Unfortunately, logs
related to this failure are gone, so I can't prove that.)
Questions:
1) Vladimir, could the JGRP-855 issue result in block() getting called
with no subsequent call to unblock(), either on the flush coordinator or
on one of the other nodes? If yes, your JGRP-855 fix will probably fix
this as well.
2) Looking at RPCManagerImpl.start(), it does a connect+state transfer
in a try/catch where any failure should result in a CacheException being
thrown from start(). That CacheException should have prevented
deployment of the ejb; i.e. the call shown in the stack trace above
shouldn't have happened. Only way I see it could have happened is if the
node that threw above exception wasn't the flush coordinator; i.e. its
cache started fine, but a problem on another node led to its block()
being called with no matching unblock(). That's a big issue too, as it
means a failure in one node can take down the entire cluster by leaving
everyone's flushBlockGate closed.
[1]
https://jira.jboss.org/jira/browse/EJBTHREE-1580
[2]
https://jira.jboss.org/jira/browse/JGRP-855
[3]
http://www.jboss.com/index.html?module=bb&op=viewtopic&t=145138
--
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry(a)redhat.com