On 11 Nov 2008, at 22:33, Brian Stansberry wrote:
We just found an intermittent failure in the EJB3 testsuite[1]
that's more a JBC or JGroups issue. This is with JBC 3.0.0.CR4 and
JG 2.6.6. I'm speculating it relates to FLUSH work Vladimir's been
doing[2][3].
Issue is an inability to replicate a put:
Caused by: org.jboss.cache.lock.TimeoutException: State retrieval
timed out waiting for flush unblock.
at
org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:
455)
at ....
org
.jboss
.cache
.invocation.CacheInvocationDelegate.put(CacheInvocationDelegate.java:
560)
at
org
.jboss
.ha
.cachemanager
.CacheManagerManagedCache.put(CacheManagerManagedCache.java:285)
at
org
.jboss
.ejb3.cache.tree.StatefulTreeCache.putInCache(StatefulTreeCache.java:
511)
at
org
.jboss
.ejb3.cache.tree.StatefulTreeCache.create(StatefulTreeCache.java:123)
... 70 more
Looking at RPCManagerImpl.java:455 we have:
if (channel.flushSupported() && !
flushBlockGate.await(configuration.getStateRetrievalTimeout(),
TimeUnit.MILLISECONDS))
{
throw new TimeoutException("State retrieval timed out waiting for
flush unblock.");
}
Basically, failing on flushBlockGate.await(). Looking at use of
flushBlockGate, the gate is closed in block() and opened in
unblock(). *Assuming* no bug in ReclosableLatch, seems like block()
is getting called here with no subsequent call to unblock().
(Unfortunately, logs related to this failure are gone, so I can't
prove that.)
Questions:
1) Vladimir, could the JGRP-855 issue result in block() getting
called with no subsequent call to unblock(), either on the flush
coordinator or on one of the other nodes? If yes, your JGRP-855 fix
will probably fix this as well.
2) Looking at RPCManagerImpl.start(), it does a connect+state
transfer in a try/catch where any failure should result in a
CacheException being thrown from start(). That CacheException
should have prevented deployment of the ejb; i.e. the call shown in
the stack trace above shouldn't have happened. Only way I see it
could have happened is if the node that threw above exception wasn't
the flush coordinator; i.e. its cache started fine, but a problem on
another node led to its block() being called with no matching
unblock(). That's a big issue too, as it means a failure in one
node can take down the entire cluster by leaving everyone's
flushBlockGate closed.
Yes, this was always an issue with the way we used FLUSH - that
someone in the group could initiate a FLUSH and then die leaving other
members' flushBlockGates closed. TBH, apart from adding timeouts to
the flushBlockGate, I can't see how we would get around this.
Vladimir/Bela - in the scenario described (node initiates a FLUSH and
then dies) would other nodes still see a view change relating to the
node dying?
[1]
https://jira.jboss.org/jira/browse/EJBTHREE-1580
[2]
https://jira.jboss.org/jira/browse/JGRP-855
[3]
http://www.jboss.com/index.html?module=bb&op=viewtopic&t=145138
--
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry(a)redhat.com
_______________________________________________
jbosscache-dev mailing list
jbosscache-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/jbosscache-dev
--
Manik Surtani
Lead, JBoss Cache
manik(a)jboss.org