[jbosscache-dev] Cache unable to write to cluster
Manik Surtani
manik at jboss.org
Wed Nov 12 05:03:11 EST 2008
On 11 Nov 2008, at 22:33, Brian Stansberry wrote:
> We just found an intermittent failure in the EJB3 testsuite[1]
> that's more a JBC or JGroups issue. This is with JBC 3.0.0.CR4 and
> JG 2.6.6. I'm speculating it relates to FLUSH work Vladimir's been
> doing[2][3].
>
> Issue is an inability to replicate a put:
>
> Caused by: org.jboss.cache.lock.TimeoutException: State retrieval
> timed out waiting for flush unblock.
> at
> org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:
> 455)
> at ....
> org
> .jboss
> .cache
> .invocation.CacheInvocationDelegate.put(CacheInvocationDelegate.java:
> 560)
> at
> org
> .jboss
> .ha
> .cachemanager
> .CacheManagerManagedCache.put(CacheManagerManagedCache.java:285)
> at
> org
> .jboss
> .ejb3.cache.tree.StatefulTreeCache.putInCache(StatefulTreeCache.java:
> 511)
> at
> org
> .jboss
> .ejb3.cache.tree.StatefulTreeCache.create(StatefulTreeCache.java:123)
> ... 70 more
>
> Looking at RPCManagerImpl.java:455 we have:
>
> if (channel.flushSupported() && !
> flushBlockGate.await(configuration.getStateRetrievalTimeout(),
> TimeUnit.MILLISECONDS))
> {
> throw new TimeoutException("State retrieval timed out waiting for
> flush unblock.");
> }
>
> Basically, failing on flushBlockGate.await(). Looking at use of
> flushBlockGate, the gate is closed in block() and opened in
> unblock(). *Assuming* no bug in ReclosableLatch, seems like block()
> is getting called here with no subsequent call to unblock().
> (Unfortunately, logs related to this failure are gone, so I can't
> prove that.)
>
> Questions:
>
> 1) Vladimir, could the JGRP-855 issue result in block() getting
> called with no subsequent call to unblock(), either on the flush
> coordinator or on one of the other nodes? If yes, your JGRP-855 fix
> will probably fix this as well.
>
> 2) Looking at RPCManagerImpl.start(), it does a connect+state
> transfer in a try/catch where any failure should result in a
> CacheException being thrown from start(). That CacheException
> should have prevented deployment of the ejb; i.e. the call shown in
> the stack trace above shouldn't have happened. Only way I see it
> could have happened is if the node that threw above exception wasn't
> the flush coordinator; i.e. its cache started fine, but a problem on
> another node led to its block() being called with no matching
> unblock(). That's a big issue too, as it means a failure in one
> node can take down the entire cluster by leaving everyone's
> flushBlockGate closed.
Yes, this was always an issue with the way we used FLUSH - that
someone in the group could initiate a FLUSH and then die leaving other
members' flushBlockGates closed. TBH, apart from adding timeouts to
the flushBlockGate, I can't see how we would get around this.
Vladimir/Bela - in the scenario described (node initiates a FLUSH and
then dies) would other nodes still see a view change relating to the
node dying?
>
>
> [1] https://jira.jboss.org/jira/browse/EJBTHREE-1580
> [2] https://jira.jboss.org/jira/browse/JGRP-855
> [3] http://www.jboss.com/index.html?module=bb&op=viewtopic&t=145138
>
> --
> Brian Stansberry
> Lead, AS Clustering
> JBoss, a division of Red Hat
> brian.stansberry at redhat.com
>
> _______________________________________________
> jbosscache-dev mailing list
> jbosscache-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/jbosscache-dev
--
Manik Surtani
Lead, JBoss Cache
manik at jboss.org
More information about the jbosscache-dev
mailing list