[jbosscache-dev] Cache unable to write to cluster

Wed Nov 12 05:03:11 EST 2008

On 11 Nov 2008, at 22:33, Brian Stansberry wrote:

> We just found an intermittent failure in the EJB3 testsuite[1]  
> that's more a JBC or JGroups issue. This is with JBC 3.0.0.CR4 and  
> JG 2.6.6. I'm speculating it relates to FLUSH work Vladimir's been  
> doing[2][3].
>
> Issue is an inability to replicate a put:
>
> Caused by: org.jboss.cache.lock.TimeoutException: State retrieval  
> timed out waiting for flush unblock.
> at  
> org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java: 
> 455)
> at ....
> org 
> .jboss 
> .cache 
> .invocation.CacheInvocationDelegate.put(CacheInvocationDelegate.java: 
> 560)
> at  
> org 
> .jboss 
> .ha 
> .cachemanager 
> .CacheManagerManagedCache.put(CacheManagerManagedCache.java:285)
> at  
> org 
> .jboss 
> .ejb3.cache.tree.StatefulTreeCache.putInCache(StatefulTreeCache.java: 
> 511)
> at  
> org 
> .jboss 
> .ejb3.cache.tree.StatefulTreeCache.create(StatefulTreeCache.java:123)
> ... 70 more
>
> Looking at RPCManagerImpl.java:455 we have:
>
> if (channel.flushSupported() && ! 
> flushBlockGate.await(configuration.getStateRetrievalTimeout(),  
> TimeUnit.MILLISECONDS))
> {
>   throw new TimeoutException("State retrieval timed out waiting for  
> flush unblock.");
> }
>
> Basically, failing on flushBlockGate.await().  Looking at use of  
> flushBlockGate, the gate is closed in block() and opened in  
> unblock(). *Assuming* no bug in ReclosableLatch, seems like block()  
> is getting called here with no subsequent call to unblock().  
> (Unfortunately, logs related to this failure are gone, so I can't  
> prove that.)
>
> Questions:
>
> 1) Vladimir, could the JGRP-855 issue result in block() getting  
> called with no subsequent call to unblock(), either on the flush  
> coordinator or on one of the other nodes?  If yes, your JGRP-855 fix  
> will probably fix this as well.
>
> 2) Looking at RPCManagerImpl.start(), it does a connect+state  
> transfer in a try/catch where any failure should result in a  
> CacheException being thrown from start().  That CacheException  
> should have prevented deployment of the ejb; i.e. the call shown in  
> the stack trace above shouldn't have happened. Only way I see it  
> could have happened is if the node that threw above exception wasn't  
> the flush coordinator; i.e. its cache started fine, but a problem on  
> another node led to its block() being called with no matching  
> unblock().  That's a big issue too, as it means a failure in one  
> node can take down the entire cluster by leaving everyone's  
> flushBlockGate closed.

Yes, this was always an issue with the way we used FLUSH - that  
someone in the group could initiate a FLUSH and then die leaving other  
members' flushBlockGates closed.  TBH, apart from adding timeouts to  
the flushBlockGate, I can't see how we would get around this.

Vladimir/Bela - in the scenario described (node initiates a FLUSH and  
then dies) would other nodes still see a view change relating to the  
node dying?

>
>
> [1] https://jira.jboss.org/jira/browse/EJBTHREE-1580
> [2] https://jira.jboss.org/jira/browse/JGRP-855
> [3] http://www.jboss.com/index.html?module=bb&op=viewtopic&t=145138
>
> -- 
> Brian Stansberry
> Lead, AS Clustering
> JBoss, a division of Red Hat
> brian.stansberry at redhat.com
>
> _______________________________________________
> jbosscache-dev mailing list
> jbosscache-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/jbosscache-dev

--
Manik Surtani
Lead, JBoss Cache
manik at jboss.org