On 11 Nov 2008, at 22:33, Brian Stansberry wrote:
 We just found an intermittent failure in the EJB3 testsuite[1]  
 that's more a JBC or JGroups issue. This is with JBC 3.0.0.CR4 and  
 JG 2.6.6. I'm speculating it relates to FLUSH work Vladimir's been  
 doing[2][3].
 Issue is an inability to replicate a put:
 Caused by: org.jboss.cache.lock.TimeoutException: State retrieval  
 timed out waiting for flush unblock.
 at  
 org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java: 
 455)
 at ....
 org 
 .jboss 
 .cache 
 .invocation.CacheInvocationDelegate.put(CacheInvocationDelegate.java: 
 560)
 at  
 org 
 .jboss 
 .ha 
 .cachemanager 
 .CacheManagerManagedCache.put(CacheManagerManagedCache.java:285)
 at  
 org 
 .jboss 
 .ejb3.cache.tree.StatefulTreeCache.putInCache(StatefulTreeCache.java: 
 511)
 at  
 org 
 .jboss 
 .ejb3.cache.tree.StatefulTreeCache.create(StatefulTreeCache.java:123)
 ... 70 more
 Looking at RPCManagerImpl.java:455 we have:
 if (channel.flushSupported() && ! 
 flushBlockGate.await(configuration.getStateRetrievalTimeout(),  
 TimeUnit.MILLISECONDS))
 {
   throw new TimeoutException("State retrieval timed out waiting for  
 flush unblock.");
 }
 Basically, failing on flushBlockGate.await().  Looking at use of  
 flushBlockGate, the gate is closed in block() and opened in  
 unblock(). *Assuming* no bug in ReclosableLatch, seems like block()  
 is getting called here with no subsequent call to unblock().  
 (Unfortunately, logs related to this failure are gone, so I can't  
 prove that.)
 Questions:
 1) Vladimir, could the JGRP-855 issue result in block() getting  
 called with no subsequent call to unblock(), either on the flush  
 coordinator or on one of the other nodes?  If yes, your JGRP-855 fix  
 will probably fix this as well.
 2) Looking at RPCManagerImpl.start(), it does a connect+state  
 transfer in a try/catch where any failure should result in a  
 CacheException being thrown from start().  That CacheException  
 should have prevented deployment of the ejb; i.e. the call shown in  
 the stack trace above shouldn't have happened. Only way I see it  
 could have happened is if the node that threw above exception wasn't  
 the flush coordinator; i.e. its cache started fine, but a problem on  
 another node led to its block() being called with no matching  
 unblock().  That's a big issue too, as it means a failure in one  
 node can take down the entire cluster by leaving everyone's  
 flushBlockGate closed. 
Yes, this was always an issue with the way we used FLUSH - that  
someone in the group could initiate a FLUSH and then die leaving other  
members' flushBlockGates closed.  TBH, apart from adding timeouts to  
the flushBlockGate, I can't see how we would get around this.
Vladimir/Bela - in the scenario described (node initiates a FLUSH and  
then dies) would other nodes still see a view change relating to the  
node dying?
 [1] 
https://jira.jboss.org/jira/browse/EJBTHREE-1580
 [2] 
https://jira.jboss.org/jira/browse/JGRP-855
 [3] 
http://www.jboss.com/index.html?module=bb&op=viewtopic&t=145138
 -- 
 Brian Stansberry
 Lead, AS Clustering
 JBoss, a division of Red Hat
 brian.stansberry(a)redhat.com
 _______________________________________________
 jbosscache-dev mailing list
 jbosscache-dev(a)lists.jboss.org
 
https://lists.jboss.org/mailman/listinfo/jbosscache-dev 
--
Manik Surtani
Lead, JBoss Cache
manik(a)jboss.org