[infinispan-issues] [JBoss JIRA] Updated: (ISPN-762) possible deadlock during start of 3 HotRod Server nodes

Mon Nov 8 09:17:01 EST 2010

     [ https://jira.jboss.org/browse/ISPN-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jacek Gerbszt updated ISPN-762:
-------------------------------

    Summary: possible deadlock during start of 3 HotRod Server nodes  (was: possible deadlock during start of 3 node HotRod Server nodes)


> possible deadlock during start of 3 HotRod Server nodes
> -------------------------------------------------------
>
>                 Key: ISPN-762
>                 URL: https://jira.jboss.org/browse/ISPN-762
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Cache Server
>    Affects Versions: 4.2.0.BETA1
>            Reporter: Jacek Gerbszt
>            Assignee: Manik Surtani
>         Attachments: topologycache.diff
>
>
> While starting 3 HotrodServer nodes at the same time, I receive an exception: 
> 2010-11-08 13:21:11,081 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl] (InfinispanServer-Main) unexpected error while replicating
> org.infinispan.manager.NamedCacheNotFoundException: Cache: ___hotRodTopologyCache
>         at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:72)
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:176)
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:148)
>         at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:575)
>         at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:486)
>         at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:362)
>         at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
>         at org.jgroups.JChannel.up(JChannel.java:1453)
>         at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:887)
>         at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:483)
>         at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:265)
>         at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
>         at org.jgroups.protocols.FC.up(FC.java:494)
>         at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
>         at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
>         at org.jgroups.protocols.UNICAST.up(UNICAST.java:309)
>         at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:813)
>         at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:671)
>         at org.jgroups.protocols.BARRIER.up(BARRIER.java:120)
>         at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:177)
>         at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:270)
>         at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
>         at org.jgroups.protocols.Discovery.up(Discovery.java:283)
>         at org.jgroups.protocols.PING.up(PING.java:67)
>         at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
>         at org.jgroups.protocols.TP.access$100(TP.java:56)
>         at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1549)
>         at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1531)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> I think the problem is that ___hotRodTopologyCache is defined to late. When the default cache is started in AbstractProtocolServer.start()
> the cluster is up and running - ready to receive the replication messages. But unfortunately ___hotRodTopologyCache is not yet defined, which is
> done a few lines later in HotRodServer.addSelfToTopologyView(). Simply moving one line of code from addSelfToTopologyView to start method, 
> as shown in the attachement, would solve this problem.
> But it's not the end of story. Even though all the caches are defined in the right place, there is still a problem 
> during starting 3 nodes at a time. Look at the errors coming from the log files: 
> * wp-60814 log:
> 2010-11-08 14:29:26,687 INFO  [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,wp-60814) This is a JOIN event!  Wait for notification from 
> new joiner wp-59660
> 2010-11-08 14:31:26,668 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-wp-60814) Caught exception!
> org.infinispan.CacheException: org.infinispan.util.concurrent.TimeoutException: Timed out after 120 seconds waiting for a response from wp-34493
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:122)
>         at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:403)
>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:125)
>         at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:192)
>         at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:87)
>         at org.infinispan.distribution.RehashTask.call(RehashTask.java:53)
>         at org.infinispan.distribution.RehashTask.call(RehashTask.java:33)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 120 seconds waiting for a response from wp-34493
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:304)
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:120)
>         ... 12 more
> * wp-34493 log:
> 2010-11-08 14:29:26,654 INFO  [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,wp-34493) This is a JOIN event!  Wait for notification from 
> new joiner wp-59660
> 2010-11-08 14:29:36,759 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (InfinispanServer-Main) Execution error: 
> org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses!
>         at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:421)
>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
> 		...
>         at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:273)
>         at org.infinispan.CacheDelegate.putIfAbsent(CacheDelegate.java:453)
>         at org.infinispan.CacheSupport.putIfAbsent(CacheSupport.java:40)
>         at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:85)
>         at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:75)
>         at org.infinispan.server.hotrod.HotRodServer.isViewUpdated(HotRodServer.scala:102)
>         at org.infinispan.server.hotrod.HotRodServer.org$infinispan$server$hotrod$HotRodServer$$updateTopologyView(HotRodServer.scala:97)
>         at org.infinispan.server.hotrod.HotRodServer.addSelfToTopologyView(HotRodServer.scala:75)
>         at org.infinispan.server.hotrod.HotRodServer.startTransport(HotRodServer.scala:63)
>         at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:70)
>         at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:44)
> 		...
> 		
> * wp-59660 log:
> 2010-11-08 14:29:26,699 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Received new cluster view: [wp-34493|2] [wp-34493, wp-60814, wp-59660]
> 2010-11-08 14:29:26,756 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Cache local address is wp-59660, physical addresses are [10.0.36.136:34302]
> 2010-11-08 14:29:26,851 ERROR [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Defined caches: [___hotRodTopologyCache]
> 2010-11-08 14:29:26,851 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Will try and wait for the cache to start
> 2010-11-08 14:29:56,864 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Cache named ___hotRodTopologyCache does not exist on this cache manager!
> 2010-11-08 14:49:26,897 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-wp-59660) Caught exception!
> org.infinispan.CacheException: Unable to retrieve old consistent hash from coordinator even after several attempts at sleeping and retrying!
>         at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:218)
>         at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:87)
>         at org.infinispan.distribution.RehashTask.call(RehashTask.java:53)
>         at org.infinispan.distribution.RehashTask.call(RehashTask.java:33)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> The first node reaches  rehashRpcTimeout=120000 while waiting on the old hash. The second one reaches 10s timeout during 
> adding himself to the topology view. And the third one fails on retrieving old hash, but after much longer time = 20min. 
> It looks as though we had a kind of deadlock.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira