[infinispan-issues] [JBoss JIRA] Updated: (ISPN-762) possible deadlock during start of 3 HotRod Server nodes
Jacek Gerbszt (JIRA)
jira-events at lists.jboss.org
Mon Nov 8 09:17:01 EST 2010
[ https://jira.jboss.org/browse/ISPN-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jacek Gerbszt updated ISPN-762:
-------------------------------
Summary: possible deadlock during start of 3 HotRod Server nodes (was: possible deadlock during start of 3 node HotRod Server nodes)
> possible deadlock during start of 3 HotRod Server nodes
> -------------------------------------------------------
>
> Key: ISPN-762
> URL: https://jira.jboss.org/browse/ISPN-762
> Project: Infinispan
> Issue Type: Bug
> Components: Cache Server
> Affects Versions: 4.2.0.BETA1
> Reporter: Jacek Gerbszt
> Assignee: Manik Surtani
> Attachments: topologycache.diff
>
>
> While starting 3 HotrodServer nodes at the same time, I receive an exception:
> 2010-11-08 13:21:11,081 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl] (InfinispanServer-Main) unexpected error while replicating
> org.infinispan.manager.NamedCacheNotFoundException: Cache: ___hotRodTopologyCache
> at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:72)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:176)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:148)
> at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:575)
> at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:486)
> at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:362)
> at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
> at org.jgroups.JChannel.up(JChannel.java:1453)
> at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:887)
> at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:483)
> at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:265)
> at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
> at org.jgroups.protocols.FC.up(FC.java:494)
> at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
> at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
> at org.jgroups.protocols.UNICAST.up(UNICAST.java:309)
> at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:813)
> at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:671)
> at org.jgroups.protocols.BARRIER.up(BARRIER.java:120)
> at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:177)
> at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:270)
> at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
> at org.jgroups.protocols.Discovery.up(Discovery.java:283)
> at org.jgroups.protocols.PING.up(PING.java:67)
> at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
> at org.jgroups.protocols.TP.access$100(TP.java:56)
> at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1549)
> at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1531)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> I think the problem is that ___hotRodTopologyCache is defined to late. When the default cache is started in AbstractProtocolServer.start()
> the cluster is up and running - ready to receive the replication messages. But unfortunately ___hotRodTopologyCache is not yet defined, which is
> done a few lines later in HotRodServer.addSelfToTopologyView(). Simply moving one line of code from addSelfToTopologyView to start method,
> as shown in the attachement, would solve this problem.
> But it's not the end of story. Even though all the caches are defined in the right place, there is still a problem
> during starting 3 nodes at a time. Look at the errors coming from the log files:
> * wp-60814 log:
> 2010-11-08 14:29:26,687 INFO [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,wp-60814) This is a JOIN event! Wait for notification from
> new joiner wp-59660
> 2010-11-08 14:31:26,668 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-wp-60814) Caught exception!
> org.infinispan.CacheException: org.infinispan.util.concurrent.TimeoutException: Timed out after 120 seconds waiting for a response from wp-34493
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:122)
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:403)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:125)
> at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:192)
> at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:87)
> at org.infinispan.distribution.RehashTask.call(RehashTask.java:53)
> at org.infinispan.distribution.RehashTask.call(RehashTask.java:33)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 120 seconds waiting for a response from wp-34493
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:304)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:120)
> ... 12 more
> * wp-34493 log:
> 2010-11-08 14:29:26,654 INFO [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,wp-34493) This is a JOIN event! Wait for notification from
> new joiner wp-59660
> 2010-11-08 14:29:36,759 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (InfinispanServer-Main) Execution error:
> org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses!
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:421)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
> ...
> at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:273)
> at org.infinispan.CacheDelegate.putIfAbsent(CacheDelegate.java:453)
> at org.infinispan.CacheSupport.putIfAbsent(CacheSupport.java:40)
> at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:85)
> at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:75)
> at org.infinispan.server.hotrod.HotRodServer.isViewUpdated(HotRodServer.scala:102)
> at org.infinispan.server.hotrod.HotRodServer.org$infinispan$server$hotrod$HotRodServer$$updateTopologyView(HotRodServer.scala:97)
> at org.infinispan.server.hotrod.HotRodServer.addSelfToTopologyView(HotRodServer.scala:75)
> at org.infinispan.server.hotrod.HotRodServer.startTransport(HotRodServer.scala:63)
> at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:70)
> at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:44)
> ...
>
> * wp-59660 log:
> 2010-11-08 14:29:26,699 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Received new cluster view: [wp-34493|2] [wp-34493, wp-60814, wp-59660]
> 2010-11-08 14:29:26,756 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Cache local address is wp-59660, physical addresses are [10.0.36.136:34302]
> 2010-11-08 14:29:26,851 ERROR [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Defined caches: [___hotRodTopologyCache]
> 2010-11-08 14:29:26,851 INFO [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Will try and wait for the cache to start
> 2010-11-08 14:29:56,864 INFO [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Cache named ___hotRodTopologyCache does not exist on this cache manager!
> 2010-11-08 14:49:26,897 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-wp-59660) Caught exception!
> org.infinispan.CacheException: Unable to retrieve old consistent hash from coordinator even after several attempts at sleeping and retrying!
> at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:218)
> at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:87)
> at org.infinispan.distribution.RehashTask.call(RehashTask.java:53)
> at org.infinispan.distribution.RehashTask.call(RehashTask.java:33)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> The first node reaches rehashRpcTimeout=120000 while waiting on the old hash. The second one reaches 10s timeout during
> adding himself to the topology view. And the third one fails on retrieving old hash, but after much longer time = 20min.
> It looks as though we had a kind of deadlock.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the infinispan-issues
mailing list