[infinispan-issues] [JBoss JIRA] Updated: (ISPN-762) possible deadlock during start of 3 HotRod Server nodes

Jacek Gerbszt (JIRA) jira-events at lists.jboss.org
Mon Nov 8 09:19:01 EST 2010


     [ https://jira.jboss.org/browse/ISPN-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jacek Gerbszt updated ISPN-762:
-------------------------------

    Description: 
While starting 3 HotrodServer nodes at the same time, I receive an exception: 
2010-11-08 13:21:11,081 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl] (InfinispanServer-Main) unexpected error while replicating
org.infinispan.manager.NamedCacheNotFoundException: Cache: ___hotRodTopologyCache
        at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:72)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:176)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:148)
        at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:575)
        at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:486)
        at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:362)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
        at org.jgroups.JChannel.up(JChannel.java:1453)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:887)
        at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:483)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:265)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
        at org.jgroups.protocols.FC.up(FC.java:494)
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
        at org.jgroups.protocols.UNICAST.up(UNICAST.java:309)
        at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:813)
        at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:671)
        at org.jgroups.protocols.BARRIER.up(BARRIER.java:120)
        at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:177)
        at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:270)
        at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
        at org.jgroups.protocols.Discovery.up(Discovery.java:283)
        at org.jgroups.protocols.PING.up(PING.java:67)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
        at org.jgroups.protocols.TP.access$100(TP.java:56)
        at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1549)
        at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1531)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

I think the problem is that ___hotRodTopologyCache is defined to late. When the default cache is started in AbstractProtocolServer.start() the cluster is up and running - ready to receive the replication messages. But unfortunately ___hotRodTopologyCache is not yet defined, which is done a few lines later in HotRodServer.addSelfToTopologyView(). Simply moving one line of code from addSelfToTopologyView to start method, 
as shown in the attachement, would solve this problem.

But it's not the end of story. Even though all the caches are defined in the right place, there is still a problem during starting 3 nodes at a time. Look at the errors coming from the log files: 

* wp-60814 log:
2010-11-08 14:29:26,687 INFO  [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,wp-60814) This is a JOIN event!  Wait for notification from new joiner wp-59660
2010-11-08 14:31:26,668 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-wp-60814) Caught exception!
org.infinispan.CacheException: org.infinispan.util.concurrent.TimeoutException: Timed out after 120 seconds waiting for a response from wp-34493
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:122)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:403)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:125)
        at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:192)
        at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:87)
        at org.infinispan.distribution.RehashTask.call(RehashTask.java:53)
        at org.infinispan.distribution.RehashTask.call(RehashTask.java:33)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 120 seconds waiting for a response from wp-34493
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:304)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:120)
        ... 12 more

* wp-34493 log:
2010-11-08 14:29:26,654 INFO  [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,wp-34493) This is a JOIN event!  Wait for notification from 
new joiner wp-59660
2010-11-08 14:29:36,759 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (InfinispanServer-Main) Execution error: 
org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses!
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:421)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
		...
        at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:273)
        at org.infinispan.CacheDelegate.putIfAbsent(CacheDelegate.java:453)
        at org.infinispan.CacheSupport.putIfAbsent(CacheSupport.java:40)
        at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:85)
        at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:75)
        at org.infinispan.server.hotrod.HotRodServer.isViewUpdated(HotRodServer.scala:102)
        at org.infinispan.server.hotrod.HotRodServer.org$infinispan$server$hotrod$HotRodServer$$updateTopologyView(HotRodServer.scala:97)
        at org.infinispan.server.hotrod.HotRodServer.addSelfToTopologyView(HotRodServer.scala:75)
        at org.infinispan.server.hotrod.HotRodServer.startTransport(HotRodServer.scala:63)
        at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:70)
        at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:44)
		...
		
* wp-59660 log:
2010-11-08 14:29:26,699 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Received new cluster view: [wp-34493|2] [wp-34493, wp-60814, wp-59660]
2010-11-08 14:29:26,756 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Cache local address is wp-59660, physical addresses are [10.0.36.136:34302]
2010-11-08 14:29:26,851 ERROR [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Defined caches: [___hotRodTopologyCache]
2010-11-08 14:29:26,851 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Will try and wait for the cache to start
2010-11-08 14:29:56,864 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Cache named ___hotRodTopologyCache does not exist on this cache manager!
2010-11-08 14:49:26,897 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-wp-59660) Caught exception!
org.infinispan.CacheException: Unable to retrieve old consistent hash from coordinator even after several attempts at sleeping and retrying!
        at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:218)
        at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:87)
        at org.infinispan.distribution.RehashTask.call(RehashTask.java:53)
        at org.infinispan.distribution.RehashTask.call(RehashTask.java:33)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

The first node reaches  rehashRpcTimeout=120000 while waiting on the old hash. The second one reaches 10s timeout during adding himself to the topology view. And the third one fails on retrieving old hash, but after much longer time = 20min. 
It looks as though we had a kind of deadlock.

  was:
While starting 3 HotrodServer nodes at the same time, I receive an exception: 
2010-11-08 13:21:11,081 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl] (InfinispanServer-Main) unexpected error while replicating
org.infinispan.manager.NamedCacheNotFoundException: Cache: ___hotRodTopologyCache
        at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:72)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:176)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:148)
        at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:575)
        at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:486)
        at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:362)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
        at org.jgroups.JChannel.up(JChannel.java:1453)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:887)
        at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:483)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:265)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
        at org.jgroups.protocols.FC.up(FC.java:494)
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
        at org.jgroups.protocols.UNICAST.up(UNICAST.java:309)
        at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:813)
        at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:671)
        at org.jgroups.protocols.BARRIER.up(BARRIER.java:120)
        at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:177)
        at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:270)
        at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
        at org.jgroups.protocols.Discovery.up(Discovery.java:283)
        at org.jgroups.protocols.PING.up(PING.java:67)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
        at org.jgroups.protocols.TP.access$100(TP.java:56)
        at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1549)
        at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1531)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

I think the problem is that ___hotRodTopologyCache is defined to late. When the default cache is started in AbstractProtocolServer.start()
the cluster is up and running - ready to receive the replication messages. But unfortunately ___hotRodTopologyCache is not yet defined, which is
done a few lines later in HotRodServer.addSelfToTopologyView(). Simply moving one line of code from addSelfToTopologyView to start method, 
as shown in the attachement, would solve this problem.

But it's not the end of story. Even though all the caches are defined in the right place, there is still a problem 
during starting 3 nodes at a time. Look at the errors coming from the log files: 

* wp-60814 log:
2010-11-08 14:29:26,687 INFO  [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,wp-60814) This is a JOIN event!  Wait for notification from 
new joiner wp-59660
2010-11-08 14:31:26,668 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-wp-60814) Caught exception!
org.infinispan.CacheException: org.infinispan.util.concurrent.TimeoutException: Timed out after 120 seconds waiting for a response from wp-34493
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:122)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:403)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:125)
        at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:192)
        at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:87)
        at org.infinispan.distribution.RehashTask.call(RehashTask.java:53)
        at org.infinispan.distribution.RehashTask.call(RehashTask.java:33)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 120 seconds waiting for a response from wp-34493
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:304)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:120)
        ... 12 more

* wp-34493 log:
2010-11-08 14:29:26,654 INFO  [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,wp-34493) This is a JOIN event!  Wait for notification from 
new joiner wp-59660
2010-11-08 14:29:36,759 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (InfinispanServer-Main) Execution error: 
org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses!
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:421)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
		...
        at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:273)
        at org.infinispan.CacheDelegate.putIfAbsent(CacheDelegate.java:453)
        at org.infinispan.CacheSupport.putIfAbsent(CacheSupport.java:40)
        at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:85)
        at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:75)
        at org.infinispan.server.hotrod.HotRodServer.isViewUpdated(HotRodServer.scala:102)
        at org.infinispan.server.hotrod.HotRodServer.org$infinispan$server$hotrod$HotRodServer$$updateTopologyView(HotRodServer.scala:97)
        at org.infinispan.server.hotrod.HotRodServer.addSelfToTopologyView(HotRodServer.scala:75)
        at org.infinispan.server.hotrod.HotRodServer.startTransport(HotRodServer.scala:63)
        at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:70)
        at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:44)
		...
		
* wp-59660 log:
2010-11-08 14:29:26,699 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Received new cluster view: [wp-34493|2] [wp-34493, wp-60814, wp-59660]
2010-11-08 14:29:26,756 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Cache local address is wp-59660, physical addresses are [10.0.36.136:34302]
2010-11-08 14:29:26,851 ERROR [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Defined caches: [___hotRodTopologyCache]
2010-11-08 14:29:26,851 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Will try and wait for the cache to start
2010-11-08 14:29:56,864 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Cache named ___hotRodTopologyCache does not exist on this cache manager!
2010-11-08 14:49:26,897 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-wp-59660) Caught exception!
org.infinispan.CacheException: Unable to retrieve old consistent hash from coordinator even after several attempts at sleeping and retrying!
        at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:218)
        at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:87)
        at org.infinispan.distribution.RehashTask.call(RehashTask.java:53)
        at org.infinispan.distribution.RehashTask.call(RehashTask.java:33)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

The first node reaches  rehashRpcTimeout=120000 while waiting on the old hash. The second one reaches 10s timeout during 
adding himself to the topology view. And the third one fails on retrieving old hash, but after much longer time = 20min. 
It looks as though we had a kind of deadlock.



> possible deadlock during start of 3 HotRod Server nodes
> -------------------------------------------------------
>
>                 Key: ISPN-762
>                 URL: https://jira.jboss.org/browse/ISPN-762
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Cache Server
>    Affects Versions: 4.2.0.BETA1
>            Reporter: Jacek Gerbszt
>            Assignee: Manik Surtani
>         Attachments: topologycache.diff
>
>
> While starting 3 HotrodServer nodes at the same time, I receive an exception: 
> 2010-11-08 13:21:11,081 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl] (InfinispanServer-Main) unexpected error while replicating
> org.infinispan.manager.NamedCacheNotFoundException: Cache: ___hotRodTopologyCache
>         at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:72)
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:176)
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:148)
>         at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:575)
>         at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:486)
>         at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:362)
>         at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
>         at org.jgroups.JChannel.up(JChannel.java:1453)
>         at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:887)
>         at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:483)
>         at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:265)
>         at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
>         at org.jgroups.protocols.FC.up(FC.java:494)
>         at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
>         at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
>         at org.jgroups.protocols.UNICAST.up(UNICAST.java:309)
>         at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:813)
>         at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:671)
>         at org.jgroups.protocols.BARRIER.up(BARRIER.java:120)
>         at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:177)
>         at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:270)
>         at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
>         at org.jgroups.protocols.Discovery.up(Discovery.java:283)
>         at org.jgroups.protocols.PING.up(PING.java:67)
>         at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
>         at org.jgroups.protocols.TP.access$100(TP.java:56)
>         at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1549)
>         at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1531)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> I think the problem is that ___hotRodTopologyCache is defined to late. When the default cache is started in AbstractProtocolServer.start() the cluster is up and running - ready to receive the replication messages. But unfortunately ___hotRodTopologyCache is not yet defined, which is done a few lines later in HotRodServer.addSelfToTopologyView(). Simply moving one line of code from addSelfToTopologyView to start method, 
> as shown in the attachement, would solve this problem.
> But it's not the end of story. Even though all the caches are defined in the right place, there is still a problem during starting 3 nodes at a time. Look at the errors coming from the log files: 
> * wp-60814 log:
> 2010-11-08 14:29:26,687 INFO  [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,wp-60814) This is a JOIN event!  Wait for notification from new joiner wp-59660
> 2010-11-08 14:31:26,668 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-wp-60814) Caught exception!
> org.infinispan.CacheException: org.infinispan.util.concurrent.TimeoutException: Timed out after 120 seconds waiting for a response from wp-34493
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:122)
>         at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:403)
>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:125)
>         at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:192)
>         at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:87)
>         at org.infinispan.distribution.RehashTask.call(RehashTask.java:53)
>         at org.infinispan.distribution.RehashTask.call(RehashTask.java:33)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 120 seconds waiting for a response from wp-34493
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:304)
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:120)
>         ... 12 more
> * wp-34493 log:
> 2010-11-08 14:29:26,654 INFO  [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,wp-34493) This is a JOIN event!  Wait for notification from 
> new joiner wp-59660
> 2010-11-08 14:29:36,759 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (InfinispanServer-Main) Execution error: 
> org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses!
>         at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:421)
>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:101)
> 		...
>         at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:273)
>         at org.infinispan.CacheDelegate.putIfAbsent(CacheDelegate.java:453)
>         at org.infinispan.CacheSupport.putIfAbsent(CacheSupport.java:40)
>         at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:85)
>         at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:75)
>         at org.infinispan.server.hotrod.HotRodServer.isViewUpdated(HotRodServer.scala:102)
>         at org.infinispan.server.hotrod.HotRodServer.org$infinispan$server$hotrod$HotRodServer$$updateTopologyView(HotRodServer.scala:97)
>         at org.infinispan.server.hotrod.HotRodServer.addSelfToTopologyView(HotRodServer.scala:75)
>         at org.infinispan.server.hotrod.HotRodServer.startTransport(HotRodServer.scala:63)
>         at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:70)
>         at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:44)
> 		...
> 		
> * wp-59660 log:
> 2010-11-08 14:29:26,699 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Received new cluster view: [wp-34493|2] [wp-34493, wp-60814, wp-59660]
> 2010-11-08 14:29:26,756 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Cache local address is wp-59660, physical addresses are [10.0.36.136:34302]
> 2010-11-08 14:29:26,851 ERROR [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Defined caches: [___hotRodTopologyCache]
> 2010-11-08 14:29:26,851 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Will try and wait for the cache to start
> 2010-11-08 14:29:56,864 INFO  [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-2,wp-59660) Cache named ___hotRodTopologyCache does not exist on this cache manager!
> 2010-11-08 14:49:26,897 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-wp-59660) Caught exception!
> org.infinispan.CacheException: Unable to retrieve old consistent hash from coordinator even after several attempts at sleeping and retrying!
>         at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:218)
>         at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:87)
>         at org.infinispan.distribution.RehashTask.call(RehashTask.java:53)
>         at org.infinispan.distribution.RehashTask.call(RehashTask.java:33)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> The first node reaches  rehashRpcTimeout=120000 while waiting on the old hash. The second one reaches 10s timeout during adding himself to the topology view. And the third one fails on retrieving old hash, but after much longer time = 20min. 
> It looks as though we had a kind of deadlock.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the infinispan-issues mailing list