[infinispan-issues] [JBoss JIRA] (ISPN-2612) Problem broadcasting CH_UPDATE command

Dan Berindei (JIRA) jira-events at lists.jboss.org
Tue Jan 15 07:02:22 EST 2013


     [ https://issues.jboss.org/browse/ISPN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Berindei resolved ISPN-2612.
--------------------------------

    Resolution: Won't Fix


[~mlinhard], it looks like the problem is that you don't have enough threads in your OOB thread pool, and the RSVP ACK message from the coordinator back to the self is dropped. It is resent after UNICAST.stable_interval expires (5 seconds with your configuration), but it is dropped again because 1 OOB thread is waiting for CH_UPDATE responses and 1 thread is busy sending the RSVP ACK message (sending doesn't take long, but JGroups doesn't wait at all before dropping the message).

I must have gotten lucky when I ran the test last night, because I tried it again after looking at the logs and I got the same TimeoutException you did. It went away when I increased oob_thread_pool.max_threads to 4.

I'm not sure what the absolute minimum number of OOB threads is, but I recommend running with the defaults in jgroups-udp.xml. JGroups won't start more threads than it needs, so I don't see a reason to move from the defaults.

Please reopen the JIRA if you can still reproduce it with 4 or more OOB threads.
                
> Problem broadcasting CH_UPDATE command
> --------------------------------------
>
>                 Key: ISPN-2612
>                 URL: https://issues.jboss.org/browse/ISPN-2612
>             Project: Infinispan
>          Issue Type: Bug
>          Components: RPC
>    Affects Versions: 5.2.0.Beta5
>            Reporter: Michal Linhard
>            Assignee: Dan Berindei
>            Priority: Blocker
>             Fix For: 5.2.0.CR2
>
>         Attachments: session-cluster.xml, test.zip, test_TRACE.zip
>
>
> Infinispan 5.2.0.Beta5
> JGroups 3.2.4.Final
> Steps to reproduce (I'm using two virtual interfaces test1, test2)
> 1. Start org.jboss.qa.jdg.Test with -Djgroups.udp.bind_addr=test1 -Djava.net.preferIPv4Stack=true
> 2. wait 10 sec
> 3. Start org.jboss.qa.jdg.Test with -Djgroups.udp.bind_addr=test2 -Djava.net.preferIPv4Stack=true
> After 5 seconds there should be this timeout exception:
> {code}
> 19:42:14,146 WARN  [org.infinispan.topology.CacheTopologyControlCommand] (OOB-2,mlinhard-work-37329) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=___defaultcache, type=REBALANCE_CONFIRM, sender=mlinhard-work-47337, joinInfo=null, topologyId=1, currentCH=null, pendingCH=null, throwable=null, viewId=1}
> java.util.concurrent.ExecutionException: org.infinispan.CacheException: org.jgroups.TimeoutException: TimeoutException
> 	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:91)
> 	at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:563)
> 	at org.infinispan.topology.ClusterTopologyManagerImpl.broadcastConsistentHashUpdate(ClusterTopologyManagerImpl.java:349)
> 	at org.infinispan.topology.ClusterTopologyManagerImpl.handleRebalanceCompleted(ClusterTopologyManagerImpl.java:213)
> 	at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:160)
> 	at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:137)
> 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommandFromLocalCluster(CommandAwareRpcDispatcher.java:252)
> 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:219)
> 	at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:483)
> 	at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:390)
> 	at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:248)
> 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:598)
> 	at org.jgroups.JChannel.up(JChannel.java:703)
> 	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1020)
> 	at org.jgroups.protocols.RSVP.up(RSVP.java:172)
> 	at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)
> 	at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
> 	at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
> 	at org.jgroups.protocols.pbcast.GMS.up(GMS.java:896)
> 	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:244)
> 	at org.jgroups.protocols.UNICAST2.handleDataReceived(UNICAST2.java:736)
> 	at org.jgroups.protocols.UNICAST2.up(UNICAST2.java:414)
> 	at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:606)
> 	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:143)
> 	at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:187)
> 	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)
> 	at org.jgroups.protocols.MERGE2.up(MERGE2.java:205)
> 	at org.jgroups.protocols.Discovery.up(Discovery.java:359)
> 	at org.jgroups.protocols.TP.passMessageUp(TP.java:1287)
> 	at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1850)
> 	at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1823)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> Caused by: org.infinispan.CacheException: org.jgroups.TimeoutException: TimeoutException
> 	at org.infinispan.util.Util.rewrapAsCacheException(Util.java:532)
> 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:152)
> 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:518)
> 	at org.infinispan.topology.ClusterTopologyManagerImpl$2.call(ClusterTopologyManagerImpl.java:545)
> 	at org.infinispan.topology.ClusterTopologyManagerImpl$2.call(ClusterTopologyManagerImpl.java:542)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	... 3 more
> Caused by: org.jgroups.TimeoutException: TimeoutException
> 	at org.jgroups.util.Promise._getResultWithTimeout(Promise.java:145)
> 	at org.jgroups.util.Promise.getResultWithTimeout(Promise.java:40)
> 	at org.jgroups.util.AckCollector.waitForAllAcks(AckCollector.java:93)
> 	at org.jgroups.protocols.RSVP$Entry.block(RSVP.java:287)
> 	at org.jgroups.protocols.RSVP.down(RSVP.java:118)
> 	at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1025)
> 	at org.jgroups.JChannel.down(JChannel.java:718)
> 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:616)
> 	at org.jgroups.blocks.RequestCorrelator.sendRequest(RequestCorrelator.java:173)
> 	at org.jgroups.blocks.GroupRequest.sendRequest(GroupRequest.java:360)
> 	at org.jgroups.blocks.GroupRequest.sendRequest(GroupRequest.java:103)
> 	at org.jgroups.blocks.Request.execute(Request.java:83)
> 	at org.jgroups.blocks.MessageDispatcher.cast(MessageDispatcher.java:335)
> 	at org.jgroups.blocks.MessageDispatcher.castMessage(MessageDispatcher.java:249)
> 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processCalls(CommandAwareRpcDispatcher.java:330)
> 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:145)
> 	... 8 more
> {code}
> Analysis:
> These are the messages sent after view change:
> {code}
> test1                     test2
>  <--- JOIN ---- 
>  ---- REBALANCE_START ---> 
>  <--- StateRequestCommand ---- 
>  ---- StateResponseCommand ---> 
>  <--- REBALANCE_CONFIRM ---- 
>  ---- CH_UPDATE --->
> {code}
> The last CH_UPDATE message is broadcast, test2 successfully processes it, but test1 stays in waiting state, because it for some reason awaits response also from itself - local variable entry in the method RSVP.down
> (https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/RSVP.java#L121)
> contained local address.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


More information about the infinispan-issues mailing list