[infinispan-issues] [JBoss JIRA] (ISPN-2802) Cache recovery fails due to missing responses

Mon Feb 11 07:28:56 EST 2013

    [ https://issues.jboss.org/browse/ISPN-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753387#comment-12753387 ] 

Radim Vansa commented on ISPN-2802:
-----------------------------------

Bela: sorry :) node3 = hyperion806 sends GET_STATUS, you can see that node8 = hyperion811 gets the message and responds with message
{code}
02:44:34,677 BYTEMAN (OOB-94,hyperion811-7798): Receiving local hyperion806-18589 -> null: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=hyperion806-18589, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=33}
02:44:34,678 BYTEMAN (OOB-94,hyperion811-7798): Received local hyperion806-18589 -> null: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=hyperion806-18589, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=33}
02:44:34,690 TRACE [org.jgroups.protocols.UNICAST2] (OOB-94,hyperion811-7798) hyperion811-7798 --> DATA(hyperion806-18589: #178639, conn_id=3)
{code}

Then there are repeating
{code}
02:45:07,652 TRACE [org.jgroups.protocols.UNICAST2] (OOB-94,hyperion811-7798) hyperion811-7798 <-- XMIT(hyperion806-18589: #[178639])
02:45:07,652 BYTEMAN (OOB-94,hyperion811-7798): XMIT 178639 to hyperion806-18589
{code}
(the order of messages in log seems to differ as BYTEMAN writes to stdout and not through the logging system)


> Cache recovery fails due to missing responses
> ---------------------------------------------
>
>                 Key: ISPN-2802
>                 URL: https://issues.jboss.org/browse/ISPN-2802
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.2.0.CR3
>            Reporter: Radim Vansa
>            Assignee: Mircea Markus
>
> When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.
> Here are the logs (TRACE is not doable here, but I added some byteman traces - see topology.btm in the archive): http://dl.dropbox.com/u/103079234/recovery.zip
> The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
> All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but some responses are not received on node3 (look for Receiving rsp bound to GroupRequest).
> JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).
> As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira