[infinispan-issues] [JBoss JIRA] (ISPN-2802) Cache recovery fails due to missing responses

Mon Feb 11 08:41:56 EST 2013

    [ https://issues.jboss.org/browse/ISPN-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753394#comment-12753394 ] 

Bela Ban commented on ISPN-2802:
--------------------------------

OK, so I guess I know the problem...

First off, 806 asks 811 to retransmit message #178639 from 2:45:07 - 2:45:49, for ca. 42 seconds:
(cat node8.log |grep -e "<-- XMIT(hyperion806-18589" -e "<-- STABLE(hyperion806-18589")

At 2:44:34, 806 actually does get message 178639 from 811:
02:44:34,677 [UNICAST2] hyperion806-18589 <-- DATA(hyperion811-7798: #178637, conn_id=3)

and we can see that 811 gets a STABLE message from 806 at 2:46:48:

02:46:48,246 [UNICAST2] hyperion811-7798 <-- STABLE(hyperion806-18589: 178641-178641, conn_id=3)

My guess is that the retransmitted message (especially #178639) is an OOB message, and that the OOB thread pool at the requester is full, therefore dropping the message to the ground, until there is finally a thread available that's processing the retransmitted message.

I guess all the more important to get that Infinispan-internal thread pool.

> Cache recovery fails due to missing responses
> ---------------------------------------------
>
>                 Key: ISPN-2802
>                 URL: https://issues.jboss.org/browse/ISPN-2802
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.2.0.CR3
>            Reporter: Radim Vansa
>            Assignee: Mircea Markus
>
> When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.
> Here are the logs (TRACE is not doable here, but I added some byteman traces - see topology.btm in the archive): http://dl.dropbox.com/u/103079234/recovery.zip
> The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
> All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but some responses are not received on node3 (look for Receiving rsp bound to GroupRequest).
> JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).
> As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira