[infinispan-issues] [JBoss JIRA] (ISPN-2802) Cache recovery fails due to missing responses

Wed Feb 6 10:23:51 EST 2013

     [ https://issues.jboss.org/browse/ISPN-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Vansa updated ISPN-2802:
------------------------------

    Description: 
When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.

Here are the logs (TRACE is not doable here, but I added some byteman traces): http://dl.dropbox.com/u/103079234/recovery.zip
The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but some responses are not received on node3 (look for Receiving rsp bound to GroupRequest).
JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).

As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.

  was:
When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.

Here are the logs (TRACE is not doable here, but I added some byteman traces): http://dl.dropbox.com/u/103079234/recovery.zip
The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but not all responses are not received on node3.
JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).

As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.

> Cache recovery fails due to missing responses
> ---------------------------------------------
>
>                 Key: ISPN-2802
>                 URL: https://issues.jboss.org/browse/ISPN-2802
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.2.0.CR3
>            Reporter: Radim Vansa
>            Assignee: Mircea Markus
>
> When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.
> Here are the logs (TRACE is not doable here, but I added some byteman traces): http://dl.dropbox.com/u/103079234/recovery.zip
> The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
> All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but some responses are not received on node3 (look for Receiving rsp bound to GroupRequest).
> JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).
> As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira