[infinispan-issues] [JBoss JIRA] (ISPN-2802) Cache recovery fails due to missing responses

Dan Berindei (JIRA) jira-events at lists.jboss.org
Fri Feb 8 04:18:51 EST 2013


    [ https://issues.jboss.org/browse/ISPN-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752964#comment-12752964 ] 

Dan Berindei commented on ISPN-2802:
------------------------------------

Radim, what's your JGroups configuration? 

Note that the CacheTopologyControlCommand responses are not themselves marked with the RSVP flag, so if they are dropped at any point it's up to UNICAST2 to re-transmit them. So if UNICAST2.stable_interval is the JGroups default (60 seconds), that would explain the timeout.

Most of the messages are received in < 40ms, but there are a couple response messages that arrive only after 30 seconds. That suggests there is some re-transmission involved, so you could try increasing the GET_STATUS timeout (configured with transport().distributedSyncTimeout()).

{code}
05:37:57,709 BYTEMAN (transport-thread-20): Sending hyperion988-43314 -> null: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=hyperion988-43314, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=33}
... lots of other responses ...
05:37:57,748 BYTEMAN (OOB-1110,hyperion988-43314): Received rsp from hyperion1006-37219
05:38:05,231 BYTEMAN (OOB-1114,hyperion988-43314): Received rsp from hyperion1005-58850
05:38:30,750 BYTEMAN (OOB-1114,hyperion988-43314): Received rsp from hyperion1016-23122
05:38:33,252 BYTEMAN (OOB-1114,hyperion988-43314): Received rsp from hyperion1015-10195
{code}

It looks like the cluster doesn't have much activity except for the rebalancing, because there are a lot of log messages about UNICAST2 connections being reaped after 60 seconds of inactivity. There are a couple JGroups bugs that might be relevant: JGRP-1563 (which should actually be fixed in JGroups 3.2.6.Final, the version we use in 5.2.0.CR3), and JGRP-1586 (which won't be fixed until JGroups 3.3). So you should definitely run your test with {{UNICAST2.conn_expiry_timeout=0}}.

We still need to get some logging from UNICAST2, though - we need to see when the messages are actually sent, when they are re-transmitted, when the connections are created etc. If you think enabling TRACE for UNICAST2 isn't feasible with log4j, then we should definitely look into log4j2.
                
> Cache recovery fails due to missing responses
> ---------------------------------------------
>
>                 Key: ISPN-2802
>                 URL: https://issues.jboss.org/browse/ISPN-2802
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.2.0.CR3
>            Reporter: Radim Vansa
>            Assignee: Mircea Markus
>
> When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.
> Here are the logs (TRACE is not doable here, but I added some byteman traces - see topology.btm in the archive): http://dl.dropbox.com/u/103079234/recovery.zip
> The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
> All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but some responses are not received on node3 (look for Receiving rsp bound to GroupRequest).
> JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).
> As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


More information about the infinispan-issues mailing list