[JBoss JIRA] (ISPN-2802) Cache recovery fails due to missing responses

Friday, 8 February 2013

    [
https://issues.jboss.org/browse/ISPN-2802?page=com.atlassian.jira.plugin....
] 

Dan Berindei commented on ISPN-2802:
------------------------------------

Radim, what's your JGroups configuration? 

Note that the CacheTopologyControlCommand responses are not themselves marked with the
RSVP flag, so if they are dropped at any point it's up to UNICAST2 to re-transmit
them. So if UNICAST2.stable_interval is the JGroups default (60 seconds), that would
explain the timeout.

Most of the messages are received in < 40ms, but there are a couple response messages
that arrive only after 30 seconds. That suggests there is some re-transmission involved,
so you could try increasing the GET_STATUS timeout (configured with
transport().distributedSyncTimeout()).

{code}
05:37:57,709 BYTEMAN (transport-thread-20): Sending hyperion988-43314 -> null:
CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=hyperion988-43314,
joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=33}
... lots of other responses ...
05:37:57,748 BYTEMAN (OOB-1110,hyperion988-43314): Received rsp from hyperion1006-37219
05:38:05,231 BYTEMAN (OOB-1114,hyperion988-43314): Received rsp from hyperion1005-58850
05:38:30,750 BYTEMAN (OOB-1114,hyperion988-43314): Received rsp from hyperion1016-23122
05:38:33,252 BYTEMAN (OOB-1114,hyperion988-43314): Received rsp from hyperion1015-10195
{code}

It looks like the cluster doesn't have much activity except for the rebalancing,
because there are a lot of log messages about UNICAST2 connections being reaped after 60
seconds of inactivity. There are a couple JGroups bugs that might be relevant: JGRP-1563
(which should actually be fixed in JGroups 3.2.6.Final, the version we use in 5.2.0.CR3),
and JGRP-1586 (which won't be fixed until JGroups 3.3). So you should definitely run
your test with {{UNICAST2.conn_expiry_timeout=0}}.

We still need to get some logging from UNICAST2, though - we need to see when the messages
are actually sent, when they are re-transmitted, when the connections are created etc. If
you think enabling TRACE for UNICAST2 isn't feasible with log4j, then we should
definitely look into log4j2.

...
 Cache recovery fails due to missing responses
 ---------------------------------------------

                 Key: ISPN-2802
                 URL: https://issues.jboss.org/browse/ISPN-2802
             Project: Infinispan
          Issue Type: Bug
          Components: State transfer
    Affects Versions: 5.2.0.CR3
            Reporter: Radim Vansa
            Assignee: Mircea Markus

 When the cache recovery is started, the new coordinator sends
CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I
have a reproducible test-case where it always times out waiting for the responses.
 Here are the logs (TRACE is not doable here, but I added some byteman traces - see
topology.btm in the archive): http://dl.dropbox.com/u/103079234/recovery.zip
 The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
 All nodes (except the one which is killed, in this case node1) respond quickly to the
GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command
execution in CommandAwareRpcDispatcher), but some responses are not received on node3
(look for Receiving rsp bound to GroupRequest).
 JGroups tracing could be useful here but it is not available (intensive logging often
blocks on internal log4j locks and the node becomes unresponsive).
 As mentioned above, the case is reproducible, therefore if you can suggest any particular
BYTEMAN hook, I can try it. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009