[jboss-jira] [JBoss JIRA] (JGRP-2030) GMS: view_ack_collection_timeout delay when last 2 members leave concurrently

Dan Berindei (JIRA) issues at jboss.org
Thu Mar 17 06:00:02 EDT 2016


Dan Berindei created JGRP-2030:
----------------------------------

             Summary: GMS: view_ack_collection_timeout delay when last 2 members leave concurrently
                 Key: JGRP-2030
                 URL: https://issues.jboss.org/browse/JGRP-2030
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.6.8
            Reporter: Dan Berindei
            Assignee: Bela Ban


When the coordinator ({{NodeE}}) leaves, it tries to install a new view on behalf of the new coordinator ({{NodeG}}, the last member).

{noformat}
21:33:26,844 TRACE (ViewHandler,InitialClusterSizeTest-NodeE-42422:) [GMS] InitialClusterSizeTest-NodeE-42422: mcasting view [InitialClusterSizeTest-NodeG-30521|3] (1) [InitialClusterSizeTest-NodeG-30521] (1 mbrs)
21:33:26,844 TRACE (ViewHandler,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2] InitialClusterSizeTest-NodeE-42422: sending msg to null, src=InitialClusterSizeTest-NodeE-42422, headers are GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=1], TP: [cluster_name=ISPN]
{noformat}

The message is actually sent later by the bundler, but {{NodeG}} is also sending its {{LEAVE_REQ}} message at the same time. Both nodes try to create a connection to each other, and only {{NodeG}} succeeds:

{noformat}
21:33:26,844 TRACE (ForkThread-2,InitialClusterSizeTest:) [TCP_NIO2] InitialClusterSizeTest-NodeG-30521: sending msg to InitialClusterSizeTest-NodeE-42422, src=InitialClusterSizeTest-NodeG-30521, headers are GMS: GmsHeader[LEAVE_REQ]: mbr=InitialClusterSizeTest-NodeG-30521, UNICAST3: DATA, seqno=1, conn_id=1, first, TP: [cluster_name=ISPN]

21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2] InitialClusterSizeTest-NodeG-30521: sending 1 msgs (83 bytes (0.27% of max_bundle_size) to 1 dests(s): [ISPN:InitialClusterSizeTest-NodeE-42422]
21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2] InitialClusterSizeTest-NodeE-42422: sending 1 msgs (91 bytes (0.29% of max_bundle_size) to 1 dests(s): [ISPN]
21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2] dest=127.0.0.1:7900 (86 bytes)
21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2] dest=127.0.0.1:7920 (94 bytes)
21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2] 127.0.0.1:7900: connecting to 127.0.0.1:7920
21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2] 127.0.0.1:7920: connecting to 127.0.0.1:7900
21:33:26,866 TRACE (NioConnection.Reader [null],InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2] 127.0.0.1:7920: rejected connection from 127.0.0.1:7900  (connection existed and my address won as it's higher)
21:33:26,867 TRACE (OOB-1,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2] InitialClusterSizeTest-NodeE-42422: received [dst: InitialClusterSizeTest-NodeE-42422, src: InitialClusterSizeTest-NodeG-30521 (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[LEAVE_REQ]: mbr=InitialClusterSizeTest-NodeG-30521, UNICAST3: DATA, seqno=1, conn_id=1, first, TP: [cluster_name=ISPN]
{noformat}

I'm guessing {{NodeE}} would need a {{STABLE}} round in order to retransmit the {{VIEW}} message, but I'm not sure if the stable round would work, since it already (partially?) installed the new view with {{NodeG}} as the only member. However, I think it should be possible for {{NodeE}} to remove {{NodeG}} from it's {{AckCollector}} once it receives its {{LEAVE_REQ}}, and stop blocking.

This is a minor annoyance a few the Infinispan tests - most of them shut down the nodes serially, so they don't see this delay.

The question is whether the concurrent connection setup can have an impact for other messages as well - e.g. during startup, when there aren't a lot of messages being sent around to trigger retransmission. Could the node that failed to open its connection retry immediately on the connection opened by the other node?



--
This message was sent by Atlassian JIRA
(v6.4.11#64026)


More information about the jboss-jira mailing list