[JBoss JIRA] (JGRP-2030) GMS: view_ack_collection_timeout delay when last 2 members leave concurrently

Thursday, 17 March 2016

Dan Berindei created JGRP-2030:
----------------------------------

             Summary: GMS: view_ack_collection_timeout delay when last 2 members leave
concurrently
                 Key: JGRP-2030
                 URL: https://issues.jboss.org/browse/JGRP-2030
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.6.8
            Reporter: Dan Berindei
            Assignee: Bela Ban

When the coordinator ({{NodeE}}) leaves, it tries to install a new view on behalf of the
new coordinator ({{NodeG}}, the last member).

{noformat}
21:33:26,844 TRACE (ViewHandler,InitialClusterSizeTest-NodeE-42422:) [GMS]
InitialClusterSizeTest-NodeE-42422: mcasting view [InitialClusterSizeTest-NodeG-30521|3]
(1) [InitialClusterSizeTest-NodeG-30521] (1 mbrs)
21:33:26,844 TRACE (ViewHandler,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2]
InitialClusterSizeTest-NodeE-42422: sending msg to null,
src=InitialClusterSizeTest-NodeE-42422, headers are GMS: GmsHeader[VIEW], NAKACK2: [MSG,
seqno=1], TP: [cluster_name=ISPN]
{noformat}

The message is actually sent later by the bundler, but {{NodeG}} is also sending its
{{LEAVE_REQ}} message at the same time. Both nodes try to create a connection to each
other, and only {{NodeG}} succeeds:

{noformat}
21:33:26,844 TRACE (ForkThread-2,InitialClusterSizeTest:) [TCP_NIO2]
InitialClusterSizeTest-NodeG-30521: sending msg to InitialClusterSizeTest-NodeE-42422,
src=InitialClusterSizeTest-NodeG-30521, headers are GMS: GmsHeader[LEAVE_REQ]:
mbr=InitialClusterSizeTest-NodeG-30521, UNICAST3: DATA, seqno=1, conn_id=1, first, TP:
[cluster_name=ISPN]

21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2]
InitialClusterSizeTest-NodeG-30521: sending 1 msgs (83 bytes (0.27% of max_bundle_size) to
1 dests(s): [ISPN:InitialClusterSizeTest-NodeE-42422]
21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2]
InitialClusterSizeTest-NodeE-42422: sending 1 msgs (91 bytes (0.29% of max_bundle_size) to
1 dests(s): [ISPN]
21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2]
dest=127.0.0.1:7900 (86 bytes)
21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2]
dest=127.0.0.1:7920 (94 bytes)
21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2]
127.0.0.1:7900: connecting to 127.0.0.1:7920
21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2]
127.0.0.1:7920: connecting to 127.0.0.1:7900
21:33:26,866 TRACE (NioConnection.Reader [null],InitialClusterSizeTest-NodeG-30521:)
[TCP_NIO2] 127.0.0.1:7920: rejected connection from 127.0.0.1:7900  (connection existed
and my address won as it's higher)
21:33:26,867 TRACE (OOB-1,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2]
InitialClusterSizeTest-NodeE-42422: received [dst: InitialClusterSizeTest-NodeE-42422,
src: InitialClusterSizeTest-NodeG-30521 (3 headers), size=0 bytes, flags=OOB], headers are
GMS: GmsHeader[LEAVE_REQ]: mbr=InitialClusterSizeTest-NodeG-30521, UNICAST3: DATA,
seqno=1, conn_id=1, first, TP: [cluster_name=ISPN]
{noformat}

I'm guessing {{NodeE}} would need a {{STABLE}} round in order to retransmit the
{{VIEW}} message, but I'm not sure if the stable round would work, since it already
(partially?) installed the new view with {{NodeG}} as the only member. However, I think it
should be possible for {{NodeE}} to remove {{NodeG}} from it's {{AckCollector}} once
it receives its {{LEAVE_REQ}}, and stop blocking.

This is a minor annoyance a few the Infinispan tests - most of them shut down the nodes
serially, so they don't see this delay.

The question is whether the concurrent connection setup can have an impact for other
messages as well - e.g. during startup, when there aren't a lot of messages being sent
around to trigger retransmission. Could the node that failed to open its connection retry
immediately on the connection opened by the other node?

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006