[JBoss JIRA] (JGRP-2461) Clustering can fail when re-adding an existing node using TCP_NIO2

Thursday, 19 March 2020

    [
https://issues.redhat.com/browse/JGRP-2461?page=com.atlassian.jira.plugin...
] 

Robert Mitchell commented on JGRP-2461:
---------------------------------------

We do rely on the use_ip_addrs, mostly because we want to know if certain
"special" nodes are in the cluster.  We could probably do that in other ways,
but because the IP addresses of these nodes are well known throughout the cluster it was
easiest to do that way.

I had wondered whether it was part of the problem or not, but have not been able to find
time to investigate.  Given that, I think it could still happen.  The problem is that the
connection being made from the cluster coordinator is tied to the old logical address of
the node and the node won't let it make a second connection which uses the new logical
address.  I think TCPPING would not cause problems without this setting because it's
use of the IpAddressUUID as a PhysicalAddress is closely tied to the issue.  However, for
UNICAST3, I would think its need to send the unacknowledged message to the logical address
would still tie the connection to the old logical address even if use_ip_addrs were not
set.

...
 Clustering can fail when re-adding an existing node using TCP_NIO2
 ------------------------------------------------------------------

                 Key: JGRP-2461
                 URL: https://issues.redhat.com/browse/JGRP-2461
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 4.1.8
            Reporter: Robert Mitchell
            Assignee: Bela Ban
            Priority: Major

 When a node leaves a cluster and then later attempts to re-enter, a race condition can
occur where the clustering fails to occur.  Here is the sequence of events that seems to
allow this to occur:
 # The rejoining node must have a "higher" IP address than the current cluster
coordinator.
 # On the rejoin attempt, the coordinator sends a message to the rejoining node before the
rejoining node sends to the coordinator using its prior address.  I have seen this happen
for two reasons:
 ## UNICAST3 is resending messages (which often happens with the final LEAVE_RSP from the
prior cluster membership because it apparently does not get acked before the connection
closes)
 ## TCPPING is sending a ping request to the cached prior address.
 # The connection gets established.  It will then be used by the rejoining node whenever
communicating with the cluster coordinator.
 #  However, the cluster coordinator has this as the connection for the prior address.  So
the following happens whenever it wants to send a message to the rejoining node:
 ## It will attempt to create a new connection.
 ## The rejoining node will reject the connection as a redundant connection with its
current connection taking precedence since it is coming from the same logical address as
the "bad" connection.
 Since the messages needed to find and join the cluster or merge the two clusters are all
unicast messages, the rejoining node will never get them and not be able to join until
something happens that causes the initial connection to get closed. 

--
This message was sent by Atlassian Jira
(v7.13.8#713008)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006