[JBoss JIRA] (JGRP-2470) JBDC_PING can face a split-brain issue when restarting a coordinator node

Wednesday, 27 May 2020

    [
https://issues.redhat.com/browse/JGRP-2470?page=com.atlassian.jira.plugin...
] 

Radoslav Husar commented on JGRP-2470:
--------------------------------------

bq. What's the rationale for reverting JGRP-2199? If the old coord removes all
information, the new one will re-insert it (actually multiple times, if configured)...

That's the problem – these are not synchronized in any way and thus it's racey. As
Masafumi explained, the problem is that if these are ordered in the database in a way 1.
the new coordinator reinserted the data and 2. then the stopping coordinators clear table
query {{clearTable(String clustername)}} the restarted member won't discovery anything
and start a singleton cluster.

bq. It seems that a periodic findMembers() triggered by MERGE3 can heal the singleton
clusters situation. The findMembers() can insert their own node information when the table
is empty. But this happens after the interval calculated by Math.max(min_interval,
Util.random(max_interval) + max_interval/2) where min_interval is 10000 and max_interval
is 30000 in JBoss EAP by default. And, the actual merge happens after this. So, it could
take a long time to heal the singleton clusters situation.

We don't even need to go into details on merging partitions, discovery can never lead
by design into this as this causes partitions and data loss.

...
 JBDC_PING can face a split-brain issue when restarting a coordinator
node
 -------------------------------------------------------------------------

                 Key: JGRP-2470
                 URL: https://issues.redhat.com/browse/JGRP-2470
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 4.1.9, 4.0.22
            Reporter: Masafumi Miura
            Assignee: Radoslav Husar
            Priority: Major
             Fix For: 4.2.5

 After [the change|https://github.com/belaban/JGroups/commit/215cdb6] for JGRP-2199,
JDBC_PING deletes all entries from the table during the shutdown of the coordinator node.

 This behavior has a possibility to cause a split-brain when restarting a coordinator
node. Because, as all entries are lost in the following scenario, the restarting node can
not find any information about existing nodes from the table and does not form a cluster.
 0. node1 and node2 form a cluster. The node1 is a coordinator.
 1. Trigger a restart of the node1
 2. The node1 removes their node information from the table
 3. The node2 becomes a new coordinator
 4. The node2 updates their node information in the table
 5. The node1 clears all entries from the table
 6. The node1 starts again
 7. The node1 does not join the existing cluster because there's no node information
in the table
 Note: If step 5 happens before step 4, the split-brain issue does not happen. However, as
step 4 and step 5 happen on different nodes, these steps can happen in parallel. So, the
order is undefined. So, for example, if the shutdown of node1 takes a long time,
there's a high possibility to face this issue. 

--
This message was sent by Atlassian Jira
(v7.13.8#713008)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006