[jboss-jira] [JBoss JIRA] (JGRP-2470) JBDC_PING can face a split-brain issue when restarting a coordinator node

Tue Aug 11 10:59:00 EDT 2020

    [ https://issues.redhat.com/browse/JGRP-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380772#comment-14380772 ] 

Bela Ban edited comment on JGRP-2470 at 8/11/20 10:58 AM:
----------------------------------------------------------

OK, so {{removeAll()}} is only called when {{is_coord}} is true. So I suggest instead to set {{is_coord}} to false in {{super.stop()}} ({{Discovery}}). This is in line with setting {{is_server}} to false in {{Discovery}}.

Testing...

This works, as in step 5, {{is_coord}} will be false, so {{removeall()}} is not called.

was (Author: belaban):
OK, so {{removeAll()}} is only called when {{is_coord}} is true. So I suggest instead to set {{is_coord}} to false in {{super.stop()}} ({{Discovery}}). This is in line with setting {{is_server}} to false in {{Discovery}}.

Testing...

> JBDC_PING can face a split-brain issue when restarting a coordinator node
> -------------------------------------------------------------------------
>
>                 Key: JGRP-2470
>                 URL: https://issues.redhat.com/browse/JGRP-2470
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 4.1.9, 4.0.22
>            Reporter: Masafumi Miura
>            Assignee: Radoslav Husar
>            Priority: Major
>             Fix For: 4.2.5, 5.0.1
>
>
> After [the change|https://github.com/belaban/JGroups/commit/215cdb6] for JGRP-2199, JDBC_PING deletes all entries from the table during the shutdown of the coordinator node. 
> This behavior has a possibility to cause a split-brain when restarting a coordinator node. Because, as all entries are lost in the following scenario, the restarting node can not find any information about existing nodes from the table and does not form a cluster.
> 0. node1 and node2 form a cluster. The node1 is a coordinator.
> 1. Trigger a restart of the node1
> 2. The node1 removes their node information from the table
> 3. The node2 becomes a new coordinator
> 4. The node2 updates their node information in the table
> 5. The node1 clears all entries from the table
> 6. The node1 starts again
> 7. The node1 does not join the existing cluster because there's no node information in the table
> Note: If step 5 happens before step 4, the split-brain issue does not happen. However, as step 4 and step 5 happen on different nodes, these steps can happen in parallel. So, the order is undefined. So, for example, if the shutdown of node1 takes a long time, there's a high possibility to face this issue.

--
This message was sent by Atlassian Jira
(v7.13.8#713008)