[jboss-jira] [JBoss JIRA] (JGRP-2470) JBDC_PING can face a split-brain issue when restarting a coordinator node

Wed Aug 12 02:50:00 EDT 2020

    [ https://issues.redhat.com/browse/JGRP-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384134#comment-14384134 ] 

Masafumi Miura commented on JGRP-2470:
--------------------------------------

I saw [the fix|https://github.com/belaban/JGroups/commit/9b5aeb5ea28d99fb104dafe2e98e6ee9146b5eb5#diff-6b7306e1d2f7cbc9fddffed0e74b452cR134] that set {{is_coord}} to false in {{Discovery#stop()}}. Hmm, then, when will {{removeAll(cluster_name)}} in {{JDBC_PING#stop()}} be invoked? 

I think it will be never invoked. If so, isn't such an unused code better to be removed to avoid confusion?

{code:title=jgroups/src/org/jgroups/protocols/JDBC_PING.java }
115     @Override
116     public void stop() {
117         super.stop();
118         if(is_coord) // always false here because it's set to false in super.stop()
119             removeAll(cluster_name);
120     }
{code}

> JBDC_PING can face a split-brain issue when restarting a coordinator node
> -------------------------------------------------------------------------
>
>                 Key: JGRP-2470
>                 URL: https://issues.redhat.com/browse/JGRP-2470
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 4.1.9, 4.0.22
>            Reporter: Masafumi Miura
>            Assignee: Radoslav Husar
>            Priority: Major
>             Fix For: 4.2.5, 5.0.1
>
>
> After [the change|https://github.com/belaban/JGroups/commit/215cdb6] for JGRP-2199, JDBC_PING deletes all entries from the table during the shutdown of the coordinator node. 
> This behavior has a possibility to cause a split-brain when restarting a coordinator node. Because, as all entries are lost in the following scenario, the restarting node can not find any information about existing nodes from the table and does not form a cluster.
> 0. node1 and node2 form a cluster. The node1 is a coordinator.
> 1. Trigger a restart of the node1
> 2. The node1 removes their node information from the table
> 3. The node2 becomes a new coordinator
> 4. The node2 updates their node information in the table
> 5. The node1 clears all entries from the table
> 6. The node1 starts again
> 7. The node1 does not join the existing cluster because there's no node information in the table
> Note: If step 5 happens before step 4, the split-brain issue does not happen. However, as step 4 and step 5 happen on different nodes, these steps can happen in parallel. So, the order is undefined. So, for example, if the shutdown of node1 takes a long time, there's a high possibility to face this issue.

--
This message was sent by Atlassian Jira
(v7.13.8#713008)