[jboss-jira] [JBoss JIRA] (JGRP-2470) JBDC_PING can face a split-brain issue when restarting a coordinator node

Wed Aug 12 05:19:00 EDT 2020

    [ https://issues.redhat.com/browse/JGRP-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384619#comment-14384619 ] 

Bela Ban commented on JGRP-2470:
--------------------------------

Probably, yes, and [~rhusar] already submitted a PR to do this. I'll look into this today. But, please test with 4.2.5, to see if this fixes the issue.

{{JDBC_PING.stop()}} calls {{super.stop()}}, so {{is_coord}} will be set to true. This is equivalent to removing the code, so you can go ahead and test.

> JBDC_PING can face a split-brain issue when restarting a coordinator node
> -------------------------------------------------------------------------
>
>                 Key: JGRP-2470
>                 URL: https://issues.redhat.com/browse/JGRP-2470
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 4.1.9, 4.0.22
>            Reporter: Masafumi Miura
>            Assignee: Radoslav Husar
>            Priority: Major
>             Fix For: 4.2.5, 5.0.1
>
>
> After [the change|https://github.com/belaban/JGroups/commit/215cdb6] for JGRP-2199, JDBC_PING deletes all entries from the table during the shutdown of the coordinator node. 
> This behavior has a possibility to cause a split-brain when restarting a coordinator node. Because, as all entries are lost in the following scenario, the restarting node can not find any information about existing nodes from the table and does not form a cluster.
> 0. node1 and node2 form a cluster. The node1 is a coordinator.
> 1. Trigger a restart of the node1
> 2. The node1 removes their node information from the table
> 3. The node2 becomes a new coordinator
> 4. The node2 updates their node information in the table
> 5. The node1 clears all entries from the table
> 6. The node1 starts again
> 7. The node1 does not join the existing cluster because there's no node information in the table
> Note: If step 5 happens before step 4, the split-brain issue does not happen. However, as step 4 and step 5 happen on different nodes, these steps can happen in parallel. So, the order is undefined. So, for example, if the shutdown of node1 takes a long time, there's a high possibility to face this issue.

--
This message was sent by Atlassian Jira
(v7.13.8#713008)