[JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge

Saturday, 15 November 2014

    [
https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin....
] 

Bela Ban edited comment on ISPN-4949 at 11/15/14 2:37 AM:
----------------------------------------------------------

JGroups has had the same view installation algorithm for 15 years now.

In a view A,B,C,D, if both C and D crash or are partitioned away, there's never been a
guarantee that A installs view AB immediately, but the view succession can be A,B,C,D
-> ABC -> AB *or* A,B,C,D -> AB.

This depends solely on the failure detection protocol used.

We can work with view bundling timeouts, ie. if the crash of C and D are detected *within*
the view bundling timeout, then view A,B will likely be installed immediately; if not,
there will be a view A,B,C before A,B. But this is not guaranteed.

The fact that Infinispan cannot handle this scenario doesn't mean the view
installation algorithm is wrong :-)

I'm not even sure JGRP-1901 makes sense: imagine a partition in a 100 node view: view
1-100 splits into 1-50 and 51-100. If the first partition has to be created going from
1-100 -> 1-50, then the view installation has to be delayed until all members have
acked the new view. This means we have to wait until the failure detection suspected
members 51-100 before we could install view 1-50 ! This could take a long time, especially
if we use FD (over TCP).

During this time, all RPCs from members in the first partition to members in the second
partition would block !

This could even get worse: if some members in the second partition are suspended due to GC
or temp network problems, then they might come in and out of sight
(suspect-unsuspect-suspect), so an algorithm such as outlined in JGRP-1901 would never
come to a conclusion and install a new view.

I believe it's better to notify members of a crashed member as soon as possible, so
the application can react.

was (Author: belaban):
JGroups has had the same view installation algorithm for 15 years now.

In a view A,B,C,D, if both C and D crash or are partitioned away, there's never been a
guarantee that A installs view AB immediately, but the view succession can be A,B,C,D
-> ABC -> AB *or* A,B,C,D -> AB.

This depends solely on the failure detection protocol used.

We can work with view bundling timeouts, ie. if the crash of C and D are detected *within*
the view bundling timeout, then view A,B will likely be installed immediately; if not,
there will be a view A,B,C before A,B. But this is not guaranteed.

The fact that Infinispan cannot handle this scenario doesn't mean the view
installation algorithm is wrong :-)

I'm not even sure JGRP-1901 makes sense: imagine a partition in a 100 node view: view
1-100 splits into 1-50 and 51-100. If the first partition has to be created going from
1-100 -> 1-50, then the view installation has to be delayed until all members have
acked the new view. This means we have to wait until the failure detection suspected
members 51-100 before we could install view 1-50 ! This could take a long time, especially
if we use FD (over TCP).

During this time, all RPCs from members in the first partition to members in the second
partition would block !

This could even get worse: if some members in the second partition are suspended due to GC
or temp network problems, then they might come in and out of sight
(suspect-unsuspect-suspect), so an algorithm such as above would never come to a
conclusion and install a new view.

I believe it's better to notify members of a crashed member as soon as possible, so
the application can react.

...
 Split brain: inconsistent data after merge
 ------------------------------------------

                 Key: ISPN-4949
                 URL: https://issues.jboss.org/browse/ISPN-4949
             Project: Infinispan
          Issue Type: Bug
          Components: State Transfer
    Affects Versions: 7.0.0.Final
            Reporter: Radim Vansa
            Priority: Critical

 1) cluster A, B, C, D splits into 2 parts:
 A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
 C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with
CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
 3) C inserts entry into X: all owners (only C) is present, therefore the modification is
allowed
 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B,
C, D] (it is the older of the two partitions' topologies, got from A, B) - logs
'No active or unavailable partitions, so all the partitions must be in degraded
mode' (yes, all partitions are in degraded mode, but write has happened in the
meantime)
 5) The old CH is broadcast in newest topology, no rebalance happens
 6) Inconsistency: read in X may miss the update 

--
This message was sent by Atlassian JIRA
(v6.3.8#6338)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009