[infinispan-issues] [JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge

Bela Ban (JIRA) issues at jboss.org
Sat Nov 15 02:38:29 EST 2014


    [ https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020302#comment-13020302 ] 

Bela Ban edited comment on ISPN-4949 at 11/15/14 2:37 AM:
----------------------------------------------------------

JGroups has had the same view installation algorithm for 15 years now.

In a view A,B,C,D, if both C and D crash or are partitioned away, there's never been a guarantee that A installs view AB immediately, but the view succession can be A,B,C,D -> ABC -> AB *or* A,B,C,D -> AB.

This depends solely on the failure detection protocol used.

We can work with view bundling timeouts, ie. if the crash of C and D are detected *within* the view bundling timeout, then view A,B will likely be installed immediately; if not, there will be a view A,B,C before A,B. But this is not guaranteed.

The fact that Infinispan cannot handle this scenario doesn't mean the view installation algorithm is wrong :-)

I'm not even sure JGRP-1901 makes sense: imagine a partition in a 100 node view: view 1-100 splits into 1-50 and 51-100. If the first partition has to be created going from 1-100 -> 1-50, then the view installation has to be delayed until all members have acked the new view. This means we have to wait until the failure detection suspected members 51-100 before we could install view 1-50 ! This could take a long time, especially if we use FD (over TCP).

During this time, all RPCs from members in the first partition to members in the second partition would block !

This could even get worse: if some members in the second partition are suspended due to GC or temp network problems, then they might come in and out of sight (suspect-unsuspect-suspect), so an algorithm such as outlined in JGRP-1901 would never come to a conclusion and install a new view.

I believe it's better to notify members of a crashed member as soon as possible, so the application can react.


was (Author: belaban):
JGroups has had the same view installation algorithm for 15 years now.

In a view A,B,C,D, if both C and D crash or are partitioned away, there's never been a guarantee that A installs view AB immediately, but the view succession can be A,B,C,D -> ABC -> AB *or* A,B,C,D -> AB.

This depends solely on the failure detection protocol used.

We can work with view bundling timeouts, ie. if the crash of C and D are detected *within* the view bundling timeout, then view A,B will likely be installed immediately; if not, there will be a view A,B,C before A,B. But this is not guaranteed.

The fact that Infinispan cannot handle this scenario doesn't mean the view installation algorithm is wrong :-)

I'm not even sure JGRP-1901 makes sense: imagine a partition in a 100 node view: view 1-100 splits into 1-50 and 51-100. If the first partition has to be created going from 1-100 -> 1-50, then the view installation has to be delayed until all members have acked the new view. This means we have to wait until the failure detection suspected members 51-100 before we could install view 1-50 ! This could take a long time, especially if we use FD (over TCP).

During this time, all RPCs from members in the first partition to members in the second partition would block !

This could even get worse: if some members in the second partition are suspended due to GC or temp network problems, then they might come in and out of sight (suspect-unsuspect-suspect), so an algorithm such as above would never come to a conclusion and install a new view.

I believe it's better to notify members of a crashed member as soon as possible, so the application can react.

> Split brain: inconsistent data after merge
> ------------------------------------------
>
>                 Key: ISPN-4949
>                 URL: https://issues.jboss.org/browse/ISPN-4949
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State Transfer
>    Affects Versions: 7.0.0.Final
>            Reporter: Radim Vansa
>            Priority: Critical
>
> 1) cluster A, B, C, D splits into 2 parts:
> A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
> C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
> 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
> 3) C inserts entry into X: all owners (only C) is present, therefore the modification is allowed
> 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B, C, D] (it is the older of the two partitions' topologies, got from A, B) - logs 'No active or unavailable partitions, so all the partitions must be in degraded mode' (yes, all partitions are in degraded mode, but write has happened in the meantime)
> 5) The old CH is broadcast in newest topology, no rebalance happens
> 6) Inconsistency: read in X may miss the update



--
This message was sent by Atlassian JIRA
(v6.3.8#6338)


More information about the infinispan-issues mailing list