[infinispan-issues] [JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge

Tue Nov 18 04:27:39 EST 2014

    [ https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020681#comment-13020681 ] 

Radim Vansa commented on ISPN-4949:
-----------------------------------

{quote}The fact that Infinispan cannot handle this scenario doesn't mean the view installation algorithm is wrong{quote}
I am not saying that it's wrong - Infinispan partition handling was designed based on incorrect assumptions.

Let's call the functionality Infinispan requires/expects 'reliable view'. We call either
a) build up the functionality in Infinispan itself
b) provide this service by JGroups as framework, and be reusable by other applications as wel

Currently there is JGroups view layer and Infinispan cache members layer. Therefore, we can add another layer or make the JGroups view *optionally* 'reliable'.

{quote}I believe it's better to notify members of a crashed member as soon as possible, so the application can react.{quote}
In general case, yes. But Infinispan would have to handle that anyway (and eager notification would not really help), therefore, delegating such general functionality makes sense IMO.

{quote}I'm not even sure JGRP-1901 makes sense: imagine a partition in a 100 node view: view 1-100 splits into 1-50 and 51-100. If the first partition has to be created going from 1-100 -> 1-50, then the view installation has to be delayed until all members have acked the new view. This means we have to wait until the failure detection suspected members 51-100 before we could install view 1-50 ! This could take a long time, especially if we use FD (over TCP).{quote}
I haven't considered FD, as the detection of such split would take looong anyway, I was really rather thinking of protocols where any node failure detection is constant (therefore, FD would have to suspect all other nodes when a failure is detected, and use VERIFY_SUSPECT to check who is still alive).

{quote}This could even get worse: if some members in the second partition are suspended due to GC or temp network problems, then they might come in and out of sight (suspect-unsuspect-suspect), so an algorithm such as outlined in JGRP-1901 would never come to a conclusion and install a new view.{quote}
Not sure if I understand; The algorithm installs the view as soon as it gets the ack or timeout from all members. When the node comes out of sight for a long time, it is expelled from the view - sure it can rejoin later. However, when it has already acked another view (from the second half) it should not ack the view from first half before leaving the second half.

> Split brain: inconsistent data after merge
> ------------------------------------------
>
>                 Key: ISPN-4949
>                 URL: https://issues.jboss.org/browse/ISPN-4949
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State Transfer
>    Affects Versions: 7.0.0.Final
>            Reporter: Radim Vansa
>            Assignee: Dan Berindei
>            Priority: Critical
>
> 1) cluster A, B, C, D splits into 2 parts:
> A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
> C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
> 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
> 3) C inserts entry into X: all owners (only C) is present, therefore the modification is allowed
> 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B, C, D] (it is the older of the two partitions' topologies, got from A, B) - logs 'No active or unavailable partitions, so all the partitions must be in degraded mode' (yes, all partitions are in degraded mode, but write has happened in the meantime)
> 5) The old CH is broadcast in newest topology, no rebalance happens
> 6) Inconsistency: read in X may miss the update

--
This message was sent by Atlassian JIRA
(v6.3.8#6338)