[infinispan-issues] [JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge

Mon Nov 10 07:14:30 EST 2014

    [ https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018589#comment-13018589 ] 

Radim Vansa commented on ISPN-4949:
-----------------------------------

We cannot suddenly start forcing users into odd numOwners, when they want consistent cluster. I believe that having each node in one and only one view is the way to go.

However, acking the update is not enough, imo. If the network partitioning is changing in rapid succession or if it is not transitive, the node could ack being in two views as well. We need means to both registration and deregistration from the view:
ABCD breaks into AB, CDB (not sure why you have corrected my example above, I wanted B not being coord):
1) A broadcasts view ABC, all of them ack the new view to A
2) C broadcasts view CDB, C and D ack but B is already in a view of different coord
3) B replies 'I am in another view, wait' and sends 'leave view' request to A:
4a) A responds 'removing from view...' and sends new view A (or any residual members) and let's it be acked, and after that sends 'you were removed' to B - then B can proceed with acking view to C
4b) Request to A times out and B can proceed with acking view to C
Generally, any RPC should be responded immediately in order to detect node responsibility, but it should assume that the action can take a while.

I think that it's JGroups responsibility to implement any group membership algorithm (current one is 'unreliable', so let's have one 'reliable' as an alternative required for split-brain) - although you can do that in Infinispan, let's keep the layers separate. Infinispan is not implementing RPC either, just 'because it's possible with JGroups API'.

[~belaban] Comments?

> Split brain: inconsistent data after merge
> ------------------------------------------
>
>                 Key: ISPN-4949
>                 URL: https://issues.jboss.org/browse/ISPN-4949
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State Transfer
>    Affects Versions: 7.0.0.Final
>            Reporter: Radim Vansa
>            Priority: Critical
>
> 1) cluster A, B, C, D splits into 2 parts:
> A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
> C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
> 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
> 3) C inserts entry into X: all owners (only C) is present, therefore the modification is allowed
> 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B, C, D] (it is the older of the two partitions' topologies, got from A, B) - logs 'No active or unavailable partitions, so all the partitions must be in degraded mode' (yes, all partitions are in degraded mode, but write has happened in the meantime)
> 5) The old CH is broadcast in newest topology, no rebalance happens
> 6) Inconsistency: read in X may miss the update

--
This message was sent by Atlassian JIRA
(v6.3.8#6338)