[infinispan-issues] [JBoss JIRA] (ISPN-4949) Split brain: inconsistent data after merge
Dan Berindei (JIRA)
issues at jboss.org
Wed Nov 19 05:51:39 EST 2014
[ https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021072#comment-13021072 ]
Dan Berindei commented on ISPN-4949:
------------------------------------
Bela, speaking of reliable view installation, can you clarify a bit on why view acks are needed with the current algorithm? I got reminded of them as I'm getting these errors in my stress test:
{noformat}
22:38:00,391 WARN (Incoming-1,C-35962:) [GMS] C-35962: failed to collect all ACKs (expected=2) for view [C-35962|5] after 2000ms, missing 1 ACKs from (1) C-35962
22:38:00,412 WARN (Incoming-1,A-23928:) [GMS] A-23928: failed to collect all ACKs (expected=2) for view [C-35962|5] after 2000ms, missing 1 ACKs from (1) A-23928
22:38:21,339 WARN (Incoming-1,C-35962:) [GMS] C-35962: failed to collect all ACKs (expected=2) for view [C-35962|7] after 2000ms, missing 1 ACKs from (1) C-35962
22:38:21,364 WARN (Incoming-1,D-4191:) [GMS] D-4191: failed to collect all ACKs (expected=2) for view [C-35962|7] after 2000ms, missing 1 ACKs from (1) D-4191
22:38:45,348 WARN (Incoming-1,C-35962:) [GMS] C-35962: failed to collect all ACKs (expected=2) for view [C-35962|9] after 2000ms, missing 1 ACKs from (1) C-35962
22:38:45,368 WARN (Incoming-1,B-18775:) [GMS] B-18775: failed to collect all ACKs (expected=2) for view [C-35962|9] after 2000ms, missing 1 ACKs from (1) B-18775
22:39:06,304 WARN (Incoming-1,C-35962:) [GMS] C-35962: failed to collect all ACKs (expected=2) for view [C-35962|11] after 2000ms, missing 1 ACKs from (1) C-35962
22:39:06,326 WARN (Incoming-1,A-23928:) [GMS] A-23928: failed to collect all ACKs (expected=1) for view [C-35962|11] after 2000ms, missing 1 ACKs from (1) A-23928
22:39:18,935 WARN (Incoming-1,D-4191:) [GMS] D-4191: failed to collect all ACKs (expected=2) for view [C-35962|12] after 2000ms, missing 2 ACKs from (2) A-23928, D-4191
{noformat}
[~rvansa] the current PR looks pretty good in my stress tests. I still get some failures because MERGE3 sometimes merges the partitions in 2 steps and it takes > 20 seconds to install the final view, but otherwise waiting for an ack from all the members before handling view updates seems to do the trick.
> Split brain: inconsistent data after merge
> ------------------------------------------
>
> Key: ISPN-4949
> URL: https://issues.jboss.org/browse/ISPN-4949
> Project: Infinispan
> Issue Type: Bug
> Components: State Transfer
> Affects Versions: 7.0.0.Final
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Critical
>
> 1) cluster A, B, C, D splits into 2 parts:
> A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
> C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
> 2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
> 3) C inserts entry into X: all owners (only C) is present, therefore the modification is allowed
> 4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B, C, D] (it is the older of the two partitions' topologies, got from A, B) - logs 'No active or unavailable partitions, so all the partitions must be in degraded mode' (yes, all partitions are in degraded mode, but write has happened in the meantime)
> 5) The old CH is broadcast in newest topology, no rebalance happens
> 6) Inconsistency: read in X may miss the update
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
More information about the infinispan-issues
mailing list