[
https://issues.jboss.org/browse/ISPN-4949?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-4949:
------------------------------------
Bela, speaking of reliable view installation, can you clarify a bit on why view acks are
needed with the current algorithm? I got reminded of them as I'm getting these errors
in my stress test:
{noformat}
22:38:00,391 WARN (Incoming-1,C-35962:) [GMS] C-35962: failed to collect all ACKs
(expected=2) for view [C-35962|5] after 2000ms, missing 1 ACKs from (1) C-35962
22:38:00,412 WARN (Incoming-1,A-23928:) [GMS] A-23928: failed to collect all ACKs
(expected=2) for view [C-35962|5] after 2000ms, missing 1 ACKs from (1) A-23928
22:38:21,339 WARN (Incoming-1,C-35962:) [GMS] C-35962: failed to collect all ACKs
(expected=2) for view [C-35962|7] after 2000ms, missing 1 ACKs from (1) C-35962
22:38:21,364 WARN (Incoming-1,D-4191:) [GMS] D-4191: failed to collect all ACKs
(expected=2) for view [C-35962|7] after 2000ms, missing 1 ACKs from (1) D-4191
22:38:45,348 WARN (Incoming-1,C-35962:) [GMS] C-35962: failed to collect all ACKs
(expected=2) for view [C-35962|9] after 2000ms, missing 1 ACKs from (1) C-35962
22:38:45,368 WARN (Incoming-1,B-18775:) [GMS] B-18775: failed to collect all ACKs
(expected=2) for view [C-35962|9] after 2000ms, missing 1 ACKs from (1) B-18775
22:39:06,304 WARN (Incoming-1,C-35962:) [GMS] C-35962: failed to collect all ACKs
(expected=2) for view [C-35962|11] after 2000ms, missing 1 ACKs from (1) C-35962
22:39:06,326 WARN (Incoming-1,A-23928:) [GMS] A-23928: failed to collect all ACKs
(expected=1) for view [C-35962|11] after 2000ms, missing 1 ACKs from (1) A-23928
22:39:18,935 WARN (Incoming-1,D-4191:) [GMS] D-4191: failed to collect all ACKs
(expected=2) for view [C-35962|12] after 2000ms, missing 2 ACKs from (2) A-23928, D-4191
{noformat}
[~rvansa] the current PR looks pretty good in my stress tests. I still get some failures
because MERGE3 sometimes merges the partitions in 2 steps and it takes > 20 seconds to
install the final view, but otherwise waiting for an ack from all the members before
handling view updates seems to do the trick.
Split brain: inconsistent data after merge
------------------------------------------
Key: ISPN-4949
URL:
https://issues.jboss.org/browse/ISPN-4949
Project: Infinispan
Issue Type: Bug
Components: State Transfer
Affects Versions: 7.0.0.Final
Reporter: Radim Vansa
Assignee: Dan Berindei
Priority: Critical
1) cluster A, B, C, D splits into 2 parts:
A, B (coord A) finds this out immediately and enters degraded mode with CH [A, B, C, D]
C, D (coord D) first detects that B is lost, gets view A, C, D and starts rebalance with
CH [A, C, D]. Segment X is primary owned by C (it had backup on B but this got lost)
2) D detects that A was lost as well, therefore enters degraded mode with CH [A, C, D]
3) C inserts entry into X: all owners (only C) is present, therefore the modification is
allowed
4) cluster is merged and coordinator finds out that the max stable topology has CH [A, B,
C, D] (it is the older of the two partitions' topologies, got from A, B) - logs
'No active or unavailable partitions, so all the partitions must be in degraded
mode' (yes, all partitions are in degraded mode, but write has happened in the
meantime)
5) The old CH is broadcast in newest topology, no rebalance happens
6) Inconsistency: read in X may miss the update
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)