[
https://issues.jboss.org/browse/JGRP-1910?page=com.atlassian.jira.plugin....
]
Bela Ban commented on JGRP-1910:
--------------------------------
To reduces the chances of multiple parallel merges, we could say that the _merge leader_
are collected from all views, e.g.
{noformat}
0 -> [0|5] (4) [0,1,2,3]
5 -> [5|5] (2) [5,6]
9 -> [8|5] (2) [8,9]
{noformat}
Here, the current algorithm to pick merge leaders only collects view creators that
actually *sent* the {{INFO}} message: 0 and 5, but *not* 8. The reason is that we want to
pick responses from live members, from which we received {{INFO}} messages, and not from
members that were proposed by other members and might be incorrect.
For example, 8 might have crashed, but 9 still has it in its view. We don't want to
wait until 8 has been exluded by failure detection.
If 8 was indeed dead and we collected 0, 5 and 8 as merge leaders and 8 became merge
leader (by sorting 0, 5 and 8 (UUID-wise) and taking the first one), then we'd waste
merge rounds until 8 was actualy declared dead and removed from the views.
MERGE3: Do not lose any members from view during a series of merges
-------------------------------------------------------------------
Key: JGRP-1910
URL:
https://issues.jboss.org/browse/JGRP-1910
Project: JGroups
Issue Type: Bug
Reporter: Radim Vansa
Assignee: Bela Ban
Fix For: 3.6.3
Attachments: SplitMergeFailFastTest.java, SplitMergeTest.java
When connection between nodes is re-established, MERGE3 should merge the cluster
together. This often does not involve a single MergeView but a series of such events. The
problematic property of this protocol is that some of those views can lack certain
members, though these are reachable.
This causes problem in Infinispan since the cache cannot be fully rebalanced before
another merge arrives, and all owners of certain segment can be gradually removed (and
added again) to the view, while this is not detected as partition but crashed nodes ->
losing all owners means data loss.
Removing members from view should be the role of FDx protocols, not MERGEx.
--
This message was sent by Atlassian JIRA
(v6.3.11#6341)