[
https://issues.jboss.org/browse/ISPN-4900?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-4900:
------------------------------------
The problem in the log seems a little different: state transfer is not cancelled in step
3, instead node A finished receiving state and confirmed the rebalance:
{noformat}
edg-perf10.log:11:51:48,144 TRACE [org.infinispan.topology.LocalTopologyManagerImpl]
(transport-thread-15) Attempting to execute command on self:
CacheTopologyControlCommand{cache=testCache, type=REBALANCE_CONFIRM,
sender=edg-perf10-65050, joinInfo=null, topologyId=7, rebalanceId=0, currentCH=null,
pendingCH=null, availabilityMode=null, throwable=null, viewId=5}
{noformat}
But then A gets a new cluster view in which it is the only member, and it concludes that
the rebalance is now done:
{noformat}
edg-perf10.log:11:52:19,504 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-2,edg-perf10-65050)
ISPN000094: Received new cluster view: [edg-perf10-65050|6] (1) [edg-perf10-65050]
edg-perf10.log:11:52:20,878 DEBUG [org.infinispan.topology.ClusterCacheStatus]
(transport-thread-16) Finished cluster-wide rebalance for cache testCache, topology id =
7
edg-perf10.log:11:52:20,969 TRACE [org.infinispan.topology.ClusterCacheStatus]
(transport-thread-16) Cache testCache topology updated: CacheTopology{id=8, rebalanceId=3,
currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf10-65050: 171+170,
edg-perf11-31342: 170+172, edg-perf13-33773: 171+170]}, pendingCH=null, unionCH=null},
members = [edg-perf10-65050], joiners = []
edg-perf10.log:11:52:21,194 TRACE [org.infinispan.topology.ClusterCacheStatus]
(transport-thread-16) Updating stable topology for cache testCache: CacheTopology{id=8,
rebalanceId=3, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf10-65050:
171+170, edg-perf11-31342: 170+172, edg-perf13-33773: 171+170]}, pendingCH=null,
unionCH=null}
{noformat}
The fix would be to update the list of expected rebalance confirmations based on the
current topology, not based on the list of expected members. Leavers are removed from the
expected members list even in degraded mode, but the current topology's member list
doesn't change unless the cache is available.
Split-brain: cancelled ST results in missing data
-------------------------------------------------
Key: ISPN-4900
URL:
https://issues.jboss.org/browse/ISPN-4900
Project: Infinispan
Issue Type: Bug
Components: State Transfer
Reporter: Radim Vansa
Priority: Critical
Attachments: log.txt
1. Cluster [A, B, C, D], in CH 1 segment X owned by [D, C]
2. Split brain [A, B], [C, D]: A and B detects that D is missing, therefore they get view
[A, B, C] and start rebalancing, in CH 2 segment X is owned by [C, B]
3. A and B get new view [A, B] (C is missing) and state transfer of X is cancelled, nodes
enter degraded mode.
4. Split brain is fixed, all nodes find each other and merge - B starts to be AVAILABLE,
but still does not have data for X
5. Subsequent requests on B return null upon cache.get()
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)