[JBoss JIRA] (ISPN-4900) Split-brain: cancelled ST results in missing data

Thursday, 30 October 2014

    [
https://issues.jboss.org/browse/ISPN-4900?page=com.atlassian.jira.plugin....
] 

Dan Berindei commented on ISPN-4900:
------------------------------------

The problem in the log seems a little different: state transfer is not cancelled in step
3, instead node A finished receiving state and confirmed the rebalance:

{noformat}
edg-perf10.log:11:51:48,144 TRACE [org.infinispan.topology.LocalTopologyManagerImpl]
(transport-thread-15) Attempting to execute command on self:
CacheTopologyControlCommand{cache=testCache, type=REBALANCE_CONFIRM,
sender=edg-perf10-65050, joinInfo=null, topologyId=7, rebalanceId=0, currentCH=null,
pendingCH=null, availabilityMode=null, throwable=null, viewId=5}
{noformat}

But then A gets a new cluster view in which it is the only member, and it concludes that
the rebalance is now done:

{noformat}
edg-perf10.log:11:52:19,504 INFO 
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-2,edg-perf10-65050)
ISPN000094: Received new cluster view: [edg-perf10-65050|6] (1) [edg-perf10-65050]
edg-perf10.log:11:52:20,878 DEBUG [org.infinispan.topology.ClusterCacheStatus]
(transport-thread-16) Finished cluster-wide rebalance for cache testCache, topology id =
7
edg-perf10.log:11:52:20,969 TRACE [org.infinispan.topology.ClusterCacheStatus]
(transport-thread-16) Cache testCache topology updated: CacheTopology{id=8, rebalanceId=3,
currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf10-65050: 171+170,
edg-perf11-31342: 170+172, edg-perf13-33773: 171+170]}, pendingCH=null, unionCH=null},
members = [edg-perf10-65050], joiners = []
edg-perf10.log:11:52:21,194 TRACE [org.infinispan.topology.ClusterCacheStatus]
(transport-thread-16) Updating stable topology for cache testCache: CacheTopology{id=8,
rebalanceId=3, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf10-65050:
171+170, edg-perf11-31342: 170+172, edg-perf13-33773: 171+170]}, pendingCH=null,
unionCH=null}
{noformat}

The fix would be to update the list of expected rebalance confirmations based on the
current topology, not based on the list of expected members. Leavers are removed from the
expected members list even in degraded mode, but the current topology's member list
doesn't change unless the cache is available.

...
 Split-brain: cancelled ST results in missing data
 -------------------------------------------------

                 Key: ISPN-4900
                 URL: https://issues.jboss.org/browse/ISPN-4900
             Project: Infinispan
          Issue Type: Bug
          Components: State Transfer
            Reporter: Radim Vansa
            Priority: Critical
         Attachments: log.txt

 1. Cluster [A, B, C, D], in CH 1 segment X owned by [D, C]
 2. Split brain [A, B], [C, D]: A and B detects that D is missing, therefore they get view
[A, B, C] and start rebalancing, in CH 2 segment X is owned by [C, B]
 3. A and B get new view [A, B] (C is missing) and state transfer of X is cancelled, nodes
enter degraded mode.
 4. Split brain is fixed, all nodes find each other and merge - B starts to be AVAILABLE,
but still does not have data for X
 5. Subsequent requests on B return null upon cache.get() 

--
This message was sent by Atlassian JIRA
(v6.3.1#6329)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009