[infinispan-issues] [JBoss JIRA] (ISPN-4900) Split-brain: cancelled ST results in missing data

Thu Oct 30 09:51:40 EDT 2014

    [ https://issues.jboss.org/browse/ISPN-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016118#comment-13016118 ] 

Dan Berindei commented on ISPN-4900:
------------------------------------

The problem in the log seems a little different: state transfer is not cancelled in step 3, instead node A finished receiving state and confirmed the rebalance:

{noformat}
edg-perf10.log:11:51:48,144 TRACE [org.infinispan.topology.LocalTopologyManagerImpl] (transport-thread-15) Attempting to execute command on self: CacheTopologyControlCommand{cache=testCache, type=REBALANCE_CONFIRM, sender=edg-perf10-65050, joinInfo=null, topologyId=7, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, throwable=null, viewId=5}
{noformat}

But then A gets a new cluster view in which it is the only member, and it concludes that the rebalance is now done:

{noformat}
edg-perf10.log:11:52:19,504 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-2,edg-perf10-65050) ISPN000094: Received new cluster view: [edg-perf10-65050|6] (1) [edg-perf10-65050]
edg-perf10.log:11:52:20,878 DEBUG [org.infinispan.topology.ClusterCacheStatus] (transport-thread-16) Finished cluster-wide rebalance for cache testCache, topology id = 7
edg-perf10.log:11:52:20,969 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-16) Cache testCache topology updated: CacheTopology{id=8, rebalanceId=3, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf10-65050: 171+170, edg-perf11-31342: 170+172, edg-perf13-33773: 171+170]}, pendingCH=null, unionCH=null}, members = [edg-perf10-65050], joiners = []
edg-perf10.log:11:52:21,194 TRACE [org.infinispan.topology.ClusterCacheStatus] (transport-thread-16) Updating stable topology for cache testCache: CacheTopology{id=8, rebalanceId=3, currentCH=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf10-65050: 171+170, edg-perf11-31342: 170+172, edg-perf13-33773: 171+170]}, pendingCH=null, unionCH=null}
{noformat}

The fix would be to update the list of expected rebalance confirmations based on the current topology, not based on the list of expected members. Leavers are removed from the expected members list even in degraded mode, but the current topology's member list doesn't change unless the cache is available.

> Split-brain: cancelled ST results in missing data
> -------------------------------------------------
>
>                 Key: ISPN-4900
>                 URL: https://issues.jboss.org/browse/ISPN-4900
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State Transfer
>            Reporter: Radim Vansa
>            Priority: Critical
>         Attachments: log.txt
>
>
> 1. Cluster [A, B, C, D], in CH 1 segment X owned by [D, C]
> 2. Split brain [A, B], [C, D]: A and B detects that D is missing, therefore they get view [A, B, C] and start rebalancing, in CH 2 segment X is owned by [C, B]
> 3. A and B get new view [A, B] (C is missing) and state transfer of X is cancelled, nodes enter degraded mode.
> 4. Split brain is fixed, all nodes find each other and merge - B starts to be AVAILABLE, but still does not have data for X
> 5. Subsequent requests on B return null upon cache.get()

--
This message was sent by Atlassian JIRA
(v6.3.1#6329)