[JBoss JIRA] (ISPN-6107) State transfer should not fetch segments that were added during a rebalance

Friday, 29 January 2016

     [
https://issues.jboss.org/browse/ISPN-6107?page=com.atlassian.jira.plugin....
]

Pedro Ruivo updated ISPN-6107:
------------------------------
        Status: Resolved  (was: Pull Request Sent)
    Resolution: Done

...
 State transfer should not fetch segments that were added during a
rebalance
 ---------------------------------------------------------------------------

                 Key: ISPN-6107
                 URL: https://issues.jboss.org/browse/ISPN-6107
             Project: Infinispan
          Issue Type: Bug
          Components: Core, Test Suite - Core
    Affects Versions: 8.1.0.Final
            Reporter: Dan Berindei
            Assignee: Dan Berindei
             Fix For: 8.2.0.Beta1

 When the last owner of a segment leaves the cache, the coordinator will update the
consistent hash and replace that owner with {{numOwners}} owners (so that a segment always
has at least 1 owner). If there is a rebalance in progress, it could be that both the
current and the pending CH lost all the owners of a segment, and the coordinator will
assign new owners in both CHs (not necessarily the same).
 Sometimes, this causes tests that create clusters with many nodes to spend a lot of time
shutting down the cluster. Here's an example:
 # Cluster ABCDE, coordinator A, topology id = 0, currentCH = \{0: CD, 1: BC\}, pendingCH
= null
 # D leaves
 # A broadcasts a REBALANCE_START command with topology id 1, members = ABCE, currentCH =
\{0: C, 1: BC\}, pendingCH = \{0: BC, 1: BC\}
 # A and E confirm that they finished the rebalance
 # C leaves before sending the data for segment 0 to B
 # A broadcasts a CH_UPDATE command with topology id 2, members = ABE, currentCH = \{0:
AE, 1: B\}, pendingCH = \{0: B, 1: B\}
 # A now owns segment 0 in the writeCH (which is the union of currentCH and pendingCH).
 # A tries to request segment 0 from the other owner in the currentCH, E
 # B confirms that it finished the rebalance
 # A broadcasts a new topology: topology id 3, currentCH = \{0: B, 1: B\}, pendingCH =
null
 # E installs topology 3, and throws an IllegalArgumentException when handling A's
request for segments
 # A is not able to install topology 3, because it requests the transactions data while
holding the lock on the LocalCacheStatus
 # A receives the IllegalArgumentException from E and retries. But because it still has
the old topology, it retries on E ad infinitum - using a lot of CPU in the process.
 A requesting segment 0 from E is not a problem in itself - normally E would just send
back an empty set of transactions and entries. The problem is that the cluster is able to
install a new topology, because A already confirmed receiving all the data, but A is stuck
with the old topology. 

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009