[JBoss JIRA] (ISPN-9187) Lost segments logged when node leaves during rebalance

Monday, 21 May 2018

Dan Berindei created ISPN-9187:
----------------------------------

             Summary: Lost segments logged when node leaves during rebalance
                 Key: ISPN-9187
                 URL: https://issues.jboss.org/browse/ISPN-9187
             Project: Infinispan
          Issue Type: Bug
          Components: Core
    Affects Versions: 9.3.0.Beta1, 9.2.3.Final
            Reporter: Dan Berindei
            Assignee: Dan Berindei
             Fix For: 9.3.0.CR1

When a node leaves during rebalance, we remove the leaver from the current CH and from the
pending CH with {{ConsistentHashFactory.updateMembers()}}. However, since the 4-phase
rebalance changes joiners are also removed from the pending CH. 

In the following log numOwners=2 and pending CH had one segment (239) owned by a joiner
(D) and a leaver (B). The updated pending CH doesn't have either B or D, this means
both owners are lost and random owners (C and A) are elected. A sees that it was allocated
new segments from {{updateMembers}} and logs that the segment has been lost:

{noformat}
08:53:01,528 INFO  (remote-thread-test-NodeA-p2-t6:[cluster-listener]) [CLUSTER]
[Context=cluster-listener] ISPN100002: Starting rebalance with members [test-NodeA-15001,
test-NodeB-55628, test-NodeC-62395, test-NodeD-29215], phase READ_OLD_WRITE_ALL, topology
id 10
08:53:01,554 TRACE (transport-thread-test-NodeD-p28-t2:[Topology-cluster-listener])
[CacheTopology] Current consistent hash's routing table: test-NodeA-15001 primary:
{... 237-238 244-245 248-250 252 254}, backup: {... 235-236 243 246-247 251 253}
  test-NodeB-55628 primary: {... 231 234 240 242}, backup: {... 232-233 239 241 255}
  test-NodeC-62395 primary: {... 232-233 235-236 239 241 243 246-247 251 253 255}, backup:
{... 231 234 237-238 240 242 244-245 248-250 252 254}
08:53:01,554 TRACE (transport-thread-test-NodeD-p28-t2:[Topology-cluster-listener])
[CacheTopology] Pending consistent hash's routing table: test-NodeA-15001 primary:
{... 237-238 245 248 252}, backup: {... 235-236 244 246-247 251}
  test-NodeB-55628 primary: {... 231 240 242}, backup: {... 230 239 241}
  test-NodeC-62395 primary: {... 232-233 235-236 241 243 246-247 251 253 255}, backup:
{... 231 234 240 242 245 249-250 252 254}
  test-NodeD-29215 primary: {... 234 239 244 249-250 254}, backup: {... 232-233 237-238
243 248 253 255}

08:53:01,606 TRACE (remote-thread-test-NodeA-p2-t5:[cluster-listener])
[ClusterCacheStatus] Removed node test-NodeB-55628 from cache cluster-listener: members =
[test-NodeA-15001, test-NodeC-62395, test-NodeD-29215], joiners = [test-NodeD-29215]
08:53:01,611 TRACE (remote-thread-test-NodeA-p2-t5:[cluster-listener]) [CacheTopology]
Current consistent hash's routing table: test-NodeA-15001 primary: {... 237-238
244-245 248-250 252 254}, backup: {... 235-236 243 246-247 251 253}
  test-NodeC-62395 primary: {... 230-236 239-243 246-247 251 253 255}, backup: {...
237-238 244-245 248-250 252 254}
08:53:01,611 TRACE (remote-thread-test-NodeA-p2-t5:[cluster-listener]) [CacheTopology]
Pending consistent hash's routing table: test-NodeA-15001 primary: {... 237-238
244-245 248 252}, backup: {... 235-236 239 246-247 251}
  test-NodeC-62395 primary: {... 227-236 239-243 246-247 249-251 253-255}, backup: {...
226 245 252}

08:53:01,613 TRACE (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener])
[StateTransferManagerImpl] Installing new cache topology CacheTopology{id=11,
phase=READ_OLD_WRITE_ALL, rebalanceId=4, currentCH=DefaultConsistentHash{ns=256, owners =
(2)[test-NodeA-15001: 134+45, test-NodeC-62395: 122+50]},
pendingCH=DefaultConsistentHash{ns=256, owners = (2)[test-NodeA-15001: 129+45,
test-NodeC-62395: 127+28]}, unionCH=DefaultConsistentHash{ns=256, owners =
(2)[test-NodeA-15001: 134+62, test-NodeC-62395: 122+68]}, actualMembers=[test-NodeA-15001,
test-NodeC-62395, test-NodeD-29215],
persistentUUIDs=[0506cc27-9762-4703-ad56-6a3bf7953529,
c365b93f-e46c-4f11-ab46-6cafa2b2d92b, d3f21b0d-07f2-4089-b160-f754e719de83]} on cache
cluster-listener
08:53:01,662 TRACE (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener])
[StateConsumerImpl] On cache cluster-listener we have: new segments: {1-18 21 30-53 56-63
67 69-73 75-78 81-85 88-101 107-115 117-118 125-134 136-139 142-156 158-165 168-170
172-174 177-179 181-188 193-211 214-226 228-229 235-239 243-254}; old segments: {1-15
30-53 56-63 69-73 75-77 81-85 88-101 107-115 117-118 125-131 136-139 142-145 150-156
158-165 168-170 172-174 177-179 183-188 193-211 214-218 220-226 228-229 235-238 243-254}
08:53:01,663 TRACE (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener])
[StateConsumerImpl] On cache cluster-listener we have: added segments: {16-18 21 67 78
132-134 146-149 181-182 219 239}; removed segments: {}
08:53:01,663 DEBUG (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener])
[StateConsumerImpl] Not requesting segments {16-18 21 67 78 132-134 146-149 181-182 219
239} because the last owner left the cluster
{noformat}

There isn't any visible inconsistency: A only owns segment 239 for writing, and the
coordinator immediately starts a new rebalance, ignoring the pending CH that it sent out
earlier. However, the new rebalance causes its own problems: ISPN-8240.

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009