[JBoss JIRA] (ISPN-9014) Conflict resolution consistent hash should not include nodes that are not in the merged cluster view

Tuesday, 17 July 2018

    [
https://issues.jboss.org/browse/ISPN-9014?page=com.atlassian.jira.plugin....
] 

M S edited comment on ISPN-9014 at 7/17/18 6:37 AM:
----------------------------------------------------

Hi.
In one of our environments based on infinispan version 9.3.0 where we have 15 nodes in
cloud we got to the point where similar issue occurs on node 22 (Cache zones encountered
exception whilst trying to resolve conflicts on merge:
java.util.concurrent.CompletionException: org.infinispan.commons.CacheException). 

We reproduced it while having 15 nodes in cloud, and then unplugging and plugging node11
back.

I'm attaching infinispan logs from the failed controllers and our cluster config.
Please have a look if the issue is really the same one and the fix from beta version is
not sufficient, or new issue must be created.
Thx

was (Author: staho):
Hi.
In one of our environments based on infinispan version 9.1.3 where we have 15 nodes in
cloud we got to the point where similar issue occurs on node 22 (Cache zones encountered
exception whilst trying to resolve conflicts on merge:
java.util.concurrent.CompletionException: org.infinispan.commons.CacheException). 

We reproduced it while having 15 nodes in cloud, and then unplugging and plugging node11
back.

I'm attaching infinispan logs from the failed controllers and our cluster config.
Please have a look if the issue is really the same one and the fix from beta version is
not sufficient, or new issue must be created.
Thx

...
 Conflict resolution consistent hash should not include nodes that are
not in the merged cluster view

----------------------------------------------------------------------------------------------------

                 Key: ISPN-9014
                 URL: https://issues.jboss.org/browse/ISPN-9014
             Project: Infinispan
          Issue Type: Bug
          Components: Test Suite - Core
    Affects Versions: 9.2.1.Final
            Reporter: Dan Berindei
            Assignee: Ryan Emerson
              Labels: testsuite_stability
             Fix For: 9.3.0.Beta1

         Attachments: 15nodes-merge-issue.zip

 Conflict resolution fails when trying to read entries from nodes that are not in the
JGroups cluster view, and this causes random failures in
{{ClusterListenerDistTest.testClusterListenerNodeGoesDown random}}.
 # NodeA leaves the cluster, but still manages to start a rebalance with [NodeB, NodeC]
(topology id 11)
 # One node doesn't receive topology 11, so NodeB becomes coordinator and starts
conflict resolution with all 3 nodes in the pending CH (topology 12)
 # Conflict resolution fails because NodeB and NodeC can't read the entries from
NodeA
 # {{onPartitionMerge}} also queued a rebalance, so NodeB starts a new rebalance without
canceling the previous rebalance first (topology 13)
 # Because there is no reset topology, NodeB thinks it already requested all the segments
for NodeC's in topology 11, so it doesn't add any new inbound transfer
 # NodeC's state response arrives on NodeB with topology 11, NodeB discards it, and
state transfer hangs.
 {noformat}
 14:52:52,426 INFO  (testng-Test:[cluster-listener]) [CLUSTER] ISPN000310: Starting
cluster-wide rebalance for cache cluster-listener, topology CacheTopology{id=11,
phase=READ_OLD_WRITE_ALL, rebalanceId=4, currentCH=DefaultConsistentHash{ns=256, owners =
(2)[Test-NodeB-45145: 128+50, Test-NodeC-20831: 128+49]},
pendingCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145: 131+125,
Test-NodeC-20831: 125+131]}, unionCH=null, actualMembers=[Test-NodeB-45145,
Test-NodeC-20831], persistentUUIDs=[301597c4-a4e4-46a6-8983-53e698ef70f7,
ae95a681-2ba1-4e04-bfe5-05aa59425149]}
 14:52:52,479 DEBUG (stateTransferExecutor-thread-Test-NodeB-p23774-t4:[Merge-3])
[ClusterCacheStatus] Recovered 2 partition(s) for cache cluster-listener:
[CacheTopology{id=11, phase=READ_OLD_WRITE_ALL, rebalanceId=4,
currentCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145: 128+50,
Test-NodeC-20831: 128+49]}, pendingCH=DefaultConsistentHash{ns=256, owners =
(2)[Test-NodeB-45145: 131+125, Test-NodeC-20831: 125+131]}, unionCH=null,
actualMembers=[Test-NodeB-45145, Test-NodeC-20831],
persistentUUIDs=[301597c4-a4e4-46a6-8983-53e698ef70f7,
ae95a681-2ba1-4e04-bfe5-05aa59425149]}, CacheTopology{id=9, phase=NO_REBALANCE,
rebalanceId=3, currentCH=DefaultConsistentHash{ns=256, owners = (3)[Test-NodeA-57087:
78+79, Test-NodeB-45145: 90+88, Test-NodeC-20831: 88+89]}, pendingCH=null, unionCH=null,
actualMembers=[Test-NodeA-57087, Test-NodeB-45145, Test-NodeC-20831],
persistentUUIDs=[48e3ddc7-ee97-42d8-a57d-283e8d28ec25,
301597c4-a4e4-46a6-8983-53e698ef70f7, ae95a681-2ba1-4e04-bfe5-05aa59425149]}]
 14:52:52,484 DEBUG (stateTransferExecutor-thread-Test-NodeB-p23774-t4:[Merge-3])
[ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache
cluster-listener, topology = CacheTopology{id=12, phase=CONFLICT_RESOLUTION,
rebalanceId=5, currentCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145:
128+50, Test-NodeC-20831: 128+49]}, pendingCH=DefaultConsistentHash{ns=256, owners =
(3)[Test-NodeB-45145: 128+50, Test-NodeC-20831: 128+49, Test-NodeA-57087: 0+157]},
unionCH=DefaultConsistentHash{ns=256, owners = (3)[Test-NodeB-45145: 128+50,
Test-NodeC-20831: 128+49, Test-NodeA-57087: 0+157]}, actualMembers=[Test-NodeB-45145,
Test-NodeA-57087, Test-NodeC-20831],
persistentUUIDs=[301597c4-a4e4-46a6-8983-53e698ef70f7,
48e3ddc7-ee97-42d8-a57d-283e8d28ec25, ae95a681-2ba1-4e04-bfe5-05aa59425149]}, availability
mode = null
 14:52:52,488 ERROR (stateTransferExecutor-thread-Test-NodeB-p23774-t4:[Merge-3])
[DefaultConflictManager] Cache cluster-listener encountered exception whilst trying to
resolve conflicts on merge: org.infinispan.remoting.transport.jgroups.SuspectException:
ISPN000400: Node Test-NodeA-57087 was suspected
 14:52:52,532 INFO  (stateTransferExecutor-thread-Test-NodeB-p23774-t4:[Merge-3])
[CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache cluster-listener, topology
CacheTopology{id=13, phase=READ_OLD_WRITE_ALL, rebalanceId=6,
currentCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145: 128+50,
Test-NodeC-20831: 128+49]}, pendingCH=DefaultConsistentHash{ns=256, owners =
(2)[Test-NodeB-45145: 131+125, Test-NodeC-20831: 125+131]}, unionCH=null,
actualMembers=[Test-NodeB-45145, Test-NodeC-20831],
persistentUUIDs=[301597c4-a4e4-46a6-8983-53e698ef70f7,
ae95a681-2ba1-4e04-bfe5-05aa59425149]}
 14:52:52,577 TRACE
(stateTransferExecutor-thread-Test-NodeB-p23774-t3:[StateRequest-cluster-listener])
[StateConsumerImpl] Waiting for inbound transfer to finish:
InboundTransferTask{segments={19-21 28-33 38-44 50-55 60-62 72 77-79 86-91 101 104-107 113
116-126 168-169 172 181-182 188-189 195-197 200-202 223-226 235 242 245 249-254},
finishedSegments={}, unfinishedSegments={19-21 28-33 38-44 50-55 60-62 72 77-79 86-91 101
104-107 113 116-126 168-169 172 181-182 188-189 195-197 200-202 223-226 235 242 245
249-254}, source=Test-NodeC-20831, isCancelled=false,
completionFuture=java.util.concurrent.CompletableFuture@110952e8[Not completed],
topologyId=11, timeout=240000, cacheName=cluster-listener}
 14:52:52,584 DEBUG (remote-thread-Test-NodeB-p23771-t4:[cluster-listener])
[StateConsumerImpl] Discarding state response with old topology id 11 for cache
cluster-listener, state transfer request topology was true
 14:52:52,584 TRACE (remote-thread-Test-NodeB-p23771-t4:[]) [JGroupsTransport]
Test-NodeB-45145 sending response for request 13 to Test-NodeC-20831: null
 {noformat} 

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009