[infinispan-issues] [JBoss JIRA] (ISPN-9014) Conflict resolution consistent hash should not include nodes that are not in the merged cluster view

Wed Mar 28 12:46:01 EDT 2018

     [ https://issues.jboss.org/browse/ISPN-9014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Berindei updated ISPN-9014:
-------------------------------
    Status: Open  (was: New)


> Conflict resolution consistent hash should not include nodes that are not in the merged cluster view
> ----------------------------------------------------------------------------------------------------
>
>                 Key: ISPN-9014
>                 URL: https://issues.jboss.org/browse/ISPN-9014
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Test Suite - Core
>    Affects Versions: 9.2.1.Final
>            Reporter: Dan Berindei
>            Assignee: Ryan Emerson
>              Labels: testsuite_stability
>             Fix For: 9.3.0.Alpha1
>
>
> Conflict resolution fails when trying to read entries from nodes that are not in the JGroups cluster view, and this causes random failures in {{ClusterListenerDistTest.testClusterListenerNodeGoesDown random}}.
> # NodeA leaves the cluster, but still manages to start a rebalance with [NodeB, NodeC] (topology id 11)
> # One node doesn't receive topology 11, so NodeB becomes coordinator and starts conflict resolution with all 3 nodes in the pending CH (topology 12)
> # Conflict resolution fails because NodeB and NodeC can't read the entries from NodeA
> # {{onPartitionMerge}} also queued a rebalance, so NodeB starts a new rebalance without canceling the previous rebalance first (topology 13)
> # Because there is no reset topology, NodeB thinks it already requested all the segments for NodeC's in topology 11, so it doesn't add any new inbound transfer
> # NodeC's state response arrives on NodeB with topology 11, NodeB discards it, and state transfer hangs.
> {noformat}
> 14:52:52,426 INFO  (testng-Test:[cluster-listener]) [CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache cluster-listener, topology CacheTopology{id=11, phase=READ_OLD_WRITE_ALL, rebalanceId=4, currentCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145: 128+50, Test-NodeC-20831: 128+49]}, pendingCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145: 131+125, Test-NodeC-20831: 125+131]}, unionCH=null, actualMembers=[Test-NodeB-45145, Test-NodeC-20831], persistentUUIDs=[301597c4-a4e4-46a6-8983-53e698ef70f7, ae95a681-2ba1-4e04-bfe5-05aa59425149]}
> 14:52:52,479 DEBUG (stateTransferExecutor-thread-Test-NodeB-p23774-t4:[Merge-3]) [ClusterCacheStatus] Recovered 2 partition(s) for cache cluster-listener: [CacheTopology{id=11, phase=READ_OLD_WRITE_ALL, rebalanceId=4, currentCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145: 128+50, Test-NodeC-20831: 128+49]}, pendingCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145: 131+125, Test-NodeC-20831: 125+131]}, unionCH=null, actualMembers=[Test-NodeB-45145, Test-NodeC-20831], persistentUUIDs=[301597c4-a4e4-46a6-8983-53e698ef70f7, ae95a681-2ba1-4e04-bfe5-05aa59425149]}, CacheTopology{id=9, phase=NO_REBALANCE, rebalanceId=3, currentCH=DefaultConsistentHash{ns=256, owners = (3)[Test-NodeA-57087: 78+79, Test-NodeB-45145: 90+88, Test-NodeC-20831: 88+89]}, pendingCH=null, unionCH=null, actualMembers=[Test-NodeA-57087, Test-NodeB-45145, Test-NodeC-20831], persistentUUIDs=[48e3ddc7-ee97-42d8-a57d-283e8d28ec25, 301597c4-a4e4-46a6-8983-53e698ef70f7, ae95a681-2ba1-4e04-bfe5-05aa59425149]}]
> 14:52:52,484 DEBUG (stateTransferExecutor-thread-Test-NodeB-p23774-t4:[Merge-3]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache cluster-listener, topology = CacheTopology{id=12, phase=CONFLICT_RESOLUTION, rebalanceId=5, currentCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145: 128+50, Test-NodeC-20831: 128+49]}, pendingCH=DefaultConsistentHash{ns=256, owners = (3)[Test-NodeB-45145: 128+50, Test-NodeC-20831: 128+49, Test-NodeA-57087: 0+157]}, unionCH=DefaultConsistentHash{ns=256, owners = (3)[Test-NodeB-45145: 128+50, Test-NodeC-20831: 128+49, Test-NodeA-57087: 0+157]}, actualMembers=[Test-NodeB-45145, Test-NodeA-57087, Test-NodeC-20831], persistentUUIDs=[301597c4-a4e4-46a6-8983-53e698ef70f7, 48e3ddc7-ee97-42d8-a57d-283e8d28ec25, ae95a681-2ba1-4e04-bfe5-05aa59425149]}, availability mode = null
> 14:52:52,488 ERROR (stateTransferExecutor-thread-Test-NodeB-p23774-t4:[Merge-3]) [DefaultConflictManager] Cache cluster-listener encountered exception whilst trying to resolve conflicts on merge: org.infinispan.remoting.transport.jgroups.SuspectException: ISPN000400: Node Test-NodeA-57087 was suspected
> 14:52:52,532 INFO  (stateTransferExecutor-thread-Test-NodeB-p23774-t4:[Merge-3]) [CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache cluster-listener, topology CacheTopology{id=13, phase=READ_OLD_WRITE_ALL, rebalanceId=6, currentCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145: 128+50, Test-NodeC-20831: 128+49]}, pendingCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeB-45145: 131+125, Test-NodeC-20831: 125+131]}, unionCH=null, actualMembers=[Test-NodeB-45145, Test-NodeC-20831], persistentUUIDs=[301597c4-a4e4-46a6-8983-53e698ef70f7, ae95a681-2ba1-4e04-bfe5-05aa59425149]}
> 14:52:52,577 TRACE (stateTransferExecutor-thread-Test-NodeB-p23774-t3:[StateRequest-cluster-listener]) [StateConsumerImpl] Waiting for inbound transfer to finish: InboundTransferTask{segments={19-21 28-33 38-44 50-55 60-62 72 77-79 86-91 101 104-107 113 116-126 168-169 172 181-182 188-189 195-197 200-202 223-226 235 242 245 249-254}, finishedSegments={}, unfinishedSegments={19-21 28-33 38-44 50-55 60-62 72 77-79 86-91 101 104-107 113 116-126 168-169 172 181-182 188-189 195-197 200-202 223-226 235 242 245 249-254}, source=Test-NodeC-20831, isCancelled=false, completionFuture=java.util.concurrent.CompletableFuture at 110952e8[Not completed], topologyId=11, timeout=240000, cacheName=cluster-listener}
> 14:52:52,584 DEBUG (remote-thread-Test-NodeB-p23771-t4:[cluster-listener]) [StateConsumerImpl] Discarding state response with old topology id 11 for cache cluster-listener, state transfer request topology was true
> 14:52:52,584 TRACE (remote-thread-Test-NodeB-p23771-t4:[]) [JGroupsTransport] Test-NodeB-45145 sending response for request 13 to Test-NodeC-20831: null
> {noformat}


--
This message was sent by Atlassian JIRA
(v7.5.0#75005)