[infinispan-issues] [JBoss JIRA] (ISPN-8587) Coordinator crash in 2-node cluster can lead to invalid cache topology

Wednesday, 6 December 2017

     [
https://issues.jboss.org/browse/ISPN-8587?page=com.atlassian.jira.plugin....
]

Dan Berindei updated ISPN-8587:
-------------------------------
    Status: Open  (was: New)

...
 Coordinator crash in 2-node cluster can lead to invalid cache
topology
 ----------------------------------------------------------------------

                 Key: ISPN-8587
                 URL: https://issues.jboss.org/browse/ISPN-8587
             Project: Infinispan
          Issue Type: Bug
          Components: Core
    Affects Versions: 9.2.0.Beta1, 9.1.3.Final
            Reporter: Dan Berindei
            Assignee: Dan Berindei
              Labels: testsuite_stability
             Fix For: 9.2.0.Beta2, 9.1.4.Final

 After the coordinator changes, {{PreferAvailabilityStrategy}} first broadcasts a cache
topology with the {{currentCH}} of the "maximum" topology. In the 2nd step it
broadcasts a topology that removes all the topology members no longer in the cluster, and
in the 3rd step it queues a rebalance with the remaining members.
 If the cluster had only 2 nodes, {{A}} (the coordinator) and {{B}}, and B had not
finished joining the cache, the maximum topology has {{A}} as the only member. That means
step 2 tries to remove all members, and in the process removes the cache topology from
{{ClusterCacheStatus}}. When step 3 tries to rebalance with {{B}} as the only member, it
re-initializes {{ClusterCacheStatus}} with topology id 1, and because
{{LocalTopologyManager}} already has a higher topology id it will never confirm the
rebalance.
 This sometimes happens in {{CacheManagerTest.testRestartReusingConfiguration}}. Like most
other tests, it waits for the cache to finish joining before killing a node. But it only
waits for the test cache, not for the {{CONFIG}} cache (which has
{{awaitInitialTransfer(false)}}). Also, most of the time {{A}} either finishes the
rebalance or re-initializes {{ClusterCacheStatus}} and sends a topology update with {{B}}
as the only member before leaving. The test only fails if {{B}} doesn't receive or
ignores one or more topology updates.
 {noformat}
 10:37:50,674 INFO  (remote-thread-Test-NodeA-p2265-t6:[]) [CLUSTER] ISPN000310: Starting
cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2,
rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820:
256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134,
Test-NodeB-59687: 122]}, unionCH=null, phase=READ_OLD_WRITE_ALL,
actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}
 10:37:51,037 DEBUG (remote-thread-Test-NodeA-p2265-t6:[]) [ClusterTopologyManagerImpl]
Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology =
CacheTopology{id=3, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners =
(1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners =
(2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null,
phase=READ_ALL_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
 10:37:51,097 DEBUG (remote-thread-Test-NodeA-p2265-t5:[]) [ClusterTopologyManagerImpl]
Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology =
CacheTopology{id=4, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners =
(1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners =
(2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null,
phase=READ_NEW_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
 10:37:51,203 DEBUG (testng-Test:[]) [ClusterTopologyManagerImpl] Updating cluster-wide
current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=5,
rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687:
256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeB-59687],
persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
 10:37:51,207 INFO  (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new
cluster view for channel ISPN: [Test-NodeB-59687|2] (1) [Test-NodeB-59687]
 *** Here topology updates are ignored
 10:37:51,340 DEBUG
(transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG])
[LocalTopologyManagerImpl] Ignoring topology 4 for cache org.infinispan.CONFIG from old
coordinator Test-NodeA-37820
 10:37:51,340 DEBUG
(transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG])
[LocalTopologyManagerImpl] Ignoring topology 5 for cache org.infinispan.CONFIG from old
coordinator Test-NodeA-37820
 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus]
Recovered 1 partition(s) for cache org.infinispan.CONFIG: [CacheTopology{id=3,
rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820:
256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134,
Test-NodeB-59687: 122]}, unionCH=null, phase=READ_ALL_WRITE_ALL,
actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}]
 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus]
Updating topologies after merge for cache org.infinispan.CONFIG, current topology =
CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners =
(1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE,
actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}, stable topology = CacheTopology{id=1,
rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820:
256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}, availability mode = null,
resolveConflicts = false
 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2])
[ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache
org.infinispan.CONFIG, topology = CacheTopology{id=4, rebalanceId=3,
currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]},
pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820,
Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = null
 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2])
[ClusterTopologyManagerImpl] Updating cluster-wide stable topology for cache
org.infinispan.CONFIG, topology = CacheTopology{id=1, rebalanceId=1,
currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]},
pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}
 10:37:51,340 FATAL (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [CLUSTER]
[Context=org.infinispan.CONFIG]ISPN000313: Lost data because of abrupt leavers
[Test-NodeA-37820]
 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus]
Queueing rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687]
 10:37:51,341 DEBUG
(transport-thread-Test-NodeB-p2311-t6:[Topology-org.infinispan.CONFIG])
[LocalTopologyManagerImpl] Updating local topology for cache org.infinispan.CONFIG:
CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners =
(1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE,
actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}
 *** The topology is re-initialized, without sending topology update
 10:37:51,378 DEBUG (transport-thread-Test-NodeB-p2311-t1:[Merge-2]) [ClusterCacheStatus]
Queueing rebalance for cache ___defaultcache with members [Test-NodeB-59687]
 10:37:51,547 INFO  (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new
cluster view for channel ISPN: [Test-NodeB-59687|3] (2) [Test-NodeB-59687,
Test-NodeA-12100]
 10:37:51,962 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Node Test-NodeA-12100
joining cache org.infinispan.CONFIG
 10:37:51,964 DEBUG (remote-thread-Test-NodeB-p2309-t6:[]) [ClusterCacheStatus] Queueing
rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687,
Test-NodeA-12100]
 *** Rebalance start is sent with wrong topology id
 10:37:51,964 INFO  (remote-thread-Test-NodeB-p2309-t6:[]) [CLUSTER] ISPN000310: Starting
cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2,
rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687:
256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeB-59687: 129,
Test-NodeA-12100: 127]}, unionCH=null, phase=READ_OLD_WRITE_ALL,
actualMembers=[Test-NodeB-59687, Test-NodeA-12100],
persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb,
538b5324-cda9-49df-9786-7c6d6458332e]}
 10:37:51,965 DEBUG
(transport-thread-Test-NodeB-p2311-t4:[Topology-org.infinispan.CONFIG])
[LocalTopologyManagerImpl] Ignoring old rebalance for cache org.infinispan.CONFIG, current
topology is 4: CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns =
256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256,
owners = (2)[Test-NodeB-59687: 129, Test-NodeA-12100: 127]}, unionCH=null,
phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeB-59687, Test-NodeA-12100],
persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb,
538b5324-cda9-49df-9786-7c6d6458332e]}
 {noformat} 

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[infinispan-issues] [JBoss JIRA] (ISPN-8587) Coordinator crash in 2-node cluster can lead to invalid cache topology