[JBoss JIRA] (ISPN-8587) Coordinator crash in 2-node cluster can lead to invalid cache topology

Wednesday, 6 December 2017

Dan Berindei created ISPN-8587:
----------------------------------

             Summary: Coordinator crash in 2-node cluster can lead to invalid cache
topology
                 Key: ISPN-8587
                 URL: https://issues.jboss.org/browse/ISPN-8587
             Project: Infinispan
          Issue Type: Bug
          Components: Core
    Affects Versions: 9.1.3.Final, 9.2.0.Beta1
            Reporter: Dan Berindei
            Assignee: Dan Berindei
             Fix For: 9.2.0.Beta2, 9.1.4.Final

After the coordinator changes, {{PreferAvailabilityStrategy}} first broadcasts a cache
topology with the {{currentCH}} of the "maximum" topology. In the 2nd step it
broadcasts a topology that removes all the topology members no longer in the cluster, and
in the 3rd step it queues a rebalance with the remaining members.

If the cluster had only 2 nodes, {{A}} (the coordinator) and {{B}}, and B had not finished
joining the cache, the maximum topology has {{A}} as the only member. That means step 2
tries to remove all members, and in the process removes the cache topology from
{{ClusterCacheStatus}}. When step 3 tries to rebalance with {{B}} as the only member, it
re-initializes {{ClusterCacheStatus}} with topology id 1, and because
{{LocalTopologyManager}} already has a higher topology id it will never confirm the
rebalance.

This sometimes happens in {{CacheManagerTest.testRestartReusingConfiguration}}. Like most
other tests, it waits for the cache to finish joining before killing a node. But it only
waits for the test cache, not for the {{CONFIG}} cache (which has
{{awaitInitialTransfer(false)}}). Also, most of the time {{A}} either finishes the
rebalance or re-initializes {{ClusterCacheStatus}} and sends a topology update with {{B}}
as the only member before leaving. The test only fails if {{B}} doesn't receive or
ignores one or more topology updates.

{noformat}
10:37:50,674 INFO  (remote-thread-Test-NodeA-p2265-t6:[]) [CLUSTER] ISPN000310: Starting
cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2,
rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820:
256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134,
Test-NodeB-59687: 122]}, unionCH=null, phase=READ_OLD_WRITE_ALL,
actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}
10:37:51,037 DEBUG (remote-thread-Test-NodeA-p2265-t6:[]) [ClusterTopologyManagerImpl]
Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology =
CacheTopology{id=3, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners =
(1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners =
(2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null,
phase=READ_ALL_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
10:37:51,097 DEBUG (remote-thread-Test-NodeA-p2265-t5:[]) [ClusterTopologyManagerImpl]
Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology =
CacheTopology{id=4, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners =
(1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners =
(2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null,
phase=READ_NEW_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
10:37:51,203 DEBUG (testng-Test:[]) [ClusterTopologyManagerImpl] Updating cluster-wide
current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=5,
rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687:
256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeB-59687],
persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
10:37:51,207 INFO  (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new
cluster view for channel ISPN: [Test-NodeB-59687|2] (1) [Test-NodeB-59687]
*** Here topology updates are ignored
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG])
[LocalTopologyManagerImpl] Ignoring topology 4 for cache org.infinispan.CONFIG from old
coordinator Test-NodeA-37820
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG])
[LocalTopologyManagerImpl] Ignoring topology 5 for cache org.infinispan.CONFIG from old
coordinator Test-NodeA-37820
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus]
Recovered 1 partition(s) for cache org.infinispan.CONFIG: [CacheTopology{id=3,
rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820:
256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134,
Test-NodeB-59687: 122]}, unionCH=null, phase=READ_ALL_WRITE_ALL,
actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}]
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus]
Updating topologies after merge for cache org.infinispan.CONFIG, current topology =
CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners =
(1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE,
actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}, stable topology = CacheTopology{id=1,
rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820:
256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}, availability mode = null,
resolveConflicts = false
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2])
[ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache
org.infinispan.CONFIG, topology = CacheTopology{id=4, rebalanceId=3,
currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]},
pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820,
Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = null
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2])
[ClusterTopologyManagerImpl] Updating cluster-wide stable topology for cache
org.infinispan.CONFIG, topology = CacheTopology{id=1, rebalanceId=1,
currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]},
pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}
10:37:51,340 FATAL (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [CLUSTER]
[Context=org.infinispan.CONFIG]ISPN000313: Lost data because of abrupt leavers
[Test-NodeA-37820]
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus]
Queueing rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687]
10:37:51,341 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Topology-org.infinispan.CONFIG])
[LocalTopologyManagerImpl] Updating local topology for cache org.infinispan.CONFIG:
CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners =
(1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE,
actualMembers=[Test-NodeA-37820, Test-NodeB-59687],
persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73,
96c95d15-440a-4dc7-915d-5d36ac4257bb]}
*** The topology is re-initialized, without sending topology update
10:37:51,378 DEBUG (transport-thread-Test-NodeB-p2311-t1:[Merge-2]) [ClusterCacheStatus]
Queueing rebalance for cache ___defaultcache with members [Test-NodeB-59687]
10:37:51,547 INFO  (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new
cluster view for channel ISPN: [Test-NodeB-59687|3] (2) [Test-NodeB-59687,
Test-NodeA-12100]
10:37:51,962 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Node Test-NodeA-12100
joining cache org.infinispan.CONFIG
10:37:51,964 DEBUG (remote-thread-Test-NodeB-p2309-t6:[]) [ClusterCacheStatus] Queueing
rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687,
Test-NodeA-12100]
*** Rebalance start is sent with wrong topology id
10:37:51,964 INFO  (remote-thread-Test-NodeB-p2309-t6:[]) [CLUSTER] ISPN000310: Starting
cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2,
rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687:
256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeB-59687: 129,
Test-NodeA-12100: 127]}, unionCH=null, phase=READ_OLD_WRITE_ALL,
actualMembers=[Test-NodeB-59687, Test-NodeA-12100],
persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb,
538b5324-cda9-49df-9786-7c6d6458332e]}
10:37:51,965 DEBUG (transport-thread-Test-NodeB-p2311-t4:[Topology-org.infinispan.CONFIG])
[LocalTopologyManagerImpl] Ignoring old rebalance for cache org.infinispan.CONFIG, current
topology is 4: CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns =
256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256,
owners = (2)[Test-NodeB-59687: 129, Test-NodeA-12100: 127]}, unionCH=null,
phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeB-59687, Test-NodeA-12100],
persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb,
538b5324-cda9-49df-9786-7c6d6458332e]}
{noformat}

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009