[infinispan-issues] [JBoss JIRA] (ISPN-8587) Coordinator crash in 2-node cluster can lead to invalid cache topology

Tue Dec 19 08:29:00 EST 2017

     [ https://issues.jboss.org/browse/ISPN-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Emerson updated ISPN-8587:
-------------------------------
    Fix Version/s: 9.1.5.Final
                       (was: 9.1.4.Final)


> Coordinator crash in 2-node cluster can lead to invalid cache topology
> ----------------------------------------------------------------------
>
>                 Key: ISPN-8587
>                 URL: https://issues.jboss.org/browse/ISPN-8587
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 9.2.0.Beta1, 9.1.3.Final
>            Reporter: Dan Berindei
>            Assignee: Dan Berindei
>              Labels: testsuite_stability
>             Fix For: 9.2.0.CR1, 9.1.5.Final
>
>
> After the coordinator changes, {{PreferAvailabilityStrategy}} first broadcasts a cache topology with the {{currentCH}} of the "maximum" topology. In the 2nd step it broadcasts a topology that removes all the topology members no longer in the cluster, and in the 3rd step it queues a rebalance with the remaining members.
> If the cluster had only 2 nodes, {{A}} (the coordinator) and {{B}}, and B had not finished joining the cache, the maximum topology has {{A}} as the only member. That means step 2 tries to remove all members, and in the process removes the cache topology from {{ClusterCacheStatus}}. When step 3 tries to rebalance with {{B}} as the only member, it re-initializes {{ClusterCacheStatus}} with topology id 1, and because {{LocalTopologyManager}} already has a higher topology id it will never confirm the rebalance.
> This sometimes happens in {{CacheManagerTest.testRestartReusingConfiguration}}. Like most other tests, it waits for the cache to finish joining before killing a node. But it only waits for the test cache, not for the {{CONFIG}} cache (which has {{awaitInitialTransfer(false)}}). Also, most of the time {{A}} either finishes the rebalance or re-initializes {{ClusterCacheStatus}} and sends a topology update with {{B}} as the only member before leaving. The test only fails if {{B}} doesn't receive or ignores one or more topology updates.
> {noformat}
> 10:37:50,674 INFO  (remote-thread-Test-NodeA-p2265-t6:[]) [CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}
> 10:37:51,037 DEBUG (remote-thread-Test-NodeA-p2265-t6:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=3, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_ALL_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
> 10:37:51,097 DEBUG (remote-thread-Test-NodeA-p2265-t5:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=4, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_NEW_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
> 10:37:51,203 DEBUG (testng-Test:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=5, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeB-59687], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
> 10:37:51,207 INFO  (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new cluster view for channel ISPN: [Test-NodeB-59687|2] (1) [Test-NodeB-59687]
> *** Here topology updates are ignored
> 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring topology 4 for cache org.infinispan.CONFIG from old coordinator Test-NodeA-37820
> 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring topology 5 for cache org.infinispan.CONFIG from old coordinator Test-NodeA-37820
> 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Recovered 1 partition(s) for cache org.infinispan.CONFIG: [CacheTopology{id=3, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_ALL_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}]
> 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Updating topologies after merge for cache org.infinispan.CONFIG, current topology = CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, stable topology = CacheTopology{id=1, rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}, availability mode = null, resolveConflicts = false
> 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = null
> 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterTopologyManagerImpl] Updating cluster-wide stable topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=1, rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}
> 10:37:51,340 FATAL (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [CLUSTER] [Context=org.infinispan.CONFIG]ISPN000313: Lost data because of abrupt leavers [Test-NodeA-37820]
> 10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Queueing rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687]
> 10:37:51,341 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Updating local topology for cache org.infinispan.CONFIG: CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}
> *** The topology is re-initialized, without sending topology update
> 10:37:51,378 DEBUG (transport-thread-Test-NodeB-p2311-t1:[Merge-2]) [ClusterCacheStatus] Queueing rebalance for cache ___defaultcache with members [Test-NodeB-59687]
> 10:37:51,547 INFO  (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new cluster view for channel ISPN: [Test-NodeB-59687|3] (2) [Test-NodeB-59687, Test-NodeA-12100]
> 10:37:51,962 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Node Test-NodeA-12100 joining cache org.infinispan.CONFIG
> 10:37:51,964 DEBUG (remote-thread-Test-NodeB-p2309-t6:[]) [ClusterCacheStatus] Queueing rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687, Test-NodeA-12100]
> *** Rebalance start is sent with wrong topology id
> 10:37:51,964 INFO  (remote-thread-Test-NodeB-p2309-t6:[]) [CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeB-59687: 129, Test-NodeA-12100: 127]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeB-59687, Test-NodeA-12100], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb, 538b5324-cda9-49df-9786-7c6d6458332e]}
> 10:37:51,965 DEBUG (transport-thread-Test-NodeB-p2311-t4:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring old rebalance for cache org.infinispan.CONFIG, current topology is 4: CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeB-59687: 129, Test-NodeA-12100: 127]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeB-59687, Test-NodeA-12100], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb, 538b5324-cda9-49df-9786-7c6d6458332e]}
> {noformat}


--
This message was sent by Atlassian JIRA
(v7.5.0#75005)