[JBoss JIRA] (ISPN-8587) Coordinator crash in 2-node cluster can lead to invalid cache topology
by Dan Berindei (JIRA)
Dan Berindei created ISPN-8587:
----------------------------------
Summary: Coordinator crash in 2-node cluster can lead to invalid cache topology
Key: ISPN-8587
URL: https://issues.jboss.org/browse/ISPN-8587
Project: Infinispan
Issue Type: Bug
Components: Core
Affects Versions: 9.1.3.Final, 9.2.0.Beta1
Reporter: Dan Berindei
Assignee: Dan Berindei
Fix For: 9.2.0.Beta2, 9.1.4.Final
After the coordinator changes, {{PreferAvailabilityStrategy}} first broadcasts a cache topology with the {{currentCH}} of the "maximum" topology. In the 2nd step it broadcasts a topology that removes all the topology members no longer in the cluster, and in the 3rd step it queues a rebalance with the remaining members.
If the cluster had only 2 nodes, {{A}} (the coordinator) and {{B}}, and B had not finished joining the cache, the maximum topology has {{A}} as the only member. That means step 2 tries to remove all members, and in the process removes the cache topology from {{ClusterCacheStatus}}. When step 3 tries to rebalance with {{B}} as the only member, it re-initializes {{ClusterCacheStatus}} with topology id 1, and because {{LocalTopologyManager}} already has a higher topology id it will never confirm the rebalance.
This sometimes happens in {{CacheManagerTest.testRestartReusingConfiguration}}. Like most other tests, it waits for the cache to finish joining before killing a node. But it only waits for the test cache, not for the {{CONFIG}} cache (which has {{awaitInitialTransfer(false)}}). Also, most of the time {{A}} either finishes the rebalance or re-initializes {{ClusterCacheStatus}} and sends a topology update with {{B}} as the only member before leaving. The test only fails if {{B}} doesn't receive or ignores one or more topology updates.
{noformat}
10:37:50,674 INFO (remote-thread-Test-NodeA-p2265-t6:[]) [CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}
10:37:51,037 DEBUG (remote-thread-Test-NodeA-p2265-t6:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=3, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_ALL_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
10:37:51,097 DEBUG (remote-thread-Test-NodeA-p2265-t5:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=4, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_NEW_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
10:37:51,203 DEBUG (testng-Test:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=5, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeB-59687], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
10:37:51,207 INFO (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new cluster view for channel ISPN: [Test-NodeB-59687|2] (1) [Test-NodeB-59687]
*** Here topology updates are ignored
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring topology 4 for cache org.infinispan.CONFIG from old coordinator Test-NodeA-37820
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring topology 5 for cache org.infinispan.CONFIG from old coordinator Test-NodeA-37820
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Recovered 1 partition(s) for cache org.infinispan.CONFIG: [CacheTopology{id=3, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_ALL_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}]
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Updating topologies after merge for cache org.infinispan.CONFIG, current topology = CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, stable topology = CacheTopology{id=1, rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}, availability mode = null, resolveConflicts = false
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = null
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterTopologyManagerImpl] Updating cluster-wide stable topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=1, rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}
10:37:51,340 FATAL (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [CLUSTER] [Context=org.infinispan.CONFIG]ISPN000313: Lost data because of abrupt leavers [Test-NodeA-37820]
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Queueing rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687]
10:37:51,341 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Updating local topology for cache org.infinispan.CONFIG: CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}
*** The topology is re-initialized, without sending topology update
10:37:51,378 DEBUG (transport-thread-Test-NodeB-p2311-t1:[Merge-2]) [ClusterCacheStatus] Queueing rebalance for cache ___defaultcache with members [Test-NodeB-59687]
10:37:51,547 INFO (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new cluster view for channel ISPN: [Test-NodeB-59687|3] (2) [Test-NodeB-59687, Test-NodeA-12100]
10:37:51,962 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Node Test-NodeA-12100 joining cache org.infinispan.CONFIG
10:37:51,964 DEBUG (remote-thread-Test-NodeB-p2309-t6:[]) [ClusterCacheStatus] Queueing rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687, Test-NodeA-12100]
*** Rebalance start is sent with wrong topology id
10:37:51,964 INFO (remote-thread-Test-NodeB-p2309-t6:[]) [CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeB-59687: 129, Test-NodeA-12100: 127]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeB-59687, Test-NodeA-12100], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb, 538b5324-cda9-49df-9786-7c6d6458332e]}
10:37:51,965 DEBUG (transport-thread-Test-NodeB-p2311-t4:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring old rebalance for cache org.infinispan.CONFIG, current topology is 4: CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeB-59687: 129, Test-NodeA-12100: 127]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeB-59687, Test-NodeA-12100], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb, 538b5324-cda9-49df-9786-7c6d6458332e]}
{noformat}
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years
[JBoss JIRA] (ISPN-8448) Retried prepare times out while partition is in degraded mode
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-8448?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-8448:
----------------------------------
Status: Resolved (was: Pull Request Sent)
Resolution: Done
> Retried prepare times out while partition is in degraded mode
> -------------------------------------------------------------
>
> Key: ISPN-8448
> URL: https://issues.jboss.org/browse/ISPN-8448
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.1.9.Final, 9.0.3.Final, 8.2.8.Final, 9.1.2.Final, 9.2.0.Alpha2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 8.1.10.Final, 8.2.9.Final, 9.1.3.Final, 9.2.0.Beta1
>
>
> Since ISPN-5046, prepare commands are retried if one of the prepare targets has left the cluster. However, when the cache enters degraded mode, the prepare targets still include the owners in other partitions, and the prepare command is retried again.
> Each retry automatically waits for cache topology {{<command topology> + 1}}. But the second retry is not really triggered by a topology change, so the retry blocks for {{remoteTimeout}} milliseconds before failing with a {{TimeoutException}}.
> This situation actually happens in {{OptimisticTxPartitionAndMergeDuringPrepareTest}}, but the tests do not fail because it doesn't wait for an {{AvailabilityException}} specifically: they just take 15+ seconds each.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years
[JBoss JIRA] (ISPN-8001) HotRodCustomMarshallerIteratorIT fails randomly
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-8001?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-8001:
----------------------------------
Status: Resolved (was: Pull Request Sent)
Resolution: Done
> HotRodCustomMarshallerIteratorIT fails randomly
> -----------------------------------------------
>
> Key: ISPN-8001
> URL: https://issues.jboss.org/browse/ISPN-8001
> Project: Infinispan
> Issue Type: Bug
> Components: Remote Protocols, Test Suite - Server
> Affects Versions: 9.0.0.Final
> Reporter: Adrian Nistor
> Assignee: Adrian Nistor
> Labels: testsuite_stability
> Fix For: 9.0.4.Final, 9.2.0.Beta2, 9.2.0.Final
>
>
> There seems to be a race condition between the execution of the test and the deployment of the marshaller.
> {code}
> -------------------------------------------------------------------------------
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.355 sec <<< FAILURE! - in org.infinispan.server.test.client.hotrod.HotRodCustomMarshallerIteratorIT
> testIteration(org.infinispan.server.test.client.hotrod.HotRodCustomMarshallerIteratorIT) Time elapsed: 0.103 sec <<< ERROR!
> org.infinispan.client.hotrod.exceptions.HotRodClientException: java.io.IOException: Unsupported protocol version 40
> at org.infinispan.client.hotrod.impl.protocol.Codec20.checkForErrorsInResponseStatus(Codec20.java:363)
> at org.infinispan.client.hotrod.impl.protocol.Codec20.readPartialHeader(Codec20.java:152)
> at org.infinispan.client.hotrod.impl.protocol.Codec20.readHeader(Codec20.java:138)
> at org.infinispan.client.hotrod.impl.operations.HotRodOperation.readHeaderAndValidate(HotRodOperation.java:60)
> at org.infinispan.client.hotrod.impl.operations.IterationNextOperation.execute(IterationNextOperation.java:51)
> at org.infinispan.client.hotrod.impl.iteration.RemoteCloseableIterator.fetch(RemoteCloseableIterator.java:104)
> at org.infinispan.client.hotrod.impl.iteration.RemoteCloseableIterator.hasNext(RemoteCloseableIterator.java:88)
> at org.infinispan.server.test.client.hotrod.HotRodCustomMarshallerIteratorIT.iteratorToMap(HotRodCustomMarshallerIteratorIT.java:145)
> at org.infinispan.server.test.client.hotrod.HotRodCustomMarshallerIteratorIT.testIteration(HotRodCustomMarshallerIteratorIT.java:132)
> {code}
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years
[JBoss JIRA] (ISPN-8001) HotRodCustomMarshallerIteratorIT fails randomly
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-8001?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-8001:
----------------------------------
Fix Version/s: 9.1.4.Final
(was: 9.0.4.Final)
> HotRodCustomMarshallerIteratorIT fails randomly
> -----------------------------------------------
>
> Key: ISPN-8001
> URL: https://issues.jboss.org/browse/ISPN-8001
> Project: Infinispan
> Issue Type: Bug
> Components: Remote Protocols, Test Suite - Server
> Affects Versions: 9.0.0.Final
> Reporter: Adrian Nistor
> Assignee: Adrian Nistor
> Labels: testsuite_stability
> Fix For: 9.2.0.Beta2, 9.2.0.Final, 9.1.4.Final
>
>
> There seems to be a race condition between the execution of the test and the deployment of the marshaller.
> {code}
> -------------------------------------------------------------------------------
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.355 sec <<< FAILURE! - in org.infinispan.server.test.client.hotrod.HotRodCustomMarshallerIteratorIT
> testIteration(org.infinispan.server.test.client.hotrod.HotRodCustomMarshallerIteratorIT) Time elapsed: 0.103 sec <<< ERROR!
> org.infinispan.client.hotrod.exceptions.HotRodClientException: java.io.IOException: Unsupported protocol version 40
> at org.infinispan.client.hotrod.impl.protocol.Codec20.checkForErrorsInResponseStatus(Codec20.java:363)
> at org.infinispan.client.hotrod.impl.protocol.Codec20.readPartialHeader(Codec20.java:152)
> at org.infinispan.client.hotrod.impl.protocol.Codec20.readHeader(Codec20.java:138)
> at org.infinispan.client.hotrod.impl.operations.HotRodOperation.readHeaderAndValidate(HotRodOperation.java:60)
> at org.infinispan.client.hotrod.impl.operations.IterationNextOperation.execute(IterationNextOperation.java:51)
> at org.infinispan.client.hotrod.impl.iteration.RemoteCloseableIterator.fetch(RemoteCloseableIterator.java:104)
> at org.infinispan.client.hotrod.impl.iteration.RemoteCloseableIterator.hasNext(RemoteCloseableIterator.java:88)
> at org.infinispan.server.test.client.hotrod.HotRodCustomMarshallerIteratorIT.iteratorToMap(HotRodCustomMarshallerIteratorIT.java:145)
> at org.infinispan.server.test.client.hotrod.HotRodCustomMarshallerIteratorIT.testIteration(HotRodCustomMarshallerIteratorIT.java:132)
> {code}
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years