]
Dan Berindei updated ISPN-9291:
-------------------------------
Status: Pull Request Sent (was: Open)
Git Pull Request:
* Force STABLE gc before merge to reproduce the problem
* Compute the NAKACK2 digest for the merge view to fix it
BasePartitionHandlingTest.Partition.installMergeView() doesn't
compute the merge digest
---------------------------------------------------------------------------------------
Key: ISPN-9291
URL:
https://issues.jboss.org/browse/ISPN-9291
Project: Infinispan
Issue Type: Bug
Components: Test Suite - Core
Affects Versions: 9.3.0.CR1
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Minor
Labels: testsuite_stability
Fix For: 10.0.0.Final
The partition handling tests use
{{BasePartitionHandlingTest.Partition.installMergeView(view1, view2)}} to install the
merge view without waiting for {{MERGE3}} to run, making them much faster. Unfortunately,
the implementation is incorrect: {{GMS.installView(view)}} only works for regular views,
merge views need to be installed with {{GMS.installView(mergeView, digest)}}.
The result is that the nodes that got isolated from the coordinator request the
retransmission of all the {{NAKACK2}} messages (including view updates) since the cluster
first started. The isolated nodes cannot install the merge view until they deliver all the
older messages (even without knowing whether they're OOB or not). But if {{STABLE}}
ran and cleared a range of messages already, the retransmission request cannot be
satisfied, so the view updates will never be delivered.
This is easily reproducible in {{CrashedNodeDuringConflictResolutionTest}} if we add a
delay before updating the topology in {{StateConsumerImpl}}. The test installs the merge
view manually, but then kills NodeC and expects the cluster to install the new view
automatically. NodeD can't install the new view because it's waiting for earlier
messages from NodeA:
{noformat}
18:27:13,054 INFO (testng-test:[]) [TestSuiteProgress] Test starting:
org.infinispan.conflict.impl.CrashedNodeDuringConflictResolutionTest.testPartitionMergePolicy[DIST_SYNC]
18:27:13,640 DEBUG (testng-test:[]) [GMS] test-NodeA-39513: installing view
MergeView::[test-NodeA-39513|10] (4) [test-NodeA-39513, test-NodeB-9439, test-NodeC-43706,
test-NodeD-59078], 2 subgroups: [test-NodeA-39513|8] (2) [test-NodeA-39513,
test-NodeB-9439], [test-NodeC-43706|9] (2) [test-NodeC-43706, test-NodeD-59078]
18:27:13,674 DEBUG (testng-test:[]) [GMS] test-NodeD-59078: installing view
MergeView::[test-NodeA-39513|10] (4) [test-NodeA-39513, test-NodeB-9439, test-NodeC-43706,
test-NodeD-59078], 2 subgroups: [test-NodeA-39513|8] (2) [test-NodeA-39513,
test-NodeB-9439], [test-NodeC-43706|9] (2) [test-NodeC-43706, test-NodeD-59078]
18:27:13,828 TRACE (jgroups-7,test-NodeD-59078:[]) [NAKACK2] test-NodeD-59078: sending
XMIT_REQ ((1): {50}) to test-NodeA-39513
18:27:13,966 TRACE (Timer runner-1,test-NodeD-59078:[]) [NAKACK2] test-NodeD-59078:
sending XMIT_REQ ((49): {1-49}) to test-NodeA-39513
18:27:14,067 TRACE (Timer runner-1,test-NodeD-59078:[]) [NAKACK2] test-NodeD-59078:
sending XMIT_REQ ((45): {1-45}) to test-NodeA-39513
18:27:14,504 DEBUG (testng-test:[]) [DefaultCacheManager] Stopping cache manager ISPN on
test-NodeC-43706
18:27:18,642 TRACE (VERIFY_SUSPECT.TimerThread-89,test-NodeA-39513:[]) [GMS]
test-NodeA-39513: joiners=[], suspected=[test-NodeC-43706], leaving=[], new view:
[test-NodeA-39513|11] (3) [test-NodeA-39513, test-NodeB-9439, test-NodeD-59078]
18:27:18,643 TRACE (VERIFY_SUSPECT.TimerThread-89,test-NodeA-39513:[]) [GMS]
test-NodeA-39513: mcasting view [test-NodeA-39513|11] (3) [test-NodeA-39513,
test-NodeB-9439, test-NodeD-59078]
18:27:18,646 DEBUG (VERIFY_SUSPECT.TimerThread-89,test-NodeA-39513:[]) [GMS]
test-NodeA-39513: installing view [test-NodeA-39513|11] (3) [test-NodeA-39513,
test-NodeB-9439, test-NodeD-59078]
18:27:18,652 TRACE (VERIFY_SUSPECT.TimerThread-89,test-NodeA-39513:[]) [TCP_NIO2]
test-NodeA-39513: sending msg to null, src=test-NodeA-39513, headers are GMS:
GmsHeader[VIEW], NAKACK2: [MSG, seqno=63], TP: [cluster_name=ISPN]
18:27:18,656 TRACE (jgroups-20,test-NodeA-39513:[]) [TCP_NIO2] test-NodeA-39513: received
[dst: test-NodeA-39513, src: test-NodeB-9439 (3 headers), size=0 bytes,
flags=OOB|INTERNAL], headers are GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=100, TP:
[cluster_name=ISPN]
18:27:20,554 TRACE (Timer runner-1,test-NodeD-59078:[]) [NAKACK2] test-NodeD-59078:
sending XMIT_REQ ((45): {1-45}) to test-NodeA-39513
18:27:20,653 WARN (VERIFY_SUSPECT.TimerThread-89,test-NodeA-39513:[]) [GMS]
test-NodeA-39513: failed to collect all ACKs (expected=2) for view [test-NodeA-39513|11]
after 2000ms, missing 1 ACKs from (1) test-NodeD-59078
18:27:20,656 TRACE (Timer runner-1,test-NodeD-59078:[]) [NAKACK2] test-NodeD-59078:
sending XMIT_REQ ((45): {1-45}) to test-NodeA-39513
18:27:20,756 TRACE (Timer runner-1,test-NodeD-59078:[]) [NAKACK2] test-NodeD-59078:
sending XMIT_REQ ((45): {1-45}) to test-NodeA-39513
...
18:28:14,412 TRACE (Timer runner-1,test-NodeD-59078:[]) [NAKACK2] test-NodeD-59078:
sending XMIT_REQ ((45): {1-45}) to test-NodeA-39513
18:28:14,513 TRACE (Timer runner-1,test-NodeD-59078:[]) [NAKACK2] test-NodeD-59078:
sending XMIT_REQ ((45): {1-45}) to test-NodeA-39513
18:28:14,589 ERROR (testng-test:[]) [TestSuiteProgress] Test failed:
org.infinispan.conflict.impl.CrashedNodeDuringConflictResolutionTest.testPartitionMergePolicy[DIST_SYNC]
java.lang.RuntimeException: Cache ___defaultcache timed out waiting for rebalancing to
complete on node test-NodeA-39513, current topology is CacheTopology{id=21,
phase=CONFLICT_RESOLUTION, rebalanceId=7,
currentCH=PartitionerConsistentHash:DefaultConsistentHash{ns=256, owners =
(3)[test-NodeD-59078: 256+0, test-NodeA-39513: 0+256, test-NodeB-9439: 0+256]},
pendingCH=null, unionCH=null, actualMembers=[test-NodeD-59078, test-NodeA-39513,
test-NodeB-9439], persistentUUIDs=[828108c4-4251-49fc-9481-ff6392bea9fb,
1d4b6f07-b71b-41a1-adfb-abbe68944a9f, 3a1ece05-c282-433e-9eb5-7b3e0f1932aa]}.
rebalanceInProgress=true, currentChIsBalanced=true
at org.infinispan.test.TestingUtil.waitForNoRebalance(TestingUtil.java:392)
~[test-classes/:?]
at
org.infinispan.conflict.impl.CrashedNodeDuringConflictResolutionTest.performMerge(CrashedNodeDuringConflictResolutionTest.java:113)
~[test-classes/:?]
at
org.infinispan.conflict.impl.BaseMergePolicyTest.testPartitionMergePolicy(BaseMergePolicyTest.java:137)
~[test-classes/:?]
{noformat}