]
Pedro Zapata Fernandez updated ISPN-9588:
-----------------------------------------
Sprint: Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint 10.0.0.Alpha0, Sprint
10.0.0.Beta1, DataGrid Sprint #31, DataGrid Sprint #32, DataGrid Sprint #33, DataGrid
Sprint #34, DataGrid Sprint #35, DataGrid Sprint #36, DataGrid Sprint #37, DataGrid Sprint
#38, DataGrid Sprint #39 (was: Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint
10.0.0.Alpha0, Sprint 10.0.0.Beta1, DataGrid Sprint #31, DataGrid Sprint #32, DataGrid
Sprint #33, DataGrid Sprint #34, DataGrid Sprint #35, DataGrid Sprint #36, DataGrid Sprint
#37, DataGrid Sprint #38)
JGroups fails to install new cluster view after coordinator leave
-----------------------------------------------------------------
Key: ISPN-9588
URL:
https://issues.redhat.com/browse/ISPN-9588
Project: Infinispan
Issue Type: Bug
Components: Core, Test Suite
Affects Versions: 9.4.0.CR3
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Major
Fix For: 9.4.7.Final, 10.0.0.Beta1
Attachments: ISPN-9127_ScatteredCrashInSequenceTest-infinispan-core.log.gz,
ISPN-9127_ThreeWaySplitAndMergeTest-infinispan-core.log.gz,
ThreeWaySplitAndMergeTest.clearContent.log.gz
The core test suite normally shuts down cache managers in the reverse order of their
start, so that the coordinator stops last. This should be just an optimization, to reduce
the number of coordinator changes, and with it the test suite duration and log size.
Unfortunately, the few tests that stop the coordinator first are routinely failing to
stop the cluster properly, usually when there are 2 nodes left in the cluster the 2nd node
doesn't receive the view:
{noformat}
10:08:01,656 DEBUG (jgroups-26,Test-NodeB-20661:[]) [GMS] Test-NodeB-20661: installing
view [Test-NodeC-25266|33] (3) [Test-NodeC-25266, Test-NodeA-8936, Test-NodeB-20661]
10:08:01,862 TRACE (testng-Test:[]) [GMS] Test-NodeC-25266: sending LEAVE request to
Test-NodeA-8936
10:08:01,863 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: members are (3)
Test-NodeC-25266,Test-NodeA-8936,Test-NodeB-20661, coord=Test-NodeA-8936: I'm the new
coordinator
10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: joiners=[],
suspected=[], leaving=[Test-NodeC-25266], new view: [Test-NodeA-8936|34] (2)
[Test-NodeA-8936, Test-NodeB-20661]
10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: mcasting view
[Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
10:08:01,900 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: installing view
[Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
10:08:02,245 TRACE (testng-Test:[]) [JGroupsTransport] Test-NodeB-20661 sending request
140 to Test-NodeC-25266: CacheTopologyControlCommand{cache=org.infinispan.CONFIG,
type=LEAVE, sender=Test-NodeB-20661, joinInfo=null, topologyId=0, rebalanceId=0,
currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null,
throwable=null, viewId=33}
10:12:02,247 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Error sending the leave
request for cache org.infinispan.CONFIG to coordinator
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 140 from 7d5d965d-eacc-fd47-2ab8-5cec95f4c10b
{noformat}
Sometimes the new coordinator gets stuck in a merge loop, keeps trying to re-include the
stopped node and fails:
{noformat}
10:06:21,484 DEBUG (jgroups-24,Test-NodeC-57941:[]) [GMS] Test-NodeC-57941: installing
view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
10:06:21,486 DEBUG (jgroups-25,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing
view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
10:06:22,070 TRACE (testng-Test:[]) [GMS] Test-NodeD-37696: sending LEAVE request to
Test-NodeB-39137
10:06:22,078 TRACE (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view
[Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
10:06:22,080 DEBUG (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing
view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
10:06:30,887 DEBUG (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: I will be the
merge leader. Starting the merge task. Views: {Test-NodeB-39137=[Test-NodeB-39137|9] (2)
[Test-NodeB-39137, Test-NodeC-57941], Test-NodeC-57941=[Test-NodeD-37696|8] (3)
[Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]}
10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] suspending message
garbage collection
10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] Test-NodeB-39137: resume
task started, max_suspend_time=220000
10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge task
Test-NodeB-39137::7 started with 2 participants
10:06:30,888 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: sending
MERGE_REQ to [Test-NodeD-37696, Test-NodeB-39137]
10:06:36,041 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: collected
1 merge response(s) in 5153 ms
10:06:36,041 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:06:36,043 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137:
consolidated view=MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137,
Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137,
Test-NodeC-57941]
consolidated digest=Test-NodeB-39137: [15 (16)], Test-NodeC-57941: [5 (5)]
10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing
merge view [Test-NodeB-39137|10] (2 members) in 1 coords
10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
Test-NodeB-39137::7 took 5155 ms
10:06:36,043 TRACE (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view
MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups:
[Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
10:06:49,877 DEBUG (MergeTask-229,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:07:03,755 DEBUG (MergeTask-342,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:07:17,703 DEBUG (MergeTask-447,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:07:31,547 DEBUG (MergeTask-568,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:07:45,381 DEBUG (MergeTask-688,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:07:59,196 DEBUG (MergeTask-806,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:08:13,031 DEBUG (MergeTask-932,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:08:26,885 DEBUG (MergeTask-1061,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:08:40,697 DEBUG (MergeTask-1197,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:08:54,510 DEBUG (MergeTask-1330,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:09:08,324 DEBUG (MergeTask-1463,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:09:22,135 DEBUG (MergeTask-1596,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:09:35,953 DEBUG (MergeTask-1729,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:09:49,770 DEBUG (MergeTask-1857,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:10:03,609 DEBUG (MergeTask-1986,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:10:17,420 DEBUG (MergeTask-2117,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:10:31,252 DEBUG (MergeTask-2248,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:10:45,099 DEBUG (MergeTask-2383,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:10:58,916 DEBUG (MergeTask-2516,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:11:12,743 DEBUG (MergeTask-2648,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
{noformat}