]
Dan Berindei edited comment on ISPN-9588 at 4/3/19 9:54 AM:
------------------------------------------------------------
We assumed that upgrading to JGroups 4.0.18.Final, which includes JGRP-2293 and JGRP-2324,
would fix the issue.
Apparently not, as a 9.4.x pull request build hanged in
{{ThreeWaySplitAndMergeTest.clearContent}} with the same symptoms:
{noformat}
17:11:23,512 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [DefaultCacheManager] Stopping
cache manager ThreeWaySplitAndMergeTest-NodeC on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,512 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [CacheImpl] Stopping cache
org.infinispan.CONFIG on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,514 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [CacheImpl] Stopping cache
___defaultcache on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,515 DEBUG (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS]
ThreeWaySplitAndMergeTest-NodeA-41022: members are (3)
ThreeWaySplitAndMergeTest-NodeC-48886,ThreeWaySplitAndMergeTest-NodeA-41022,ThreeWaySplitAndMergeTest-NodeB-21968,
coord=ThreeWaySplitAndMergeTest-NodeA-41022: I'm the new coordinator
17:11:23,522 DEBUG (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS]
ThreeWaySplitAndMergeTest-NodeA-41022: installing view
[ThreeWaySplitAndMergeTest-NodeA-41022|34] (2) [ThreeWaySplitAndMergeTest-NodeA-41022,
ThreeWaySplitAndMergeTest-NodeB-21968]
17:11:23,523 INFO (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [CLUSTER]
ISPN000094: Received new cluster view for channel ISPN:
[ThreeWaySplitAndMergeTest-NodeA-41022|34] (2) [ThreeWaySplitAndMergeTest-NodeA-41022,
ThreeWaySplitAndMergeTest-NodeB-21968]
17:11:23,526 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [TCP_NIO2] closing sockets and
stopping threads
17:11:25,525 WARN (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS]
ThreeWaySplitAndMergeTest-NodeA-41022: failed to collect all ACKs (expected=1) for view
[ThreeWaySplitAndMergeTest-NodeA-41022|34] after 2000ms, missing 1 ACKs from (1)
ThreeWaySplitAndMergeTest-NodeB-21968
17:15:23,558 ERROR
(transport-thread-ThreeWaySplitAndMergeTest-NodeA-p2798-t1:[ViewHandling]) [CLUSTER]
ISPN000196: Failed to recover cluster state after the current node became the coordinator
(or after merge)
java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException:
ISPN000476: Timed out waiting for responses for request 153 from
ThreeWaySplitAndMergeTest-NodeB-21968
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022) ~[?:?]
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:105)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:612)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.recoverClusterStatus(ClusterTopologyManagerImpl.java:452)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.becomeCoordinator(ClusterTopologyManagerImpl.java:336)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:315)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:88)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:758)
~[classes/:?]
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting
for responses for request 153 from ThreeWaySplitAndMergeTest-NodeB-21968
at
org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
~[classes/:?]
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
~[classes/:?]
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
~[classes/:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
~[?:?]
... 3 more
17:15:23,558 FATAL
(transport-thread-ThreeWaySplitAndMergeTest-NodeA-p2798-t1:[ViewHandling]) [CLUSTER]
ISPN100004: After merge (or coordinator change), the coordinator failed to recover
cluster. Cluster members are [ThreeWaySplitAndMergeTest-NodeA-41022,
ThreeWaySplitAndMergeTest-NodeB-21968].
{noformat}
was (Author: dan.berindei):
We assumed that upgrading to JGroups 4.0.18.Final, which includex JGRP-2324, would fix the
issue.
Apparently not, as a 9.4.x pull request build hanged in
{{ThreeWaySplitAndMergeTest.clearContent}} with the same symptoms:
{noformat}
17:11:23,512 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [DefaultCacheManager] Stopping
cache manager ThreeWaySplitAndMergeTest-NodeC on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,512 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [CacheImpl] Stopping cache
org.infinispan.CONFIG on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,514 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [CacheImpl] Stopping cache
___defaultcache on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,515 DEBUG (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS]
ThreeWaySplitAndMergeTest-NodeA-41022: members are (3)
ThreeWaySplitAndMergeTest-NodeC-48886,ThreeWaySplitAndMergeTest-NodeA-41022,ThreeWaySplitAndMergeTest-NodeB-21968,
coord=ThreeWaySplitAndMergeTest-NodeA-41022: I'm the new coordinator
17:11:23,522 DEBUG (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS]
ThreeWaySplitAndMergeTest-NodeA-41022: installing view
[ThreeWaySplitAndMergeTest-NodeA-41022|34] (2) [ThreeWaySplitAndMergeTest-NodeA-41022,
ThreeWaySplitAndMergeTest-NodeB-21968]
17:11:23,523 INFO (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [CLUSTER]
ISPN000094: Received new cluster view for channel ISPN:
[ThreeWaySplitAndMergeTest-NodeA-41022|34] (2) [ThreeWaySplitAndMergeTest-NodeA-41022,
ThreeWaySplitAndMergeTest-NodeB-21968]
17:11:23,526 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [TCP_NIO2] closing sockets and
stopping threads
17:11:25,525 WARN (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS]
ThreeWaySplitAndMergeTest-NodeA-41022: failed to collect all ACKs (expected=1) for view
[ThreeWaySplitAndMergeTest-NodeA-41022|34] after 2000ms, missing 1 ACKs from (1)
ThreeWaySplitAndMergeTest-NodeB-21968
17:15:23,558 ERROR
(transport-thread-ThreeWaySplitAndMergeTest-NodeA-p2798-t1:[ViewHandling]) [CLUSTER]
ISPN000196: Failed to recover cluster state after the current node became the coordinator
(or after merge)
java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException:
ISPN000476: Timed out waiting for responses for request 153 from
ThreeWaySplitAndMergeTest-NodeB-21968
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022) ~[?:?]
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:105)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:612)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.recoverClusterStatus(ClusterTopologyManagerImpl.java:452)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.becomeCoordinator(ClusterTopologyManagerImpl.java:336)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:315)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:88)
~[classes/:?]
at
org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:758)
~[classes/:?]
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting
for responses for request 153 from ThreeWaySplitAndMergeTest-NodeB-21968
at
org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
~[classes/:?]
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
~[classes/:?]
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
~[classes/:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
~[?:?]
... 3 more
17:15:23,558 FATAL
(transport-thread-ThreeWaySplitAndMergeTest-NodeA-p2798-t1:[ViewHandling]) [CLUSTER]
ISPN100004: After merge (or coordinator change), the coordinator failed to recover
cluster. Cluster members are [ThreeWaySplitAndMergeTest-NodeA-41022,
ThreeWaySplitAndMergeTest-NodeB-21968].
{noformat}
JGroups fails to install new cluster view after coordinator leave
-----------------------------------------------------------------
Key: ISPN-9588
URL:
https://issues.jboss.org/browse/ISPN-9588
Project: Infinispan
Issue Type: Bug
Components: Core, Test Suite - Core
Affects Versions: 9.4.0.CR3
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Major
Attachments: ISPN-9127_ScatteredCrashInSequenceTest-infinispan-core.log.gz,
ISPN-9127_ThreeWaySplitAndMergeTest-infinispan-core.log.gz
The core test suite normally shuts down cache managers in the reverse order of their
start, so that the coordinator stops last. This should be just an optimization, to reduce
the number of coordinator changes, and with it the test suite duration and log size.
Unfortunately, the few tests that stop the coordinator first are routinely failing to
stop the cluster properly, usually when there are 2 nodes left in the cluster the 2nd node
doesn't receive the view:
{noformat}
10:08:01,656 DEBUG (jgroups-26,Test-NodeB-20661:[]) [GMS] Test-NodeB-20661: installing
view [Test-NodeC-25266|33] (3) [Test-NodeC-25266, Test-NodeA-8936, Test-NodeB-20661]
10:08:01,862 TRACE (testng-Test:[]) [GMS] Test-NodeC-25266: sending LEAVE request to
Test-NodeA-8936
10:08:01,863 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: members are (3)
Test-NodeC-25266,Test-NodeA-8936,Test-NodeB-20661, coord=Test-NodeA-8936: I'm the new
coordinator
10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: joiners=[],
suspected=[], leaving=[Test-NodeC-25266], new view: [Test-NodeA-8936|34] (2)
[Test-NodeA-8936, Test-NodeB-20661]
10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: mcasting view
[Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
10:08:01,900 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: installing view
[Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
10:08:02,245 TRACE (testng-Test:[]) [JGroupsTransport] Test-NodeB-20661 sending request
140 to Test-NodeC-25266: CacheTopologyControlCommand{cache=org.infinispan.CONFIG,
type=LEAVE, sender=Test-NodeB-20661, joinInfo=null, topologyId=0, rebalanceId=0,
currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null,
throwable=null, viewId=33}
10:12:02,247 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Error sending the leave
request for cache org.infinispan.CONFIG to coordinator
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 140 from 7d5d965d-eacc-fd47-2ab8-5cec95f4c10b
{noformat}
Sometimes the new coordinator gets stuck in a merge loop, keeps trying to re-include the
stopped node and fails:
{noformat}
10:06:21,484 DEBUG (jgroups-24,Test-NodeC-57941:[]) [GMS] Test-NodeC-57941: installing
view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
10:06:21,486 DEBUG (jgroups-25,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing
view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
10:06:22,070 TRACE (testng-Test:[]) [GMS] Test-NodeD-37696: sending LEAVE request to
Test-NodeB-39137
10:06:22,078 TRACE (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view
[Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
10:06:22,080 DEBUG (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing
view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
10:06:30,887 DEBUG (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: I will be the
merge leader. Starting the merge task. Views: {Test-NodeB-39137=[Test-NodeB-39137|9] (2)
[Test-NodeB-39137, Test-NodeC-57941], Test-NodeC-57941=[Test-NodeD-37696|8] (3)
[Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]}
10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] suspending message
garbage collection
10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] Test-NodeB-39137: resume
task started, max_suspend_time=220000
10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge task
Test-NodeB-39137::7 started with 2 participants
10:06:30,888 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: sending
MERGE_REQ to [Test-NodeD-37696, Test-NodeB-39137]
10:06:36,041 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: collected
1 merge response(s) in 5153 ms
10:06:36,041 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:06:36,043 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137:
consolidated view=MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137,
Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137,
Test-NodeC-57941]
consolidated digest=Test-NodeB-39137: [15 (16)], Test-NodeC-57941: [5 (5)]
10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing
merge view [Test-NodeB-39137|10] (2 members) in 1 coords
10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
Test-NodeB-39137::7 took 5155 ms
10:06:36,043 TRACE (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view
MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups:
[Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
10:06:49,877 DEBUG (MergeTask-229,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:07:03,755 DEBUG (MergeTask-342,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:07:17,703 DEBUG (MergeTask-447,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:07:31,547 DEBUG (MergeTask-568,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:07:45,381 DEBUG (MergeTask-688,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:07:59,196 DEBUG (MergeTask-806,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:08:13,031 DEBUG (MergeTask-932,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:08:26,885 DEBUG (MergeTask-1061,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:08:40,697 DEBUG (MergeTask-1197,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:08:54,510 DEBUG (MergeTask-1330,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:09:08,324 DEBUG (MergeTask-1463,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:09:22,135 DEBUG (MergeTask-1596,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:09:35,953 DEBUG (MergeTask-1729,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:09:49,770 DEBUG (MergeTask-1857,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:10:03,609 DEBUG (MergeTask-1986,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:10:17,420 DEBUG (MergeTask-2117,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:10:31,252 DEBUG (MergeTask-2248,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:10:45,099 DEBUG (MergeTask-2383,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:10:58,916 DEBUG (MergeTask-2516,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
10:11:12,743 DEBUG (MergeTask-2648,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge
leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing
responses from 1 members, removing them from the merge
{noformat}