]
Tristan Tarrant updated ISPN-9801:
----------------------------------
Sprint: Sprint 10.0.0.Alpha2, Sprint 10.0.0.Beta1, DataGrid Sprint #31, DataGrid
Sprint #32, DataGrid Sprint #33 (was: Sprint 10.0.0.Alpha2, Sprint 10.0.0.Beta1, DataGrid
Sprint #31, DataGrid Sprint #32)
ClusterTopologyManagerImpl hangs when restarting a node with FORK
-----------------------------------------------------------------
Key: ISPN-9801
URL:
https://issues.jboss.org/browse/ISPN-9801
Project: Infinispan
Issue Type: Bug
Components: Core
Affects Versions: 10.0.0.Alpha1, 9.4.3.Final
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Major
Fix For: 9.4.7.Final, 10.0.0.Beta1
When a server is restarted with `kill -9` or similar, both the old node and the new one
can be in the JGroups view for a while. Normally this shouldn't be a problem, but
sometimes the new node doesn't receive the {{HeartBeatCommand}} and the coordinator
cannot process any new view updates.
{noformat}
14:29:19,981 INFO (jgroups-12,Test-NodeA:[]) [CLUSTER] ISPN000094: Received new cluster
view for channel FORKISPN: [Test-NodeA|4] (5) [Test-NodeA, Test-NodeB, Test-NodeC,
Test-NodeD, Test-NodeE]
14:29:19,982 TRACE (transport-thread-Test-NodeA-p4-t14:[ViewHandling])
[ClusterTopologyManagerImpl] Updating cluster members for all the caches. New list is
[Test-NodeA, Test-NodeB, Test-NodeC, Test-NodeD, Test-NodeE]
14:29:19,982 TRACE (transport-thread-Test-NodeA-p4-t14:[ViewHandling]) [JGroupsTransport]
Test-NodeA sending request 9 to all: org.infinispan.topology.HeartBeatCommand@1163beb6
14:29:19,986 TRACE (jgroups-6,Test-NodeA:[]) [JGroupsTransport] Test-NodeA received
response for request 9 from Test-NodeC: SuccessfulResponse(null)
14:29:19,987 TRACE (jgroups-9,Test-NodeA:[]) [JGroupsTransport] Test-NodeA received
response for request 9 from Test-NodeD: SuccessfulResponse(null)
14:29:20,032 TRACE (jgroups-6,Test-NodeE:[]) [TCP_NIO2] Test-NodeE: received message
batch of 1 messages from Test-NodeA
14:29:20,032 TRACE (jgroups-6,Test-NodeE:[]) [NAKACK2] Test-NodeE: message Test-NodeA::39
was added to queue (not yet server)
14:29:20,054 TRACE (jgroups-6,Test-NodeE:[]) [NAKACK2] Test-NodeE: received
Test-NodeA#38
14:29:20,054 TRACE (jgroups-6,Test-NodeE:[]) [NAKACK2] Test-NodeE: delivering
Test-NodeA#38
# not actually delivered :)
14:29:20,054 TRACE (jgroups-6,Test-NodeE:[]) [MFC] Test-NodeA used 5 credits, 1999995
remaining
14:29:20,149 INFO (ForkThread-1,ForkChannelRestartTest:[]) [CLUSTER] ISPN000094:
Received new cluster view for channel FORKISPN: [Test-NodeA|4] (5) [Test-NodeA,
Test-NodeB, Test-NodeC, Test-NodeD, Test-NodeE]
14:29:21,119 DEBUG (testng-Test-1:[]) [ForkChannelRestartTest] Stopping channel
Test-NodeB
14:29:23,319 INFO (VERIFY_SUSPECT.TimerThread-32,Test-NodeA:[]) [CLUSTER] ISPN000094:
Received new cluster view for channel FORKISPN: [Test-NodeA|5] (4) [Test-NodeA,
Test-NodeC, Test-NodeD, Test-NodeE]
14:29:23,320 TRACE (remote-thread-Test-NodeA-p2-t1:[]) [MultiTargetRequest] Target
Test-NodeB of request 9 left the cluster view
{noformat}
So far, it looks like it's a JGroups bug similar to JGRP-2294, but we need to
investigate further.