[JBoss JIRA] (WFLY-10736) Server in cluster hangs during start after previous kill

Monday, 30 July 2018

    [
https://issues.jboss.org/browse/WFLY-10736?page=com.atlassian.jira.plugin...
] 

Miroslav Novak commented on WFLY-10736:
---------------------------------------

[~pferraro] This use case does not seem to be so uncommon in production. In case that any
liveness check of server fails there are scripts which simply restart the server. Yes, it
can try clean shutdown first but if server does not shutdown due to any reason in given
timeout then it will kill and start server again which is the same scenario. This is
default behaviour for our Wildfly/EAP CD docker images for OpenShift. If server crashes in
the pod then server in the pod is restarted immediately with the same node name.

Also this issue is regression. Those tests with kill and restart scenario are executed
since EAP 6 times and there were no issues until now. There is used default config for
infinispan and JGroups. 

I tried the workaround:
{{/subsytem=infinispan/cache-container=server/replicated-cache=default/component=state-transfer:write-attribute(name=timeout,
value=0)}} 
and it seems to work. I can still see the exception on node-2 however node-1 does not hang
any more. 

cc-ing [~bmaxwell] [~msvehla] to let them know as well.

...
 Server in cluster hangs during start after previous kill
 --------------------------------------------------------

                 Key: WFLY-10736
                 URL: https://issues.jboss.org/browse/WFLY-10736
             Project: WildFly
          Issue Type: Bug
          Components: Clustering
            Reporter: Miroslav Novak
            Assignee: Paul Ferraro
            Priority: Blocker
              Labels: blocker-WF14
         Attachments: clusterKilTest.zip,
node-1-thread-dump-before-kill-shutdown-sequence.txt, standalone-full-ha-1.xml,
standalone-full-ha-2.xml

 There is regression in JGroups or Infinispan in one of our tests for fault tolerance of
JMS bridges. However work on JMS bridge appears to be unrelated. Issue was hit in WF
weekly run. 
 Test Scenario:
 * There are two servers. InQueue is deployed on Node 1,
 * OutQueue is deployed on Node 2. Both servers are started. 
 * Large byte messages are sent to InQueue deployed on Node 1. Bridge between
servers/queues transfers messages from node 1 to node 2. 
 * Node 1 is killed and started again.
 * All messages are received from OutQueue deployed on Node 2. 
 Result:
 Node 1 does not start after kill and hangs. There is following exception logged in node
2:
 {code}
 :26:17,894 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100000: Node node-1
joined the cluster
 09:26:18,520 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received
new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
 09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node
node-1 left the cluster
 09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received
new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node
node-1 left the cluster
 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received
new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node
node-1 left the cluster
 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received
new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
 09:26:18,523 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node
node-1 left the cluster
 09:26:18,868 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) ISPN000310: Starting
cluster-wide rebalance for cache default, topology CacheTopology{id=17,
phase=READ_OLD_WRITE_ALL, rebalanceId=6, currentCH=ReplicatedConsistentHash{ns = 256,
owners = (2)[node-2: 122, node-1: 134]}, pendingCH=ReplicatedConsistentHash{ns = 256,
owners = (3)[node-2: 84, node-1: 90, node-1: 82]}, unionCH=null, actualMembers=[node-2,
node-1, node-1], persistentUUIDs=[12443bfb-e88a-46f3-919e-9213bf38ce19,
2873237f-d881-463f-8a5a-940bf1d764e5, a05ea8af-a83b-42a9-b937-dc2da1cae6d1]}
 09:26:18,869 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2)
[Context=default][Scope=node-2]ISPN100002: Started rebalance with topology id 17
 09:26:18,870 INFO  [org.infinispan.CLUSTER] (transport-thread--p14-t5)
[Context=default][Scope=node-2]ISPN100003: Node node-2 finished rebalance phase with
topology id 17
 09:26:18,981 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2)
[Context=default][Scope=node-1]ISPN100003: Node node-1 finished rebalance phase with
topology id 17
 09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl]
(transport-thread--p15-t4) ISPN000197: Error updating cluster member list:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 4 from node-1
 	at
org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[rt.jar:1.8.0_131]
 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[rt.jar:1.8.0_131]
 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[rt.jar:1.8.0_131]
 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[rt.jar:1.8.0_131]
 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
 	Suppressed: java.util.concurrent.ExecutionException:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 4 from node-1
 		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
[rt.jar:1.8.0_131]
 		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
[rt.jar:1.8.0_131]
 		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[rt.jar:1.8.0_131]
 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[rt.jar:1.8.0_131]
 		at
org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
 		... 1 more
 	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out
waiting for responses for request 4 from node-1
 		at
org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
 		at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[rt.jar:1.8.0_131]
 		at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[rt.jar:1.8.0_131]
 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[rt.jar:1.8.0_131]
 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[rt.jar:1.8.0_131]
 		... 1 more
 	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 4 from node-1]
 09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl]
(transport-thread--p16-t4) ISPN000197: Error updating cluster member list:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 4 from node-1
 	at
org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[rt.jar:1.8.0_131]
 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[rt.jar:1.8.0_131]
 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[rt.jar:1.8.0_131]
 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[rt.jar:1.8.0_131]
 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
 	Suppressed: java.util.concurrent.ExecutionException:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 4 from node-1
 		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
[rt.jar:1.8.0_131]
 		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
[rt.jar:1.8.0_131]
 		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[rt.jar:1.8.0_131]
 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[rt.jar:1.8.0_131]
 		at
org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
 		... 1 more
 	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out
waiting for responses for request 4 from node-1
 		at
org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
 		at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[rt.jar:1.8.0_131]
 		at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[rt.jar:1.8.0_131]
 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[rt.jar:1.8.0_131]
 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[rt.jar:1.8.0_131]
 		... 1 more
 	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 4 from node-1]
 {code}
 There is default JGroups udp stack configured which is used by Infinispan. Both of the
servers (jgroups udp) are bound to 127.0.0.1. Node 2 has port offset 1000.
 Attaching thread dump from node 1 when it hangs during start. 

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006