[jboss-jira] [JBoss JIRA] (WFLY-10736) Server in cluster hangs during start after previous kill

Mon Jul 30 08:08:02 EDT 2018

    [ https://issues.jboss.org/browse/WFLY-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13612090#comment-13612090 ] 

Paul Ferraro commented on WFLY-10736:
-------------------------------------

[~mnovak]
{quote}This use case does not seem to be so uncommon in production. In case that any liveness check of server fails there are scripts which simply restart the server. Yes, it can try clean shutdown first but if server does not shutdown due to any reason in given timeout then it will kill and start server again which is the same scenario.{quote}

In the same way that we recommend staggering the startup of cluster nodes to facilitate the cluster joining process, those scripts should also be modified to include a brief delay between kill and start.

{quote}
Also this issue is regression. Those tests with kill and restart scenario are executed since EAP 6 times and there were no issues until now. There is used default config for infinispan and JGroups.
{quote}

I acknowledge that this might be a regression - however, I'm trying to determine if the workaround is sufficient to lower the priority from blocker.  State-transfer timeouts causing systems to fail to start have plagued customers in the past.  We exposed the ability to configure state-transfer in a non-blocking fashion for that reason.

{quote}
I tried the workaround [...] and it seems to work. I can still see the exception on node-2 however node-1 does not hang any more.
{quote}

You will always see clustering related exceptions following a failure condition.  However, I want to make sure that the system recovers adequately (i.e. beyond not handing on startup).  Can you attach logs from node-1 & node-2 for this run?

> Server in cluster hangs during start after previous kill
> --------------------------------------------------------
>
>                 Key: WFLY-10736
>                 URL: https://issues.jboss.org/browse/WFLY-10736
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering
>            Reporter: Miroslav Novak
>            Assignee: Paul Ferraro
>            Priority: Blocker
>              Labels: blocker-WF14
>         Attachments: clusterKilTest.zip, node-1-thread-dump-before-kill-shutdown-sequence.txt, standalone-full-ha-1.xml, standalone-full-ha-2.xml
>
>
> There is regression in JGroups or Infinispan in one of our tests for fault tolerance of JMS bridges. However work on JMS bridge appears to be unrelated. Issue was hit in WF weekly run. 
> Test Scenario:
> * There are two servers. InQueue is deployed on Node 1,
> * OutQueue is deployed on Node 2. Both servers are started. 
> * Large byte messages are sent to InQueue deployed on Node 1. Bridge between servers/queues transfers messages from node 1 to node 2. 
> * Node 1 is killed and started again.
> * All messages are received from OutQueue deployed on Node 2. 
> Result:
> Node 1 does not start after kill and hangs. There is following exception logged in node 2:
> {code}
> :26:17,894 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100000: Node node-1 joined the cluster
> 09:26:18,520 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,523 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,868 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) ISPN000310: Starting cluster-wide rebalance for cache default, topology CacheTopology{id=17, phase=READ_OLD_WRITE_ALL, rebalanceId=6, currentCH=ReplicatedConsistentHash{ns = 256, owners = (2)[node-2: 122, node-1: 134]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (3)[node-2: 84, node-1: 90, node-1: 82]}, unionCH=null, actualMembers=[node-2, node-1, node-1], persistentUUIDs=[12443bfb-e88a-46f3-919e-9213bf38ce19, 2873237f-d881-463f-8a5a-940bf1d764e5, a05ea8af-a83b-42a9-b937-dc2da1cae6d1]}
> 09:26:18,869 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) [Context=default][Scope=node-2]ISPN100002: Started rebalance with topology id 17
> 09:26:18,870 INFO  [org.infinispan.CLUSTER] (transport-thread--p14-t5) [Context=default][Scope=node-2]ISPN100003: Node node-2 finished rebalance phase with topology id 17
> 09:26:18,981 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) [Context=default][Scope=node-1]ISPN100003: Node node-1 finished rebalance phase with topology id 17
> 09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p15-t4) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
> 	Suppressed: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) [rt.jar:1.8.0_131]
> 		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
> 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
> 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
> 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
> 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
> 		... 1 more
> 	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		... 1 more
> 	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1]
> 09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p16-t4) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
> 	Suppressed: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) [rt.jar:1.8.0_131]
> 		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
> 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
> 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
> 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
> 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
> 		... 1 more
> 	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		... 1 more
> 	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1]
> {code}
> There is default JGroups udp stack configured which is used by Infinispan. Both of the servers (jgroups udp) are bound to 127.0.0.1. Node 2 has port offset 1000.
> Attaching thread dump from node 1 when it hangs during start.

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)