[jboss-jira] [JBoss JIRA] (WFLY-10736) Server in cluster hangs during start after previous kill

Thu Sep 13 06:28:02 EDT 2018

    [ https://issues.jboss.org/browse/WFLY-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632849#comment-13632849 ] 

Kabir Khan commented on WFLY-10736:
-----------------------------------

[~pferraro] I am able to reproduce the issue using the steps in [~mnovak]'s 'Steps to reproduce' section. I get the following last output in the logs which apparently is where it normally hangs.
{code}
10:39:37,482 INFO  [org.jboss.as.ejb3] (MSC service thread 1-4) WFLYEJB0042: Started message driven bean 'mdb' with 'activemq-ra.rar' resource adapter
{code}
where it just hangs.

Using both the workaround and the patched jgroups I end up with the following:
{code}
11:25:10,516 INFO  [org.jboss.as] (MSC service thread 1-2) WFLYSRV0050: JBoss EAP CD 7.2.0.CD14 (WildFly Core 6.0.4.Final-redhat-00001) stopped in 34ms
Sep 13, 2018 11:25:10 AM org.jboss.arquillian.container.test.impl.client.container.ClientContainerController kill
INFO: Hard killing of a server instance
{code}
Here it waits for a while, before proceeding with: 
{code}
11:26:40,552 Thread-34 INFO  [org.jboss.qa.hornetq.tools.ContainerUtils:94] File exists so appending suffix. Printing thread dump for pid: 28850 to file: /Users/kabir/sourcecontrol/temp/eap-tests-hornetq/scripts/server1/jboss-eap/standalone/log/node-1-thread-dump-before-kill-shutdown-sequence.txt
11:26:40,552 Thread-34 INFO  [org.jboss.qa.hornetq.tools.ContainerUtils:98] Check if thread dump is supported on this platform: Mac OS X
11:26:40,689 Thread-34 WARN  [org.jboss.qa.hornetq.tools.ContainerUtils:146] Print thread dump of process: 28850 was not successful. Return value was 1
{code}

I think [~rhusar] is able to reproduce so he can hopefully provide more details.

> Server in cluster hangs during start after previous kill
> --------------------------------------------------------
>
>                 Key: WFLY-10736
>                 URL: https://issues.jboss.org/browse/WFLY-10736
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering
>            Reporter: Miroslav Novak
>            Assignee: Paul Ferraro
>            Priority: Blocker
>              Labels: blocker-WF14
>             Fix For: 15.0.0.Alpha1
>
>         Attachments: Lodh2TestCase.testRemoteJcaInboundOnly-traces.zip, Lodh2TestCase.testRemoteJcaInboundOnly.zip, Lodh2TestCase.testRemoteJcaInboundOnly2.zip, clusterKilTest.zip, logs-traces.zip, logs-traces2.zip, logs-traces3.zip, logs-traces4.zip, logs-with-workaround.zip, node-1-thread-dump-before-kill-shutdown-sequence.txt, server-with-mdb.log, standalone-full-ha-1.xml, standalone-full-ha-2.xml
>
>
> There is regression in JGroups or Infinispan in one of our tests for fault tolerance of JMS bridges. However work on JMS bridge appears to be unrelated. Issue was hit in WF weekly run. 
> Test Scenario:
> * There are two servers. InQueue is deployed on Node 1,
> * OutQueue is deployed on Node 2. Both servers are started. 
> * Large byte messages are sent to InQueue deployed on Node 1. Bridge between servers/queues transfers messages from node 1 to node 2. 
> * Node 1 is killed and started again.
> * All messages are received from OutQueue deployed on Node 2. 
> Result:
> Node 1 does not start after kill and hangs. There is following exception logged in node 2:
> {code}
> :26:17,894 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100000: Node node-1 joined the cluster
> 09:26:18,520 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,523 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,868 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) ISPN000310: Starting cluster-wide rebalance for cache default, topology CacheTopology{id=17, phase=READ_OLD_WRITE_ALL, rebalanceId=6, currentCH=ReplicatedConsistentHash{ns = 256, owners = (2)[node-2: 122, node-1: 134]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (3)[node-2: 84, node-1: 90, node-1: 82]}, unionCH=null, actualMembers=[node-2, node-1, node-1], persistentUUIDs=[12443bfb-e88a-46f3-919e-9213bf38ce19, 2873237f-d881-463f-8a5a-940bf1d764e5, a05ea8af-a83b-42a9-b937-dc2da1cae6d1]}
> 09:26:18,869 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) [Context=default][Scope=node-2]ISPN100002: Started rebalance with topology id 17
> 09:26:18,870 INFO  [org.infinispan.CLUSTER] (transport-thread--p14-t5) [Context=default][Scope=node-2]ISPN100003: Node node-2 finished rebalance phase with topology id 17
> 09:26:18,981 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) [Context=default][Scope=node-1]ISPN100003: Node node-1 finished rebalance phase with topology id 17
> 09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p15-t4) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
> 	Suppressed: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) [rt.jar:1.8.0_131]
> 		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
> 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
> 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
> 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
> 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
> 		... 1 more
> 	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		... 1 more
> 	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1]
> 09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p16-t4) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
> 	Suppressed: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) [rt.jar:1.8.0_131]
> 		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
> 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
> 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
> 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
> 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
> 		... 1 more
> 	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		... 1 more
> 	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1]
> {code}
> There is default JGroups udp stack configured which is used by Infinispan. Both of the servers (jgroups udp) are bound to 127.0.0.1. Node 2 has port offset 1000.
> Attaching thread dump from node 1 when it hangs during start.

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)