[jboss-jira] [JBoss JIRA] (WFLY-10736) Server in cluster hangs during start after previous kill

Thu Sep 13 07:26:02 EDT 2018

    [ https://issues.jboss.org/browse/WFLY-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632878#comment-13632878 ] 

Miroslav Novak commented on WFLY-10736:
---------------------------------------

Thanks [~kabirkhan] for running the reproducer and confirming that issue is still there (with workaround to VERIFY_SUSPECT and in custom JGroups branch). 

[~pferraro] [~rhusar] [~kabirkhan]
I’ll update “Steps to Reproduce:” and provide link Jenkins job which will automatically run reproducer against your custom EAP/WF zip in case you cannot reproduce it locally and will verify that your fix is correct. Still I have info from [~kabirkhan] that he had no problem to reproduce locally.

??The fact that the restarted pod has the same logical name is inconsequential - JGroups member identity is based on UUID, not logical name. AIUI, Kubernetes dynamically assigns addresses to pods when they start, so it would most likely have a distinct physical address...??

[~pferraro] Unfortunately it’s not correct. As mentioned Pod itself is not restarted and remains the same. Also restarted docker container with EAP in the Pod does not change MAC address and remains the same.

Check following snippet from @maschmidt. He logs to the docker container in the pod and prints MAC address which is {{0a:58:0a:83:01:97}}. Then he kills EAP and waits until docker container with EAP is restarted. He logs to the restarted docker container again and he checks MAC address again and it’s the same {{0a:58:0a:83:01:97}}. See following snippet:
{code}
[maschmid at archie master]$ oc get pod hsc-1-ch95s
NAME          READY     STATUS    RESTARTS   AGE
hsc-1-ch95s   1/1       Running   0          9m
[maschmid at archie master]$ oc rsh hsc-1-ch95s
sh-4.2$ cat /sys/class/net/eth0/address
0a:58:0a:83:01:97
sh-4.2$ jps
433 jboss-modules.jar
2435 Jps
sh-4.2$ kill -9 433
sh-4.2$ command terminated with exit code 137
[maschmid at archie master]$
[maschmid at archie master]$ oc get pod hsc-1-ch95s
NAME          READY     STATUS    RESTARTS   AGE
hsc-1-ch95s   0/1       Error     0          9m
[maschmid at archie master]$ oc get pod hsc-1-ch95s
NAME          READY     STATUS    RESTARTS   AGE
hsc-1-ch95s   0/1       Running   1          10m
[maschmid at archie master]$ oc rsh hsc-1-ch95s
sh-4.2$ cat /sys/class/net/eth0/address
0a:58:0a:83:01:97
{code}

Thus this issue affects OpenShift use case and remains blocker CD14.

> Server in cluster hangs during start after previous kill
> --------------------------------------------------------
>
>                 Key: WFLY-10736
>                 URL: https://issues.jboss.org/browse/WFLY-10736
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering
>            Reporter: Miroslav Novak
>            Assignee: Paul Ferraro
>            Priority: Blocker
>              Labels: blocker-WF14
>             Fix For: 15.0.0.Alpha1
>
>         Attachments: Lodh2TestCase.testRemoteJcaInboundOnly-traces.zip, Lodh2TestCase.testRemoteJcaInboundOnly.zip, Lodh2TestCase.testRemoteJcaInboundOnly2.zip, clusterKilTest.zip, logs-traces.zip, logs-traces2.zip, logs-traces3.zip, logs-traces4.zip, logs-with-workaround.zip, node-1-thread-dump-before-kill-shutdown-sequence.txt, server-with-mdb.log, standalone-full-ha-1.xml, standalone-full-ha-2.xml
>
>
> There is regression in JGroups or Infinispan in one of our tests for fault tolerance of JMS bridges. However work on JMS bridge appears to be unrelated. Issue was hit in WF weekly run. 
> Test Scenario:
> * There are two servers. InQueue is deployed on Node 1,
> * OutQueue is deployed on Node 2. Both servers are started. 
> * Large byte messages are sent to InQueue deployed on Node 1. Bridge between servers/queues transfers messages from node 1 to node 2. 
> * Node 1 is killed and started again.
> * All messages are received from OutQueue deployed on Node 2. 
> Result:
> Node 1 does not start after kill and hangs. There is following exception logged in node 2:
> {code}
> :26:17,894 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100000: Node node-1 joined the cluster
> 09:26:18,520 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,523 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,868 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) ISPN000310: Starting cluster-wide rebalance for cache default, topology CacheTopology{id=17, phase=READ_OLD_WRITE_ALL, rebalanceId=6, currentCH=ReplicatedConsistentHash{ns = 256, owners = (2)[node-2: 122, node-1: 134]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (3)[node-2: 84, node-1: 90, node-1: 82]}, unionCH=null, actualMembers=[node-2, node-1, node-1], persistentUUIDs=[12443bfb-e88a-46f3-919e-9213bf38ce19, 2873237f-d881-463f-8a5a-940bf1d764e5, a05ea8af-a83b-42a9-b937-dc2da1cae6d1]}
> 09:26:18,869 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) [Context=default][Scope=node-2]ISPN100002: Started rebalance with topology id 17
> 09:26:18,870 INFO  [org.infinispan.CLUSTER] (transport-thread--p14-t5) [Context=default][Scope=node-2]ISPN100003: Node node-2 finished rebalance phase with topology id 17
> 09:26:18,981 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) [Context=default][Scope=node-1]ISPN100003: Node node-1 finished rebalance phase with topology id 17
> 09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p15-t4) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
> 	Suppressed: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) [rt.jar:1.8.0_131]
> 		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
> 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
> 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
> 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
> 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
> 		... 1 more
> 	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		... 1 more
> 	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1]
> 09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p16-t4) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
> 	Suppressed: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) [rt.jar:1.8.0_131]
> 		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
> 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
> 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
> 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
> 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
> 		... 1 more
> 	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		... 1 more
> 	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1]
> {code}
> There is default JGroups udp stack configured which is used by Infinispan. Both of the servers (jgroups udp) are bound to 127.0.0.1. Node 2 has port offset 1000.
> Attaching thread dump from node 1 when it hangs during start.

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)