[jboss-jira] [JBoss JIRA] (WFLY-10736) Server in cluster hangs during start after previous kill

Wed Sep 12 17:48:01 EDT 2018

    [ https://issues.jboss.org/browse/WFLY-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632569#comment-13632569 ] 

Paul Ferraro edited comment on WFLY-10736 at 9/12/18 5:47 PM:
--------------------------------------------------------------

Again, the source of the problem is the following:

06:03:12,437 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN000094: Received new cluster view for channel ejb: [node-1|2] (3) [node-1, node-2, node-2]
06:03:12,437 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN100000: Node node-2 joined the cluster
06:03:12,441 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN000094: Received new cluster view for channel ejb: [node-1|2] (3) [node-1, node-2, node-2]
06:03:12,441 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN100000: Node node-2 joined the cluster
06:03:12,442 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN000094: Received new cluster view for channel ejb: [node-1|2] (3) [node-1, node-2, node-2]
06:03:12,442 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN100000: Node node-2 joined the cluster
06:03:12,442 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN000094: Received new cluster view for channel ejb: [node-1|2] (3) [node-1, node-2, node-2]

Here's what I believe is happening:
1. FD_SOCK detects that node-2 failed very quickly due to its socket closing.
2. FD_SOCK sends a SUSPECT event up the stack
3. node-2 restarts, using the same physical address
4. SUSPECT event is handled by VERIFY_SUSPECT which sends a message to the suspected member.
5. The restarted server receives the message (intended for the old member) and replies with an I_AM_NOT_DEAD response, thus the killed member never leaves the view.

Ideally, VERIFY_SUSPECT shouldn't be prone to false positives like this.  I've opened https://issues.jboss.org/browse/JGRP-2286 to solve this problem in JGroups directly.  I have a fix prepared, and am in the process of verifying it locally.

In the meantime, I can think of a couple workarounds:
1. Remove VERIFY_SUSPECT from the stack.  This will certainly solve the problem, but will have some negative consequences.  In particular, suspected members will immediately be removed from the view.  This means an increased likelihood of false positives from failure detection.  If this proves very problematic, we can mitigate this somewhat by increasing the timeouts used by FD_ALL.
2. Decrease the timeout used by VERIFY_SUSPECT.  This should have the effect of shrinking the window where a reincarnated member joins before the previously killed member is removed by VERIFY_SUSPECT.  Currently, this is set to 2 seconds.  However, given that WF starts very quickly, we can try reducing this.  However, this also reduces the amount of time a suspected member has to respond - and can still result in an overloaded member being removed from the view too quickly.  That said, this option is still preferable to #1.  I'll try to verify locally whether this is sufficient to reduce the likelihood of the above issue to an acceptable level.

[~mnovak] Can you try the following modification to the JGroups subsystem to see if this adequately works around the problem?
{noformat}
<protocol type="VERIFY_SUSPECT">
    <property name="timeout">1000</property>
</protocol>
{noformat}

{quote}afaik in OpenShift node name is taken from name of the pod. It can be for example "eap-app-2-pzhlk". If EAP/WF crashes inside the pod then it's restarted immediately in the same pod and has the same node name "eap-app-2-pzhlk".{quote}

The fact that the restarted pod has the same logical name is inconsequential - JGroups member identity is based on UUID, not logical name.  AIUI, Kubernetes dynamically assigns addresses to pods when they start, so it would most likely have a distinct physical address.  On bare metal, the same physical address is used, hence the problem above.

was (Author: pferraro):
Again, the source of the problem is the following:

06:03:12,437 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN000094: Received new cluster view for channel ejb: [node-1|2] (3) [node-1, node-2, node-2]
06:03:12,437 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN100000: Node node-2 joined the cluster
06:03:12,441 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN000094: Received new cluster view for channel ejb: [node-1|2] (3) [node-1, node-2, node-2]
06:03:12,441 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN100000: Node node-2 joined the cluster
06:03:12,442 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN000094: Received new cluster view for channel ejb: [node-1|2] (3) [node-1, node-2, node-2]
06:03:12,442 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN100000: Node node-2 joined the cluster
06:03:12,442 INFO  [org.infinispan.CLUSTER] (thread-8,ejb,node-1) ISPN000094: Received new cluster view for channel ejb: [node-1|2] (3) [node-1, node-2, node-2]

Here's what I believe is happening:
1. FD_SOCK detects that node-2 failed very quickly due to its socket closing.
2. FD_SOCK sends a SUSPECT event up the stack
3. node-2 restarts, using the same physical address
4. SUSPECT event is handled by VERIFY_SUSPECT which sends a message to the suspected member.
5. The restarted server receives the message (intended for the old member) and replies with an I_AM_NOT_DEAD response, thus the killed member never leaves the view.

Ideally, VERIFY_SUSPECT shouldn't be prone to false positives like this.  I've opened https://issues.jboss.org/browse/JGRP-2286 to solve this problem in JGroups directly.  I have a fix prepared, and am in the process of verifying it locally.

In the meantime, I can think of a couple workarounds:
1. Remove VERIFY_SUSPECT from the stack.  This will certainly solve the problem, but will have some negative consequences.  In particular, suspected members will immediately be removed from the view.  This means an increased likelihood of false positives from failure detection.  If this proves very problematic, we can mitigate this somewhat by increasing the timeouts used by FD_ALL.
2. Decrease the timeout used by VERIFY_SUSPECT.  This should have the effect of shrinking the window where a reincarnated member joins before the previously killed member is removed by VERIFY_SUSPECT.  Currently, this is set to 2 seconds.  However, given that WF starts very quickly, we can try reducing this.  However, this also reduces the amount of time a suspected member has to respond - and can still result in an overloaded member being removed from the view too quickly.  That said, this option is still preferable to #1.  I'll try to verify locally whether this is sufficient to reduce the likelihood of the above issue to an acceptable level.

[~mnovak] Can you try the following modification to the JGroups subsystem to see if this adequately works around the problem?
{noformat}
<protocol type="VERIFY_SUSPECT">
    <property name="timeout">1000</property>
</protocol>
{noformat}

{quote}afaik in OpenShift node name is taken from name of the pod. It can be for example "eap-app-2-pzhlk". If EAP/WF crashes inside the pod then it's restarted immediately in the same pod and has the same node name "eap-app-2-pzhlk".{quote}

The fact that the restarted pod has the same logical name is inconsequential - JGroups member identity is based on UUID, not logical name.  AIUI, the Kubernetes dynamically assigns addresses to pods when they start, so it would most likely have a distinct physical address.  On bare metal, the same physical address is used, hence the problem above.

> Server in cluster hangs during start after previous kill
> --------------------------------------------------------
>
>                 Key: WFLY-10736
>                 URL: https://issues.jboss.org/browse/WFLY-10736
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering
>            Reporter: Miroslav Novak
>            Assignee: Paul Ferraro
>            Priority: Blocker
>              Labels: blocker-WF14
>             Fix For: 15.0.0.Alpha1
>
>         Attachments: Lodh2TestCase.testRemoteJcaInboundOnly-traces.zip, Lodh2TestCase.testRemoteJcaInboundOnly.zip, Lodh2TestCase.testRemoteJcaInboundOnly2.zip, clusterKilTest.zip, logs-traces.zip, logs-traces2.zip, logs-traces3.zip, logs-traces4.zip, logs-with-workaround.zip, node-1-thread-dump-before-kill-shutdown-sequence.txt, server-with-mdb.log, standalone-full-ha-1.xml, standalone-full-ha-2.xml
>
>
> There is regression in JGroups or Infinispan in one of our tests for fault tolerance of JMS bridges. However work on JMS bridge appears to be unrelated. Issue was hit in WF weekly run. 
> Test Scenario:
> * There are two servers. InQueue is deployed on Node 1,
> * OutQueue is deployed on Node 2. Both servers are started. 
> * Large byte messages are sent to InQueue deployed on Node 1. Bridge between servers/queues transfers messages from node 1 to node 2. 
> * Node 1 is killed and started again.
> * All messages are received from OutQueue deployed on Node 2. 
> Result:
> Node 1 does not start after kill and hangs. There is following exception logged in node 2:
> {code}
> :26:17,894 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100000: Node node-1 joined the cluster
> 09:26:18,520 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
> 09:26:18,523 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
> 09:26:18,868 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) ISPN000310: Starting cluster-wide rebalance for cache default, topology CacheTopology{id=17, phase=READ_OLD_WRITE_ALL, rebalanceId=6, currentCH=ReplicatedConsistentHash{ns = 256, owners = (2)[node-2: 122, node-1: 134]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (3)[node-2: 84, node-1: 90, node-1: 82]}, unionCH=null, actualMembers=[node-2, node-1, node-1], persistentUUIDs=[12443bfb-e88a-46f3-919e-9213bf38ce19, 2873237f-d881-463f-8a5a-940bf1d764e5, a05ea8af-a83b-42a9-b937-dc2da1cae6d1]}
> 09:26:18,869 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) [Context=default][Scope=node-2]ISPN100002: Started rebalance with topology id 17
> 09:26:18,870 INFO  [org.infinispan.CLUSTER] (transport-thread--p14-t5) [Context=default][Scope=node-2]ISPN100003: Node node-2 finished rebalance phase with topology id 17
> 09:26:18,981 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) [Context=default][Scope=node-1]ISPN100003: Node node-1 finished rebalance phase with topology id 17
> 09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p15-t4) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
> 	Suppressed: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) [rt.jar:1.8.0_131]
> 		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
> 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
> 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
> 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
> 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
> 		... 1 more
> 	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		... 1 more
> 	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1]
> 09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p16-t4) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
> 	Suppressed: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) [rt.jar:1.8.0_131]
> 		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
> 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
> 		at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
> 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
> 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
> 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
> 		... 1 more
> 	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
> 		at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
> 		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
> 		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
> 		... 1 more
> 	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1]
> {code}
> There is default JGroups udp stack configured which is used by Infinispan. Both of the servers (jgroups udp) are bound to 127.0.0.1. Node 2 has port offset 1000.
> Attaching thread dump from node 1 when it hangs during start.

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)