[JBoss JIRA] (WFLY-9890) "ISPN000476: Timed out waiting for responses for request X from node Y" immediately after node Y rejoins the cluster (failover)

Friday, 9 March 2018

    [
https://issues.jboss.org/browse/WFLY-9890?page=com.atlassian.jira.plugin....
] 

Michal Vinkler commented on WFLY-9890:
--------------------------------------

Right now we have an automated way of building nightly wildfly images. With image built
last night, I can still see the WFLY-9890 occuring:

https://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/mvinkler_perflab...

In the perf19 log, it can be clearly seen that perf18 leaves cluster (jvmkill, no errors),
perf18 joins the cluster and exactly 10 seconds later, the perf19 starts logging errors:
{code}
[JBossINF] [0m[0m09:43:49,106 INFO  [org.infinispan.CLUSTER] (thread-21,ejb,perf19)
ISPN000094: Received new cluster view for channel ejb: [perf20|4] (3) [perf20, perf19,
perf21]
[JBossINF] [0m[0m09:43:49,108 INFO  [org.infinispan.CLUSTER] (thread-21,ejb,perf19)
ISPN100001: Node perf18 left the cluster
[JBossINF] [0m[0m09:43:49,109 INFO  [org.infinispan.CLUSTER] (thread-21,ejb,perf19)
ISPN000094: Received new cluster view for channel ejb: [perf20|4] (3) [perf20, perf19,
perf21]
[JBossINF] [0m[0m09:43:49,109 INFO  [org.infinispan.CLUSTER] (thread-21,ejb,perf19)
ISPN100001: Node perf18 left the cluster
[JBossINF] [0m[0m09:43:49,110 INFO  [org.infinispan.CLUSTER] (thread-21,ejb,perf19)
ISPN000094: Received new cluster view for channel ejb: [perf20|4] (3) [perf20, perf19,
perf21]
[JBossINF] [0m[0m09:43:49,110 INFO  [org.infinispan.CLUSTER] (thread-21,ejb,perf19)
ISPN100001: Node perf18 left the cluster
[JBossINF] [0m[0m09:43:49,111 INFO  [org.infinispan.CLUSTER] (thread-21,ejb,perf19)
ISPN000094: Received new cluster view for channel ejb: [perf20|4] (3) [perf20, perf19,
perf21]
[JBossINF] [0m[0m09:43:49,111 INFO  [org.infinispan.CLUSTER] (thread-21,ejb,perf19)
ISPN100001: Node perf18 left the cluster
[JBossINF] [0m[0m09:44:51,070 INFO  [org.infinispan.CLUSTER] (thread-24,ejb,perf19)
ISPN000094: Received new cluster view for channel ejb: [perf20|5] (4) [perf20, perf19,
perf21, perf18]
[JBossINF] [0m[0m09:44:51,072 INFO  [org.infinispan.CLUSTER] (thread-24,ejb,perf19)
ISPN100000: Node perf18 joined the cluster
[JBossINF] [0m[0m09:44:51,073 INFO  [org.infinispan.CLUSTER] (thread-24,ejb,perf19)
ISPN000094: Received new cluster view for channel ejb: [perf20|5] (4) [perf20, perf19,
perf21, perf18]
[JBossINF] [0m[0m09:44:51,073 INFO  [org.infinispan.CLUSTER] (thread-24,ejb,perf19)
ISPN100000: Node perf18 joined the cluster
[JBossINF] [0m[0m09:44:51,074 INFO  [org.infinispan.CLUSTER] (thread-24,ejb,perf19)
ISPN000094: Received new cluster view for channel ejb: [perf20|5] (4) [perf20, perf19,
perf21, perf18]
[JBossINF] [0m[0m09:44:51,074 INFO  [org.infinispan.CLUSTER] (thread-24,ejb,perf19)
ISPN100000: Node perf18 joined the cluster
[JBossINF] [0m[0m09:44:51,074 INFO  [org.infinispan.CLUSTER] (thread-24,ejb,perf19)
ISPN000094: Received new cluster view for channel ejb: [perf20|5] (4) [perf20, perf19,
perf21, perf18]
[JBossINF] [0m[0m09:44:51,074 INFO  [org.infinispan.CLUSTER] (thread-24,ejb,perf19)
ISPN100000: Node perf18 joined the cluster
[JBossINF] [0m[31m09:45:01,140 ERROR
[org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p12-t1)
ISPN000136: Error executing command PrepareCommand, writing keys
[SessionAttributesKey(v7OhmM7PY_vkHTkMsd1NZaZaDb6NkBy3q4PvQDMd),
SessionAccessMetaDataKey(v7OhmM7PY_vkHTkMsd1NZaZaDb6NkBy3q4PvQDMd),
SessionCreationMetaDataKey(v7OhmM7PY_vkHTkMsd1NZaZaDb6NkBy3q4PvQDMd)]:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 29574 from perf18
[JBossINF] 	at
org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:164)
[JBossINF] 	at
org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
[JBossINF] 	at
org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
[JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[JBossINF] 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[JBossINF] 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[JBossINF] 	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[JBossINF] 	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[JBossINF] 	at java.lang.Thread.run(Thread.java:748)
[JBossINF] 
{code} 

There are no sampling errors at that point o the client though.

...
 "ISPN000476: Timed out waiting for responses for request X from
node Y" immediately after node Y rejoins the cluster (failover)

-------------------------------------------------------------------------------------------------------------------------------

                 Key: WFLY-9890
                 URL: https://issues.jboss.org/browse/WFLY-9890
             Project: WildFly
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 12.0.0.Beta1
            Reporter: Michal Vinkler
            Assignee: Paul Ferraro
            Priority: Critical
             Fix For: 13.0.0.Beta1

 Seen in SF failover tests - scenarios: 
 failover-http-granular-shutdown-repl-sync (replication granularity is set to ATTRIBUTE
level)
 failover-http-session-jvmkill-repl-sync (replication granularity is set to SESSION
level)
 Description: In the failover test, each server is shut down and after some time it
rejoins the cluster. After joining the cluster again, the load is distributed to this
node. It seems that first request for some session handled by that node produces ERROR in
the log of some other server until the time another cluster topology change occurs. 
 Stacktrace of the error: 
 {code}
 [JBossINF] [0m[31m10:18:00,256 ERROR
[org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p12-t1)
ISPN000136: Error executing command PrepareCommand, writing keys
[SessionCreationMetaDataKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9),
SessionAccessMetaDataKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9),
SessionAttributeKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9[1])]:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 30250 from perf18
 [JBossINF] 	at
org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:163)
 [JBossINF] 	at
org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
 [JBossINF] 	at
org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
 [JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 [JBossINF] 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 [JBossINF] 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 [JBossINF] 	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 [JBossINF] 	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 [JBossINF] 	at java.lang.Thread.run(Thread.java:748)
 {code}
 Client gets the stale data: 
 {code}
 2018/02/21 10:18:00:272 EST [WARN ][Runner - 16] HOST
perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data: 
<org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data
received. Expected 60, received 59, Runner: 16>
         org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data
received. Expected 60, received 59, Runner: 16
 	at
org.jboss.smartfrog.loaddriver.http.AbstractSerialNumberValidatorFactoryImpl$SerialNumberValidator.processRequest(AbstractSerialNumberValidatorFactoryImpl.java:133)
 	at
org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
 	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:103)
 	at java.lang.Thread.run(Thread.java:748)
 {code}
 Example - timeline:
 {code}
 4 servers are running, 2000 clients, no issue in the beggining
 2018/02/21 10:16:43:234 EST [INFO ][TestController] HOST
perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Node 0 (perf18) is down. 
 - still no issues
 2018/02/21 10:17:43:234 EST [INFO ][TestController] HOST
perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Bringing back node 0 (perf18)
 2018/02/21 10:17:55:611 EST [INFO ][TestController] HOST
perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Node 0 (perf18) is up.
 2018/02/21 10:18:00:272 EST [WARN ][Runner - 16] HOST
perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data: 
<org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data
received. Expected 60, received 59, Runner: 16>
 {code}
 perf20 started logging the errors meanwhile and stopped only after perf19 was shut down:
 {code}
 [JBossINF] [0m[0m10:17:50,227 INFO  [org.infinispan.CLUSTER] (thread-19,ejb,perf20)
ISPN000094: Received new cluster view for channel ejb: [perf19|5] (4) [perf19, perf20,
perf21, perf18]
 [JBossINF] [0m[0m10:17:50,228 INFO  [org.infinispan.CLUSTER] (thread-19,ejb,perf20)
ISPN100000: Node perf18 joined the cluster
 [JBossINF] [0m[31m10:18:00,256 ERROR
[org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p12-t1)
ISPN000136: Error executing command PrepareCommand, writing keys
[SessionCreationMetaDataKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9),
SessionAccessMetaDataKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9),
SessionAttributeKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9[1])]:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 30250 from perf18
 [JBossINF] 	at
org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:163)
 [JBossINF] 	at
org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
 [JBossINF] 	at
org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
 [JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 [JBossINF] 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 [JBossINF] 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 [JBossINF] 	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 [JBossINF] 	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 [JBossINF] 	at java.lang.Thread.run(Thread.java:748)
 [JBossINF] 
 {code}
 There were no issues in this scenario in EAP 7.1.0.GA.
 Links:
 Clients:
http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/perflab_eap-7x-fa...
 perf18:
http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/perflab_eap-7x-fa...
 perf20:
http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/perflab_eap-7x-fa...
 Another test:
 perf19:
http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/perflab_eap-7x-fa...

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006