]
Paul Ferraro commented on WFLY-6926:
------------------------------------
This is likely due to ISPN-6745 - for which a fix is already committed in the 8.2.x
branch.
TimeoutException: Replication timeout when handling request with DIST
cache
---------------------------------------------------------------------------
Key: WFLY-6926
URL:
https://issues.jboss.org/browse/WFLY-6926
Project: WildFly
Issue Type: Bug
Components: Clustering
Affects Versions: 10.1.0.CR1
Reporter: Michal Vinkler
Assignee: Paul Ferraro
Priority: Blocker
Seen in http-session scenarios with *DIST* cache only. REPL cache scenarios seem not to
be affected.
TimeoutExceptions occur right away after sending first request and definitely result in
client getting 500:
{code}
[JBossINF] [0m[31m04:31:39,033 ERROR
[org.infinispan.interceptors.InvocationContextInterceptor] (default task-1) ISPN000136:
Error executing command GetKeyValueCommand, writing keys []:
org.infinispan.util.concurrent.TimeoutException: Replication timeout for perf19
[JBossINF] at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:801)
[JBossINF] at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$1(JGroupsTransport.java:642)
[JBossINF] at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
[JBossINF] at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
[JBossINF] at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
[JBossINF] at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
[JBossINF] at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.staggeredProcessNext(CommandAwareRpcDispatcher.java:375)
[JBossINF] at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.lambda$processCallsStaggered$3(CommandAwareRpcDispatcher.java:357)
[JBossINF] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[JBossINF] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[JBossINF] at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[JBossINF] at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[JBossINF] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[JBossINF] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[JBossINF] at java.lang.Thread.run(Thread.java:745)
{code}
Client:
{code}
2016/08/08 04:31:39:068 EDT [WARN ][Runner - 4] HOST
perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data: Internal Server
Error>
org.jboss.smartfrog.loaddriver.RequestProcessingException: Invalid response code:
500 Content: Internal Server Error
at
org.jboss.smartfrog.loaddriver.http.HttpRequestProcessorFactoryImpl$HttpRequestProcessor.processRequest(HttpRequestProcessorFactoryImpl.java:163)
at
org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:103)
at java.lang.Thread.run(Thread.java:745)
{code}
I was able to reproduce it even with a very small load of 30 sessions.
Server link (run with 30 sessions):
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failo...
Client link (run with 30 sessions):
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failo...
This issue occurs under different conditions than JBEAP-3779 (which is similar),
that's why I opened new issue.
Number of occurrences of TimeoutExceptions is very high. The server logs are full of
them.
This issue makes clustering with *DIST* cache unusable, thus giving blocker priority.
Also, there is a issue with shutting down the servers in the end of the test, I don't
know if it can be related, but I am hitting it constantly in the http-session DIST
scenarios.
The server shutdown usually gets stuck on this command:
{code}
[JBossINF] [0m[0m04:38:51,694 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-7)
ISPN000082: Stopping the RpcDispatcher for channel web
{code}
*EDIT:*
Also ejb-remote tests with *DIST* cache are affected (even if only 20 sessions are
created).
Server log for ejb-remote run:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failo...
Client log for ejb-remote run:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failo...
*EDIT #2:*
Can be very easily reproduced locally. Just start a cluster of at least 3 servers with
clusterbench deployed. Then do few requests to various nodes until you get 500 as a
response. Also notice, when shutting down a node, which logged TimeoutException before,
the node won't stop and get stuck during the shutdown.