[jboss-jira] [JBoss JIRA] (WFLY-6926) TimeoutException: Replication timeout when handling request with DIST cache

Wed Aug 10 07:00:01 EDT 2016

    [ https://issues.jboss.org/browse/WFLY-6926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277224#comment-13277224 ] 

Michal Vinkler commented on WFLY-6926:
--------------------------------------

[~pferraro] I was able to reproduce the issue with the latest infinispan-core.jar built from 8.2.x branch.
Logs:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-http-session-jvmkill-dist-sync/29/console-perf17/
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-http-session-jvmkill-dist-sync/29/console-perf19/  

>From server logs: Infinispan version: Infinispan 'Chakra' 8.2.4-SNAPSHOT

> TimeoutException: Replication timeout when handling request with DIST cache
> ---------------------------------------------------------------------------
>
>                 Key: WFLY-6926
>                 URL: https://issues.jboss.org/browse/WFLY-6926
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 10.1.0.CR1
>            Reporter: Michal Vinkler
>            Assignee: Paul Ferraro
>            Priority: Blocker
>
> Seen in http-session scenarios with *DIST* cache only. REPL cache scenarios seem not to be affected.
> TimeoutExceptions occur right away after sending first request and definitely result in client getting 500:
> {code}
> [JBossINF] [0m[31m04:31:39,033 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (default task-1) ISPN000136: Error executing command GetKeyValueCommand, writing keys []: org.infinispan.util.concurrent.TimeoutException: Replication timeout for perf19
> [JBossINF] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:801)
> [JBossINF] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$1(JGroupsTransport.java:642)
> [JBossINF] 	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
> [JBossINF] 	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> [JBossINF] 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> [JBossINF] 	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
> [JBossINF] 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.staggeredProcessNext(CommandAwareRpcDispatcher.java:375)
> [JBossINF] 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.lambda$processCallsStaggered$3(CommandAwareRpcDispatcher.java:357)
> [JBossINF] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> [JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [JBossINF] 	at java.lang.Thread.run(Thread.java:745)
> {code}
> Client:
> {code}
> 2016/08/08 04:31:39:068 EDT [WARN ][Runner - 4] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  Internal Server Error>
>         org.jboss.smartfrog.loaddriver.RequestProcessingException: Invalid response code: 500 Content: Internal Server Error
> 	at org.jboss.smartfrog.loaddriver.http.HttpRequestProcessorFactoryImpl$HttpRequestProcessor.processRequest(HttpRequestProcessorFactoryImpl.java:163)
> 	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
> 	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:103)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> I was able to reproduce it even with a very small load of 30 sessions.
> Server link (run with 30 sessions):
> http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-http-session-jvmkill-dist-sync/27/console-perf21/consoleText
> Client link (run with 30 sessions):
> http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-http-session-jvmkill-dist-sync/27/console-perf17/consoleText
> This issue occurs under different conditions than JBEAP-3779 (which is similar), that's why I opened new issue.
> Number of occurrences of TimeoutExceptions is very high. The server logs are full of them. 
> This issue makes clustering with *DIST* cache unusable, thus giving blocker priority. 
> Also, there is a issue with shutting down the servers in the end of the test, I don't know if it can be related, but I am hitting it constantly in the http-session DIST scenarios.
> The server shutdown usually gets stuck on this command:
> {code}
> [JBossINF] [0m[0m04:38:51,694 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-7) ISPN000082: Stopping the RpcDispatcher for channel web
> {code}
> *EDIT:*
> Also ejb-remote tests with *DIST* cache are affected (even if only 20 sessions are created).
> Server log for ejb-remote run:
> http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/3/console-perf20/
> Client log for ejb-remote run:
> http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/3/console-perf17/
> *EDIT #2:*
> Can be very easily reproduced locally. Just start a cluster of at least 3 servers with clusterbench deployed. Then do few requests to various nodes until you get 500 as a response. Also notice, when shutting down a node, which logged TimeoutException before, the node won't stop and get stuck during the shutdown.

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)