[infinispan-issues] [JBoss JIRA] (ISPN-1838) State transfer takes more than 10 minutes with only 10K entries.

Sun Feb 5 15:26:48 EST 2012

    [ https://issues.jboss.org/browse/ISPN-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664320#comment-12664320 ] 

Dan Berindei commented on ISPN-1838:
------------------------------------

The logs are quite weird, node-01 sends an APPLY_STATE command to node-02 for view 8 which never seems to reach its destination.

Each of the surviving nodes (node-01, node-03, node-04) sent an APPLY_STATE command to node-02. For the node-03 command it took 1 second to receive and unmarshal it, for the node-04 command it took 1 minute, and the node-01 command never started executing (either because it never arrived or because unmarshalling took too long).

{noformat}
2012-02-02 05:18:17,479 TRACE [RpcManagerImpl] (transport-thread-4) edg-perf03-36944 broadcasting call StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=8, state=1806} to recipient list [edg-perf02-21799]
2012-02-02 05:18:18,479 TRACE [CommandAwareRpcDispatcher] (OOB-100,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=8, state=1806} [sender=edg-perf03-36944]
{noformat}

{noformat}
2012-02-02 05:18:17,468 TRACE [CommandAwareRpcDispatcher] (transport-thread-5) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=8, state=1443} to addresses [edg-perf02-21799]
2012-02-02 05:19:16,630 TRACE [CommandAwareRpcDispatcher] (OOB-98,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=8, state=1443} [sender=edg-perf04-45788]
{noformat}

{noformat}
2012-02-02 05:18:17,468 TRACE [CommandAwareRpcDispatcher] (transport-thread-5) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf01-47003, viewId=8, state=1903} to addresses [edg-perf02-21799]
???
{noformat}

Michal, can you run the test again with TRACE enabled for org.jgroups and for org.infinispan.marshall as well?

> State transfer takes more than 10 minutes with only 10K entries.
> ----------------------------------------------------------------
>
>                 Key: ISPN-1838
>                 URL: https://issues.jboss.org/browse/ISPN-1838
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.1.1.CR1
>            Reporter: Michal Linhard
>            Assignee: Dan Berindei
>         Attachments: apply_state.log
>
>
> This could be categorized as a performance problem.
> It happened in resilience test run: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/23
> originally to verify ISPN-1826
> It was run with infinispan special build from Galder's branch (https://github.com/galderz/infinispan/tree/t_1826_5)
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-infinispan-from-source/45/
> test starts 4 nodes, kills node2, starts node2 and sees what happens
> trace logging on server side was on. there were two runs
> 200 clients, 10K entries
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/23
> 20 clients, 1K entries
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/24
> in run 24 everyting looks nice:
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/24/artifact/report/stats-throughput.png
> in run 23 the state transfer takes forever (more than 10 min)
> these important views are installed on coordinator (node03):
> {code}
> 2012-02-02 05:11:00,560 TRACE [BaseStateTransferManagerImpl] (transport-thread-1) Received new cache view: testCache CacheView{viewId=6, members=[edg-perf04-45788, edg-perf03-36944, edg-perf02-51026, edg-perf01-47003]}
> 2012-02-02 05:15:13,591 TRACE [BaseStateTransferManagerImpl] (transport-thread-9) Received new cache view: testCache CacheView{viewId=7, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003]}
> 2012-02-02 05:18:17,219 TRACE [BaseStateTransferManagerImpl] (transport-thread-1) Received new cache view: testCache CacheView{viewId=8, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
> 2012-02-02 05:28:17,511 TRACE [BaseStateTransferManagerImpl] (transport-thread-22) Received new cache view: testCache CacheView{viewId=10, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
> {code}
> viewId=8 is the one that takes 10 min to prepare and after that the prepare fails:
> {code}
> 2012-02-02 05:28:17,219 ERROR [CacheViewsManagerImpl] (CacheViewInstaller-9,edg-perf03-36944) ISPN000172: Failed to prepare view CacheView{viewId=8, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]} for cache  testCache, ro..
> java.util.concurrent.TimeoutException
> 	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:91)
> 	at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:319)
> 	at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)
> 	at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:877)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> {code}
> viewId=10 is a retry and that succeeds quite quickly but the test is already ending about that time.
> It might be worth looking at the tracelogs since they're already there...
> 10K entries and 200 clients isn't such a big load ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira