[infinispan-issues] [JBoss JIRA] (ISPN-1838) State transfer takes more than 10 minutes with only 10K entries.

Tue Feb 14 04:54:02 EST 2012

    [ https://issues.jboss.org/browse/ISPN-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666315#comment-12666315 ] 

RH Bugzilla Integration commented on ISPN-1838:
-----------------------------------------------

JBoss JIRA Server <jira-update at redhat.com> made a comment on [bug 786202|https://bugzilla.redhat.com/show_bug.cgi?id=786202]

Dan Berindei <dan.berindei at gmail.com> made a comment on jira ISPN-1838

For comparison, here are the logs for the next view's APPLY_STATE commands. They all take < 100ms from sending to execution on the target:

{noformat}
2012-02-02 05:28:17,689 TRACE [CommandAwareRpcDispatcher] (transport-thread-6) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=10, state=1443} to addresses [edg-perf02-21799]
2012-02-02 05:28:17,751 TRACE [CommandAwareRpcDispatcher] (OOB-99,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf04-45788, viewId=10, state=1443} [sender=edg-perf04-45788]
{noformat}

{noformat}
2012-02-02 05:28:17,724 TRACE [CommandAwareRpcDispatcher] (transport-thread-24) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=10, state=1806} to addresses [edg-perf02-21799]
2012-02-02 05:28:17,799 TRACE [CommandAwareRpcDispatcher] (OOB-100,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf03-36944, viewId=10, state=1806} [sender=edg-perf03-36944]
{noformat}

{noformat}
2012-02-02 05:28:17,792 TRACE [CommandAwareRpcDispatcher] (transport-thread-6) Replication task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf01-47003, viewId=10, state=1903} to addresses [edg-perf02-21799]
2012-02-02 05:28:17,876 TRACE [CommandAwareRpcDispatcher] (OOB-98,edg-perf02-21799) Attempting to execute command: StateTransferControlCommand{cache=testCache, type=APPLY_STATE, sender=edg-perf01-47003, viewId=10, state=1903} [sender=edg-perf01-47003]
{noformat}

One difference that might be significant is that in view 8 all three nodes send their commands at exactly the same time (within only 1ms of each other). In view 10 the commands are more spread out, there's about 100ms between the first and the last send.

> State transfer takes more than 10 minutes with only 10K entries.
> ----------------------------------------------------------------
>
>                 Key: ISPN-1838
>                 URL: https://issues.jboss.org/browse/ISPN-1838
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.1.1.CR1
>            Reporter: Michal Linhard
>            Assignee: Dan Berindei
>         Attachments: apply_state.log
>
>
> This could be categorized as a performance problem.
> It happened in resilience test run: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/23
> originally to verify ISPN-1826
> It was run with infinispan special build from Galder's branch (https://github.com/galderz/infinispan/tree/t_1826_5)
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-infinispan-from-source/45/
> test starts 4 nodes, kills node2, starts node2 and sees what happens
> trace logging on server side was on. there were two runs
> 200 clients, 10K entries
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/23
> 20 clients, 1K entries
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/24
> in run 24 everyting looks nice:
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/24/artifact/report/stats-throughput.png
> in run 23 the state transfer takes forever (more than 10 min)
> these important views are installed on coordinator (node03):
> {code}
> 2012-02-02 05:11:00,560 TRACE [BaseStateTransferManagerImpl] (transport-thread-1) Received new cache view: testCache CacheView{viewId=6, members=[edg-perf04-45788, edg-perf03-36944, edg-perf02-51026, edg-perf01-47003]}
> 2012-02-02 05:15:13,591 TRACE [BaseStateTransferManagerImpl] (transport-thread-9) Received new cache view: testCache CacheView{viewId=7, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003]}
> 2012-02-02 05:18:17,219 TRACE [BaseStateTransferManagerImpl] (transport-thread-1) Received new cache view: testCache CacheView{viewId=8, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
> 2012-02-02 05:28:17,511 TRACE [BaseStateTransferManagerImpl] (transport-thread-22) Received new cache view: testCache CacheView{viewId=10, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
> {code}
> viewId=8 is the one that takes 10 min to prepare and after that the prepare fails:
> {code}
> 2012-02-02 05:28:17,219 ERROR [CacheViewsManagerImpl] (CacheViewInstaller-9,edg-perf03-36944) ISPN000172: Failed to prepare view CacheView{viewId=8, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]} for cache  testCache, ro..
> java.util.concurrent.TimeoutException
> 	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:91)
> 	at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:319)
> 	at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)
> 	at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:877)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> {code}
> viewId=10 is a retry and that succeeds quite quickly but the test is already ending about that time.
> It might be worth looking at the tracelogs since they're already there...
> 10K entries and 200 clients isn't such a big load ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira