[infinispan-issues] [JBoss JIRA] (ISPN-1838) State transfer takes more than 10 minutes with only 10K entries.

Thu Mar 1 09:20:40 EST 2012

    [ https://issues.jboss.org/browse/ISPN-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672666#comment-12672666 ] 

RH Bugzilla Integration commented on ISPN-1838:
-----------------------------------------------

JBoss JIRA Server <jira-update at redhat.com> made a comment on [bug 786202|https://bugzilla.redhat.com/show_bug.cgi?id=786202]

Bela Ban <bela at jboss.com> made a comment on jira ISPN-1838

I ran UUPerf on my local cluster and got good results (between 30 and 40 reqs/sec/node for a 4 node cluster).

However, the same test (same config and same JGroups version) had abysmal perf on cluster01-04 ! I think that this is due to a switch that's either not optimally configured for UDP datagrams and/or has small buffers for datagram packets.
Note that I've had this suspicion before, as MPerf numbers are also bad...

The TCP transport scales its send window down, to reduce packet loss and subsequent retransmission. However, JGroups' UFC (Unicast Flow Control) protocol has a fixed window size (number of credits). (This will be changed in 3.2 or 3.3, see [1]).
Due to this, messages are sent too fast, causing packet loss at the switch (still my assumption !) and retransmission which is costly and generates more traffic.

There are 2 things that correct this:
#1 Reduce UFC.max_bytes to (say) 200K

#2 Reduce FRAG2.fra_size to 8000. Note that the MTU on cluster01-10 is 9000, so a frag_size of 8K won't lead to IP fragmentation. On other systems, such as perf01-10 where the MTU is 1500, we'd have to experiment with even smaller sizes.

The config file I used to run UUPerf on cluster01-04 is attached (dan.xml).

[1] https://issues.jboss.org/browse/JGRP-1053

> State transfer takes more than 10 minutes with only 10K entries.
> ----------------------------------------------------------------
>
>                 Key: ISPN-1838
>                 URL: https://issues.jboss.org/browse/ISPN-1838
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.1.1.CR1
>            Reporter: Michal Linhard
>            Assignee: Dan Berindei
>             Fix For: 5.1.2.FINAL
>
>         Attachments: apply_state.log, apply_state.txt, dan.xml, retransmissions.txt, uuperf-tcp.txt, uuperf-udp.txt, uuperf-unicast1.txt
>
>
> This could be categorized as a performance problem.
> It happened in resilience test run: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/23
> originally to verify ISPN-1826
> It was run with infinispan special build from Galder's branch (https://github.com/galderz/infinispan/tree/t_1826_5)
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-infinispan-from-source/45/
> test starts 4 nodes, kills node2, starts node2 and sees what happens
> trace logging on server side was on. there were two runs
> 200 clients, 10K entries
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/23
> 20 clients, 1K entries
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/24
> in run 24 everyting looks nice:
> http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/24/artifact/report/stats-throughput.png
> in run 23 the state transfer takes forever (more than 10 min)
> these important views are installed on coordinator (node03):
> {code}
> 2012-02-02 05:11:00,560 TRACE [BaseStateTransferManagerImpl] (transport-thread-1) Received new cache view: testCache CacheView{viewId=6, members=[edg-perf04-45788, edg-perf03-36944, edg-perf02-51026, edg-perf01-47003]}
> 2012-02-02 05:15:13,591 TRACE [BaseStateTransferManagerImpl] (transport-thread-9) Received new cache view: testCache CacheView{viewId=7, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003]}
> 2012-02-02 05:18:17,219 TRACE [BaseStateTransferManagerImpl] (transport-thread-1) Received new cache view: testCache CacheView{viewId=8, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
> 2012-02-02 05:28:17,511 TRACE [BaseStateTransferManagerImpl] (transport-thread-22) Received new cache view: testCache CacheView{viewId=10, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
> {code}
> viewId=8 is the one that takes 10 min to prepare and after that the prepare fails:
> {code}
> 2012-02-02 05:28:17,219 ERROR [CacheViewsManagerImpl] (CacheViewInstaller-9,edg-perf03-36944) ISPN000172: Failed to prepare view CacheView{viewId=8, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]} for cache  testCache, ro..
> java.util.concurrent.TimeoutException
> 	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:91)
> 	at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:319)
> 	at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)
> 	at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:877)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> {code}
> viewId=10 is a retry and that succeeds quite quickly but the test is already ending about that time.
> It might be worth looking at the tracelogs since they're already there...
> 10K entries and 200 clients isn't such a big load ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira