[infinispan-issues] [JBoss JIRA] (ISPN-1838) State transfer takes more than 10 minutes with only 10K entries.

Thursday, 1 March 2012

    [
https://issues.jboss.org/browse/ISPN-1838?page=com.atlassian.jira.plugin....
] 

RH Bugzilla Integration commented on ISPN-1838:
-----------------------------------------------

JBoss JIRA Server <jira-update(a)redhat.com&gt; made a comment on [bug
786202|https://bugzilla.redhat.com/show_bug.cgi?id=786202]

Dan Berindei <dan.berindei(a)gmail.com&gt; made a comment on jira ISPN-1838

I have discussed this with Bela yesterday and we have thought of several workarounds:
1. Stagger APPLY_STATE commands - add a random 0-500ms delay before sending the command
2. Make the chunk size smaller to decrease the chance of collisions.
3. Use RSVP for APPLY_STATE commands - would require JGroups 3.1
4. Use UNICAST instead of UNICAST2
5. Decrease the STABLE timeout - can't lower it too much, as it is quite expensive
6. Decrease FRAG2.frag_size (currently 60000 bytes). When there is a colision, an entire
fragment has to be discarded and sent again, even if the Ethernet frame that collided was
only 1500 bytes.

I have tried options #2 (500KB messages), #5 (15000 bytes) and #6 (10 seconds), and but I
still got a timeout after 400 seconds with 4 senders and 1 receiver.

I have also tried option #4, the results are in the attached uuperf-unicast1.txt. It still
scales worse than UNICAST2 over TCP, but it's much better than UNICAST2 over UDP.

I haven't tried yet options #1 and #3, but I suspect #1 will only make the problem
harder to reproduce (because once STABLE kicks in, the retransmission responses are not
staggered).

...
 State transfer takes more than 10 minutes with only 10K entries.
 ----------------------------------------------------------------

                 Key: ISPN-1838
                 URL: https://issues.jboss.org/browse/ISPN-1838
             Project: Infinispan
          Issue Type: Bug
          Components: State transfer
    Affects Versions: 5.1.1.CR1
            Reporter: Michal Linhard
            Assignee: Dan Berindei
             Fix For: 5.1.2.FINAL

         Attachments: apply_state.log, apply_state.txt, dan.xml, retransmissions.txt,
uuperf-tcp.txt, uuperf-udp.txt, uuperf-unicast1.txt

 This could be categorized as a performance problem.
 It happened in resilience test run:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
 originally to verify ISPN-1826
 It was run with infinispan special build from Galder's branch
(https://github.com/galderz/infinispan/tree/t_1826_5)

http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-...
 test starts 4 nodes, kills node2, starts node2 and sees what happens
 trace logging on server side was on. there were two runs
 200 clients, 10K entries

http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
 20 clients, 1K entries

http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
 in run 24 everyting looks nice:

http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
 in run 23 the state transfer takes forever (more than 10 min)
 these important views are installed on coordinator (node03):
 {code}
 2012-02-02 05:11:00,560 TRACE [BaseStateTransferManagerImpl] (transport-thread-1)
Received new cache view: testCache CacheView{viewId=6, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf02-51026, edg-perf01-47003]}
 2012-02-02 05:15:13,591 TRACE [BaseStateTransferManagerImpl] (transport-thread-9)
Received new cache view: testCache CacheView{viewId=7, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf01-47003]}
 2012-02-02 05:18:17,219 TRACE [BaseStateTransferManagerImpl] (transport-thread-1)
Received new cache view: testCache CacheView{viewId=8, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
 2012-02-02 05:28:17,511 TRACE [BaseStateTransferManagerImpl] (transport-thread-22)
Received new cache view: testCache CacheView{viewId=10, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
 {code}
 viewId=8 is the one that takes 10 min to prepare and after that the prepare fails:
 {code}
 2012-02-02 05:28:17,219 ERROR [CacheViewsManagerImpl]
(CacheViewInstaller-9,edg-perf03-36944) ISPN000172: Failed to prepare view
CacheView{viewId=8, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003,
edg-perf02-21799]} for cache  testCache, ro..
 java.util.concurrent.TimeoutException
 	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
 	at java.util.concurrent.FutureTask.get(FutureTask.java:91)
 	at
org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:319)
 	at
org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)
 	at
org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:877)
 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 {code}
 viewId=10 is a retry and that succeeds quite quickly but the test is already ending about
that time.
 It might be worth looking at the tracelogs since they're already there...
 10K entries and 200 clients isn't such a big load ... 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[infinispan-issues] [JBoss JIRA] (ISPN-1838) State transfer takes more than 10 minutes with only 10K entries.