[
https://issues.jboss.org/browse/ISPN-1838?page=com.atlassian.jira.plugin....
]
RH Bugzilla Integration commented on ISPN-1838:
-----------------------------------------------
JBoss JIRA Server <jira-update(a)redhat.com> made a comment on [bug
786202|https://bugzilla.redhat.com/show_bug.cgi?id=786202]
Dan Berindei <dan.berindei(a)gmail.com> made a comment on jira ISPN-1838
This looks like a bad interaction between our state transfer behaviour on join and
UNICAST2.
I wrote a test called UUPerf, modeled after the existing UPerf, but modeling our state
transfer behaviour: all the nodes send very few, very large messages to a single node (the
last to join the cluster). The code is here:
https://github.com/danberindei/JGroups/commit/8bdbce0d0fe3e5a65caf5d28e43...
I have run the test with Infinispan's jgroups-udp.xml and jgroups-tcp.xml
configurations on cluster01..04, and I have attached the results as uuperf-udp.txt and
uuperf-tcp.txt
As I increased the number of nodes in the cluster, with TCP the throughput drops
relatively quickly because of (I assume) collisions on the network between the switch and
the joiner. With UDP however, the effect is much more pronounced, I even got a timeout
after 400 seconds (thus reproducing Michal's original report).
The problem seems to be that UNICAST2 (just as NAKACK(2)) can't detect missing
messages when they are the last messages sent by a node. So if there are lots of
collisions on the wire, there is a high chance that the missing messages will only be
detected when the STABLE timer expires (every 50 seconds in our configurations). However,
since the joiner sends retransmission requests to all the nodes at the same time, there is
again a high chance that the retransmission responses will collide and repeat the cycle.
State transfer takes more than 10 minutes with only 10K entries.
----------------------------------------------------------------
Key: ISPN-1838
URL:
https://issues.jboss.org/browse/ISPN-1838
Project: Infinispan
Issue Type: Bug
Components: State transfer
Affects Versions: 5.1.1.CR1
Reporter: Michal Linhard
Assignee: Dan Berindei
Fix For: 5.1.2.FINAL
Attachments: apply_state.log, apply_state.txt, dan.xml, retransmissions.txt,
uuperf-tcp.txt, uuperf-udp.txt, uuperf-unicast1.txt
This could be categorized as a performance problem.
It happened in resilience test run:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
originally to verify ISPN-1826
It was run with infinispan special build from Galder's branch
(
https://github.com/galderz/infinispan/tree/t_1826_5)
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-...
test starts 4 nodes, kills node2, starts node2 and sees what happens
trace logging on server side was on. there were two runs
200 clients, 10K entries
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
20 clients, 1K entries
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
in run 24 everyting looks nice:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
in run 23 the state transfer takes forever (more than 10 min)
these important views are installed on coordinator (node03):
{code}
2012-02-02 05:11:00,560 TRACE [BaseStateTransferManagerImpl] (transport-thread-1)
Received new cache view: testCache CacheView{viewId=6, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf02-51026, edg-perf01-47003]}
2012-02-02 05:15:13,591 TRACE [BaseStateTransferManagerImpl] (transport-thread-9)
Received new cache view: testCache CacheView{viewId=7, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf01-47003]}
2012-02-02 05:18:17,219 TRACE [BaseStateTransferManagerImpl] (transport-thread-1)
Received new cache view: testCache CacheView{viewId=8, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
2012-02-02 05:28:17,511 TRACE [BaseStateTransferManagerImpl] (transport-thread-22)
Received new cache view: testCache CacheView{viewId=10, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
{code}
viewId=8 is the one that takes 10 min to prepare and after that the prepare fails:
{code}
2012-02-02 05:28:17,219 ERROR [CacheViewsManagerImpl]
(CacheViewInstaller-9,edg-perf03-36944) ISPN000172: Failed to prepare view
CacheView{viewId=8, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003,
edg-perf02-21799]} for cache testCache, ro..
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at
org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:319)
at
org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)
at
org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:877)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
{code}
viewId=10 is a retry and that succeeds quite quickly but the test is already ending about
that time.
It might be worth looking at the tracelogs since they're already there...
10K entries and 200 clients isn't such a big load ...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira