[
https://issues.jboss.org/browse/ISPN-1838?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-1838:
------------------------------------
The logs are quite weird, node-01 sends an APPLY_STATE command to node-02 for view 8 which
never seems to reach its destination.
Each of the surviving nodes (node-01, node-03, node-04) sent an APPLY_STATE command to
node-02. For the node-03 command it took 1 second to receive and unmarshal it, for the
node-04 command it took 1 minute, and the node-01 command never started executing (either
because it never arrived or because unmarshalling took too long).
{noformat}
2012-02-02 05:18:17,479 TRACE [RpcManagerImpl] (transport-thread-4) edg-perf03-36944
broadcasting call StateTransferControlCommand{cache=testCache, type=APPLY_STATE,
sender=edg-perf03-36944, viewId=8, state=1806} to recipient list [edg-perf02-21799]
2012-02-02 05:18:18,479 TRACE [CommandAwareRpcDispatcher] (OOB-100,edg-perf02-21799)
Attempting to execute command: StateTransferControlCommand{cache=testCache,
type=APPLY_STATE, sender=edg-perf03-36944, viewId=8, state=1806}
[sender=edg-perf03-36944]
{noformat}
{noformat}
2012-02-02 05:18:17,468 TRACE [CommandAwareRpcDispatcher] (transport-thread-5) Replication
task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE,
sender=edg-perf04-45788, viewId=8, state=1443} to addresses [edg-perf02-21799]
2012-02-02 05:19:16,630 TRACE [CommandAwareRpcDispatcher] (OOB-98,edg-perf02-21799)
Attempting to execute command: StateTransferControlCommand{cache=testCache,
type=APPLY_STATE, sender=edg-perf04-45788, viewId=8, state=1443}
[sender=edg-perf04-45788]
{noformat}
{noformat}
2012-02-02 05:18:17,468 TRACE [CommandAwareRpcDispatcher] (transport-thread-5) Replication
task sending StateTransferControlCommand{cache=testCache, type=APPLY_STATE,
sender=edg-perf01-47003, viewId=8, state=1903} to addresses [edg-perf02-21799]
???
{noformat}
Michal, can you run the test again with TRACE enabled for org.jgroups and for
org.infinispan.marshall as well?
State transfer takes more than 10 minutes with only 10K entries.
----------------------------------------------------------------
Key: ISPN-1838
URL:
https://issues.jboss.org/browse/ISPN-1838
Project: Infinispan
Issue Type: Bug
Components: State transfer
Affects Versions: 5.1.1.CR1
Reporter: Michal Linhard
Assignee: Dan Berindei
Attachments: apply_state.log
This could be categorized as a performance problem.
It happened in resilience test run:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
originally to verify ISPN-1826
It was run with infinispan special build from Galder's branch
(
https://github.com/galderz/infinispan/tree/t_1826_5)
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-...
test starts 4 nodes, kills node2, starts node2 and sees what happens
trace logging on server side was on. there were two runs
200 clients, 10K entries
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
20 clients, 1K entries
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
in run 24 everyting looks nice:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/j...
in run 23 the state transfer takes forever (more than 10 min)
these important views are installed on coordinator (node03):
{code}
2012-02-02 05:11:00,560 TRACE [BaseStateTransferManagerImpl] (transport-thread-1)
Received new cache view: testCache CacheView{viewId=6, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf02-51026, edg-perf01-47003]}
2012-02-02 05:15:13,591 TRACE [BaseStateTransferManagerImpl] (transport-thread-9)
Received new cache view: testCache CacheView{viewId=7, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf01-47003]}
2012-02-02 05:18:17,219 TRACE [BaseStateTransferManagerImpl] (transport-thread-1)
Received new cache view: testCache CacheView{viewId=8, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
2012-02-02 05:28:17,511 TRACE [BaseStateTransferManagerImpl] (transport-thread-22)
Received new cache view: testCache CacheView{viewId=10, members=[edg-perf04-45788,
edg-perf03-36944, edg-perf01-47003, edg-perf02-21799]}
{code}
viewId=8 is the one that takes 10 min to prepare and after that the prepare fails:
{code}
2012-02-02 05:28:17,219 ERROR [CacheViewsManagerImpl]
(CacheViewInstaller-9,edg-perf03-36944) ISPN000172: Failed to prepare view
CacheView{viewId=8, members=[edg-perf04-45788, edg-perf03-36944, edg-perf01-47003,
edg-perf02-21799]} for cache testCache, ro..
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at
org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:319)
at
org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)
at
org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:877)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
{code}
viewId=10 is a retry and that succeeds quite quickly but the test is already ending about
that time.
It might be worth looking at the tracelogs since they're already there...
10K entries and 200 clients isn't such a big load ...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira