[JBoss JIRA] (ISPN-2802) Cache recovery fails due to missing responses
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-2802?page=com.atlassian.jira.plugin.... ]
Radim Vansa edited comment on ISPN-2802 at 2/11/13 11:21 AM:
-------------------------------------------------------------
I would suspect OOB TP as it's pretty popular nowadays, but it's rather strange what all the threads do when there's no background load in the cluster. I should setup the threadpool monitor here... will do and publish the result.
Btw., I've also tried to run tcpdump on 806 to confirm if the message is received and the issue did not happen. Viva la resistencialism! :-) (http://en.wikipedia.org/wiki/Resistentialism)
was (Author: rvansa):
I would suspect OOB TP as it's pretty popular nowadays, but it's pretty strange what all the threads do when there's no background load in the cluster. I should setup the threadpool monitor here... will do and publish the result.
Btw., I've also tried to run tcpdump on 806 to confirm if the message is received and the issue did not happen. Viva la resistencialism! :-) (http://en.wikipedia.org/wiki/Resistentialism)
> Cache recovery fails due to missing responses
> ---------------------------------------------
>
> Key: ISPN-2802
> URL: https://issues.jboss.org/browse/ISPN-2802
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.0.CR3
> Reporter: Radim Vansa
> Assignee: Mircea Markus
>
> When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.
> Here are the logs (TRACE is not doable here, but I added some byteman traces - see topology.btm in the archive): http://dl.dropbox.com/u/103079234/recovery.zip
> The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
> All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but some responses are not received on node3 (look for Receiving rsp bound to GroupRequest).
> JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).
> As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 1 month
[JBoss JIRA] (ISPN-2608) VersionedDistStateTransferTest intermittent failure
by Adrian Nistor (JIRA)
[ https://issues.jboss.org/browse/ISPN-2608?page=com.atlassian.jira.plugin.... ]
Adrian Nistor commented on ISPN-2608:
-------------------------------------
This is nearly impossible to investigate without a trace log. I'm running now the suite hoping to reproduce it but still did not encounter it after 6 runs...
> VersionedDistStateTransferTest intermittent failure
> ---------------------------------------------------
>
> Key: ISPN-2608
> URL: https://issues.jboss.org/browse/ISPN-2608
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer, Test Suite
> Affects Versions: 5.2.0.Beta5
> Reporter: Mircea Markus
> Assignee: Adrian Nistor
> Labels: testsuite_stability
> Fix For: 5.2.2, 5.3.0.Final
>
>
> Failure:
> {quote}
> org.infinispan.CacheException: Could not commit implicit transaction
> at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1168)
> at org.infinispan.CacheImpl.putInternal(CacheImpl.java:741)
> at org.infinispan.CacheImpl.put(CacheImpl.java:735)
> at org.infinispan.CacheImpl.put(CacheImpl.java:729)
> at org.infinispan.CacheSupport.put(CacheSupport.java:53)
> at org.infinispan.container.versioning.VersionedDistStateTransferTest.testStateTransfer(VersionedDistStateTransferTest.java:101)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80)
> at org.testng.internal.Invoker.invokeMethod(Invoker.java:715)
> at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:907)
> at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1237)
> at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
> at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
> at org.testng.TestRunner.privateRun(TestRunner.java:767)
> at org.testng.TestRunner.run(TestRunner.java:617)
> at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
> at org.testng.SuiteRunner.access$000(SuiteRunner.java:37)
> at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:368)
> at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:680)
> {quote}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 1 month
[JBoss JIRA] (ISPN-2608) VersionedDistStateTransferTest intermittent failure
by Adrian Nistor (JIRA)
[ https://issues.jboss.org/browse/ISPN-2608?page=com.atlassian.jira.plugin.... ]
Work on ISPN-2608 started by Adrian Nistor.
> VersionedDistStateTransferTest intermittent failure
> ---------------------------------------------------
>
> Key: ISPN-2608
> URL: https://issues.jboss.org/browse/ISPN-2608
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer, Test Suite
> Affects Versions: 5.2.0.Beta5
> Reporter: Mircea Markus
> Assignee: Adrian Nistor
> Labels: testsuite_stability
> Fix For: 5.2.2, 5.3.0.Final
>
>
> Failure:
> {quote}
> org.infinispan.CacheException: Could not commit implicit transaction
> at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1168)
> at org.infinispan.CacheImpl.putInternal(CacheImpl.java:741)
> at org.infinispan.CacheImpl.put(CacheImpl.java:735)
> at org.infinispan.CacheImpl.put(CacheImpl.java:729)
> at org.infinispan.CacheSupport.put(CacheSupport.java:53)
> at org.infinispan.container.versioning.VersionedDistStateTransferTest.testStateTransfer(VersionedDistStateTransferTest.java:101)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80)
> at org.testng.internal.Invoker.invokeMethod(Invoker.java:715)
> at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:907)
> at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1237)
> at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
> at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
> at org.testng.TestRunner.privateRun(TestRunner.java:767)
> at org.testng.TestRunner.run(TestRunner.java:617)
> at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
> at org.testng.SuiteRunner.access$000(SuiteRunner.java:37)
> at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:368)
> at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:680)
> {quote}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 1 month
[JBoss JIRA] (ISPN-2718) MinViewIdCalculusTest.testMinViewId1 randomly failing
by Adrian Nistor (JIRA)
[ https://issues.jboss.org/browse/ISPN-2718?page=com.atlassian.jira.plugin.... ]
Adrian Nistor closed ISPN-2718.
-------------------------------
Resolution: Done
> MinViewIdCalculusTest.testMinViewId1 randomly failing
> -----------------------------------------------------
>
> Key: ISPN-2718
> URL: https://issues.jboss.org/browse/ISPN-2718
> Project: Infinispan
> Issue Type: Bug
> Components: Locking and Concurrency
> Reporter: Galder Zamarreño
> Assignee: Adrian Nistor
> Labels: testsuite_stability
> Fix For: 5.2.2, 5.3.0.Alpha1
>
> Attachments: testMinViewId1-4.tgz
>
>
> {code}testMinViewId1(org.infinispan.lock.singlelock.MinViewIdCalculusTest) Time elapsed: 0.002 sec <<< FAILURE!
> java.lang.AssertionError: expected [4] but found [3]
> at org.testng.Assert.fail(Assert.java:89)
> at org.testng.Assert.failNotEquals(Assert.java:489)
> at org.testng.Assert.assertEquals(Assert.java:118)
> at org.testng.Assert.assertEquals(Assert.java:365)
> at org.testng.Assert.assertEquals(Assert.java:375)
> at org.infinispan.lock.singlelock.MinViewIdCalculusTest.testMinViewId1(MinViewIdCalculusTest.java:81)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80)
> at org.testng.internal.Invoker.invokeMethod(Invoker.java:715)
> at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:907)
> at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1237)
> at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
> at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
> at org.testng.TestRunner.privateRun(TestRunner.java:767)
> at org.testng.TestRunner.run(TestRunner.java:617)
> at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
> at org.testng.SuiteRunner.access$000(SuiteRunner.java:37)
> at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:368)
> at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:680){code}
> Is this related to ISPN-2114?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 1 month
[JBoss JIRA] (ISPN-2718) MinViewIdCalculusTest.testMinViewId1 randomly failing
by Adrian Nistor (JIRA)
[ https://issues.jboss.org/browse/ISPN-2718?page=com.atlassian.jira.plugin.... ]
Adrian Nistor commented on ISPN-2718:
-------------------------------------
By analyzing the log I found that the call to waitForClusterToForm() in testMinViewId1() returns too soon. The topology 3 (which is a rebalance) is installed on all nodes when it returns but the topology 4 (the CH update that follows the rebalance) is not processed yet by node B. Still, TestUtil.waitForRehashToComplete logs "Node MinViewIdCalculusTest-Node?-.* finished state transfer" for all 3 nodes and returns. The problem with TestUtil.waitForRehashToComplete returning too soon was fixed since then by changes in commit 1eeebcff5f5cea39fb2a71376d2bb5c0ae21fe7e done for ISPN-2507.
> MinViewIdCalculusTest.testMinViewId1 randomly failing
> -----------------------------------------------------
>
> Key: ISPN-2718
> URL: https://issues.jboss.org/browse/ISPN-2718
> Project: Infinispan
> Issue Type: Bug
> Components: Locking and Concurrency
> Reporter: Galder Zamarreño
> Assignee: Adrian Nistor
> Labels: testsuite_stability
> Fix For: 5.2.2, 5.3.0.Alpha1
>
> Attachments: testMinViewId1-4.tgz
>
>
> {code}testMinViewId1(org.infinispan.lock.singlelock.MinViewIdCalculusTest) Time elapsed: 0.002 sec <<< FAILURE!
> java.lang.AssertionError: expected [4] but found [3]
> at org.testng.Assert.fail(Assert.java:89)
> at org.testng.Assert.failNotEquals(Assert.java:489)
> at org.testng.Assert.assertEquals(Assert.java:118)
> at org.testng.Assert.assertEquals(Assert.java:365)
> at org.testng.Assert.assertEquals(Assert.java:375)
> at org.infinispan.lock.singlelock.MinViewIdCalculusTest.testMinViewId1(MinViewIdCalculusTest.java:81)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80)
> at org.testng.internal.Invoker.invokeMethod(Invoker.java:715)
> at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:907)
> at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1237)
> at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
> at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
> at org.testng.TestRunner.privateRun(TestRunner.java:767)
> at org.testng.TestRunner.run(TestRunner.java:617)
> at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
> at org.testng.SuiteRunner.access$000(SuiteRunner.java:37)
> at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:368)
> at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:680){code}
> Is this related to ISPN-2114?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 1 month
[JBoss JIRA] (ISPN-2802) Cache recovery fails due to missing responses
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/ISPN-2802?page=com.atlassian.jira.plugin.... ]
Bela Ban commented on ISPN-2802:
--------------------------------
Obviously, times are not synchronized between 811 and 806, that's why 806 gets the message before 811 starts retransmitting it :-)
> Cache recovery fails due to missing responses
> ---------------------------------------------
>
> Key: ISPN-2802
> URL: https://issues.jboss.org/browse/ISPN-2802
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.0.CR3
> Reporter: Radim Vansa
> Assignee: Mircea Markus
>
> When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.
> Here are the logs (TRACE is not doable here, but I added some byteman traces - see topology.btm in the archive): http://dl.dropbox.com/u/103079234/recovery.zip
> The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
> All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but some responses are not received on node3 (look for Receiving rsp bound to GroupRequest).
> JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).
> As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 1 month
[JBoss JIRA] (ISPN-2802) Cache recovery fails due to missing responses
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/ISPN-2802?page=com.atlassian.jira.plugin.... ]
Bela Ban edited comment on ISPN-2802 at 2/11/13 8:43 AM:
---------------------------------------------------------
OK, so I guess I know the problem...
First off, 806 asks 811 to retransmit message #178639 from 2:45:07 - 2:45:49, for ca. 42 seconds:
{noformat}
(cat node8.log |grep -e "<-- XMIT(hyperion806-18589" -e "<-- STABLE(hyperion806-18589")
{noformat}
At 2:44:34, 806 actually does get message 178639 from 811:
{noformat}
02:44:34,677 [UNICAST2] hyperion806-18589 <-- DATA(hyperion811-7798: #178637, conn_id=3)
{noformat}
and we can see that 811 gets a STABLE message from 806 at 2:46:48:
{noformat}
02:46:48,246 [UNICAST2] hyperion811-7798 <-- STABLE(hyperion806-18589: 178641-178641, conn_id=3)
{noformat}
My guess is that the retransmitted message (especially #178639) is an OOB message, and that the OOB thread pool at the requester is full, therefore dropping the message to the ground, until there is finally a thread available that's processing the retransmitted message.
I guess all the more important to get that Infinispan-internal thread pool.
was (Author: belaban):
OK, so I guess I know the problem...
First off, 806 asks 811 to retransmit message #178639 from 2:45:07 - 2:45:49, for ca. 42 seconds:
(cat node8.log |grep -e "<-- XMIT(hyperion806-18589" -e "<-- STABLE(hyperion806-18589")
At 2:44:34, 806 actually does get message 178639 from 811:
02:44:34,677 [UNICAST2] hyperion806-18589 <-- DATA(hyperion811-7798: #178637, conn_id=3)
and we can see that 811 gets a STABLE message from 806 at 2:46:48:
02:46:48,246 [UNICAST2] hyperion811-7798 <-- STABLE(hyperion806-18589: 178641-178641, conn_id=3)
My guess is that the retransmitted message (especially #178639) is an OOB message, and that the OOB thread pool at the requester is full, therefore dropping the message to the ground, until there is finally a thread available that's processing the retransmitted message.
I guess all the more important to get that Infinispan-internal thread pool.
> Cache recovery fails due to missing responses
> ---------------------------------------------
>
> Key: ISPN-2802
> URL: https://issues.jboss.org/browse/ISPN-2802
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.0.CR3
> Reporter: Radim Vansa
> Assignee: Mircea Markus
>
> When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.
> Here are the logs (TRACE is not doable here, but I added some byteman traces - see topology.btm in the archive): http://dl.dropbox.com/u/103079234/recovery.zip
> The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
> All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but some responses are not received on node3 (look for Receiving rsp bound to GroupRequest).
> JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).
> As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 1 month
[JBoss JIRA] (ISPN-2802) Cache recovery fails due to missing responses
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/ISPN-2802?page=com.atlassian.jira.plugin.... ]
Bela Ban commented on ISPN-2802:
--------------------------------
OK, so I guess I know the problem...
First off, 806 asks 811 to retransmit message #178639 from 2:45:07 - 2:45:49, for ca. 42 seconds:
(cat node8.log |grep -e "<-- XMIT(hyperion806-18589" -e "<-- STABLE(hyperion806-18589")
At 2:44:34, 806 actually does get message 178639 from 811:
02:44:34,677 [UNICAST2] hyperion806-18589 <-- DATA(hyperion811-7798: #178637, conn_id=3)
and we can see that 811 gets a STABLE message from 806 at 2:46:48:
02:46:48,246 [UNICAST2] hyperion811-7798 <-- STABLE(hyperion806-18589: 178641-178641, conn_id=3)
My guess is that the retransmitted message (especially #178639) is an OOB message, and that the OOB thread pool at the requester is full, therefore dropping the message to the ground, until there is finally a thread available that's processing the retransmitted message.
I guess all the more important to get that Infinispan-internal thread pool.
> Cache recovery fails due to missing responses
> ---------------------------------------------
>
> Key: ISPN-2802
> URL: https://issues.jboss.org/browse/ISPN-2802
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.0.CR3
> Reporter: Radim Vansa
> Assignee: Mircea Markus
>
> When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.
> Here are the logs (TRACE is not doable here, but I added some byteman traces - see topology.btm in the archive): http://dl.dropbox.com/u/103079234/recovery.zip
> The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
> All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but some responses are not received on node3 (look for Receiving rsp bound to GroupRequest).
> JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).
> As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 1 month
[JBoss JIRA] (ISPN-2802) Cache recovery fails due to missing responses
by Radim Vansa (JIRA)
[ https://issues.jboss.org/browse/ISPN-2802?page=com.atlassian.jira.plugin.... ]
Radim Vansa commented on ISPN-2802:
-----------------------------------
Bela: sorry :) node3 = hyperion806 sends GET_STATUS, you can see that node8 = hyperion811 gets the message and responds with message
{code}
02:44:34,677 BYTEMAN (OOB-94,hyperion811-7798): Receiving local hyperion806-18589 -> null: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=hyperion806-18589, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=33}
02:44:34,678 BYTEMAN (OOB-94,hyperion811-7798): Received local hyperion806-18589 -> null: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=hyperion806-18589, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=33}
02:44:34,690 TRACE [org.jgroups.protocols.UNICAST2] (OOB-94,hyperion811-7798) hyperion811-7798 --> DATA(hyperion806-18589: #178639, conn_id=3)
{code}
Then there are repeating
{code}
02:45:07,652 TRACE [org.jgroups.protocols.UNICAST2] (OOB-94,hyperion811-7798) hyperion811-7798 <-- XMIT(hyperion806-18589: #[178639])
02:45:07,652 BYTEMAN (OOB-94,hyperion811-7798): XMIT 178639 to hyperion806-18589
{code}
(the order of messages in log seems to differ as BYTEMAN writes to stdout and not through the logging system)
> Cache recovery fails due to missing responses
> ---------------------------------------------
>
> Key: ISPN-2802
> URL: https://issues.jboss.org/browse/ISPN-2802
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.0.CR3
> Reporter: Radim Vansa
> Assignee: Mircea Markus
>
> When the cache recovery is started, the new coordinator sends CacheTopologyControlCommand.GET_STATUS to all nodes and waits for responses. However, I have a reproducible test-case where it always times out waiting for the responses.
> Here are the logs (TRACE is not doable here, but I added some byteman traces - see topology.btm in the archive): http://dl.dropbox.com/u/103079234/recovery.zip
> The problematic spot is on node3 at 05:37:57 receiving cluster view 34.
> All nodes (except the one which is killed, in this case node1) respond quickly to the GET_STATUS command (see BYTEMAN Receiving - Received pairs, these are bound to command execution in CommandAwareRpcDispatcher), but some responses are not received on node3 (look for Receiving rsp bound to GroupRequest).
> JGroups tracing could be useful here but it is not available (intensive logging often blocks on internal log4j locks and the node becomes unresponsive).
> As mentioned above, the case is reproducible, therefore if you can suggest any particular BYTEMAN hook, I can try it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 1 month