[JBoss JIRA] (ISPN-4845) statetransfer.ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus fails randomly
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-4845?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant closed ISPN-4845.
---------------------------------
Resolution: Done
> statetransfer.ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus fails randomly
> -------------------------------------------------------------------------------------
>
> Key: ISPN-4845
> URL: https://issues.jboss.org/browse/ISPN-4845
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 5.2.10.Final
> Reporter: Michal Vinkler
> Assignee: Dan Berindei
> Labels: 5.2.x
> Fix For: 8.1.4.Final, 8.2.0.Beta1
>
>
> Seen with EAP 6.3.0.ER10, Infinispan 5.2.10
> Test org.infinispan.statetransfer.ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus randomly fails (seen on Solaris and HP-UX).
> Might be the same as ISPN-4743.
> Stacktraces:
> HP-UX version
> Error Message
> {code}
> Timed out waiting for rebalancing to complete on node ClusterTopologyManagerTest-NodeB-47391, expected member list is [ClusterTopologyManagerTest-NodeB-47391], current member list is [ClusterTopologyManagerTest-NodeB-47391, ClusterTopologyManagerTest-NodeC-55740]!
> {code}
> Stacktrace
> {code}
> java.lang.RuntimeException: Timed out waiting for rebalancing to complete on node ClusterTopologyManagerTest-NodeB-47391, expected member list is [ClusterTopologyManagerTest-NodeB-47391], current member list is [ClusterTopologyManagerTest-NodeB-47391, ClusterTopologyManagerTest-NodeC-55740]!
> at org.infinispan.test.TestingUtil.waitForRehashToComplete(TestingUtil.java:203)
> at org.infinispan.statetransfer.ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus(ClusterTopologyManagerTest.java:353)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80)
> at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
> at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
> at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
> at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
> at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
> at org.testng.TestRunner.privateRun(TestRunner.java:767)
> at org.testng.TestRunner.run(TestRunner.java:617)
> at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
> at org.testng.SuiteRunner.access$000(SuiteRunner.java:37)
> at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:368)
> at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> Also see standard output:
> https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Infi...
> Solaris version
> Error Message
> {code}
> Thread already timed out waiting for event 3 left
> {code}
> Stacktrace
> {code}
> java.lang.IllegalStateException: Thread already timed out waiting for event 3 left
> at org.infinispan.test.fwk.CheckPoint.trigger(CheckPoint.java:150)
> at org.infinispan.test.fwk.CheckPoint.trigger(CheckPoint.java:135)
> at org.infinispan.statetransfer.ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus(ClusterTopologyManagerTest.java:350)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80)
> at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
> at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
> at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
> at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
> at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
> at org.testng.TestRunner.privateRun(TestRunner.java:767)
> at org.testng.TestRunner.run(TestRunner.java:617)
> at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
> at org.testng.SuiteRunner.access$000(SuiteRunner.java:37)
> at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:368)
> at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> Also see standard output:
> https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Infi...
> Might be the same as ISPN-4743.
> Downstream BZ was: https://bugzilla.redhat.com/show_bug.cgi?id=987461
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-5481) ConfigurationOverrideTest random failures
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-5481?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant reopened ISPN-5481:
-----------------------------------
> ConfigurationOverrideTest random failures
> -----------------------------------------
>
> Key: ISPN-5481
> URL: https://issues.jboss.org/browse/ISPN-5481
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 7.2.1.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 8.2.0.Beta1, 8.2.0.Final
>
>
> {{ConfigurationOverrideTest}} uses the default global configuration, and it fails when another test has already registered a cache manager MBean in JMX with the same name:
> {noformat}
> org.infinispan.jmx.JmxDomainConflictException: ISPN000034: There's already a JMX MBean instance type=CacheManager,name="DefaultCacheManager" already registered under 'org.infinispan' JMX domain. If you want to allow multiple instances configured with same JMX domain enable 'allowDuplicateDomains' attribute in 'globalJmxStatistics' config element
> at org.infinispan.jmx.JmxUtil.buildJmxDomain(JmxUtil.java:51)
> at org.infinispan.jmx.CacheManagerJmxRegistration.updateDomain(CacheManagerJmxRegistration.java:79)
> at org.infinispan.jmx.CacheManagerJmxRegistration.buildRegistrar(CacheManagerJmxRegistration.java:73)
> at org.infinispan.jmx.AbstractJmxRegistration.registerMBeans(AbstractJmxRegistration.java:37)
> at org.infinispan.jmx.CacheManagerJmxRegistration.start(CacheManagerJmxRegistration.java:41)
> at org.infinispan.manager.DefaultCacheManager.start(DefaultCacheManager.java:625)
> at org.infinispan.manager.DefaultCacheManager.<init>(DefaultCacheManager.java:218)
> at org.infinispan.manager.DefaultCacheManager.<init>(DefaultCacheManager.java:199)
> at org.infinispan.configuration.ConfigurationOverrideTest.testOverrideWithStore(ConfigurationOverrideTest.java:80)
> {noformat}
> We should verify the other tests as well, to make sure they all use the {{PerThreadMBeanServerLookup}} and/or a unique JMX domain.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-5481) ConfigurationOverrideTest random failures
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-5481?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-5481:
----------------------------------
Fix Version/s: 8.1.4.Final
> ConfigurationOverrideTest random failures
> -----------------------------------------
>
> Key: ISPN-5481
> URL: https://issues.jboss.org/browse/ISPN-5481
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 7.2.1.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 8.2.0.Beta1, 8.2.0.Final, 8.1.4.Final
>
>
> {{ConfigurationOverrideTest}} uses the default global configuration, and it fails when another test has already registered a cache manager MBean in JMX with the same name:
> {noformat}
> org.infinispan.jmx.JmxDomainConflictException: ISPN000034: There's already a JMX MBean instance type=CacheManager,name="DefaultCacheManager" already registered under 'org.infinispan' JMX domain. If you want to allow multiple instances configured with same JMX domain enable 'allowDuplicateDomains' attribute in 'globalJmxStatistics' config element
> at org.infinispan.jmx.JmxUtil.buildJmxDomain(JmxUtil.java:51)
> at org.infinispan.jmx.CacheManagerJmxRegistration.updateDomain(CacheManagerJmxRegistration.java:79)
> at org.infinispan.jmx.CacheManagerJmxRegistration.buildRegistrar(CacheManagerJmxRegistration.java:73)
> at org.infinispan.jmx.AbstractJmxRegistration.registerMBeans(AbstractJmxRegistration.java:37)
> at org.infinispan.jmx.CacheManagerJmxRegistration.start(CacheManagerJmxRegistration.java:41)
> at org.infinispan.manager.DefaultCacheManager.start(DefaultCacheManager.java:625)
> at org.infinispan.manager.DefaultCacheManager.<init>(DefaultCacheManager.java:218)
> at org.infinispan.manager.DefaultCacheManager.<init>(DefaultCacheManager.java:199)
> at org.infinispan.configuration.ConfigurationOverrideTest.testOverrideWithStore(ConfigurationOverrideTest.java:80)
> {noformat}
> We should verify the other tests as well, to make sure they all use the {{PerThreadMBeanServerLookup}} and/or a unique JMX domain.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-5481) ConfigurationOverrideTest random failures
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-5481?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant closed ISPN-5481.
---------------------------------
Resolution: Done
> ConfigurationOverrideTest random failures
> -----------------------------------------
>
> Key: ISPN-5481
> URL: https://issues.jboss.org/browse/ISPN-5481
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 7.2.1.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 8.1.4.Final, 8.2.0.Final, 8.2.0.Beta1
>
>
> {{ConfigurationOverrideTest}} uses the default global configuration, and it fails when another test has already registered a cache manager MBean in JMX with the same name:
> {noformat}
> org.infinispan.jmx.JmxDomainConflictException: ISPN000034: There's already a JMX MBean instance type=CacheManager,name="DefaultCacheManager" already registered under 'org.infinispan' JMX domain. If you want to allow multiple instances configured with same JMX domain enable 'allowDuplicateDomains' attribute in 'globalJmxStatistics' config element
> at org.infinispan.jmx.JmxUtil.buildJmxDomain(JmxUtil.java:51)
> at org.infinispan.jmx.CacheManagerJmxRegistration.updateDomain(CacheManagerJmxRegistration.java:79)
> at org.infinispan.jmx.CacheManagerJmxRegistration.buildRegistrar(CacheManagerJmxRegistration.java:73)
> at org.infinispan.jmx.AbstractJmxRegistration.registerMBeans(AbstractJmxRegistration.java:37)
> at org.infinispan.jmx.CacheManagerJmxRegistration.start(CacheManagerJmxRegistration.java:41)
> at org.infinispan.manager.DefaultCacheManager.start(DefaultCacheManager.java:625)
> at org.infinispan.manager.DefaultCacheManager.<init>(DefaultCacheManager.java:218)
> at org.infinispan.manager.DefaultCacheManager.<init>(DefaultCacheManager.java:199)
> at org.infinispan.configuration.ConfigurationOverrideTest.testOverrideWithStore(ConfigurationOverrideTest.java:80)
> {noformat}
> We should verify the other tests as well, to make sure they all use the {{PerThreadMBeanServerLookup}} and/or a unique JMX domain.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-5883) Node can apply new topology after sending status response
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-5883?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant reopened ISPN-5883:
-----------------------------------
> Node can apply new topology after sending status response
> ---------------------------------------------------------
>
> Key: ISPN-5883
> URL: https://issues.jboss.org/browse/ISPN-5883
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 8.0.1.Final, 7.2.5.Final, 8.1.0.Alpha2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Labels: testsuite_stability
> Fix For: 8.2.0.Beta1
>
>
> {{LocalTopologyManagerImpl}} is responsible for sending the {{ClusterTopologyControlCommand(GET_STATUS)}} response, and when it sends the response it doesn't check the current view id against the new coordinator's view id. If the old coordinator already sent a topology update before the merge, that topology update might be processed after sending the status response. The new coordinator will send a topology update with a topology id of {{max(status response topology ids) + 1}}. The node will then process the topology update from the old coordinator, but it will ignore the topology update from the new coordinator with the same topology id.
> This is extra common in the partition handling tests, e.g. {{BasePessimisticTxPartitionAndMergeTest}} subclasses, because the test "injects" the JGroups view on each node serially, and often the 4th node sends the status response before it gets the new view.
> {noformat}
> 22:16:37,776 DEBUG (remote-thread-NodeD-p26-t6:[]) [LocalTopologyManagerImpl] Sending cluster status response for view 10
> // Topology from NodeC
> 22:16:37,778 DEBUG (transport-thread-NodeD-p28-t2:[]) [LocalTopologyManagerImpl] Updating local topology for cache pes-cache: CacheTopology{id=8, rebalanceId=3, currentCH=DefaultConsistentHash{ns=60, owners = (4)[NodeA-37631: 15+15, NodeB-47846: 15+15, NodeC-46467: 15+15, NodeD-30486: 15+15]}, pendingCH=null, unionCH=null, actualMembers=[NodeC-46467, NodeD-30486]}
> // Later, topology from NodeA
> 22:16:37,827 DEBUG (transport-thread-NodeD-p28-t1:[]) [LocalTopologyManagerImpl] Ignoring late consistent hash update for cache pes-cache, current topology is 8: CacheTopology{id=8, rebalanceId=3, currentCH=DefaultConsistentHash{ns=60, owners = (4)[NodeA-37631: 15+15, NodeB-47846: 15+15, NodeC-46467: 15+15, NodeD-30486: 15+15]}, pendingCH=null, unionCH=null, actualMembers=[NodeA-37631, NodeB-47846, NodeC-46467, NodeD-30486]}
> {noformat}
> As a solution, we can delay sending the status response until we have the same view as the coordinator (or a later one). We already check that the sender is the current coordinator before applying a topology update, so this will guarantee that the we don't apply other topology updates from the old coordinator. Since the status request is only sent after the new view was installed, this will not introduce any delays in the vast majority of cases.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-5883) Node can apply new topology after sending status response
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-5883?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-5883:
----------------------------------
Fix Version/s: 8.1.4.Final
> Node can apply new topology after sending status response
> ---------------------------------------------------------
>
> Key: ISPN-5883
> URL: https://issues.jboss.org/browse/ISPN-5883
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 8.0.1.Final, 7.2.5.Final, 8.1.0.Alpha2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Labels: testsuite_stability
> Fix For: 8.2.0.Beta1, 8.1.4.Final
>
>
> {{LocalTopologyManagerImpl}} is responsible for sending the {{ClusterTopologyControlCommand(GET_STATUS)}} response, and when it sends the response it doesn't check the current view id against the new coordinator's view id. If the old coordinator already sent a topology update before the merge, that topology update might be processed after sending the status response. The new coordinator will send a topology update with a topology id of {{max(status response topology ids) + 1}}. The node will then process the topology update from the old coordinator, but it will ignore the topology update from the new coordinator with the same topology id.
> This is extra common in the partition handling tests, e.g. {{BasePessimisticTxPartitionAndMergeTest}} subclasses, because the test "injects" the JGroups view on each node serially, and often the 4th node sends the status response before it gets the new view.
> {noformat}
> 22:16:37,776 DEBUG (remote-thread-NodeD-p26-t6:[]) [LocalTopologyManagerImpl] Sending cluster status response for view 10
> // Topology from NodeC
> 22:16:37,778 DEBUG (transport-thread-NodeD-p28-t2:[]) [LocalTopologyManagerImpl] Updating local topology for cache pes-cache: CacheTopology{id=8, rebalanceId=3, currentCH=DefaultConsistentHash{ns=60, owners = (4)[NodeA-37631: 15+15, NodeB-47846: 15+15, NodeC-46467: 15+15, NodeD-30486: 15+15]}, pendingCH=null, unionCH=null, actualMembers=[NodeC-46467, NodeD-30486]}
> // Later, topology from NodeA
> 22:16:37,827 DEBUG (transport-thread-NodeD-p28-t1:[]) [LocalTopologyManagerImpl] Ignoring late consistent hash update for cache pes-cache, current topology is 8: CacheTopology{id=8, rebalanceId=3, currentCH=DefaultConsistentHash{ns=60, owners = (4)[NodeA-37631: 15+15, NodeB-47846: 15+15, NodeC-46467: 15+15, NodeD-30486: 15+15]}, pendingCH=null, unionCH=null, actualMembers=[NodeA-37631, NodeB-47846, NodeC-46467, NodeD-30486]}
> {noformat}
> As a solution, we can delay sending the status response until we have the same view as the coordinator (or a later one). We already check that the sender is the current coordinator before applying a topology update, so this will guarantee that the we don't apply other topology updates from the old coordinator. Since the status request is only sent after the new view was installed, this will not introduce any delays in the vast majority of cases.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-6384) JGroupsTransport.invokeRemotelyAsync with a filter returns null on timeout
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6384?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-6384:
----------------------------------
Fix Version/s: 8.1.4.Final
> JGroupsTransport.invokeRemotelyAsync with a filter returns null on timeout
> --------------------------------------------------------------------------
>
> Key: ISPN-6384
> URL: https://issues.jboss.org/browse/ISPN-6384
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.0.Final, 9.0.0.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 8.2.1.Final, 9.0.0.Alpha1, 8.1.4.Final, 9.0.0.Final
>
>
> {{JGroupsTransport.invokeRemotelyAsync()}} has a {{ResponseFilter}} parameter that was traditionally used only with {{ResponseMode.GET_FIRST}}, for remote get commands. In that particular case, returning a {{null}} when some of the nodes timed out and other nodes returned invalid responses (i.e. {{null}}) was acceptable.
> Since ISPN-4979, {{JGroupsTransport.invokeRemotelyAsync()}} is also used by {{ClusterTopologyManagerImpl}}, with {{ResponseMode.GET_ALL}}. Here, however, returning a {{null}} instead of throwing a {{TimeoutException}} is not acceptable.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months