[JBoss JIRA] (ISPN-7606) DistributedExecutorMassIndexer.executeInternal should use an existing pool
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-7606?page=com.atlassian.jira.plugin.... ]
Dan Berindei reassigned ISPN-7606:
----------------------------------
Assignee: Dan Berindei (was: Gustavo Fernandes)
> DistributedExecutorMassIndexer.executeInternal should use an existing pool
> --------------------------------------------------------------------------
>
> Key: ISPN-7606
> URL: https://issues.jboss.org/browse/ISPN-7606
> Project: Infinispan
> Issue Type: Bug
> Components: Embedded Querying
> Affects Versions: 9.0.0.CR2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Final
>
>
> {{DistributedExecutorMassIndexer.executeInternal}} creates 2 executors:
> 1. {{new DefaultExecutorService(cache)}} implicitly creates a single-threaded executor to run tasks on the local node.
> 2. {{compositeFuture.whenCompleteAsync(consumer, Executors.newSingleThreadExecutor())}} explicitly creates a new single-threaded executor to run the consumer on.
> Neither of these executors are shut down, and rather than complicating the code to stop them properly it would be better to use an existing executor.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 9 months
[JBoss JIRA] (ISPN-10111) Avoiding Duplicate MBean Registration with Spring
by Donald Naro (Jira)
[ https://issues.jboss.org/browse/ISPN-10111?page=com.atlassian.jira.plugin... ]
Donald Naro updated ISPN-10111:
-------------------------------
Status: Pull Request Sent (was: Open)
Git Pull Request: https://github.com/infinispan/infinispan/pull/6830
> Avoiding Duplicate MBean Registration with Spring
> -------------------------------------------------
>
> Key: ISPN-10111
> URL: https://issues.jboss.org/browse/ISPN-10111
> Project: Infinispan
> Issue Type: Enhancement
> Affects Versions: 10.0.0.Beta4, 9.4.12.Final
> Reporter: Donald Naro
> Assignee: Donald Naro
> Priority: Major
>
> org.springframework.jmx.export.UnableToRegisterMBeanException: Unable to register MBean [org.infinispan.client.hotrod.RemoteCacheManager@51a06cbe] with key 'remoteCacheManager'; nested exception is javax.management.InstanceAlreadyExistsException: MXBean already registered with name org.infinispan:type=HotRodClient,name=Default
> at org.springframework.jmx.export.MBeanExporter.registerBeanNameOrInstance(MBeanExporter.java:625) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.jmx.export.MBeanExporter.lambda$registerBeans$2(MBeanExporter.java:551) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at java.util.HashMap.forEach(HashMap.java:1289) ~[na:1.8.0_201]
> at org.springframework.jmx.export.MBeanExporter.registerBeans(MBeanExporter.java:551) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.jmx.export.MBeanExporter.afterSingletonsInstantiated(MBeanExporter.java:434) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:863) ~[spring-beans-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:863) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:546) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.refresh(ServletWebServerApplicationContext.java:142) ~[spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:775) [spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:397) [spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) [spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at org.springframework.boot.SpringApplication.run(SpringApplication.java:1260) [spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at org.springframework.boot.SpringApplication.run(SpringApplication.java:1248) [spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at me.ignaciosanchez.hotrodtester.HotrodTesterApplication.main(HotrodTesterApplication.java:10) [classes/:na]
> Caused by: javax.management.InstanceAlreadyExistsException: MXBean already registered with name org.infinispan:type=HotRodClient,name=Default
> at com.sun.jmx.mbeanserver.MXBeanLookup.addReference(MXBeanLookup.java:151) ~[na:1.8.0_201]
> at com.sun.jmx.mbeanserver.MXBeanSupport.register(MXBeanSupport.java:160) ~[na:1.8.0_201]
> at com.sun.jmx.mbeanserver.MBeanSupport.preRegister2(MBeanSupport.java:173) ~[na:1.8.0_201]
> at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:930) ~[na:1.8.0_201]
> at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900) ~[na:1.8.0_201]
> at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324) ~[na:1.8.0_201]
> at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522) ~[na:1.8.0_201]
> at org.springframework.jmx.support.MBeanRegistrationSupport.doRegister(MBeanRegistrationSupport.java:137) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.jmx.export.MBeanExporter.registerBeanInstance(MBeanExporter.java:671) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.jmx.export.MBeanExporter.registerBeanNameOrInstance(MBeanExporter.java:615) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> ... 14 common frames omitted
> We are using a Spring Boot application (with Spring Cache and Spring Session) and maybe this could be the problem. My configuration is:
> @Bean
> public InfinispanRemoteConfigurer infinispanRemoteConfigurer() {
> return () -> new ConfigurationBuilder()
> .addServer()
> .host(host)
> .port(Integer.parseInt(port))
> .statistics()
> .enable()
> .jmxEnable()
> .build();
> }
> And I have also tried with this one:
> @Bean
> public SpringRemoteCacheManager cacheManager() {
> return new SpringRemoteCacheManager(infinispanCacheManager());
> }
> @Bean
> public RemoteCacheManager infinispanCacheManager() {
> ConfigurationBuilder builder = new ConfigurationBuilder();
> builder.addServer()
> .host(host)
> .port(Integer.parseInt(port))
> .statistics()
> .enable()
> .jmxEnable();
> return new RemoteCacheManager(builder.build());
> }
> But both return me the same error.
> Disable Spring JMX auto configuration: spring.jmx.enabled=false
> https://docs.spring.io/spring-boot/docs/current/api/org/springframework/b...
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 9 months
[JBoss JIRA] (ISPN-10111) Avoiding Duplicate MBean Registration with Spring
by Donald Naro (Jira)
[ https://issues.jboss.org/browse/ISPN-10111?page=com.atlassian.jira.plugin... ]
Donald Naro updated ISPN-10111:
-------------------------------
Status: Open (was: New)
> Avoiding Duplicate MBean Registration with Spring
> -------------------------------------------------
>
> Key: ISPN-10111
> URL: https://issues.jboss.org/browse/ISPN-10111
> Project: Infinispan
> Issue Type: Enhancement
> Affects Versions: 10.0.0.Beta4, 9.4.12.Final
> Reporter: Donald Naro
> Assignee: Donald Naro
> Priority: Major
>
> org.springframework.jmx.export.UnableToRegisterMBeanException: Unable to register MBean [org.infinispan.client.hotrod.RemoteCacheManager@51a06cbe] with key 'remoteCacheManager'; nested exception is javax.management.InstanceAlreadyExistsException: MXBean already registered with name org.infinispan:type=HotRodClient,name=Default
> at org.springframework.jmx.export.MBeanExporter.registerBeanNameOrInstance(MBeanExporter.java:625) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.jmx.export.MBeanExporter.lambda$registerBeans$2(MBeanExporter.java:551) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at java.util.HashMap.forEach(HashMap.java:1289) ~[na:1.8.0_201]
> at org.springframework.jmx.export.MBeanExporter.registerBeans(MBeanExporter.java:551) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.jmx.export.MBeanExporter.afterSingletonsInstantiated(MBeanExporter.java:434) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:863) ~[spring-beans-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:863) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:546) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.refresh(ServletWebServerApplicationContext.java:142) ~[spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:775) [spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:397) [spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) [spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at org.springframework.boot.SpringApplication.run(SpringApplication.java:1260) [spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at org.springframework.boot.SpringApplication.run(SpringApplication.java:1248) [spring-boot-2.1.2.RELEASE.jar:2.1.2.RELEASE]
> at me.ignaciosanchez.hotrodtester.HotrodTesterApplication.main(HotrodTesterApplication.java:10) [classes/:na]
> Caused by: javax.management.InstanceAlreadyExistsException: MXBean already registered with name org.infinispan:type=HotRodClient,name=Default
> at com.sun.jmx.mbeanserver.MXBeanLookup.addReference(MXBeanLookup.java:151) ~[na:1.8.0_201]
> at com.sun.jmx.mbeanserver.MXBeanSupport.register(MXBeanSupport.java:160) ~[na:1.8.0_201]
> at com.sun.jmx.mbeanserver.MBeanSupport.preRegister2(MBeanSupport.java:173) ~[na:1.8.0_201]
> at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:930) ~[na:1.8.0_201]
> at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900) ~[na:1.8.0_201]
> at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324) ~[na:1.8.0_201]
> at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522) ~[na:1.8.0_201]
> at org.springframework.jmx.support.MBeanRegistrationSupport.doRegister(MBeanRegistrationSupport.java:137) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.jmx.export.MBeanExporter.registerBeanInstance(MBeanExporter.java:671) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> at org.springframework.jmx.export.MBeanExporter.registerBeanNameOrInstance(MBeanExporter.java:615) ~[spring-context-5.1.4.RELEASE.jar:5.1.4.RELEASE]
> ... 14 common frames omitted
> We are using a Spring Boot application (with Spring Cache and Spring Session) and maybe this could be the problem. My configuration is:
> @Bean
> public InfinispanRemoteConfigurer infinispanRemoteConfigurer() {
> return () -> new ConfigurationBuilder()
> .addServer()
> .host(host)
> .port(Integer.parseInt(port))
> .statistics()
> .enable()
> .jmxEnable()
> .build();
> }
> And I have also tried with this one:
> @Bean
> public SpringRemoteCacheManager cacheManager() {
> return new SpringRemoteCacheManager(infinispanCacheManager());
> }
> @Bean
> public RemoteCacheManager infinispanCacheManager() {
> ConfigurationBuilder builder = new ConfigurationBuilder();
> builder.addServer()
> .host(host)
> .port(Integer.parseInt(port))
> .statistics()
> .enable()
> .jmxEnable();
> return new RemoteCacheManager(builder.build());
> }
> But both return me the same error.
> Disable Spring JMX auto configuration: spring.jmx.enabled=false
> https://docs.spring.io/spring-boot/docs/current/api/org/springframework/b...
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 9 months
[JBoss JIRA] (ISPN-10108) Query Project - Store the test results using TestNG
by Diego Lovison (Jira)
[ https://issues.jboss.org/browse/ISPN-10108?page=com.atlassian.jira.plugin... ]
Diego Lovison updated ISPN-10108:
---------------------------------
Summary: Query Project - Store the test results using TestNG (was: In the query project we cannot mix Junit with TestNG)
> Query Project - Store the test results using TestNG
> ---------------------------------------------------
>
> Key: ISPN-10108
> URL: https://issues.jboss.org/browse/ISPN-10108
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 10.0.0.Final, 9.4.11.Final
> Reporter: Diego Lovison
> Priority: Major
>
> Running the command {{mvn verify -pl 'query' -Dtest=MassIndexerAsyncBackendTest}} from the root folder will produce two XMLs
> TEST-infinispan-query.xml
> TEST-org.infinispan.query.distributed.MassIndexerAsyncBackendTest.xml
> One will be the Junit report and the other one will be the Polarion Report.
> The query project uses TestNG instead of JUNIT
> I would like to fix the following
> {code:xml}
> <testcase name="query.distributed.MassIndexerAsyncBackendTest" classname="org.infinispan.query.distributed.MassIndexerAsyncBackendTest" time="3.469"/>
> {code}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 9 months
[JBoss JIRA] (ISPN-9588) JGroups fails to install new cluster view after coordinator leave
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-9588?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-9588:
------------------------------------
The {{ThreeWaySplitAndMergeTest}} failure could be related to ISPN-9291. I have reproduced the failure with trace logs by setting {{@Test(..., invocationCount=100)}} and attached the log ( [^ThreeWaySplitAndMergeTest.clearContent.log.gz]):
{noformat}
16:55:50,316 TRACE (Timer runner-1,Test-NodeA-28036:[]) [TCP_NIO2] Test-NodeA-28036: sending msg to null, src=Test-NodeA-28036, headers are NAKACK2: [HIGHEST_SEQNO, seqno=35], TP: [cluster_name=ISPN]
16:55:50,317 TRACE (jgroups-24,Test-NodeB-20743:[]) [TCP_NIO2] Test-NodeB-20743: received [dest: null sender: Test-NodeA-28036 (2 headers), size=0 bytes, flags=OOB|INTERNAL], headers are NAKACK2: [HIGHEST_SEQNO, seqno=35], TP: [cluster_name=ISPN]
16:55:50,317 TRACE (jgroups-7,Test-NodeC-39605:[]) [TCP_NIO2] Test-NodeC-39605: received [dest: null sender: Test-NodeA-28036 (2 headers), size=0 bytes, flags=OOB|INTERNAL], headers are NAKACK2: [HIGHEST_SEQNO, seqno=35], TP: [cluster_name=ISPN]
16:55:50,391 TRACE (jgroups-6,Test-NodeA-28036:[]) [GMS] Test-NodeA-28036: mcasting view [Test-NodeA-28036|1843] (2) [Test-NodeA-28036, Test-NodeB-20743]
16:55:50,392 DEBUG (jgroups-6,Test-NodeA-28036:[]) [GMS] Test-NodeA-28036: installing view [Test-NodeA-28036|1843] (2) [Test-NodeA-28036, Test-NodeB-20743]
16:55:50,395 TRACE (transport-thread-Test-NodeA-p6500-t6:[ViewHandling]) [JGroupsTransport] Test-NodeA-28036 sending request 140 to all: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=Test-NodeA-28036, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=1843}
16:55:50,395 TRACE (transport-thread-Test-NodeA-p6500-t6:[ViewHandling]) [TCP_NIO2] Test-NodeA-28036: sending msg to null, src=Test-NodeA-28036, headers are RequestCorrelator: corr_id=0, type=REQ, req_id=140, rsp_expected=true, NAKACK2: [MSG, seqno=36], TP: [cluster_name=ISPN]
16:55:50,395 TRACE (jgroups-22,Test-NodeB-20743:[]) [TCP_NIO2] Test-NodeB-20743: received [dest: null sender: Test-NodeA-28036 (3 headers), size=11 bytes, flags=OOB|NO_TOTAL_ORDER|NO_RELAY], headers are RequestCorrelator: corr_id=0, type=REQ, req_id=140, rsp_expected=true, NAKACK2: [MSG, seqno=36], TP: [cluster_name=ISPN]
{noformat}
It's clear from the log that NodeA doesn't try to send the view message to NodeB, because the last seqno before the view is 35 and the next multicast message after the view gets seqno 36.
> JGroups fails to install new cluster view after coordinator leave
> -----------------------------------------------------------------
>
> Key: ISPN-9588
> URL: https://issues.jboss.org/browse/ISPN-9588
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 9.4.0.CR3
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Attachments: ISPN-9127_ScatteredCrashInSequenceTest-infinispan-core.log.gz, ISPN-9127_ThreeWaySplitAndMergeTest-infinispan-core.log.gz, ThreeWaySplitAndMergeTest.clearContent.log.gz
>
>
> The core test suite normally shuts down cache managers in the reverse order of their start, so that the coordinator stops last. This should be just an optimization, to reduce the number of coordinator changes, and with it the test suite duration and log size.
> Unfortunately, the few tests that stop the coordinator first are routinely failing to stop the cluster properly, usually when there are 2 nodes left in the cluster the 2nd node doesn't receive the view:
> {noformat}
> 10:08:01,656 DEBUG (jgroups-26,Test-NodeB-20661:[]) [GMS] Test-NodeB-20661: installing view [Test-NodeC-25266|33] (3) [Test-NodeC-25266, Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,862 TRACE (testng-Test:[]) [GMS] Test-NodeC-25266: sending LEAVE request to Test-NodeA-8936
> 10:08:01,863 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: members are (3) Test-NodeC-25266,Test-NodeA-8936,Test-NodeB-20661, coord=Test-NodeA-8936: I'm the new coordinator
> 10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: joiners=[], suspected=[], leaving=[Test-NodeC-25266], new view: [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: mcasting view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,900 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: installing view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:02,245 TRACE (testng-Test:[]) [JGroupsTransport] Test-NodeB-20661 sending request 140 to Test-NodeC-25266: CacheTopologyControlCommand{cache=org.infinispan.CONFIG, type=LEAVE, sender=Test-NodeB-20661, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=33}
> 10:12:02,247 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Error sending the leave request for cache org.infinispan.CONFIG to coordinator
> org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 140 from 7d5d965d-eacc-fd47-2ab8-5cec95f4c10b
> {noformat}
> Sometimes the new coordinator gets stuck in a merge loop, keeps trying to re-include the stopped node and fails:
> {noformat}
> 10:06:21,484 DEBUG (jgroups-24,Test-NodeC-57941:[]) [GMS] Test-NodeC-57941: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
> 10:06:21,486 DEBUG (jgroups-25,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
> 10:06:22,070 TRACE (testng-Test:[]) [GMS] Test-NodeD-37696: sending LEAVE request to Test-NodeB-39137
> 10:06:22,078 TRACE (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:22,080 DEBUG (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:30,887 DEBUG (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: I will be the merge leader. Starting the merge task. Views: {Test-NodeB-39137=[Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941], Test-NodeC-57941=[Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]}
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] suspending message garbage collection
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] Test-NodeB-39137: resume task started, max_suspend_time=220000
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge task Test-NodeB-39137::7 started with 2 participants
> 10:06:30,888 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: sending MERGE_REQ to [Test-NodeD-37696, Test-NodeB-39137]
> 10:06:36,041 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: collected 1 merge response(s) in 5153 ms
> 10:06:36,041 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:06:36,043 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: consolidated view=MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> consolidated digest=Test-NodeB-39137: [15 (16)], Test-NodeC-57941: [5 (5)]
> 10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing merge view [Test-NodeB-39137|10] (2 members) in 1 coords
> 10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge Test-NodeB-39137::7 took 5155 ms
> 10:06:36,043 TRACE (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:49,877 DEBUG (MergeTask-229,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:03,755 DEBUG (MergeTask-342,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:17,703 DEBUG (MergeTask-447,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:31,547 DEBUG (MergeTask-568,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:45,381 DEBUG (MergeTask-688,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:59,196 DEBUG (MergeTask-806,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:13,031 DEBUG (MergeTask-932,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:26,885 DEBUG (MergeTask-1061,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:40,697 DEBUG (MergeTask-1197,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:54,510 DEBUG (MergeTask-1330,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:08,324 DEBUG (MergeTask-1463,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:22,135 DEBUG (MergeTask-1596,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:35,953 DEBUG (MergeTask-1729,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:49,770 DEBUG (MergeTask-1857,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:03,609 DEBUG (MergeTask-1986,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:17,420 DEBUG (MergeTask-2117,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:31,252 DEBUG (MergeTask-2248,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:45,099 DEBUG (MergeTask-2383,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:58,916 DEBUG (MergeTask-2516,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:11:12,743 DEBUG (MergeTask-2648,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> {noformat}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 9 months
[JBoss JIRA] (ISPN-9588) JGroups fails to install new cluster view after coordinator leave
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-9588?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-9588:
-------------------------------
Attachment: ThreeWaySplitAndMergeTest.clearContent.log.gz
> JGroups fails to install new cluster view after coordinator leave
> -----------------------------------------------------------------
>
> Key: ISPN-9588
> URL: https://issues.jboss.org/browse/ISPN-9588
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 9.4.0.CR3
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Attachments: ISPN-9127_ScatteredCrashInSequenceTest-infinispan-core.log.gz, ISPN-9127_ThreeWaySplitAndMergeTest-infinispan-core.log.gz, ThreeWaySplitAndMergeTest.clearContent.log.gz
>
>
> The core test suite normally shuts down cache managers in the reverse order of their start, so that the coordinator stops last. This should be just an optimization, to reduce the number of coordinator changes, and with it the test suite duration and log size.
> Unfortunately, the few tests that stop the coordinator first are routinely failing to stop the cluster properly, usually when there are 2 nodes left in the cluster the 2nd node doesn't receive the view:
> {noformat}
> 10:08:01,656 DEBUG (jgroups-26,Test-NodeB-20661:[]) [GMS] Test-NodeB-20661: installing view [Test-NodeC-25266|33] (3) [Test-NodeC-25266, Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,862 TRACE (testng-Test:[]) [GMS] Test-NodeC-25266: sending LEAVE request to Test-NodeA-8936
> 10:08:01,863 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: members are (3) Test-NodeC-25266,Test-NodeA-8936,Test-NodeB-20661, coord=Test-NodeA-8936: I'm the new coordinator
> 10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: joiners=[], suspected=[], leaving=[Test-NodeC-25266], new view: [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: mcasting view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,900 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: installing view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:02,245 TRACE (testng-Test:[]) [JGroupsTransport] Test-NodeB-20661 sending request 140 to Test-NodeC-25266: CacheTopologyControlCommand{cache=org.infinispan.CONFIG, type=LEAVE, sender=Test-NodeB-20661, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=33}
> 10:12:02,247 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Error sending the leave request for cache org.infinispan.CONFIG to coordinator
> org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 140 from 7d5d965d-eacc-fd47-2ab8-5cec95f4c10b
> {noformat}
> Sometimes the new coordinator gets stuck in a merge loop, keeps trying to re-include the stopped node and fails:
> {noformat}
> 10:06:21,484 DEBUG (jgroups-24,Test-NodeC-57941:[]) [GMS] Test-NodeC-57941: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
> 10:06:21,486 DEBUG (jgroups-25,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
> 10:06:22,070 TRACE (testng-Test:[]) [GMS] Test-NodeD-37696: sending LEAVE request to Test-NodeB-39137
> 10:06:22,078 TRACE (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:22,080 DEBUG (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:30,887 DEBUG (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: I will be the merge leader. Starting the merge task. Views: {Test-NodeB-39137=[Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941], Test-NodeC-57941=[Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]}
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] suspending message garbage collection
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] Test-NodeB-39137: resume task started, max_suspend_time=220000
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge task Test-NodeB-39137::7 started with 2 participants
> 10:06:30,888 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: sending MERGE_REQ to [Test-NodeD-37696, Test-NodeB-39137]
> 10:06:36,041 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: collected 1 merge response(s) in 5153 ms
> 10:06:36,041 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:06:36,043 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: consolidated view=MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> consolidated digest=Test-NodeB-39137: [15 (16)], Test-NodeC-57941: [5 (5)]
> 10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing merge view [Test-NodeB-39137|10] (2 members) in 1 coords
> 10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge Test-NodeB-39137::7 took 5155 ms
> 10:06:36,043 TRACE (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:49,877 DEBUG (MergeTask-229,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:03,755 DEBUG (MergeTask-342,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:17,703 DEBUG (MergeTask-447,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:31,547 DEBUG (MergeTask-568,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:45,381 DEBUG (MergeTask-688,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:59,196 DEBUG (MergeTask-806,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:13,031 DEBUG (MergeTask-932,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:26,885 DEBUG (MergeTask-1061,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:40,697 DEBUG (MergeTask-1197,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:54,510 DEBUG (MergeTask-1330,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:08,324 DEBUG (MergeTask-1463,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:22,135 DEBUG (MergeTask-1596,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:35,953 DEBUG (MergeTask-1729,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:49,770 DEBUG (MergeTask-1857,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:03,609 DEBUG (MergeTask-1986,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:17,420 DEBUG (MergeTask-2117,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:31,252 DEBUG (MergeTask-2248,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:45,099 DEBUG (MergeTask-2383,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:58,916 DEBUG (MergeTask-2516,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:11:12,743 DEBUG (MergeTask-2648,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> {noformat}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 9 months
[JBoss JIRA] (ISPN-9588) JGroups fails to install new cluster view after coordinator leave
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-9588?page=com.atlassian.jira.plugin.... ]
Dan Berindei edited comment on ISPN-9588 at 4/3/19 9:54 AM:
------------------------------------------------------------
We assumed that upgrading to JGroups 4.0.18.Final, which includes JGRP-2293 and JGRP-2324, would fix the issue.
Apparently not, as a 9.4.x pull request build hanged in {{ThreeWaySplitAndMergeTest.clearContent}} with the same symptoms:
{noformat}
17:11:23,512 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [DefaultCacheManager] Stopping cache manager ThreeWaySplitAndMergeTest-NodeC on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,512 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [CacheImpl] Stopping cache org.infinispan.CONFIG on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,514 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [CacheImpl] Stopping cache ___defaultcache on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,515 DEBUG (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS] ThreeWaySplitAndMergeTest-NodeA-41022: members are (3) ThreeWaySplitAndMergeTest-NodeC-48886,ThreeWaySplitAndMergeTest-NodeA-41022,ThreeWaySplitAndMergeTest-NodeB-21968, coord=ThreeWaySplitAndMergeTest-NodeA-41022: I'm the new coordinator
17:11:23,522 DEBUG (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS] ThreeWaySplitAndMergeTest-NodeA-41022: installing view [ThreeWaySplitAndMergeTest-NodeA-41022|34] (2) [ThreeWaySplitAndMergeTest-NodeA-41022, ThreeWaySplitAndMergeTest-NodeB-21968]
17:11:23,523 INFO (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [CLUSTER] ISPN000094: Received new cluster view for channel ISPN: [ThreeWaySplitAndMergeTest-NodeA-41022|34] (2) [ThreeWaySplitAndMergeTest-NodeA-41022, ThreeWaySplitAndMergeTest-NodeB-21968]
17:11:23,526 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [TCP_NIO2] closing sockets and stopping threads
17:11:25,525 WARN (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS] ThreeWaySplitAndMergeTest-NodeA-41022: failed to collect all ACKs (expected=1) for view [ThreeWaySplitAndMergeTest-NodeA-41022|34] after 2000ms, missing 1 ACKs from (1) ThreeWaySplitAndMergeTest-NodeB-21968
17:15:23,558 ERROR (transport-thread-ThreeWaySplitAndMergeTest-NodeA-p2798-t1:[ViewHandling]) [CLUSTER] ISPN000196: Failed to recover cluster state after the current node became the coordinator (or after merge)
java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 153 from ThreeWaySplitAndMergeTest-NodeB-21968
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022) ~[?:?]
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:105) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:612) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.recoverClusterStatus(ClusterTopologyManagerImpl.java:452) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.becomeCoordinator(ClusterTopologyManagerImpl.java:336) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:315) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:88) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:758) ~[classes/:?]
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 153 from ThreeWaySplitAndMergeTest-NodeB-21968
at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167) ~[classes/:?]
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87) ~[classes/:?]
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22) ~[classes/:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
... 3 more
17:15:23,558 FATAL (transport-thread-ThreeWaySplitAndMergeTest-NodeA-p2798-t1:[ViewHandling]) [CLUSTER] ISPN100004: After merge (or coordinator change), the coordinator failed to recover cluster. Cluster members are [ThreeWaySplitAndMergeTest-NodeA-41022, ThreeWaySplitAndMergeTest-NodeB-21968].
{noformat}
was (Author: dan.berindei):
We assumed that upgrading to JGroups 4.0.18.Final, which includex JGRP-2324, would fix the issue.
Apparently not, as a 9.4.x pull request build hanged in {{ThreeWaySplitAndMergeTest.clearContent}} with the same symptoms:
{noformat}
17:11:23,512 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [DefaultCacheManager] Stopping cache manager ThreeWaySplitAndMergeTest-NodeC on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,512 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [CacheImpl] Stopping cache org.infinispan.CONFIG on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,514 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [CacheImpl] Stopping cache ___defaultcache on ThreeWaySplitAndMergeTest-NodeC-48886
17:11:23,515 DEBUG (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS] ThreeWaySplitAndMergeTest-NodeA-41022: members are (3) ThreeWaySplitAndMergeTest-NodeC-48886,ThreeWaySplitAndMergeTest-NodeA-41022,ThreeWaySplitAndMergeTest-NodeB-21968, coord=ThreeWaySplitAndMergeTest-NodeA-41022: I'm the new coordinator
17:11:23,522 DEBUG (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS] ThreeWaySplitAndMergeTest-NodeA-41022: installing view [ThreeWaySplitAndMergeTest-NodeA-41022|34] (2) [ThreeWaySplitAndMergeTest-NodeA-41022, ThreeWaySplitAndMergeTest-NodeB-21968]
17:11:23,523 INFO (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [CLUSTER] ISPN000094: Received new cluster view for channel ISPN: [ThreeWaySplitAndMergeTest-NodeA-41022|34] (2) [ThreeWaySplitAndMergeTest-NodeA-41022, ThreeWaySplitAndMergeTest-NodeB-21968]
17:11:23,526 DEBUG (testng-ThreeWaySplitAndMergeTest:[]) [TCP_NIO2] closing sockets and stopping threads
17:11:25,525 WARN (jgroups-21,ThreeWaySplitAndMergeTest-NodeA-41022:[]) [GMS] ThreeWaySplitAndMergeTest-NodeA-41022: failed to collect all ACKs (expected=1) for view [ThreeWaySplitAndMergeTest-NodeA-41022|34] after 2000ms, missing 1 ACKs from (1) ThreeWaySplitAndMergeTest-NodeB-21968
17:15:23,558 ERROR (transport-thread-ThreeWaySplitAndMergeTest-NodeA-p2798-t1:[ViewHandling]) [CLUSTER] ISPN000196: Failed to recover cluster state after the current node became the coordinator (or after merge)
java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 153 from ThreeWaySplitAndMergeTest-NodeB-21968
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022) ~[?:?]
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:105) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:612) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.recoverClusterStatus(ClusterTopologyManagerImpl.java:452) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.becomeCoordinator(ClusterTopologyManagerImpl.java:336) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:315) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:88) ~[classes/:?]
at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:758) ~[classes/:?]
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 153 from ThreeWaySplitAndMergeTest-NodeB-21968
at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167) ~[classes/:?]
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87) ~[classes/:?]
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22) ~[classes/:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
... 3 more
17:15:23,558 FATAL (transport-thread-ThreeWaySplitAndMergeTest-NodeA-p2798-t1:[ViewHandling]) [CLUSTER] ISPN100004: After merge (or coordinator change), the coordinator failed to recover cluster. Cluster members are [ThreeWaySplitAndMergeTest-NodeA-41022, ThreeWaySplitAndMergeTest-NodeB-21968].
{noformat}
> JGroups fails to install new cluster view after coordinator leave
> -----------------------------------------------------------------
>
> Key: ISPN-9588
> URL: https://issues.jboss.org/browse/ISPN-9588
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 9.4.0.CR3
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Attachments: ISPN-9127_ScatteredCrashInSequenceTest-infinispan-core.log.gz, ISPN-9127_ThreeWaySplitAndMergeTest-infinispan-core.log.gz
>
>
> The core test suite normally shuts down cache managers in the reverse order of their start, so that the coordinator stops last. This should be just an optimization, to reduce the number of coordinator changes, and with it the test suite duration and log size.
> Unfortunately, the few tests that stop the coordinator first are routinely failing to stop the cluster properly, usually when there are 2 nodes left in the cluster the 2nd node doesn't receive the view:
> {noformat}
> 10:08:01,656 DEBUG (jgroups-26,Test-NodeB-20661:[]) [GMS] Test-NodeB-20661: installing view [Test-NodeC-25266|33] (3) [Test-NodeC-25266, Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,862 TRACE (testng-Test:[]) [GMS] Test-NodeC-25266: sending LEAVE request to Test-NodeA-8936
> 10:08:01,863 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: members are (3) Test-NodeC-25266,Test-NodeA-8936,Test-NodeB-20661, coord=Test-NodeA-8936: I'm the new coordinator
> 10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: joiners=[], suspected=[], leaving=[Test-NodeC-25266], new view: [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: mcasting view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,900 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: installing view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:02,245 TRACE (testng-Test:[]) [JGroupsTransport] Test-NodeB-20661 sending request 140 to Test-NodeC-25266: CacheTopologyControlCommand{cache=org.infinispan.CONFIG, type=LEAVE, sender=Test-NodeB-20661, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=33}
> 10:12:02,247 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Error sending the leave request for cache org.infinispan.CONFIG to coordinator
> org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 140 from 7d5d965d-eacc-fd47-2ab8-5cec95f4c10b
> {noformat}
> Sometimes the new coordinator gets stuck in a merge loop, keeps trying to re-include the stopped node and fails:
> {noformat}
> 10:06:21,484 DEBUG (jgroups-24,Test-NodeC-57941:[]) [GMS] Test-NodeC-57941: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
> 10:06:21,486 DEBUG (jgroups-25,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
> 10:06:22,070 TRACE (testng-Test:[]) [GMS] Test-NodeD-37696: sending LEAVE request to Test-NodeB-39137
> 10:06:22,078 TRACE (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:22,080 DEBUG (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:30,887 DEBUG (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: I will be the merge leader. Starting the merge task. Views: {Test-NodeB-39137=[Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941], Test-NodeC-57941=[Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]}
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] suspending message garbage collection
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] Test-NodeB-39137: resume task started, max_suspend_time=220000
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge task Test-NodeB-39137::7 started with 2 participants
> 10:06:30,888 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: sending MERGE_REQ to [Test-NodeD-37696, Test-NodeB-39137]
> 10:06:36,041 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: collected 1 merge response(s) in 5153 ms
> 10:06:36,041 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:06:36,043 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: consolidated view=MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> consolidated digest=Test-NodeB-39137: [15 (16)], Test-NodeC-57941: [5 (5)]
> 10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing merge view [Test-NodeB-39137|10] (2 members) in 1 coords
> 10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge Test-NodeB-39137::7 took 5155 ms
> 10:06:36,043 TRACE (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:49,877 DEBUG (MergeTask-229,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:03,755 DEBUG (MergeTask-342,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:17,703 DEBUG (MergeTask-447,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:31,547 DEBUG (MergeTask-568,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:45,381 DEBUG (MergeTask-688,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:59,196 DEBUG (MergeTask-806,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:13,031 DEBUG (MergeTask-932,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:26,885 DEBUG (MergeTask-1061,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:40,697 DEBUG (MergeTask-1197,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:54,510 DEBUG (MergeTask-1330,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:08,324 DEBUG (MergeTask-1463,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:22,135 DEBUG (MergeTask-1596,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:35,953 DEBUG (MergeTask-1729,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:49,770 DEBUG (MergeTask-1857,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:03,609 DEBUG (MergeTask-1986,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:17,420 DEBUG (MergeTask-2117,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:31,252 DEBUG (MergeTask-2248,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:45,099 DEBUG (MergeTask-2383,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:58,916 DEBUG (MergeTask-2516,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:11:12,743 DEBUG (MergeTask-2648,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> {noformat}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 9 months