[JBoss JIRA] (ISPN-9517) State transfer times out if initiated with yet to be verified suspected member and reincarnated member

Tuesday, 18 September 2018

    [
https://issues.jboss.org/browse/ISPN-9517?page=com.atlassian.jira.plugin....
] 

Paul Ferraro edited comment on ISPN-9517 at 9/18/18 7:04 AM:
-------------------------------------------------------------

Reproducer attached.  The test starts channels/cache managers on 2 servers.  1st server
stops it cache after the 2nd server starts and receives state.
Then 2nd server is killed and restarted.  Upon restart, 2nd server (which uses a distinct
logical name) fails to start its cache due to state transfer timeout.

We should expect the state transfer to ultimately abort when the killed server leaves.

was (Author: pferraro):
Reproducer attached.  The test starts channels/cache managers on 2 servers.  1st server
stops it cache after the 2nd server starts and receives state.
Then 2nd server is killed and restarted.  Upon restart, 2nd server (which uses a distinct
logical name) fails to start its cache due to state transfer timeout.

...
 State transfer times out if initiated with yet to be verified
suspected member and reincarnated member

------------------------------------------------------------------------------------------------------

                 Key: ISPN-9517
                 URL: https://issues.jboss.org/browse/ISPN-9517
             Project: Infinispan
          Issue Type: Bug
          Components: State Transfer
    Affects Versions: 9.3.3.Final
            Reporter: Paul Ferraro
            Assignee: Paul Ferraro
         Attachments: Test.java, node-1.zip, node-2.zip

 Here's the scenario:
 1. Cluster contains caches on 2 members, node-1 and node-2
 2. node-2 is killed
 3. node-2 is restarted (using same physical address)
 4. State transfer initiates, view contains node-1, suspected node-2, and reincarnated
node-2
 5. State transfer times out
 Log of node-1 includes:
 {noformat}
 12:09:51,882 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl]
(transport-thread--p14-t4) ISPN000197: Error updating cluster member list:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 3 from node-2
 	at
org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_181]
 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[rt.jar:1.8.0_181]
 	at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[rt.jar:1.8.0_181]
 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[rt.jar:1.8.0_181]
 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[rt.jar:1.8.0_181]
 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_181]
 	Suppressed: org.infinispan.util.logging.TraceException
 		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:75)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:525)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:508)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:321)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:87)
 		at
org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:731)
 		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175)
 		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:37)
 		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:227)
 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[rt.jar:1.8.0_181]
 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[rt.jar:1.8.0_181]
 		at
org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
 		... 1 more
 {noformat}
 I've attached trace logs from node-1 and node-2.
 Changing ClusterTopologyManagerImpl.confirmMembersAvailable() to use
ResponseMode.SYNCHRONOUS_IGNORE_LEAVERS instead of ResponseMode.SYNCHRONOUS allows state
transfer to complete successfully. 

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009