]
Dan Berindei updated ISPN-9544:
-------------------------------
Summary: Server cache manager stop/restart breaks cache topology management (was:
Asymmetric ForkChannels break cache topology management)
Server cache manager stop/restart breaks cache topology management
------------------------------------------------------------------
Key: ISPN-9544
URL:
https://issues.jboss.org/browse/ISPN-9544
Project: Infinispan
Issue Type: Bug
Components: Server
Affects Versions: 9.4.0.CR3
Reporter: Dan Berindei
Assignee: Dan Berindei
Before {{FORK}} was introduced, {{ClsuterTopologyManagerImpl}} and
{{LocalTopologyManagerImpl}} assumed that the coordinator would always reply to other
members' requests. After the introduction of {{FORK}} we added some hacks to work
around the fact that the coordinator may not yet have a {{ForkChannel}} with our ID
running **yet**, but we still expect the {{FORK}} setup to be symmetric after a reasonable
amount of time.
Stopping a {{FORK}} and starting it back without restarting the underlying channel also
doesn't work, because a {{FORK}} start/stop does not trigger a new view. When a node
sends a request to the coordinator and receives back a {{CacheNotFoundResponse}}, it
assumes that it will also receive a new view, but if the {{CacheNotFoundResponse}} was a
consequence of stopping a single {{DefaultCacheManager}}/{{ForkChannel}}, that view will
never arrive.
We don't restart individual cache managers in our tests, but the spark connector test
suite does it, and it sometimes fails because of it:
{noformat}
2018-09-26 21:18:03,035 INFO [org.infinispan.CLUSTER] (MSC service thread 1-4)
ISPN000094: Received new cluster view for channel cluster: [server2|6] (3) [server2,
server0, server1]
2018-09-26 21:18:05,778 TRACE
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5)
server1 sending request 37 to server2:
CacheTopologyControlCommand{cache=org.infinispan.spark.suites.DistributedSuite,
type=POLICY_GET_STATUS, sender=server1, joinInfo=null, topologyId=0, rebalanceId=0,
currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null,
throwable=null, viewId=6}
2018-09-26 21:18:05,795 TRACE
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (jgroups-4,server1) server1
received response for request 37 from server2: CacheNotFoundResponse
2018-09-26 21:18:05,798 TRACE [org.infinispan.topology.LocalTopologyManagerImpl] (MSC
service thread 1-5) Coordinator left the cluster while querying rebalancing status,
retrying
2018-09-26 21:18:05,823 TRACE
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5)
server1 sending request 41 to server2:
CacheTopologyControlCommand{cache=org.infinispan.spark.suites.DistributedSuite,
type=POLICY_GET_STATUS, sender=server1, joinInfo=null, topologyId=0, rebalanceId=0,
currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null,
throwable=null, viewId=6}
2018-09-26 21:18:05,841 TRACE
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (jgroups-19,server1) server1
received response for request 41 from server2: CacheNotFoundResponse
2018-09-26 21:18:05,846 TRACE [org.infinispan.topology.LocalTopologyManagerImpl] (MSC
service thread 1-5) Coordinator left the cluster while querying rebalancing status,
retrying
2018-09-26 21:18:05,871 TRACE
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5)
Waiting for transaction data for view 7, current view is 6
2018-09-26 21:19:05,779 ERROR [org.jboss.msc.service.fail] (MSC service thread 1-5)
MSC000001: Failed to start service
jboss.datagrid-infinispan.clustered."org.infinispan.spark.suites.DistributedSuite":
org.jboss.msc.service.StartException in service
jboss.datagrid-infinispan.clustered."org.infinispan.spark.suites.DistributedSuite":
Failed to start service
at
org.jboss.msc.service.ServiceControllerImpl$StartTask.execute(ServiceControllerImpl.java:1728)
at
org.jboss.msc.service.ServiceControllerImpl$ControllerTask.run(ServiceControllerImpl.java:1556)
at
org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1985)
at
org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1487)
at
org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1364)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000451: Timed out waiting
for view 7, current view is 6
at
org.infinispan.topology.LocalTopologyManagerImpl.waitForView(LocalTopologyManagerImpl.java:558)
at
org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinatorRetry(LocalTopologyManagerImpl.java:598)
at
org.infinispan.topology.LocalTopologyManagerImpl.isCacheRebalancingEnabled(LocalTopologyManagerImpl.java:580)
at
org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:233)
at org.infinispan.cache.impl.CacheImpl.start(CacheImpl.java:1056)
at
org.infinispan.cache.impl.AbstractDelegatingCache.start(AbstractDelegatingCache.java:451)
at
org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:653)
at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:598)
at
org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:481)
at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:465)
at
org.infinispan.manager.impl.AbstractDelegatingEmbeddedCacheManager.getCache(AbstractDelegatingEmbeddedCacheManager.java:157)
at
org.infinispan.server.infinispan.SecurityActions.lambda$startCache$4(SecurityActions.java:122)
at org.infinispan.security.Security.doPrivileged(Security.java:44)
at
org.infinispan.server.infinispan.SecurityActions.doPrivileged(SecurityActions.java:69)
at
org.infinispan.server.infinispan.SecurityActions.startCache(SecurityActions.java:126)
at
org.jboss.as.clustering.infinispan.subsystem.CacheService.start(CacheService.java:87)
at
org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1736)
at
org.jboss.msc.service.ServiceControllerImpl$StartTask.execute(ServiceControllerImpl.java:1698)
... 6 more
{noformat}