[infinispan-issues] [JBoss JIRA] (ISPN-9544) Server cache manager stop/restart breaks cache topology management
Dan Berindei (JIRA)
issues at jboss.org
Thu Sep 27 12:07:00 EDT 2018
[ https://issues.jboss.org/browse/ISPN-9544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dan Berindei updated ISPN-9544:
-------------------------------
Summary: Server cache manager stop/restart breaks cache topology management (was: Asymmetric ForkChannels break cache topology management)
> Server cache manager stop/restart breaks cache topology management
> ------------------------------------------------------------------
>
> Key: ISPN-9544
> URL: https://issues.jboss.org/browse/ISPN-9544
> Project: Infinispan
> Issue Type: Bug
> Components: Server
> Affects Versions: 9.4.0.CR3
> Reporter: Dan Berindei
> Assignee: Dan Berindei
>
> Before {{FORK}} was introduced, {{ClsuterTopologyManagerImpl}} and {{LocalTopologyManagerImpl}} assumed that the coordinator would always reply to other members' requests. After the introduction of {{FORK}} we added some hacks to work around the fact that the coordinator may not yet have a {{ForkChannel}} with our ID running **yet**, but we still expect the {{FORK}} setup to be symmetric after a reasonable amount of time.
> Stopping a {{FORK}} and starting it back without restarting the underlying channel also doesn't work, because a {{FORK}} start/stop does not trigger a new view. When a node sends a request to the coordinator and receives back a {{CacheNotFoundResponse}}, it assumes that it will also receive a new view, but if the {{CacheNotFoundResponse}} was a consequence of stopping a single {{DefaultCacheManager}}/{{ForkChannel}}, that view will never arrive.
> We don't restart individual cache managers in our tests, but the spark connector test suite does it, and it sometimes fails because of it:
> {noformat}
> 2018-09-26 21:18:03,035 INFO [org.infinispan.CLUSTER] (MSC service thread 1-4) ISPN000094: Received new cluster view for channel cluster: [server2|6] (3) [server2, server0, server1]
> 2018-09-26 21:18:05,778 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5) server1 sending request 37 to server2: CacheTopologyControlCommand{cache=org.infinispan.spark.suites.DistributedSuite, type=POLICY_GET_STATUS, sender=server1, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=6}
> 2018-09-26 21:18:05,795 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (jgroups-4,server1) server1 received response for request 37 from server2: CacheNotFoundResponse
> 2018-09-26 21:18:05,798 TRACE [org.infinispan.topology.LocalTopologyManagerImpl] (MSC service thread 1-5) Coordinator left the cluster while querying rebalancing status, retrying
> 2018-09-26 21:18:05,823 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5) server1 sending request 41 to server2: CacheTopologyControlCommand{cache=org.infinispan.spark.suites.DistributedSuite, type=POLICY_GET_STATUS, sender=server1, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=6}
> 2018-09-26 21:18:05,841 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (jgroups-19,server1) server1 received response for request 41 from server2: CacheNotFoundResponse
> 2018-09-26 21:18:05,846 TRACE [org.infinispan.topology.LocalTopologyManagerImpl] (MSC service thread 1-5) Coordinator left the cluster while querying rebalancing status, retrying
> 2018-09-26 21:18:05,871 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5) Waiting for transaction data for view 7, current view is 6
> 2018-09-26 21:19:05,779 ERROR [org.jboss.msc.service.fail] (MSC service thread 1-5) MSC000001: Failed to start service jboss.datagrid-infinispan.clustered."org.infinispan.spark.suites.DistributedSuite": org.jboss.msc.service.StartException in service jboss.datagrid-infinispan.clustered."org.infinispan.spark.suites.DistributedSuite": Failed to start service
> at org.jboss.msc.service.ServiceControllerImpl$StartTask.execute(ServiceControllerImpl.java:1728)
> at org.jboss.msc.service.ServiceControllerImpl$ControllerTask.run(ServiceControllerImpl.java:1556)
> at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
> at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1985)
> at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1487)
> at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1364)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000451: Timed out waiting for view 7, current view is 6
> at org.infinispan.topology.LocalTopologyManagerImpl.waitForView(LocalTopologyManagerImpl.java:558)
> at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinatorRetry(LocalTopologyManagerImpl.java:598)
> at org.infinispan.topology.LocalTopologyManagerImpl.isCacheRebalancingEnabled(LocalTopologyManagerImpl.java:580)
> at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:233)
> at org.infinispan.cache.impl.CacheImpl.start(CacheImpl.java:1056)
> at org.infinispan.cache.impl.AbstractDelegatingCache.start(AbstractDelegatingCache.java:451)
> at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:653)
> at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:598)
> at org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:481)
> at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:465)
> at org.infinispan.manager.impl.AbstractDelegatingEmbeddedCacheManager.getCache(AbstractDelegatingEmbeddedCacheManager.java:157)
> at org.infinispan.server.infinispan.SecurityActions.lambda$startCache$4(SecurityActions.java:122)
> at org.infinispan.security.Security.doPrivileged(Security.java:44)
> at org.infinispan.server.infinispan.SecurityActions.doPrivileged(SecurityActions.java:69)
> at org.infinispan.server.infinispan.SecurityActions.startCache(SecurityActions.java:126)
> at org.jboss.as.clustering.infinispan.subsystem.CacheService.start(CacheService.java:87)
> at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1736)
> at org.jboss.msc.service.ServiceControllerImpl$StartTask.execute(ServiceControllerImpl.java:1698)
> ... 6 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
More information about the infinispan-issues
mailing list