[Red Hat JIRA] (ISPN-4996) Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

Thursday, 18 February 2021

     [
https://issues.redhat.com/browse/ISPN-4996?page=com.atlassian.jira.plugin...
]

Dan Berindei reopened ISPN-4996:
--------------------------------

Still a problem in 12.0.x

Can also be reproduced by creating a new cache via the admin API. This inserts the cache
configuration in the {{org.infinispan.CONFIG}} cache, and listeners on all the nodes
(including zero-capacity nodes) race to create the cache locally. If a zero-capacity node
"wins" the race and sends a {{TopologyJoinCommand}} to the coordinator before
the others, the creation of the initial consistent hash will fail with the same
{{IllegalArgumentException}}.

...
 Problem with capacityFactor=0 and restart of all nodes with
capacityFactor > 0
 ------------------------------------------------------------------------------

                 Key: ISPN-4996
                 URL: https://issues.redhat.com/browse/ISPN-4996
             Project: Infinispan
          Issue Type: Bug
          Components: Core
    Affects Versions: 7.0.2.Final
            Reporter: Enrico Olivelli
            Assignee: Dan Berindei
            Priority: Blocker

 I have a only one DIST_SYNC cache, most of the JVM in the cluster are configured with
capacityFactor = 0 (like the distibutedlocalstorage=false property of Coherence) and some
node are configured with capacityFactor>0 (for instance 1000). We are talking about 100
nodes with capacityFactor=0 and 4 nodes of the other kind, al the cluster is indide one
single "site/rack". Partition Handling is off, numOwners is 1.
 When all the nodes with capacityFactor > 0 are down the cluster comes to a degraded
state
 the ploblem is that even if nodes with capacityFactor>0 are up again the cluster does
not recover, a full restart is needed
 If I enable partition-handling AvailablyExceptions start to be throw and I think is the
expected behaviour  (see the "Infinispan User Guide").

 I think this is the problem and it is a bug:

 {noformat}
 14/11/17 09:27:25 WARN topology.CacheTopologyControlCommand: ISPN000071: Caught exception
when handling command CacheTopologyControlCommand{cache=shared, type=JOIN,
sender=testserver1@xxxxxxx-22311, site-id=xxx, rack-id=xxx, machine-id=24 bytes,
joinInfo=CacheJoinInfo{consistentHashFactory=org.infinispan.distribution.ch.impl.TopologyAwareConsistentHashFactory@78b791ef,
hashFunction=MurmurHash3, numSegments=60, numOwners=1, timeout=120000, totalOrder=false,
distributed=true}, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null,
availabilityMode=null, throwable=null, viewId=3}
 java.lang.IllegalArgumentException: A cache topology's pending consistent hash must
contain all the current consistent hash's members
         at org.infinispan.topology.CacheTopology.<init>(CacheTopology.java:48)
         at org.infinispan.topology.CacheTopology.<init>(CacheTopology.java:43)
         at
org.infinispan.topology.ClusterCacheStatus.startQueuedRebalance(ClusterCacheStatus.java:631)
         at
org.infinispan.topology.ClusterCacheStatus.queueRebalance(ClusterCacheStatus.java:85)
         at
org.infinispan.partionhandling.impl.PreferAvailabilityStrategy.onJoin(PreferAvailabilityStrategy.java:22)
         at
org.infinispan.topology.ClusterCacheStatus.doJoin(ClusterCacheStatus.java:540)
         at
org.infinispan.topology.ClusterTopologyManagerImpl.handleJoin(ClusterTopologyManagerImpl.java:123)
         at
org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:158)
         at
org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:140)
         at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:278)
         at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at java.lang.Thread.run(Thread.java:745)
 {noformat}
 After that error every "put" results in:
 {noformat}
 14/11/17 09:27:27 ERROR interceptors.InvocationContextInterceptor: ISPN000136: Execution
error
 org.infinispan.util.concurrent.TimeoutException: Timed out waiting for topology 1
         at
org.infinispan.statetransfer.StateTransferLockImpl.waitForTransactionData(StateTransferLockImpl.java:93)
         at
org.infinispan.interceptors.base.BaseStateTransferInterceptor.waitForTransactionData(BaseStateTransferInterceptor.java:96)
         at
org.infinispan.statetransfer.StateTransferInterceptor.handleNonTxWriteCommand(StateTransferInterceptor.java:188)
         at
org.infinispan.statetransfer.StateTransferInterceptor.visitPutKeyValueCommand(StateTransferInterceptor.java:95)
         at
org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:71)
         at
org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98)
         at
org.infinispan.interceptors.CacheMgmtInterceptor.updateStoreStatistics(CacheMgmtInterceptor.java:148)
         at
org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:134)
         at
org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:71)
         at
org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98)
         at
org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:102)
         at
org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:71)
         at
org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:35)
         at
org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:71)
         at
org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:333)
         at
org.infinispan.cache.impl.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1576)
         at org.infinispan.cache.impl.CacheImpl.putInternal(CacheImpl.java:1054)
         at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:1046)
         at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:1646)
         at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:245)
 {noformat}

 This is the actual configuration:

 {code:java}
 GlobalConfiguration globalConfig = new GlobalConfigurationBuilder()
 .globalJmxStatistics()
 .allowDuplicateDomains(true)
 .cacheManagerName(instanceName)
 .transport()
 .defaultTransport()
 .clusterName(clustername)
 .addProperty("configurationFile", configurationFile)   (udp for my cluster,
approx 100 machines)
 .machineId(instanceName)
 .siteId("site1")
 .rackId("rack1")
 .nodeName(serviceName + "@" + instanceName)
 .remoteCommandThreadPool().threadPoolFactory(CachedThreadPoolExecutorFactory.create())
                 .build();
         Configuration wildcard = new ConfigurationBuilder()
 .locking().lockAcquisitionTimeout(lockAcquisitionTimeout)

.concurrencyLevel(10000).isolationLevel(IsolationLevel.READ_COMMITTED).useLockStriping(true)
                 .clustering()
 .cacheMode(CacheMode.DIST_SYNC)
 .l1().lifespan(l1ttl)
 .hash().numOwners(numOwners).capacityFactor(capacityFactor)
 .partitionHandling().enabled(false)

.stateTransfer().awaitInitialTransfer(false).timeout(initialTransferTimeout).fetchInMemoryState(false)
 .storeAsBinary().enabled(true).storeKeysAsBinary(false).storeValuesAsBinary(true)
 .jmxStatistics().enable()
                 .unsafe().unreliableReturnValues(true)
 .build();
 {code}
 One workaround is to set capacityFactor = 1 instead of 0, but I do not want
"simple-nodes" (with less RAM) to becaome key-owners
 For me this is a showstopper problem 

--
This message was sent by Atlassian Jira
(v8.13.1#813001)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009