[infinispan-issues] [JBoss JIRA] (ISPN-2415) Initial state transfer timed out - Fail to start 2 nodes after they were killed inside 8-node cluster

Wed Oct 17 11:17:01 EDT 2012

     [ https://issues.jboss.org/browse/ISPN-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Berindei reassigned ISPN-2415:
----------------------------------

    Assignee: Dan Berindei  (was: Mircea Markus)

    
> Initial state transfer timed out - Fail to start 2 nodes after they were killed inside 8-node cluster
> -----------------------------------------------------------------------------------------------------
>
>                 Key: ISPN-2415
>                 URL: https://issues.jboss.org/browse/ISPN-2415
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.2.0.Beta2
>            Reporter: Martin Gencur
>            Assignee: Dan Berindei
>            Priority: Critical
>
> We start 8 nodes, keep them under load, than we kill 2 nodes and later start them again. However, when we are trying to start them, the following exception is thrown and the test fails:
> {code}
> 10:47:52,830 ERROR [org.radargun.stages.helpers.StartHelper] (pool-1-thread-1) Issues while instantiating/starting cache wrapper
> org.infinispan.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete() throws java.lang.InterruptedException on object of type StateTransferManagerImpl
> 	at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205)
> 	at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:879)
> 	at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:650)
> 	at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:639)
> 	at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:542)
> 	at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:198)
> 	at org.infinispan.CacheImpl.start(CacheImpl.java:517)
> 	at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:689)
> 	at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:652)
> 	at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:548)
> 	at org.radargun.cachewrappers.InfinispanWrapper.setUpCache(InfinispanWrapper.java:125)
> 	at org.radargun.cachewrappers.InfinispanWrapper.setUp(InfinispanWrapper.java:74)
> 	at org.radargun.stages.helpers.StartHelper.start(StartHelper.java:63)
> 	at org.radargun.stages.StartClusterStage.executeOnSlave(StartClusterStage.java:47)
> 	at org.radargun.Slave$2.run(Slave.java:103)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> Caused by: org.infinispan.CacheException: Initial state transfer timed out for cache testCache on edg-perf02-25863
> 	at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:202)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:203)
> 	... 20 more
> {code}
> The problem happens at nodes edg-perf02 and edg-perf03 under this Jenkins run: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/ispn-52-radargun-resilience-8-6/29/
> Debug log can be found at those machines.
> A few more hints:
> - there are individual exceptions/errors extracted from the log - available in the "Build artifacts"
> - this job passed only once, fails otherwise
> - state transfer timeout is the default one (4 min?)
> - version of Infinspan: 5.2.0-SNAPSHOT, HEAD=d4581e570 - ISPN-2387 ClusteredGetCommand should not be a VisitableCommand
> Infinispan configuration:
> {code:xml}
> <infinispan
>       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>       xsi:schemaLocation="urn:infinispan:config:5.2 http://www.infinispan.org/schemas/infinispan-config-5.2.xsd"
>       xmlns="urn:infinispan:config:5.2">
>    <global>
>       <globalJmxStatistics
>             enabled="true"
>             jmxDomain="jboss.infinispan" 
>             cacheManagerName="default"/>
>       <transport clusterName="default" distributedSyncTimeout="600000">
>          <properties>
>             <property name="configurationFile" value="jgroups-udp-custom.xml" />
>          </properties>
>       </transport>
>    </global>
>    <default>
>       <transaction
>           transactionManagerLookupClass="org.infinispan.transaction.lookup.GenericTransactionManagerLookup"
>           transactionMode="TRANSACTIONAL" />
>       <jmxStatistics enabled="true"/>
>       <clustering mode="distribution">
>          <l1 enabled="false" />
>          <hash numOwners="3" numSegments="512" />
>          <sync replTimeout="60000"/>
>       </clustering>
>       <locking lockAcquisitionTimeout="3000" concurrencyLevel="1000" />
>    </default>
>    
>    <namedCache name="testCache" />
>    <namedCache name="memcachedCache" />
> </infinispan>
> {code}
> Test scenario (description of RadarGun's job):
> {code:xml}
> <bench-config>
>    <master bindAddress="${127.0.0.1:master.address}" port="${2103:master.port}" />
>    <benchmark initSize="${8:slaves}" maxSize="${8:slaves}" increment="1">
>       <DestroyWrapper runOnAllSlaves="true" />
>       <StartCluster
>          staggerSlaveStartup="true"
>          delayAfterFirstSlaveStarts="5000"
>          delayBetweenStartingSlaves="500" />
>       <ClusterValidation
>          partialReplication="false" />
>       <StartBackgroundStats
>          numThreads="10"
>          numEntries="${1000:numEntries}"
>          entrySize="1024"
>          puts="1"
>          gets="2"
>          statsIterationDuration="${1000:statsIterationDuration}"
>          delayBetweenRequests="100"
>          transactionSize="${30:transactionSize}"
>          startStressors="true" />
>       <!-- Synchronously start stat threads -->
>       <StartBackgroundStats
>          startStats="true" />
>       <Sleep
>          time="120000" />
>       <Kill
>          slaves="1,2" />
>       <Sleep
>          time="120000" />
>       <StartCluster
>          slaves="1,2"
>          staggerSlaveStartup="false" />
>       <Sleep
>          time="120000" />
>       <StopBackgroundStats />
>       <ReportBackgroundStats />
>    </benchmark>
>    <products>
>       <infinispan52>
>           <config name="distributed-udp-numowners-3.xml" cache="testCache"/>
>       </infinispan52>
>    </products>
>    <reports />
> </bench-config>
> {code} 
> If any further information is needed, let me know.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira