[
https://issues.jboss.org/browse/ISPN-9512?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-9512:
------------------------------------
Oops, I reduced the number of transport threads and the size of the queue in ISPN-9067
(
https://github.com/infinispan/infinispan/commit/ef9735d778a9e687f272b8db3...),
and that's what's causing the tests to hang.
I see 2 problems that we should tackle here:
# {{ClusterTopologyManagerImpl.executeOnClusterAsync()}} invokes the command on the local
node before sending it to the other nodes. This probably made sense back when the remote
invocation could only be synchronous, but now it would be better to send the command
first, then execute on the local node, and then wait for responses.
# {{ClusterCacheStatus.doMergePartitions()}} should not hold the {{ClusterCacheStatus}}
monitor while broadcasting the rebalance command, because there's always a chance that
the local invocation will happen on the caller thread.
Going further we need to move away from blocking RPCs both in
{{ClusterTopologyManagerImpl}} (ISPN-8955) and in {{StateConsumerImpl}}, and also think
about unifying the thread pools but using {{LimitedExecutor}} e.g. to apply state.
*TxPartitionAndMerge*Test tests hang during teardown
----------------------------------------------------
Key: ISPN-9512
URL:
https://issues.jboss.org/browse/ISPN-9512
Project: Infinispan
Issue Type: Bug
Components: Test Suite - Core
Reporter: Dan Berindei
Assignee: Dan Berindei
Labels: testsuite_stability
Fix For: 9.4.0.CR3
Attachments:
master_20180913-1119_PessimisticTxPartitionAndMergeDuringRollbackTest-infinispan-core.log.gz,
threaddump-org_infinispan_partitionhandling_PessimisticTxPartitionAndMergeDuringRollbackTest_clearContent-2018-09-13-13828.log
Not sure what changed recently, but the thread dumps show a state transfer executor
thread blocked waiting for a clustered listeners response. The stack includes two
instances of {{ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution()}}, which suggests
that at some point all the state transfer executor threads (6) and async transport threads
(4) were busy, and the transport thread pool queue (10) was also full.
{noformat}
"stateTransferExecutor-thread-PessimisticTxPartitionAndMergeDuringRollbackTest-NodeC-p57758-t1"
#192601 daemon prio=5 os_prio=0 tid=0x00007f7094031800 nid=0x5b27 waiting on condition
[0x00007f70190ce000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000d470b0f8> (a
java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:93)
at org.infinispan.remoting.rpc.RpcManagerImpl.blocking(RpcManagerImpl.java:262)
at
org.infinispan.statetransfer.StateConsumerImpl.getClusterListeners(StateConsumerImpl.java:895)
at
org.infinispan.statetransfer.StateConsumerImpl.fetchClusterListeners(StateConsumerImpl.java:453)
at
org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:309)
at
org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:197)
at
org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:54)
at
org.infinispan.statetransfer.StateTransferManagerImpl$1.rebalance(StateTransferManagerImpl.java:117)
at
org.infinispan.topology.LocalTopologyManagerImpl.doHandleRebalance(LocalTopologyManagerImpl.java:517)
- locked <0x00000000cc304f88> (a org.infinispan.topology.LocalCacheStatus)
at
org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleRebalance$3(LocalTopologyManagerImpl.java:475)
at org.infinispan.topology.LocalTopologyManagerImpl$$Lambda$429/1368424830.run(Unknown
Source)
at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175)
at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:37)
at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:227)
at
java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:2038)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at
org.infinispan.executors.LazyInitializingExecutorService.execute(LazyInitializingExecutorService.java:121)
at org.infinispan.executors.LimitedExecutor.tryExecute(LimitedExecutor.java:151)
at org.infinispan.executors.LimitedExecutor.executeInternal(LimitedExecutor.java:118)
at org.infinispan.executors.LimitedExecutor.execute(LimitedExecutor.java:108)
at
org.infinispan.topology.LocalTopologyManagerImpl.handleRebalance(LocalTopologyManagerImpl.java:473)
at
org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:199)
at
org.infinispan.topology.CacheTopologyControlCommand.invokeAsync(CacheTopologyControlCommand.java:160)
at org.infinispan.commands.ReplicableCommand.invoke(ReplicableCommand.java:44)
at
org.infinispan.topology.ClusterTopologyManagerImpl.lambda$executeOnClusterAsync$5(ClusterTopologyManagerImpl.java:600)
at org.infinispan.topology.ClusterTopologyManagerImpl$$Lambda$304/909965247.run(Unknown
Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:2038)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at
org.infinispan.executors.LazyInitializingExecutorService.submit(LazyInitializingExecutorService.java:91)
at
org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterAsync(ClusterTopologyManagerImpl.java:596)
at
org.infinispan.topology.ClusterTopologyManagerImpl.broadcastRebalanceStart(ClusterTopologyManagerImpl.java:437)
at
org.infinispan.topology.ClusterCacheStatus.startQueuedRebalance(ClusterCacheStatus.java:903)
- locked <0x00000000cc305138> (a org.infinispan.topology.ClusterCacheStatus)
at
org.infinispan.topology.ClusterCacheStatus.queueRebalance(ClusterCacheStatus.java:140)
- locked <0x00000000cc305138> (a org.infinispan.topology.ClusterCacheStatus)
at
org.infinispan.partitionhandling.impl.PreferConsistencyStrategy.updateMembersAndRebalance(PreferConsistencyStrategy.java:299)
at
org.infinispan.partitionhandling.impl.PreferConsistencyStrategy.onPartitionMerge(PreferConsistencyStrategy.java:245)
at
org.infinispan.topology.ClusterCacheStatus.doMergePartitions(ClusterCacheStatus.java:642)
- locked <0x00000000cc305138> (a org.infinispan.topology.ClusterCacheStatus)
at
org.infinispan.topology.ClusterTopologyManagerImpl.lambda$recoverClusterStatus$4(ClusterTopologyManagerImpl.java:494)
at org.infinispan.topology.ClusterTopologyManagerImpl$$Lambda$578/46555845.run(Unknown
Source)
at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175)
at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:37)
at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
All partition and merge tests seem to be affected:
PessimisticTxPartitionAndMergeDuringPrepareTest,
PessimisticTxPartitionAndMergeDuringRollbackTest,
PessimisticTxPartitionAndMergeDuringRuntimeTest,
OptimisticTxPartitionAndMergeDuringCommitTest,
OptimisticTxPartitionAndMergeDuringPrepareTest, and
OptimisticTxPartitionAndMergeDuringRollbackTest.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)