]
Dan Berindei commented on ISPN-9988:
------------------------------------
{{StateConsumerImpl}} can now leak the exclusive topology lock even in a dist/repl cache,
because we have some new code executed while holding the lock:
{code}
CompletionStages.join(persistenceManager.addSegments(newWriteSegments));
{code}
Sometimes it causes tests to hang during cluster shutdown, sometimes it shows only as a
thread leak:
{noformat}
org.infinispan.commons.test.ThreadLeakChecker$LeakException: Leaked thread:
transport-thread-ClusteredListenerJoinsTest-NodeA-p42036-t1 <<
testng-ClusteredListenerJoinsTest <<
org.infinispan.notifications.cachelistener.cluster.ClusteredListenerJoinsTest
at java.base(a)11.0.3/jdk.internal.misc.Unsafe.park(Native Method)
at java.base@11.0.3/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
at
java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
at
java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917)
at
java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240)
at
java.base@11.0.3/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
at
app//org.infinispan.statetransfer.StateTransferLockImpl.acquireExclusiveTopologyLock(StateTransferLockImpl.java:45)
at
app//org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:307)
at
app//org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:106)
at
app//org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:210)
at
app//org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:66)
at
app//org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:122)
at
app//org.infinispan.topology.LocalTopologyManagerImpl.doHandleTopologyUpdate(LocalTopologyManagerImpl.java:365)
at
app//org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleTopologyUpdate$1(LocalTopologyManagerImpl.java:286)
at
app//org.infinispan.topology.LocalTopologyManagerImpl$$Lambda$1134/0x0000000100685840.run(Unknown
Source)
at app//org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175)
at app//org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:37)
at app//org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:227)
at
java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
{noformat}
I believe both the scattered cache-specific cancelling of inbound transfers and the adding
of segments can be done **before** acquiring the exclusive topology lock, minimizing the
lock duration so that we don't need a {{CompletionStage}}-based API for it.
ScatteredStateConsumerImpl can leak the exclusive topology lock
---------------------------------------------------------------
Key: ISPN-9988
URL:
https://issues.jboss.org/browse/ISPN-9988
Project: Infinispan
Issue Type: Bug
Components: Core
Affects Versions: 9.4.7.Final, 10.0.0.Beta1
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Major
Fix For: 10.1.0.Final
When an exception happens in {{ScatteredStateConsumerImpl.beforeTopologyInstalled}}, the
exclusive topology lock is not released in {{StateConsumerImpl.onTopologyUpdate}}:
{noformat}
15:21:54,783 ERROR
(transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t5:[Topology-scattered])
[LocalTopologyManagerImpl] ISPN000230: Failed to start rebalance for cache scattered
java.lang.IllegalArgumentException: The task is already cancelled.
at
org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:172)
~[classes/:?]
at
org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:959)
~[classes/:?]
at
org.infinispan.scattered.impl.ScatteredStateConsumerImpl.beforeTopologyInstalled(ScatteredStateConsumerImpl.java:115)
~[classes/:?]
at
org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:292)
~[classes/:?]
at
org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102)
~[classes/:?]
at
org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200)
~[classes/:?]
{noformat}
Because the exclusive topology lock is not released, threads that try to apply a new
topology update block forever. This causes random failures with the ISPN-9863 thread leak
checker:
{noformat}
15:26:25,922 WARN (testng-RehashClusterPublisherManagerTest:[]) [ThreadLeakChecker]
Possible leaked thread:
"transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3" daemon
prio=5 tid=0x236fd nid=NA waiting
java.lang.Thread.State: WAITING
java.base(a)11/jdk.internal.misc.Unsafe.park(Native Method)
java.base@11/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917)
java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240)
java.base@11/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
app//org.infinispan.statetransfer.StateTransferLockImpl.acquireExclusiveTopologyLock(StateTransferLockImpl.java:42)
app//org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:291)
app//org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102)
app//org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200)
app//org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:57)
app//org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:113)
app//org.infinispan.topology.LocalTopologyManagerImpl.doHandleTopologyUpdate(LocalTopologyManagerImpl.java:353)
app//org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleTopologyUpdate$1(LocalTopologyManagerImpl.java:275)
15:26:25,923 ERROR (testng-RehashClusterPublisherManagerTest:[]) [TestSuiteProgress] Test
configuration failed:
org.infinispan.reactive.publisher.impl.RehashClusterPublisherManagerTest.testClassFinished
java.lang.AssertionError: Leaked threads:
{transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3: possible sources
[org.infinispan.functional.FunctionalScatteredInMemoryTest[bias=ON_WRITE],
org.infinispan.statetransfer.ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false],
org.infinispan.functional.FunctionalCachestoreTest[passivation=true],
org.infinispan.functional.distribution.rehash.FunctionalNonTxBackupOwnerBecomingPrimaryOwnerTest,
org.infinispan.functional.distribution.rehash.FunctionalNonTxJoinerBecomingBackupOwnerTest,
org.infinispan.api.mvcc.PutForExternalReadTest[REPL_SYNC, tx=false],
org.infinispan.functional.distribution.rehash.FunctionalTxTest,
org.infinispan.functional.FunctionalEncodingTypeTest[tx=true]]}
at
org.infinispan.commons.test.ThreadLeakChecker.performCheck(ThreadLeakChecker.java:148)
~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
at
org.infinispan.commons.test.ThreadLeakChecker.testFinished(ThreadLeakChecker.java:109)
~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
at
org.infinispan.test.fwk.TestResourceTracker.testFinished(TestResourceTracker.java:112)
~[test-classes/:?]
at
org.infinispan.test.AbstractInfinispanTest.testClassFinished(AbstractInfinispanTest.java:142)
~[test-classes/:?]
{noformat}
The fix should address both the exclusive topology lock itself, by releasing it in a
finally block, and the {{IllegalArgumentException}}, either by ignoring already cancelled
transfers or by only cancelling transfers while holding {{transferMapsLock}}.