[JBoss JIRA] (ISPN-10363) LazyInitializingExecutorService.shutdown() is not thread-safe
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-10363?page=com.atlassian.jira.plugin... ]
Dan Berindei commented on ISPN-10363:
-------------------------------------
It was not the missing synchronization after all, as {{executor}} is a volatile field.
The real problem is that {{shutdown()}} does nothing if the executor was not already initialized, so {{initIfNeeded()}} can start an executor even after the component was stopped.
> LazyInitializingExecutorService.shutdown() is not thread-safe
> -------------------------------------------------------------
>
> Key: ISPN-10363
> URL: https://issues.jboss.org/browse/ISPN-10363
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 10.0.0.Beta3, 9.4.15.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Labels: testsuite_stability
> Fix For: 10.0.0.Beta4
>
>
> {{LazyInitializingExecutorService.shutdown()}} is not synchronized, so it may miss the executor created on another thread.
> Normally the long time between the first use and stop is long enough for this to be a non-issue, but it does cause random thread leaks in {{PersistenceManagerTest}}:
> {noformat}
> 17:57:06,421 DEBUG (testng-PersistenceManagerTest:[]) [DefaultCacheManager] Started cache manager PersistenceManagerTest-NodeB on null
> 17:57:06,423 INFO (testng-PersistenceManagerTest:[]) [TestSuiteProgress] Test starting: org.infinispan.persistence.PersistenceManagerTest.testProcessAfterStop
> 17:57:06,446 INFO (testng-PersistenceManagerTest:[]) [TestSuiteProgress] Test succeeded: org.infinispan.persistence.PersistenceManagerTest.testProcessAfterStop
> 17:57:06,447 DEBUG (testng-PersistenceManagerTest:[]) [DefaultCacheManager] Stopped cache manager PersistenceManagerTest-NodeB
> 17:58:38,062 WARN (main:[]) [ThreadLeakChecker] Possible leaked thread:
> "async-thread-PersistenceManagerTest-NodeB-p54399-t1" daemon prio=5 tid=0x2ab4c nid=NA waiting
> java.lang.Thread.State: WAITING
> at java.base(a)11.0.3/jdk.internal.misc.Unsafe.park(Native Method)
> at java.base@11.0.3/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
> at java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
> at java.base@11.0.3/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433)
> at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
> at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
> at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
> {noformat}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months
[JBoss JIRA] (ISPN-10363) LazyInitializingExecutorService.shutdown() is not thread-safe
by Dan Berindei (Jira)
Dan Berindei created ISPN-10363:
-----------------------------------
Summary: LazyInitializingExecutorService.shutdown() is not thread-safe
Key: ISPN-10363
URL: https://issues.jboss.org/browse/ISPN-10363
Project: Infinispan
Issue Type: Bug
Components: Core, Test Suite - Core
Affects Versions: 9.4.15.Final, 10.0.0.Beta3
Reporter: Dan Berindei
Assignee: Dan Berindei
Fix For: 10.0.0.Beta4
{{LazyInitializingExecutorService.shutdown()}} is not synchronized, so it may miss the executor created on another thread.
Normally the long time between the first use and stop is long enough for this to be a non-issue, but it does cause random thread leaks in {{PersistenceManagerTest}}:
{noformat}
17:57:06,421 DEBUG (testng-PersistenceManagerTest:[]) [DefaultCacheManager] Started cache manager PersistenceManagerTest-NodeB on null
17:57:06,423 INFO (testng-PersistenceManagerTest:[]) [TestSuiteProgress] Test starting: org.infinispan.persistence.PersistenceManagerTest.testProcessAfterStop
17:57:06,446 INFO (testng-PersistenceManagerTest:[]) [TestSuiteProgress] Test succeeded: org.infinispan.persistence.PersistenceManagerTest.testProcessAfterStop
17:57:06,447 DEBUG (testng-PersistenceManagerTest:[]) [DefaultCacheManager] Stopped cache manager PersistenceManagerTest-NodeB
17:58:38,062 WARN (main:[]) [ThreadLeakChecker] Possible leaked thread:
"async-thread-PersistenceManagerTest-NodeB-p54399-t1" daemon prio=5 tid=0x2ab4c nid=NA waiting
java.lang.Thread.State: WAITING
at java.base(a)11.0.3/jdk.internal.misc.Unsafe.park(Native Method)
at java.base@11.0.3/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
at java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
at java.base@11.0.3/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433)
at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
{noformat}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months
[JBoss JIRA] (ISPN-10363) LazyInitializingExecutorService.shutdown() is not thread-safe
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-10363?page=com.atlassian.jira.plugin... ]
Dan Berindei updated ISPN-10363:
--------------------------------
Status: Open (was: New)
> LazyInitializingExecutorService.shutdown() is not thread-safe
> -------------------------------------------------------------
>
> Key: ISPN-10363
> URL: https://issues.jboss.org/browse/ISPN-10363
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 10.0.0.Beta3, 9.4.15.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Labels: testsuite_stability
> Fix For: 10.0.0.Beta4
>
>
> {{LazyInitializingExecutorService.shutdown()}} is not synchronized, so it may miss the executor created on another thread.
> Normally the long time between the first use and stop is long enough for this to be a non-issue, but it does cause random thread leaks in {{PersistenceManagerTest}}:
> {noformat}
> 17:57:06,421 DEBUG (testng-PersistenceManagerTest:[]) [DefaultCacheManager] Started cache manager PersistenceManagerTest-NodeB on null
> 17:57:06,423 INFO (testng-PersistenceManagerTest:[]) [TestSuiteProgress] Test starting: org.infinispan.persistence.PersistenceManagerTest.testProcessAfterStop
> 17:57:06,446 INFO (testng-PersistenceManagerTest:[]) [TestSuiteProgress] Test succeeded: org.infinispan.persistence.PersistenceManagerTest.testProcessAfterStop
> 17:57:06,447 DEBUG (testng-PersistenceManagerTest:[]) [DefaultCacheManager] Stopped cache manager PersistenceManagerTest-NodeB
> 17:58:38,062 WARN (main:[]) [ThreadLeakChecker] Possible leaked thread:
> "async-thread-PersistenceManagerTest-NodeB-p54399-t1" daemon prio=5 tid=0x2ab4c nid=NA waiting
> java.lang.Thread.State: WAITING
> at java.base(a)11.0.3/jdk.internal.misc.Unsafe.park(Native Method)
> at java.base@11.0.3/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
> at java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
> at java.base@11.0.3/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433)
> at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
> at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
> at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
> {noformat}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months
[JBoss JIRA] (ISPN-9988) ScatteredStateConsumerImpl can leak the exclusive topology lock
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-9988?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-9988:
-------------------------------
Status: Open (was: New)
> ScatteredStateConsumerImpl can leak the exclusive topology lock
> ---------------------------------------------------------------
>
> Key: ISPN-9988
> URL: https://issues.jboss.org/browse/ISPN-9988
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.4.7.Final, 10.0.0.Beta1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Beta4
>
>
> When an exception happens in {{ScatteredStateConsumerImpl.beforeTopologyInstalled}}, the exclusive topology lock is not released in {{StateConsumerImpl.onTopologyUpdate}}:
> {noformat}
> 15:21:54,783 ERROR (transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t5:[Topology-scattered]) [LocalTopologyManagerImpl] ISPN000230: Failed to start rebalance for cache scattered
> java.lang.IllegalArgumentException: The task is already cancelled.
> at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:172) ~[classes/:?]
> at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:959) ~[classes/:?]
> at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.beforeTopologyInstalled(ScatteredStateConsumerImpl.java:115) ~[classes/:?]
> at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:292) ~[classes/:?]
> at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102) ~[classes/:?]
> at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200) ~[classes/:?]
> {noformat}
> Because the exclusive topology lock is not released, threads that try to apply a new topology update block forever. This causes random failures with the ISPN-9863 thread leak checker:
> {noformat}
> 15:26:25,922 WARN (testng-RehashClusterPublisherManagerTest:[]) [ThreadLeakChecker] Possible leaked thread:
> "transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3" daemon prio=5 tid=0x236fd nid=NA waiting
> java.lang.Thread.State: WAITING
> java.base(a)11/jdk.internal.misc.Unsafe.park(Native Method)
> java.base@11/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240)
> java.base@11/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
> app//org.infinispan.statetransfer.StateTransferLockImpl.acquireExclusiveTopologyLock(StateTransferLockImpl.java:42)
> app//org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:291)
> app//org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102)
> app//org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200)
> app//org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:57)
> app//org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:113)
> app//org.infinispan.topology.LocalTopologyManagerImpl.doHandleTopologyUpdate(LocalTopologyManagerImpl.java:353)
> app//org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleTopologyUpdate$1(LocalTopologyManagerImpl.java:275)
> 15:26:25,923 ERROR (testng-RehashClusterPublisherManagerTest:[]) [TestSuiteProgress] Test configuration failed: org.infinispan.reactive.publisher.impl.RehashClusterPublisherManagerTest.testClassFinished
> java.lang.AssertionError: Leaked threads:
> {transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3: possible sources [org.infinispan.functional.FunctionalScatteredInMemoryTest[bias=ON_WRITE], org.infinispan.statetransfer.ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false], org.infinispan.functional.FunctionalCachestoreTest[passivation=true], org.infinispan.functional.distribution.rehash.FunctionalNonTxBackupOwnerBecomingPrimaryOwnerTest, org.infinispan.functional.distribution.rehash.FunctionalNonTxJoinerBecomingBackupOwnerTest, org.infinispan.api.mvcc.PutForExternalReadTest[REPL_SYNC, tx=false], org.infinispan.functional.distribution.rehash.FunctionalTxTest, org.infinispan.functional.FunctionalEncodingTypeTest[tx=true]]}
> at org.infinispan.commons.test.ThreadLeakChecker.performCheck(ThreadLeakChecker.java:148) ~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
> at org.infinispan.commons.test.ThreadLeakChecker.testFinished(ThreadLeakChecker.java:109) ~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
> at org.infinispan.test.fwk.TestResourceTracker.testFinished(TestResourceTracker.java:112) ~[test-classes/:?]
> at org.infinispan.test.AbstractInfinispanTest.testClassFinished(AbstractInfinispanTest.java:142) ~[test-classes/:?]
> {noformat}
> The fix should address both the exclusive topology lock itself, by releasing it in a finally block, and the {{IllegalArgumentException}}, either by ignoring already cancelled transfers or by only cancelling transfers while holding {{transferMapsLock}}.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months
[JBoss JIRA] (ISPN-8240) Coordinator sends REBALANCE_START command when there is already a rebalance in progress
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-8240?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-8240:
------------------------------------
Cf. JDG-2856 the duplicate {{REBALANCE_START}} can also lead to the installation of a "fake topology" and the cancellation of inbound transfers from the previous rebalance, effectively blocking the rebalance forever.
In addition to removing {{REBALANCE_START}} we should make the coordinator wait for responses for any {{CacheTopologyControlCommand}}s before sending a new one, so we don't need the fake topology any more.
> Coordinator sends REBALANCE_START command when there is already a rebalance in progress
> ---------------------------------------------------------------------------------------
>
> Key: ISPN-8240
> URL: https://issues.jboss.org/browse/ISPN-8240
> Project: Infinispan
> Issue Type: Bug
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
>
> Normally the {{REBALANCE_START}} command should only be sent at the start of a rebalance, and any topology updates sent before all the nodes confirm the rebalance phase should have {{CH_UPDATE}}.
> Since the change to 4 phases, this is no longer true: first {{ClusterCacheStatus.updateTopologyMembers}} first clears the {{RebalanceConfirmationCollector}}, then it broadcasts a {{CH_UPDATE}}. Then {{queueRebalance}} immediately creates a new {{RCC}} and broadcasts a {{REBALANCE_START}}, instead of waiting for the current rebalance to finish.
> I propose we remove {{REBALANCE_START}}, as it was just a crude version of {{CacheTopology.Phase}}. We should also remove the {{isRebalance}} parameter from {{StateConsumerImpl.onTopologyUpdate()}}.
> I'm still not sure if rebalancing the pending CH immediately is ok. On the one hand, I would like the rebalance to finish with {{updateMembers(union(currentCH, pendingCH))}} as the new pending CH, so that segments that were already transferred keep an extra copy. On the other hand, that would only help for segments that have at least on owner in the current CH: if the current CH has 0 owners and {{updateMembers}} allocates new ones, those new owners won't request data from the pending CH owners anyway. Fixing that case would require the coordinator to fetch the transfer status from all the nodes before removing a node from the topology. But if the coordinator knew exactly which segments were transferred, it could finish the rebalance immediately and start a new one -- so it would be more similar to the current approach.
> Note: the {{SyncConsistentHashFactory}} allocation is not 100% stable, even when nodes are not added, so A ∈ owners(segment) in topology ABCD does not guarantee that A ∈ owners(segment) in topology ABC. But it should be good enough to keep A an owner in 90% of the cases.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months