[JBoss JIRA] (ISPN-10093) PersistenceManagerImpl stop deadlock with topology update
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-10093?page=com.atlassian.jira.plugin... ]
Tristan Tarrant updated ISPN-10093:
-----------------------------------
Fix Version/s: 10.0.0.Beta5
(was: 10.0.0.Beta4)
> PersistenceManagerImpl stop deadlock with topology update
> ---------------------------------------------------------
>
> Key: ISPN-10093
> URL: https://issues.jboss.org/browse/ISPN-10093
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 10.0.0.Beta3
> Reporter: Dan Berindei
> Assignee: Will Burns
> Priority: Major
> Labels: testsuite_stability
> Fix For: 10.0.0.Beta5
>
> Attachments: threaddump.txt
>
>
> {{DistSyncStoreNotSharedTest.clearContent}} hanged in CI recently:
> {noformat}
> "testng-DistSyncStoreNotSharedTest" #16 prio=5 os_prio=0 cpu=11511.26ms elapsed=435.14s tid=0x00007fdb710b6000 nid=0x3222 waiting on condition [0x00007fdb352d3000]
> java.lang.Thread.State: WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11/Native Method)
> - parking to wait for <0x00000000c8a22450> (a java.util.concurrent.Semaphore$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(java.base@11/LockSupport.java:194)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11/AbstractQueuedSynchronizer.java:885)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11/AbstractQueuedSynchronizer.java:1009)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11/AbstractQueuedSynchronizer.java:1324)
> at java.util.concurrent.Semaphore.acquireUninterruptibly(java.base@11/Semaphore.java:504)
> at org.infinispan.persistence.manager.PersistenceManagerImpl.stop(PersistenceManagerImpl.java:222)
> at jdk.internal.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
> at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11/DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(java.base@11/Method.java:566)
> at org.infinispan.commons.util.SecurityActions.lambda$invokeAccessibly$0(SecurityActions.java:79)
> at org.infinispan.commons.util.SecurityActions$$Lambda$237/0x0000000100661c40.run(Unknown Source)
> at org.infinispan.commons.util.SecurityActions.doPrivileged(SecurityActions.java:71)
> at org.infinispan.commons.util.SecurityActions.invokeAccessibly(SecurityActions.java:76)
> at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:181)
> at org.infinispan.factories.impl.BasicComponentRegistryImpl.performStop(BasicComponentRegistryImpl.java:601)
> at org.infinispan.factories.impl.BasicComponentRegistryImpl.stopWrapper(BasicComponentRegistryImpl.java:590)
> at org.infinispan.factories.impl.BasicComponentRegistryImpl.stop(BasicComponentRegistryImpl.java:461)
> at org.infinispan.factories.AbstractComponentRegistry.internalStop(AbstractComponentRegistry.java:431)
> at org.infinispan.factories.AbstractComponentRegistry.stop(AbstractComponentRegistry.java:366)
> at org.infinispan.cache.impl.CacheImpl.performImmediateShutdown(CacheImpl.java:1160)
> at org.infinispan.cache.impl.CacheImpl.stop(CacheImpl.java:1125)
> at org.infinispan.cache.impl.AbstractDelegatingCache.stop(AbstractDelegatingCache.java:521)
> at org.infinispan.manager.DefaultCacheManager.terminate(DefaultCacheManager.java:747)
> at org.infinispan.manager.DefaultCacheManager.stopCaches(DefaultCacheManager.java:799)
> at org.infinispan.manager.DefaultCacheManager.stop(DefaultCacheManager.java:775)
> at org.infinispan.test.TestingUtil.killCacheManagers(TestingUtil.java:846)
> at org.infinispan.test.MultipleCacheManagersTest.clearContent(MultipleCacheManagersTest.java:158)
> "persistence-thread-DistSyncStoreNotSharedTest-NodeB-p16432-t1" #53654 daemon prio=5 os_prio=0 cpu=1.26ms elapsed=301.93s tid=0x00007fdb3c3d8000 nid=0x8ef waiting on condition [0x00007fdb00055000]
> java.lang.Thread.State: WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11/Native Method)
> - parking to wait for <0x00000000c8b1fb88> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(java.base@11/LockSupport.java:194)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11/AbstractQueuedSynchronizer.java:885)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11/AbstractQueuedSynchronizer.java:1009)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11/AbstractQueuedSynchronizer.java:1324)
> at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11/ReentrantReadWriteLock.java:738)
> at org.infinispan.persistence.manager.PersistenceManagerImpl.pollStoreAvailability(PersistenceManagerImpl.java:196)
> at org.infinispan.persistence.manager.PersistenceManagerImpl$$Lambda$492/0x00000001007fb440.run(Unknown Source)
> at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11/Executors.java:515)
> at java.util.concurrent.FutureTask.runAndReset(java.base@11/FutureTask.java:305)
> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11/ScheduledThreadPoolExecutor.java:305)
> "transport-thread-DistSyncStoreNotSharedTest-NodeB-p16424-t5" #53646 daemon prio=5 os_prio=0 cpu=3.15ms elapsed=301.94s tid=0x00007fdb2007a000 nid=0x8e8 waiting on condition [0x00007fdb0b406000]
> java.lang.Thread.State: WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11/Native Method)
> - parking to wait for <0x00000000c8d2abb0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(java.base@11/LockSupport.java:194)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11/AbstractQueuedSynchronizer.java:2081)
> at io.reactivex.internal.operators.flowable.BlockingFlowableIterable$BlockingFlowableIterator.hasNext(BlockingFlowableIterable.java:94)
> at io.reactivex.Flowable.blockingForEach(Flowable.java:5682)
> at org.infinispan.statetransfer.StateConsumerImpl.removeStaleData(StateConsumerImpl.java:1011)
> at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:453)
> at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:202)
> at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:58)
> at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:114)
> at org.infinispan.topology.LocalTopologyManagerImpl.resetLocalTopologyBeforeRebalance(LocalTopologyManagerImpl.java:437)
> at org.infinispan.topology.LocalTopologyManagerImpl.doHandleRebalance(LocalTopologyManagerImpl.java:519)
> - locked <0x00000000c8b30b30> (a org.infinispan.topology.LocalCacheStatus)
> at org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleRebalance$3(LocalTopologyManagerImpl.java:484)
> at org.infinispan.topology.LocalTopologyManagerImpl$$Lambda$574/0x000000010089a040.run(Unknown Source)
> at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175){noformat}
> [Full thread dump|https://ci.infinispan.org/job/Infinispan/job/master/1133/artifact/core/]
> Somehow the producer thread for the transport-thread iteration is blocked, but without waiting for the persistence mutex. Maybe it's waiting for a topology? Not sure if it's relevant, but the last test to run was {{testClearWithFlag}}, so the data container was empty and the store had 5 entries.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 9 months
[JBoss JIRA] (ISPN-10041) Locking interceptor should check the topology before acquiring locks
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-10041?page=com.atlassian.jira.plugin... ]
Tristan Tarrant updated ISPN-10041:
-----------------------------------
Fix Version/s: 10.0.0.Beta5
(was: 10.0.0.Beta4)
> Locking interceptor should check the topology before acquiring locks
> --------------------------------------------------------------------
>
> Key: ISPN-10041
> URL: https://issues.jboss.org/browse/ISPN-10041
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.11.Final, 9.3.6.Final, 9.4.9.Final, 10.0.0.Beta2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Beta5
>
>
> The distribution interceptors check the command topology is the same as the current topology before sending a command to remote nodes, but the locking interceptors do not have any check.
> On a remote node, this means the inbound invocation handler acquires some locks in topology {{T}}, then the locking interceptor acquires other locks in topology {{T+1}}, and finally the distribution interceptor throws {{OutdatedTopologyException}} and releases the locks. In older versions there is also a potential for blocking a remote executor thread while waiting for the lock, but luckily that is not a problem in 9.4+. It would be more efficient if the locking interceptor was throwing {{OutdatedTopologyException}} instead.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 9 months
[JBoss JIRA] (ISPN-9988) ScatteredStateConsumerImpl can leak the exclusive topology lock
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-9988?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-9988:
----------------------------------
Fix Version/s: 10.0.0.Beta5
(was: 10.0.0.Beta4)
> ScatteredStateConsumerImpl can leak the exclusive topology lock
> ---------------------------------------------------------------
>
> Key: ISPN-9988
> URL: https://issues.jboss.org/browse/ISPN-9988
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.4.7.Final, 10.0.0.Beta1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Beta5
>
>
> When an exception happens in {{ScatteredStateConsumerImpl.beforeTopologyInstalled}}, the exclusive topology lock is not released in {{StateConsumerImpl.onTopologyUpdate}}:
> {noformat}
> 15:21:54,783 ERROR (transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t5:[Topology-scattered]) [LocalTopologyManagerImpl] ISPN000230: Failed to start rebalance for cache scattered
> java.lang.IllegalArgumentException: The task is already cancelled.
> at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:172) ~[classes/:?]
> at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:959) ~[classes/:?]
> at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.beforeTopologyInstalled(ScatteredStateConsumerImpl.java:115) ~[classes/:?]
> at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:292) ~[classes/:?]
> at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102) ~[classes/:?]
> at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200) ~[classes/:?]
> {noformat}
> Because the exclusive topology lock is not released, threads that try to apply a new topology update block forever. This causes random failures with the ISPN-9863 thread leak checker:
> {noformat}
> 15:26:25,922 WARN (testng-RehashClusterPublisherManagerTest:[]) [ThreadLeakChecker] Possible leaked thread:
> "transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3" daemon prio=5 tid=0x236fd nid=NA waiting
> java.lang.Thread.State: WAITING
> java.base(a)11/jdk.internal.misc.Unsafe.park(Native Method)
> java.base@11/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240)
> java.base@11/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
> app//org.infinispan.statetransfer.StateTransferLockImpl.acquireExclusiveTopologyLock(StateTransferLockImpl.java:42)
> app//org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:291)
> app//org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102)
> app//org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200)
> app//org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:57)
> app//org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:113)
> app//org.infinispan.topology.LocalTopologyManagerImpl.doHandleTopologyUpdate(LocalTopologyManagerImpl.java:353)
> app//org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleTopologyUpdate$1(LocalTopologyManagerImpl.java:275)
> 15:26:25,923 ERROR (testng-RehashClusterPublisherManagerTest:[]) [TestSuiteProgress] Test configuration failed: org.infinispan.reactive.publisher.impl.RehashClusterPublisherManagerTest.testClassFinished
> java.lang.AssertionError: Leaked threads:
> {transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3: possible sources [org.infinispan.functional.FunctionalScatteredInMemoryTest[bias=ON_WRITE], org.infinispan.statetransfer.ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false], org.infinispan.functional.FunctionalCachestoreTest[passivation=true], org.infinispan.functional.distribution.rehash.FunctionalNonTxBackupOwnerBecomingPrimaryOwnerTest, org.infinispan.functional.distribution.rehash.FunctionalNonTxJoinerBecomingBackupOwnerTest, org.infinispan.api.mvcc.PutForExternalReadTest[REPL_SYNC, tx=false], org.infinispan.functional.distribution.rehash.FunctionalTxTest, org.infinispan.functional.FunctionalEncodingTypeTest[tx=true]]}
> at org.infinispan.commons.test.ThreadLeakChecker.performCheck(ThreadLeakChecker.java:148) ~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
> at org.infinispan.commons.test.ThreadLeakChecker.testFinished(ThreadLeakChecker.java:109) ~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
> at org.infinispan.test.fwk.TestResourceTracker.testFinished(TestResourceTracker.java:112) ~[test-classes/:?]
> at org.infinispan.test.AbstractInfinispanTest.testClassFinished(AbstractInfinispanTest.java:142) ~[test-classes/:?]
> {noformat}
> The fix should address both the exclusive topology lock itself, by releasing it in a finally block, and the {{IllegalArgumentException}}, either by ignoring already cancelled transfers or by only cancelling transfers while holding {{transferMapsLock}}.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 9 months
[JBoss JIRA] (ISPN-9620) Rolling Upgrade Marshaller Changes
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-9620?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-9620:
----------------------------------
Fix Version/s: 10.0.0.Beta5
(was: 10.0.0.Beta4)
> Rolling Upgrade Marshaller Changes
> ----------------------------------
>
> Key: ISPN-9620
> URL: https://issues.jboss.org/browse/ISPN-9620
> Project: Infinispan
> Issue Type: Enhancement
> Components: Core, Loaders and Stores
> Affects Versions: 9.4.0.Final
> Reporter: Ryan Emerson
> Assignee: Ryan Emerson
> Priority: Major
> Fix For: 10.0.0.Beta5
>
>
> In order to allow for compatibility between infinispan versions it is necessary for us to utilise a marshalling implementation at both the cluster (internal node-to-node communication) and persistence layer that is strictly defined but allows for future changes. This is necessary in order to facilitate both rolling and start/stop upgrades. Protocol buffers should be utilised as the wire/storage format, with protostream providing the implementation.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 9 months
[JBoss JIRA] (ISPN-8329) ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus2 random failures with scattered cache
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-8329?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-8329:
----------------------------------
Fix Version/s: 10.0.0.Beta5
(was: 10.0.0.Beta4)
> ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus2 random failures with scattered cache
> ----------------------------------------------------------------------------------------------
>
> Key: ISPN-8329
> URL: https://issues.jboss.org/browse/ISPN-8329
> Project: Infinispan
> Issue Type: Enhancement
> Components: Core
> Reporter: Tristan Tarrant
> Assignee: Dan Berindei
> Priority: Major
> Labels: testsuite_stability
> Fix For: 10.0.0.Beta5, 9.4.16.Final
>
>
> Error Message
> Timed out waiting for rebalancing to complete on node ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false]-NodeB-47665, current topology is CacheTopology{id=8, rebalanceId=4, currentCH=PartitionerConsistentHash:ScatteredConsistentHash{ns=256, rebalanced=false, owners = (1)[ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false]-NodeB-47665: 85]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false]-NodeB-47665], persistentUUIDs=[63b3a997-f229-475b-a14c-9c892f608ba0]}. rebalanceInProgress=false, currentChIsBalanced=false
> Stacktrace
> java.lang.RuntimeException: Timed out waiting for rebalancing to complete on node ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false]-NodeB-47665, current topology is CacheTopology{id=8, rebalanceId=4, currentCH=PartitionerConsistentHash:ScatteredConsistentHash{ns=256, rebalanced=false, owners = (1)[ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false]-NodeB-47665: 85]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false]-NodeB-47665], persistentUUIDs=[63b3a997-f229-475b-a14c-9c892f608ba0]}. rebalanceInProgress=false, currentChIsBalanced=false
> at org.infinispan.test.TestingUtil.waitForNoRebalance(TestingUtil.java:386)
> at org.infinispan.statetransfer.ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus2(ClusterTopologyManagerTest.java:430)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> ... Removed 16 stack frames
>
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 9 months