[JBoss JIRA] (ISPN-9988) ScatteredStateConsumerImpl can leak the exclusive topology lock
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-9988?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-9988:
-------------------------------
Sprint: DataGrid Sprint #30
> ScatteredStateConsumerImpl can leak the exclusive topology lock
> ---------------------------------------------------------------
>
> Key: ISPN-9988
> URL: https://issues.jboss.org/browse/ISPN-9988
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.4.7.Final, 10.0.0.Beta1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Beta4
>
>
> When an exception happens in {{ScatteredStateConsumerImpl.beforeTopologyInstalled}}, the exclusive topology lock is not released in {{StateConsumerImpl.onTopologyUpdate}}:
> {noformat}
> 15:21:54,783 ERROR (transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t5:[Topology-scattered]) [LocalTopologyManagerImpl] ISPN000230: Failed to start rebalance for cache scattered
> java.lang.IllegalArgumentException: The task is already cancelled.
> at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:172) ~[classes/:?]
> at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:959) ~[classes/:?]
> at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.beforeTopologyInstalled(ScatteredStateConsumerImpl.java:115) ~[classes/:?]
> at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:292) ~[classes/:?]
> at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102) ~[classes/:?]
> at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200) ~[classes/:?]
> {noformat}
> Because the exclusive topology lock is not released, threads that try to apply a new topology update block forever. This causes random failures with the ISPN-9863 thread leak checker:
> {noformat}
> 15:26:25,922 WARN (testng-RehashClusterPublisherManagerTest:[]) [ThreadLeakChecker] Possible leaked thread:
> "transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3" daemon prio=5 tid=0x236fd nid=NA waiting
> java.lang.Thread.State: WAITING
> java.base(a)11/jdk.internal.misc.Unsafe.park(Native Method)
> java.base@11/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240)
> java.base@11/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
> app//org.infinispan.statetransfer.StateTransferLockImpl.acquireExclusiveTopologyLock(StateTransferLockImpl.java:42)
> app//org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:291)
> app//org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102)
> app//org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200)
> app//org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:57)
> app//org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:113)
> app//org.infinispan.topology.LocalTopologyManagerImpl.doHandleTopologyUpdate(LocalTopologyManagerImpl.java:353)
> app//org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleTopologyUpdate$1(LocalTopologyManagerImpl.java:275)
> 15:26:25,923 ERROR (testng-RehashClusterPublisherManagerTest:[]) [TestSuiteProgress] Test configuration failed: org.infinispan.reactive.publisher.impl.RehashClusterPublisherManagerTest.testClassFinished
> java.lang.AssertionError: Leaked threads:
> {transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3: possible sources [org.infinispan.functional.FunctionalScatteredInMemoryTest[bias=ON_WRITE], org.infinispan.statetransfer.ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false], org.infinispan.functional.FunctionalCachestoreTest[passivation=true], org.infinispan.functional.distribution.rehash.FunctionalNonTxBackupOwnerBecomingPrimaryOwnerTest, org.infinispan.functional.distribution.rehash.FunctionalNonTxJoinerBecomingBackupOwnerTest, org.infinispan.api.mvcc.PutForExternalReadTest[REPL_SYNC, tx=false], org.infinispan.functional.distribution.rehash.FunctionalTxTest, org.infinispan.functional.FunctionalEncodingTypeTest[tx=true]]}
> at org.infinispan.commons.test.ThreadLeakChecker.performCheck(ThreadLeakChecker.java:148) ~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
> at org.infinispan.commons.test.ThreadLeakChecker.testFinished(ThreadLeakChecker.java:109) ~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
> at org.infinispan.test.fwk.TestResourceTracker.testFinished(TestResourceTracker.java:112) ~[test-classes/:?]
> at org.infinispan.test.AbstractInfinispanTest.testClassFinished(AbstractInfinispanTest.java:142) ~[test-classes/:?]
> {noformat}
> The fix should address both the exclusive topology lock itself, by releasing it in a finally block, and the {{IllegalArgumentException}}, either by ignoring already cancelled transfers or by only cancelling transfers while holding {{transferMapsLock}}.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 5 months
[JBoss JIRA] (ISPN-10041) Locking interceptor should check the topology before acquiring locks
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-10041?page=com.atlassian.jira.plugin... ]
Dan Berindei updated ISPN-10041:
--------------------------------
Sprint: DataGrid Sprint #30
> Locking interceptor should check the topology before acquiring locks
> --------------------------------------------------------------------
>
> Key: ISPN-10041
> URL: https://issues.jboss.org/browse/ISPN-10041
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.11.Final, 9.3.6.Final, 9.4.9.Final, 10.0.0.Beta2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Beta4
>
>
> The distribution interceptors check the command topology is the same as the current topology before sending a command to remote nodes, but the locking interceptors do not have any check.
> On a remote node, this means the inbound invocation handler acquires some locks in topology {{T}}, then the locking interceptor acquires other locks in topology {{T+1}}, and finally the distribution interceptor throws {{OutdatedTopologyException}} and releases the locks. In older versions there is also a potential for blocking a remote executor thread while waiting for the lock, but luckily that is not a problem in 9.4+. It would be more efficient if the locking interceptor was throwing {{OutdatedTopologyException}} instead.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 5 months
[JBoss JIRA] (ISPN-10070) DefaultCacheManager should stop components after start failure
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-10070?page=com.atlassian.jira.plugin... ]
Dan Berindei updated ISPN-10070:
--------------------------------
Sprint: DataGrid Sprint #29, DataGrid Sprint #30 (was: DataGrid Sprint #29)
> DefaultCacheManager should stop components after start failure
> --------------------------------------------------------------
>
> Key: ISPN-10070
> URL: https://issues.jboss.org/browse/ISPN-10070
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.4.10.Final, 10.0.0.Beta2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Beta4, 9.4.16.Final
>
>
> Currently it is impossible to release all the resources allocated during startup if the {{DefaultCacheManager}} instance was created with {{start=true}}. The user has to do something like this:
> {code:java}
> DefaultCacheManager manager = new DefaultCacheManager(..., false);
> try {
> manager.start();
> } catch (Throwable t) {
> manager.stop();
> throw t;
> }
> {code}
> Both the constructor and the public {{start()}} method should clean up the started components after a startup failure, so that the user doesn't have to call {{stop()}} explicitly.
> Our tests do not currently call {{stop()}} explicitly, so they leak threads and sockets when a manager fails to start (e.g. because something went wrong with the {{CONFIG}} cache).
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 5 months
[JBoss JIRA] (ISPN-10124) transport lock-timeout description is slightly wrong in XSD schema
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-10124?page=com.atlassian.jira.plugin... ]
Dan Berindei updated ISPN-10124:
--------------------------------
Sprint: DataGrid Sprint #30
> transport lock-timeout description is slightly wrong in XSD schema
> ------------------------------------------------------------------
>
> Key: ISPN-10124
> URL: https://issues.jboss.org/browse/ISPN-10124
> Project: Infinispan
> Issue Type: Bug
> Components: Configuration, Core
> Affects Versions: 9.4.12.Final, 10.0.0.Beta3
> Reporter: Wolf-Dieter Fink
> Assignee: Dan Berindei
> Priority: Minor
>
> Form the infinispan-core xsd the transport lock-timeout description is like followed.
> <xs:complexType name="transport">
> <xs:attribute name="lock-timeout" type="xs:long" default="240000">
> <xs:annotation>
> <xs:documentation>
> Infinispan uses a distributed lock to maintain a coherent transaction log during state transfer or rehashing, which means that only one cache can be doing state transfer or rehashing at the same time.
> This constraint is in place because more than one cache could be involved in a transaction.
> This timeout controls the time to wait to acquire a distributed lock.
> </xs:documentation>
> </xs:annotation>
> </xs:attribute>
> This does not reflect the latest changes and should be updated
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 5 months
[JBoss JIRA] (ISPN-10185) hotrod-client tests hang on Windows
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-10185?page=com.atlassian.jira.plugin... ]
Dan Berindei updated ISPN-10185:
--------------------------------
Sprint: DataGrid Sprint #30
> hotrod-client tests hang on Windows
> -----------------------------------
>
> Key: ISPN-10185
> URL: https://issues.jboss.org/browse/ISPN-10185
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Server
> Affects Versions: 9.4.13.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Labels: testsuite_stability
> Fix For: 10.0.0.Beta4
>
> Attachments: ReplFailOverRemoteIteratorTest.txt
>
>
> All the {{ForkJoin.commonPool}} in the thread dump are busy doing blocking operations:
> {noformat}
> "ForkJoinPool.commonPool-worker-1" #66 prio=0 tid=0x42 nid=NA timed_waiting
> java.lang.Thread.State: TIMED_WAITING
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x75d9efe2> (a java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
> at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
> at org.infinispan.client.hotrod.impl.Util.await(Util.java:21)
> at org.infinispan.client.hotrod.impl.RemoteCacheImpl.put(RemoteCacheImpl.java:335)
> at org.infinispan.client.hotrod.impl.RemoteCacheSupport.put(RemoteCacheSupport.java:79)
> at org.infinispan.client.hotrod.impl.iteration.AbstractRemoteIteratorTest.lambda$populateCache$0(AbstractRemoteIteratorTest.java:34)
> {noformat}
> {{ForkJoinPool.commonPool}} is supposed to add new threads to maintain the default parallelism level (= number of cpus), but that doesn't seem to work because {{RemoteCacheManagerTest}} is trying to start a cache and it didn't get a {{commonPool}} thread:
> {noformat}
> "testng-RemoteCacheManagerTest" #61 prio=0 tid=0x3d nid=NA waiting
> java.lang.Thread.State: WAITING
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x23a92f0a> (a java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at org.infinispan.client.hotrod.RemoteCacheManagerTest.testStartStopAsync(RemoteCacheManagerTest.java:55)
> {noformat}
> Server threads are all free. The first test to time out was {{ProtobufRemoteIteratorIndexingTest}}, but most put operations on {{ForkJoinPool.commonPool}} are issued by {{ReplFailOverRemoteIteratorTest}}:
> {noformat}
> "testng-ReplFailOverRemoteIteratorTest" #62 prio=0 tid=0x3e nid=NA waiting
> java.lang.Thread.State: WAITING
> at java.lang.Object.wait(Native Method)
> - waiting on <0x404a8e70> (a java.util.stream.ForEachOps$ForEachTask)
> at java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:334)
> at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)
> at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
> at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
> at java.util.stream.ForEachOps$ForEachOp$OfInt.evaluateParallel(ForEachOps.java:189)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
> at java.util.stream.IntPipeline.forEach(IntPipeline.java:404)
> at java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:560)
> at org.infinispan.client.hotrod.impl.iteration.AbstractRemoteIteratorTest.populateCache(AbstractRemoteIteratorTest.java:34)
> at org.infinispan.client.hotrod.impl.iteration.BaseIterationFailOverTest.testFailOver(BaseIterationFailOverTest.java:38)
> {noformat}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 5 months
[JBoss JIRA] (ISPN-5159) Make concurrent startup smooth
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-5159?page=com.atlassian.jira.plugin.... ]
Dan Berindei resolved ISPN-5159.
--------------------------------
Fix Version/s: 7.1.0.Final
Resolution: Done
> Make concurrent startup smooth
> ------------------------------
>
> Key: ISPN-5159
> URL: https://issues.jboss.org/browse/ISPN-5159
> Project: Infinispan
> Issue Type: Enhancement
> Components: Core
> Affects Versions: 7.1.0.Beta1
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 7.1.0.Final
>
>
> When starting many instances in parallel, it often happens that the node does not detect its neighborhood very well and this results in many subclusters, merging views etc.
> Merging two available partitions has undefined results (AFAIK). While we can expect that there are no requests to the cluster from the application ^1^, Infinispan itself uses some caches to store internal information (HotRod routing, Protobuf etc...). It would be better if the available-available merge would provide hooks for rebuilding this info.
> ^1^) Being able to start the cluster with reads/writes disabled and enable them only when the cache has expected number of members would be convenient, too.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
6 years, 5 months