[JBoss JIRA] (ISPN-9988) ScatteredStateConsumerImpl can leak the exclusive topology lock
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-9988?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-9988:
-------------------------------
Sprint: DataGrid Sprint #30
> ScatteredStateConsumerImpl can leak the exclusive topology lock
> ---------------------------------------------------------------
>
> Key: ISPN-9988
> URL: https://issues.jboss.org/browse/ISPN-9988
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.4.7.Final, 10.0.0.Beta1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Beta4
>
>
> When an exception happens in {{ScatteredStateConsumerImpl.beforeTopologyInstalled}}, the exclusive topology lock is not released in {{StateConsumerImpl.onTopologyUpdate}}:
> {noformat}
> 15:21:54,783 ERROR (transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t5:[Topology-scattered]) [LocalTopologyManagerImpl] ISPN000230: Failed to start rebalance for cache scattered
> java.lang.IllegalArgumentException: The task is already cancelled.
> at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:172) ~[classes/:?]
> at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:959) ~[classes/:?]
> at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.beforeTopologyInstalled(ScatteredStateConsumerImpl.java:115) ~[classes/:?]
> at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:292) ~[classes/:?]
> at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102) ~[classes/:?]
> at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200) ~[classes/:?]
> {noformat}
> Because the exclusive topology lock is not released, threads that try to apply a new topology update block forever. This causes random failures with the ISPN-9863 thread leak checker:
> {noformat}
> 15:26:25,922 WARN (testng-RehashClusterPublisherManagerTest:[]) [ThreadLeakChecker] Possible leaked thread:
> "transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3" daemon prio=5 tid=0x236fd nid=NA waiting
> java.lang.Thread.State: WAITING
> java.base(a)11/jdk.internal.misc.Unsafe.park(Native Method)
> java.base@11/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917)
> java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240)
> java.base@11/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
> app//org.infinispan.statetransfer.StateTransferLockImpl.acquireExclusiveTopologyLock(StateTransferLockImpl.java:42)
> app//org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:291)
> app//org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102)
> app//org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200)
> app//org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:57)
> app//org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:113)
> app//org.infinispan.topology.LocalTopologyManagerImpl.doHandleTopologyUpdate(LocalTopologyManagerImpl.java:353)
> app//org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleTopologyUpdate$1(LocalTopologyManagerImpl.java:275)
> 15:26:25,923 ERROR (testng-RehashClusterPublisherManagerTest:[]) [TestSuiteProgress] Test configuration failed: org.infinispan.reactive.publisher.impl.RehashClusterPublisherManagerTest.testClassFinished
> java.lang.AssertionError: Leaked threads:
> {transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3: possible sources [org.infinispan.functional.FunctionalScatteredInMemoryTest[bias=ON_WRITE], org.infinispan.statetransfer.ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false], org.infinispan.functional.FunctionalCachestoreTest[passivation=true], org.infinispan.functional.distribution.rehash.FunctionalNonTxBackupOwnerBecomingPrimaryOwnerTest, org.infinispan.functional.distribution.rehash.FunctionalNonTxJoinerBecomingBackupOwnerTest, org.infinispan.api.mvcc.PutForExternalReadTest[REPL_SYNC, tx=false], org.infinispan.functional.distribution.rehash.FunctionalTxTest, org.infinispan.functional.FunctionalEncodingTypeTest[tx=true]]}
> at org.infinispan.commons.test.ThreadLeakChecker.performCheck(ThreadLeakChecker.java:148) ~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
> at org.infinispan.commons.test.ThreadLeakChecker.testFinished(ThreadLeakChecker.java:109) ~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
> at org.infinispan.test.fwk.TestResourceTracker.testFinished(TestResourceTracker.java:112) ~[test-classes/:?]
> at org.infinispan.test.AbstractInfinispanTest.testClassFinished(AbstractInfinispanTest.java:142) ~[test-classes/:?]
> {noformat}
> The fix should address both the exclusive topology lock itself, by releasing it in a finally block, and the {{IllegalArgumentException}}, either by ignoring already cancelled transfers or by only cancelling transfers while holding {{transferMapsLock}}.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months
[JBoss JIRA] (ISPN-10041) Locking interceptor should check the topology before acquiring locks
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-10041?page=com.atlassian.jira.plugin... ]
Dan Berindei updated ISPN-10041:
--------------------------------
Sprint: DataGrid Sprint #30
> Locking interceptor should check the topology before acquiring locks
> --------------------------------------------------------------------
>
> Key: ISPN-10041
> URL: https://issues.jboss.org/browse/ISPN-10041
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.11.Final, 9.3.6.Final, 9.4.9.Final, 10.0.0.Beta2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Beta4
>
>
> The distribution interceptors check the command topology is the same as the current topology before sending a command to remote nodes, but the locking interceptors do not have any check.
> On a remote node, this means the inbound invocation handler acquires some locks in topology {{T}}, then the locking interceptor acquires other locks in topology {{T+1}}, and finally the distribution interceptor throws {{OutdatedTopologyException}} and releases the locks. In older versions there is also a potential for blocking a remote executor thread while waiting for the lock, but luckily that is not a problem in 9.4+. It would be more efficient if the locking interceptor was throwing {{OutdatedTopologyException}} instead.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months
[JBoss JIRA] (ISPN-10070) DefaultCacheManager should stop components after start failure
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-10070?page=com.atlassian.jira.plugin... ]
Dan Berindei updated ISPN-10070:
--------------------------------
Sprint: DataGrid Sprint #29, DataGrid Sprint #30 (was: DataGrid Sprint #29)
> DefaultCacheManager should stop components after start failure
> --------------------------------------------------------------
>
> Key: ISPN-10070
> URL: https://issues.jboss.org/browse/ISPN-10070
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.4.10.Final, 10.0.0.Beta2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Beta4, 9.4.16.Final
>
>
> Currently it is impossible to release all the resources allocated during startup if the {{DefaultCacheManager}} instance was created with {{start=true}}. The user has to do something like this:
> {code:java}
> DefaultCacheManager manager = new DefaultCacheManager(..., false);
> try {
> manager.start();
> } catch (Throwable t) {
> manager.stop();
> throw t;
> }
> {code}
> Both the constructor and the public {{start()}} method should clean up the started components after a startup failure, so that the user doesn't have to call {{stop()}} explicitly.
> Our tests do not currently call {{stop()}} explicitly, so they leak threads and sockets when a manager fails to start (e.g. because something went wrong with the {{CONFIG}} cache).
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months
[JBoss JIRA] (ISPN-10124) transport lock-timeout description is slightly wrong in XSD schema
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-10124?page=com.atlassian.jira.plugin... ]
Dan Berindei updated ISPN-10124:
--------------------------------
Sprint: DataGrid Sprint #30
> transport lock-timeout description is slightly wrong in XSD schema
> ------------------------------------------------------------------
>
> Key: ISPN-10124
> URL: https://issues.jboss.org/browse/ISPN-10124
> Project: Infinispan
> Issue Type: Bug
> Components: Configuration, Core
> Affects Versions: 9.4.12.Final, 10.0.0.Beta3
> Reporter: Wolf-Dieter Fink
> Assignee: Dan Berindei
> Priority: Minor
>
> Form the infinispan-core xsd the transport lock-timeout description is like followed.
> <xs:complexType name="transport">
> <xs:attribute name="lock-timeout" type="xs:long" default="240000">
> <xs:annotation>
> <xs:documentation>
> Infinispan uses a distributed lock to maintain a coherent transaction log during state transfer or rehashing, which means that only one cache can be doing state transfer or rehashing at the same time.
> This constraint is in place because more than one cache could be involved in a transaction.
> This timeout controls the time to wait to acquire a distributed lock.
> </xs:documentation>
> </xs:annotation>
> </xs:attribute>
> This does not reflect the latest changes and should be updated
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months
[JBoss JIRA] (ISPN-10185) hotrod-client tests hang on Windows
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-10185?page=com.atlassian.jira.plugin... ]
Dan Berindei updated ISPN-10185:
--------------------------------
Sprint: DataGrid Sprint #30
> hotrod-client tests hang on Windows
> -----------------------------------
>
> Key: ISPN-10185
> URL: https://issues.jboss.org/browse/ISPN-10185
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Server
> Affects Versions: 9.4.13.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Labels: testsuite_stability
> Fix For: 10.0.0.Beta4
>
> Attachments: ReplFailOverRemoteIteratorTest.txt
>
>
> All the {{ForkJoin.commonPool}} in the thread dump are busy doing blocking operations:
> {noformat}
> "ForkJoinPool.commonPool-worker-1" #66 prio=0 tid=0x42 nid=NA timed_waiting
> java.lang.Thread.State: TIMED_WAITING
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x75d9efe2> (a java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
> at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
> at org.infinispan.client.hotrod.impl.Util.await(Util.java:21)
> at org.infinispan.client.hotrod.impl.RemoteCacheImpl.put(RemoteCacheImpl.java:335)
> at org.infinispan.client.hotrod.impl.RemoteCacheSupport.put(RemoteCacheSupport.java:79)
> at org.infinispan.client.hotrod.impl.iteration.AbstractRemoteIteratorTest.lambda$populateCache$0(AbstractRemoteIteratorTest.java:34)
> {noformat}
> {{ForkJoinPool.commonPool}} is supposed to add new threads to maintain the default parallelism level (= number of cpus), but that doesn't seem to work because {{RemoteCacheManagerTest}} is trying to start a cache and it didn't get a {{commonPool}} thread:
> {noformat}
> "testng-RemoteCacheManagerTest" #61 prio=0 tid=0x3d nid=NA waiting
> java.lang.Thread.State: WAITING
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x23a92f0a> (a java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at org.infinispan.client.hotrod.RemoteCacheManagerTest.testStartStopAsync(RemoteCacheManagerTest.java:55)
> {noformat}
> Server threads are all free. The first test to time out was {{ProtobufRemoteIteratorIndexingTest}}, but most put operations on {{ForkJoinPool.commonPool}} are issued by {{ReplFailOverRemoteIteratorTest}}:
> {noformat}
> "testng-ReplFailOverRemoteIteratorTest" #62 prio=0 tid=0x3e nid=NA waiting
> java.lang.Thread.State: WAITING
> at java.lang.Object.wait(Native Method)
> - waiting on <0x404a8e70> (a java.util.stream.ForEachOps$ForEachTask)
> at java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:334)
> at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)
> at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
> at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
> at java.util.stream.ForEachOps$ForEachOp$OfInt.evaluateParallel(ForEachOps.java:189)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
> at java.util.stream.IntPipeline.forEach(IntPipeline.java:404)
> at java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:560)
> at org.infinispan.client.hotrod.impl.iteration.AbstractRemoteIteratorTest.populateCache(AbstractRemoteIteratorTest.java:34)
> at org.infinispan.client.hotrod.impl.iteration.BaseIterationFailOverTest.testFailOver(BaseIterationFailOverTest.java:38)
> {noformat}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months
[JBoss JIRA] (ISPN-5159) Make concurrent startup smooth
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-5159?page=com.atlassian.jira.plugin.... ]
Dan Berindei resolved ISPN-5159.
--------------------------------
Fix Version/s: 7.1.0.Final
Resolution: Done
> Make concurrent startup smooth
> ------------------------------
>
> Key: ISPN-5159
> URL: https://issues.jboss.org/browse/ISPN-5159
> Project: Infinispan
> Issue Type: Enhancement
> Components: Core
> Affects Versions: 7.1.0.Beta1
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 7.1.0.Final
>
>
> When starting many instances in parallel, it often happens that the node does not detect its neighborhood very well and this results in many subclusters, merging views etc.
> Merging two available partitions has undefined results (AFAIK). While we can expect that there are no requests to the cluster from the application ^1^, Infinispan itself uses some caches to store internal information (HotRod routing, Protobuf etc...). It would be better if the available-available merge would provide hooks for rebuilding this info.
> ^1^) Being able to start the cluster with reads/writes disabled and enable them only when the cache has expected number of members would be convenient, too.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months
[JBoss JIRA] (ISPN-10309) Convert Remaining Parts to Non Blocking & Reduce Thread Pools
by Will Burns (Jira)
[ https://issues.jboss.org/browse/ISPN-10309?page=com.atlassian.jira.plugin... ]
Will Burns updated ISPN-10309:
------------------------------
Sprint: DataGrid Sprint #29, DataGrid Sprint #30 (was: DataGrid Sprint #29)
> Convert Remaining Parts to Non Blocking & Reduce Thread Pools
> -------------------------------------------------------------
>
> Key: ISPN-10309
> URL: https://issues.jboss.org/browse/ISPN-10309
> Project: Infinispan
> Issue Type: Enhancement
> Components: Core
> Reporter: Will Burns
> Assignee: Will Burns
> Priority: Major
> Fix For: 10.0.0.Final
>
>
> We would love to get our thread pools down to a single CPU thread pool (size = numCores) and a blocking thread pool (arbitrarily large). We may also require a scheduler pool for various options as well (limited size 1-2?).
> To do this we need to remove remnants of our blocking code as possible. Possible issues for blocking are mostly around locks and io operations.
> The persistence layer was completed with ISPN-9722 so that is not an issue.
> The requirement around locking can be relaxed if the locks are guaranteed to be small in scope and do not wrap other blocking operations. An example would be a lock such as ones in CHM as long as we don't have large blocks for functional argument types.
> If code cannot be made non blocking we must offload the operation to the blocking thread pool. Care must be taken to ensure that once the blocking portion of code is completed that we switch back the to CPU thread pool as soon as possible. The listener API for example is violating this and will run code in Infinispan from any thread that completes the listener that could be done from a user.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 6 months