[JBoss JIRA] (ISPN-2240) Per-key lock container leads to superfluous TimeoutExceptions on concurrent access to same key
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-2240?page=com.atlassian.jira.plugin.... ]
Dan Berindei closed ISPN-2240.
------------------------------
Resolution: Cannot Reproduce Bug
I ran the test on 6.0.x and 5.3.x and I didn't get any TimeoutExceptions.
> Per-key lock container leads to superfluous TimeoutExceptions on concurrent access to same key
> ----------------------------------------------------------------------------------------------
>
> Key: ISPN-2240
> URL: https://issues.jboss.org/browse/ISPN-2240
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: Transactions
> Affects Versions: 4.0.0.ALPHA1, 5.1.6.FINAL
> Reporter: Robert Stupp
> Assignee: Mircea Markus
> Priority: Critical
> Fix For: 7.0.0.Beta1, 7.0.0.Final
>
> Attachments: ISPN-2240_fix_TimeoutExceptions.patch, somehow.zip
>
>
> Hi,
> I've encountered a lot of TimeoutExceptions just running a load test against an infinispan cluster.
> I tracked down the reason and found out, that the code in org.infinispan.util.concurrent.locks.containers.AbstractPerEntryLockContainer#releaseLock() causes these superfluous TimeoutExceptions.
> A small test case (which just prints out timeouts, too late timeouts and "paints" a lot of dots to the console - more dots/second on the console means better throughput ;-)
> In a short test I extended the class ReentrantPerEntryLockContainer and changed the implementation of releaseLock() as follows:
> {noformat}
> public void releaseLock(Object lockOwner, Object key) {
> ReentrantLock l = locks.get(key);
> if (l != null) {
> if (!l.isHeldByCurrentThread())
> throw new IllegalStateException("Lock for [" + key + "] not held by current thread " + Thread.currentThread());
> while (l.isHeldByCurrentThread())
> unlock(l, lockOwner);
> if (!l.hasQueuedThreads())
> locks.remove(key);
> }
> else
> throw new IllegalStateException("No lock for [" + key + ']');
> }
> {noformat}
> The main improvement is that locks are not removed from the concurrent map as long as other threads are waiting on that lock.
> If the lock is removed from the map while other threads are waiting for it, they may run into timeouts and force TimeoutExceptions to the client.
> The above methods "paints more dots per second" - means: it gives a better throughput for concurrent accesses to the same key.
> The re-implemented method should also fix some replication timeout exceptions.
> Please, please add this to 5.1.7, if possible.
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 4 months
[JBoss JIRA] (ISPN-2240) Per-key lock container leads to superfluous TimeoutExceptions on concurrent access to same key
by Robert Stupp (JIRA)
[ https://issues.jboss.org/browse/ISPN-2240?page=com.atlassian.jira.plugin.... ]
Robert Stupp commented on ISPN-2240:
------------------------------------
We don't need it at the moment - thanks.
> Per-key lock container leads to superfluous TimeoutExceptions on concurrent access to same key
> ----------------------------------------------------------------------------------------------
>
> Key: ISPN-2240
> URL: https://issues.jboss.org/browse/ISPN-2240
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: Transactions
> Affects Versions: 4.0.0.ALPHA1, 5.1.6.FINAL
> Reporter: Robert Stupp
> Assignee: Mircea Markus
> Priority: Critical
> Fix For: 7.0.0.Beta1, 7.0.0.Final
>
> Attachments: ISPN-2240_fix_TimeoutExceptions.patch, somehow.zip
>
>
> Hi,
> I've encountered a lot of TimeoutExceptions just running a load test against an infinispan cluster.
> I tracked down the reason and found out, that the code in org.infinispan.util.concurrent.locks.containers.AbstractPerEntryLockContainer#releaseLock() causes these superfluous TimeoutExceptions.
> A small test case (which just prints out timeouts, too late timeouts and "paints" a lot of dots to the console - more dots/second on the console means better throughput ;-)
> In a short test I extended the class ReentrantPerEntryLockContainer and changed the implementation of releaseLock() as follows:
> {noformat}
> public void releaseLock(Object lockOwner, Object key) {
> ReentrantLock l = locks.get(key);
> if (l != null) {
> if (!l.isHeldByCurrentThread())
> throw new IllegalStateException("Lock for [" + key + "] not held by current thread " + Thread.currentThread());
> while (l.isHeldByCurrentThread())
> unlock(l, lockOwner);
> if (!l.hasQueuedThreads())
> locks.remove(key);
> }
> else
> throw new IllegalStateException("No lock for [" + key + ']');
> }
> {noformat}
> The main improvement is that locks are not removed from the concurrent map as long as other threads are waiting on that lock.
> If the lock is removed from the map while other threads are waiting for it, they may run into timeouts and force TimeoutExceptions to the client.
> The above methods "paints more dots per second" - means: it gives a better throughput for concurrent accesses to the same key.
> The re-implemented method should also fix some replication timeout exceptions.
> Please, please add this to 5.1.7, if possible.
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 4 months
[JBoss JIRA] (ISPN-4424) getCacheEntry is not safe
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4424?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4424:
-----------------------------------------------
Martin Gencur <mgencur(a)redhat.com> changed the Status of [bug 1110647|https://bugzilla.redhat.com/show_bug.cgi?id=1110647] from ON_QA to VERIFIED
> getCacheEntry is not safe
> -------------------------
>
> Key: ISPN-4424
> URL: https://issues.jboss.org/browse/ISPN-4424
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: Remote Protocols
> Affects Versions: 6.0.2.Final, 7.0.0.Alpha4
> Reporter: Galder Zamarreño
> Assignee: Galder Zamarreño
> Fix For: 7.0.0.Alpha5, 7.0.0.Final
>
>
> Versioned update with a multi threaded Hot Rod client results in inconsistency. Some replaceWithVersion return true ignoring a version update executed in another thread. Here's a log excerpt of a concurrency stress test:
> ```
> 2014-06-20 16:16:56,798 INFO [PutFromNull] (pool-7-thread-10) count=462,prev=462,new=463
> 2014-06-20 16:16:56,820 INFO [PutFromNull] (pool-7-thread-9) count=463,prev=463,new=464
> 2014-06-20 16:16:56,831 INFO [PutFromNull] (pool-7-thread-2) count=464,prev=463,new=464
> 2014-06-20 16:16:56,845 INFO [PutFromNull] (pool-7-thread-9) count=465,prev=464,new=465
> ```
> Here you see two threads applying the same replacement, from 463 to 464.
> The issue appears a result of a race condition in Hot Rod server's protocol decoder. When replaceIfUmodified is received, the cache entry is retrieved to verify whether the version in the server and the version sent in the command match. However, the cache entry retrieved is mutable, and the value could change midway through this operation as a result of another thread updating the value. Please find below some log snippets showing this.
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 4 months
[JBoss JIRA] (ISPN-2240) Per-key lock container leads to superfluous TimeoutExceptions on concurrent access to same key
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-2240?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-2240:
------------------------------------
[~snazy] there seems to be a problem with the attached test: it locks a key twice, but it only releases it once. ReentrantPerEntryLockContainer expects every {{lock(owner, key, timeout, unit)}} call to be paired with a {{unlock(owner, key)}} call, so the test never releases any locks.
Once I changed that, the test worked beautifully with 7.0.0.Alpha4. I haven't checked older versions yet, do you still need to work with 5.3/6.0 for some reason?
> Per-key lock container leads to superfluous TimeoutExceptions on concurrent access to same key
> ----------------------------------------------------------------------------------------------
>
> Key: ISPN-2240
> URL: https://issues.jboss.org/browse/ISPN-2240
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: Transactions
> Affects Versions: 4.0.0.ALPHA1, 5.1.6.FINAL
> Reporter: Robert Stupp
> Assignee: Mircea Markus
> Priority: Critical
> Fix For: 7.0.0.Beta1, 7.0.0.Final
>
> Attachments: ISPN-2240_fix_TimeoutExceptions.patch, somehow.zip
>
>
> Hi,
> I've encountered a lot of TimeoutExceptions just running a load test against an infinispan cluster.
> I tracked down the reason and found out, that the code in org.infinispan.util.concurrent.locks.containers.AbstractPerEntryLockContainer#releaseLock() causes these superfluous TimeoutExceptions.
> A small test case (which just prints out timeouts, too late timeouts and "paints" a lot of dots to the console - more dots/second on the console means better throughput ;-)
> In a short test I extended the class ReentrantPerEntryLockContainer and changed the implementation of releaseLock() as follows:
> {noformat}
> public void releaseLock(Object lockOwner, Object key) {
> ReentrantLock l = locks.get(key);
> if (l != null) {
> if (!l.isHeldByCurrentThread())
> throw new IllegalStateException("Lock for [" + key + "] not held by current thread " + Thread.currentThread());
> while (l.isHeldByCurrentThread())
> unlock(l, lockOwner);
> if (!l.hasQueuedThreads())
> locks.remove(key);
> }
> else
> throw new IllegalStateException("No lock for [" + key + ']');
> }
> {noformat}
> The main improvement is that locks are not removed from the concurrent map as long as other threads are waiting on that lock.
> If the lock is removed from the map while other threads are waiting for it, they may run into timeouts and force TimeoutExceptions to the client.
> The above methods "paints more dots per second" - means: it gives a better throughput for concurrent accesses to the same key.
> The re-implemented method should also fix some replication timeout exceptions.
> Please, please add this to 5.1.7, if possible.
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 4 months
[JBoss JIRA] (ISPN-4484) Outbound transfers can be cancelled by old CANCEL_STATE_TRANSFER command
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4484?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4484:
-----------------------------------------------
Dan Berindei <dberinde(a)redhat.com> changed the Status of [bug 1116969|https://bugzilla.redhat.com/show_bug.cgi?id=1116969] from NEW to MODIFIED
> Outbound transfers can be cancelled by old CANCEL_STATE_TRANSFER command
> ------------------------------------------------------------------------
>
> Key: ISPN-4484
> URL: https://issues.jboss.org/browse/ISPN-4484
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: Core, State Transfer
> Affects Versions: 6.0.2.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 7.0.0.Alpha5
>
>
> This appeared during the 32-nodes elasticity test in the Hyperion environment.
> Just as apex947 left, it started a rebalance, which apex948 dutifully cancelled as it became the new coordinator. apex949 had already requested segments from apex959, so it sent a StateRequestCommand(CANCEL_STATE_TRANSFER) asynchronously to apex959. Then apex948 started a new rebalance, and apex949 asked apex959 for the same segments. When apex959 finally received the cancel request, it didn't check the topology id and it incorrectly cancelled the outbound transfer to apex949.
> The solution would be to verify the topology id in the CANCEL_STATE_TRANSFER command before cancelling the transfer. I also think we can avoid sending the cancel command completely in this case, and only send it as we are about to stop.
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 4 months
[JBoss JIRA] (ISPN-4480) Messages sent to leavers can clog the JGroups bundler thread
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4480?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4480:
-----------------------------------------------
Dan Berindei <dberinde(a)redhat.com> changed the Status of [bug 1116965|https://bugzilla.redhat.com/show_bug.cgi?id=1116965] from NEW to MODIFIED
> Messages sent to leavers can clog the JGroups bundler thread
> ------------------------------------------------------------
>
> Key: ISPN-4480
> URL: https://issues.jboss.org/browse/ISPN-4480
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: Core
> Affects Versions: 6.0.2.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
>
> In a stress test that repeatedly kills nodes while performing read/write operations, the TransferQueueBundler thread seems to spend a lot of time waiting for physical addresses:
> {noformat}
> 06:40:10,316 WARN [org.radargun.utils.Utils] (pool-5-thread-1) Stack for thread TransferQueueBundler,default,apex953-14666:
> java.lang.Thread.sleep(Native Method)
> org.jgroups.util.Util.sleep(Util.java:1504)
> org.jgroups.util.Util.sleepRandom(Util.java:1574)
> org.jgroups.protocols.TP.sendToSingleMember(TP.java:1685)
> org.jgroups.protocols.TP.doSend(TP.java:1670)
> org.jgroups.protocols.TP$TransferQueueBundler.sendBundledMessages(TP.java:2476)
> org.jgroups.protocols.TP$TransferQueueBundler.sendMessages(TP.java:2392)
> org.jgroups.protocols.TP$TransferQueueBundler.run(TP.java:2383)
> java.lang.Thread.run(Thread.java:744)
> {noformat}
> There are 2 bugs related to this already fixed in JGroups 3.5.0.Beta2+: JGRP-1814, JGRP-1815
> There is also a special case where the physical address could be removed from the cache too soon, exacerbating the effect of JGRP-1815: JGRP-1858
> We can work around the problem by changing the JGroups configuration:
> * TP.logical_addr_cache_expiration=86400000
> ** Only expire addresses after 1 day
> * TP.physical_addr_max_fetch_attempts=1
> ** Sleep for only 20ms waiting for the physical address (default 3 - 1500ms)
> * UNICAST3_conn_close_timeout=10000
> ** Drop the pending messages to leavers sooner
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 4 months
[JBoss JIRA] (ISPN-4154) Cancelled segment transfer causes future entry transfer to be ignored
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4154?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4154:
-----------------------------------------------
Dan Berindei <dberinde(a)redhat.com> changed the Status of [bug 1116963|https://bugzilla.redhat.com/show_bug.cgi?id=1116963] from NEW to MODIFIED
> Cancelled segment transfer causes future entry transfer to be ignored
> ---------------------------------------------------------------------
>
> Key: ISPN-4154
> URL: https://issues.jboss.org/browse/ISPN-4154
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: State Transfer
> Affects Versions: 7.0.0.Alpha1
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Critical
>
> Distributed transactional cache.
> 1) Coordinator is gracefully leaving the cluster, sends a REBALANCE_START with topologyId 14, ST begins.
> 2) Node receives chunk from segment X, writes entry K=V to the container.
> 3) New coordinator jumps in with CH_UPDATE topology 16
> 4) Node receives CANCEL_STATE_TRANSFER and cancels transfer of segment X, invalidating K. In CommitManager, this operation is tracked and DiscardPolicy is set to DISCARD_STATE_TRANSFER for key K.
> 5) New coordinator starts rebalance with topology 17
> 6) Node starts new ST for segment X
> 7) Node receives the X: K=V, but in CommitManager it finds out that the policy is set to DISCARD_STATE_TRANSFER and ignores this update.
> Result: entry value is lost on some node.
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 4 months