[
https://issues.jboss.org/browse/JGRP-1634?page=com.atlassian.jira.plugin....
]
Manuel Dominguez Sarmiento updated JGRP-1634:
---------------------------------------------
Attachment: AbstractJdkLockManager.java
DistributedJGroupsLockManager.java
LockingTest.java
Hi Bela, I have been able to reproduce both problems (a) and (b) as described above, with
a simple test. I have attached our own classes that wrap JGroups locking functionality, as
well as LockingTest.java which contains a main method.
The test uses the attached jgroups.xml, and generates many locking requests for different
locks simultaneously. With 3.3.0.CR1 it works fine, and locks are always acquired (they
are randomnly generated strings).
With 3.3.0.Final, after a while, the locks are randomly no longer acquired (tryLock
fails), and eventually no locks are obtained at all. This reproduces issue (b) as
originally described.
If you modify DistributedJGroupsLockManager.DEFAULT_TRY_LOCK_TIMEOUT to zero, then our
class uses regular tryLock() instead of tryLock(timeout). This allows testing issue (a) as
described before. So when you launch main, what happens is that at first, locks are
acquired normally, and after a while, and threads hang on Object.wait() as shown the the
Eclipse debugger.
LockingService.tryLock() randomly hangs, and tryLock(timeout) does
not acquire the lock even though it is free to be taken
--------------------------------------------------------------------------------------------------------------------------
Key: JGRP-1634
URL:
https://issues.jboss.org/browse/JGRP-1634
Project: JGroups
Issue Type: Bug
Affects Versions: 3.3
Environment: JGroups 3.3.0 Final
Reporter: Manuel Dominguez Sarmiento
Assignee: Bela Ban
Fix For: 3.4
Attachments: AbstractJdkLockManager.java, DistributedJGroupsLockManager.java,
jgroups.xml, LockingTest.java
We upgraded from 3.3.0.CR1 to 3.3.0.Final and began to experience all sorts of weird lock
acquisition issues. The symptoms are:
(a) tryLock() randomly hangs
(b) tryLock(timeout) times out, without acquiring the lock (even though it should, as the
lock is only requested from a single node)
This happens both with CENTRAL_LOCK as well as PEER_LOCK. I have attached the
configuration we are using.
3.3.0.CR1 worked fine. This bug seems to have been introduced by JGRP-1610. I have
carefully reviewed the code changes introduced by said fix, and they seems to be:
(i) OOB used for lock messages. This should not be causing problems.
(ii) Use of a striped ReentrantLock table instead of synchronized blocks. By itself, this
change alone should not be causing problems.
(iii) Much, much more tightening locking around the server lock table. I think this is
where something goes wrong, and deadlocks end up occuring.
The following methods on Locking.java did not even have a synchronized block before, and
now they are protected with the striped ReentrantLocks:
- handleLockRequest()
- handleAwaitRequest()
- handleDeleteAwaitRequest()
- handleSignalRequest()
This is the major change I see which could introduce deadlocks. Other methods which were
already synchronized before (handleCreateLockRequest, handleDeleteLockRequest,
handleCreateAwaitingRequest, handleDeleteAwaitingRequest) now are stripe-locked, which
should not be the cause of problems.
I would have liked to be able to indicate steps to reproduce, but it is quite random,
although the bug is consistent enough that we can see it every single time we deploy our
app.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira