]
Daniel Klosinski commented on JGRP-2360:
----------------------------------------
Hi [~belaban],
Unfortunately, we can't upgrade to 4.x as versions 3&4 are not compatible and we
cannot afford downtime.
I've created a pull request. Could you please review it?
DeadLock while acqiring a distributed lock consecutively by the same
thread in a loop
-------------------------------------------------------------------------------------
Key: JGRP-2360
URL:
https://issues.jboss.org/browse/JGRP-2360
Project: JGroups
Issue Type: Bug
Affects Versions: 3.6.18, 4.1.1
Environment: JGroups-4.1.1-Final
Red Hat 4.4.7-23
JDK 1.8.0_202
Reporter: Daniel Klosinski
Assignee: Bela Ban
Priority: Major
Fix For: 4.1.2
Attachments: DLTest.java, DistributedLockRepoducer.zip, log.log
Deadlock intermittently happens when trying to acquire a distributed lock by the same VM,
consecutively by the same thread in a loop. Here is a code snippet for which this issue
can occur :
{code}
for(String s : list){
Lock lock=lock_service.getLock("test_lock_name");
lock.lock();
//perform bussines logic
lock.unlock();
}
{code}
Running such loop I am getting dead look after a few loop iterations. In the attached
logs program hanged after 3 iterations
During the troubleshooting, I found out that lock_id is not being incremented for the new
distributed lock. In the first two loop iterations everything was fine. At the third
iteration lock_id didn't get increased:
{code}
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688:
GRANT_LOCK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
LOCK_GRANTED[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688:
RELEASE_LOCK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
RELEASE_LOCK_OK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1,
sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688:
GRANT_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
LOCK_GRANTED[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688:
RELEASE_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
RELEASE_LOCK_OK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1,
sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688:
GRANT_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
CREATE_LOCK[test_lock_name, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
LOCK_GRANTED[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
{code}
I've added few extra loggers into Jgroups-4.1.1.Final code and I realized that the
second client lock was not removed from the client lock table before the creation of 3rd
client lock. The issue lays in below piece of code. Owner consists of address and
threadID. If the same thread, on the same VM, creates distributed lock consecutively and
if there is an existing entry in the client lock table for the same owner, the new lock
won't be created. The old client lock will be used to acquire a new distributed lock
:
{code}
protected synchronized ClientLock getLock(String name, Owner owner, boolean
create_if_absent) {
Map<Owner,ClientLock> owners=table.get(name);
if(owners == null) {
if(!create_if_absent)
return null;
owners=Util.createConcurrentMap(20);
Map<Owner,ClientLock> existing=table.putIfAbsent(name,owners);
if(existing != null)
owners=existing;
}
ClientLock lock=owners.get(owner);
if(lock == null) {
if(!create_if_absent)
return null;
lock=createLock(name, owner);
owners.put(owner, lock);
}
return lock;
}
{code}
I believe that this issue was introduced by the fix for JGRP-2234 and it is caused by the
race condition. The logic that deletes client lock from the client lock table is now
executed when the client's VM receives RELEASE_LOCK_OK message from the coordinator.
Previously this deletion was executed by the thread in which unlock() method was called.
Now, it is executed by the separate thread which handles RELEASE_LOCK_OK from the
coordinator and this is why we have a race condition here. Here is a sequence which leads
to deadlock:
1. Create client lock (lock_id=2)
2. Send GRANT_LOCK (lock_id=2) to coordinator
3. Receive LOCK_GRANTED (lock_id=2) from coordinator
4. Send RELEASE_LOCK (lock_id=2) to coordinator
5. Call look() method in the same thread (new client lock won't be created as there
is an existing entry in the client lock table for this owner)
6. Receive RELEASE_LOCK_OK and delete client lock from client lock table.
7. Send GRANT_LOCK (lock_id=2) to coordinator
8. Receive LOCK_GRANTED (lock_id=2) from coordinator
9. No entry in the client lock table. It's not possible to get the thread which needs
to be notified.
I am attaching a simple program which can be used to reproduce and generated logs.