[jboss-jira] [JBoss JIRA] (JGRP-2360) DeadLock while acqiring a distributed lock consecutively by the same thread in a loop

Mon Jul 15 16:16:01 EDT 2019

Daniel Klosinski created JGRP-2360:
--------------------------------------

             Summary: DeadLock while acqiring a distributed lock consecutively by the same thread in a loop
                 Key: JGRP-2360
                 URL: https://issues.jboss.org/browse/JGRP-2360
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 4.1.1, 3.6.18
         Environment: JGroups-4.1.1-Final
Red Hat 4.4.7-23
JDK 1.8.0_202
            Reporter: Daniel Klosinski
            Assignee: Bela Ban
         Attachments: DLTest.java, DistributedLockRepoducer.zip, log.log

Deadlock intermittently happens when trying to acquire a distributed lock by the same VM, consecutively by the same thread in a loop. Here is a code snippet for which this issue can occur :

{code}
for(String s : list){
   Lock lock=lock_service.getLock("test_lock_name");
   lock.lock();
   //perform bussines logic
   lock.unlock();
}
{code}

During the troubleshooting, I found out that lock_id is not being incremented for the new distributed lock. In the first two loop iterations everything was fine. At the third iteration lock_id didn't get increased:

{code}
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: GRANT_LOCK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: LOCK_GRANTED[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: RELEASE_LOCK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: RELEASE_LOCK_OK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]

2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: GRANT_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: LOCK_GRANTED[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: RELEASE_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: RELEASE_LOCK_OK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]

2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: GRANT_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: CREATE_LOCK[test_lock_name, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: LOCK_GRANTED[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
{code}

I've added few extra loggers into Jgroups-4.1.1.Final code and I realized that the second client lock was not removed from the client lock table before the creation of 3rd client lock. The issue lays in below piece of code. Owner consists of address and threadID. If the same thread, on the same VM, creates distributed lock consecutively and if there is an existing entry in the client lock table for the same owner, the new lock won't be created. The old client lock will be used to acquire a new distributed lock :

{code}
        protected synchronized ClientLock getLock(String name, Owner owner, boolean create_if_absent) {
            Map<Owner,ClientLock> owners=table.get(name);
            if(owners == null) {
                if(!create_if_absent)
                    return null;
                owners=Util.createConcurrentMap(20);
                Map<Owner,ClientLock> existing=table.putIfAbsent(name,owners);
                if(existing != null)
                    owners=existing;
            }
            ClientLock lock=owners.get(owner);
            if(lock == null) {
                if(!create_if_absent)
                    return null;
                lock=createLock(name, owner);
                owners.put(owner, lock);
            }
            return lock;
        }
{code}

I believe that this issue was introduced by the fix for JGRP-2234 and it is caused by the race condition. The logic that deletes client lock from the client lock table is now executed when the client's VM receives RELEASE_LOCK_OK message from the coordinator. Previously this deletion was executed by the thread in which unlock() method was called. Now, it is executed by the separate thread wich handles RELEASE_LOCK_OK from the coordinator and this is why we have a care condition here. 

I am attaching a simple program which can be used to reproduce and generated logs.

--
This message was sent by Atlassian Jira
(v7.12.1#712002)