[JBoss JIRA] (JGRP-2360) DeadLock while acqiring a distributed lock consecutively by the same thread in a loop

Friday, 30 August 2019

    [
https://issues.jboss.org/browse/JGRP-2360?page=com.atlassian.jira.plugin....
] 

Daniel Klosinski commented on JGRP-2360:
----------------------------------------

Hi [~belaban],

Unfortunately, we can't upgrade to 4.x as versions 3&4 are not compatible and we
cannot afford downtime.
I've created a pull request. Could you please review it?

...
 DeadLock while acqiring a distributed lock consecutively by the same
thread in a loop
 -------------------------------------------------------------------------------------

                 Key: JGRP-2360
                 URL: https://issues.jboss.org/browse/JGRP-2360
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.6.18, 4.1.1
         Environment: JGroups-4.1.1-Final
 Red Hat 4.4.7-23
 JDK 1.8.0_202
            Reporter: Daniel Klosinski
            Assignee: Bela Ban
            Priority: Major
             Fix For: 4.1.2

         Attachments: DLTest.java, DistributedLockRepoducer.zip, log.log

 Deadlock intermittently happens when trying to acquire a distributed lock by the same VM,
consecutively by the same thread in a loop. Here is a code snippet for which this issue
can occur :
 {code}
 for(String s : list){
    Lock lock=lock_service.getLock("test_lock_name");
    lock.lock();
    //perform bussines logic
    lock.unlock();
 }
 {code}
 Running such loop I am getting dead look after a few loop iterations. In the attached
logs program hanged after 3 iterations
 During the troubleshooting, I found out that lock_id is not being incremented for the new
distributed lock. In the first two loop iterations everything was fine. At the third
iteration lock_id didn't get increased:
 {code}
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688:
GRANT_LOCK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1]
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
LOCK_GRANTED[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688:
RELEASE_LOCK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1]
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
RELEASE_LOCK_OK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1,
sender=svc-1-sps-4688]
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688:
GRANT_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
LOCK_GRANTED[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688:
RELEASE_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
RELEASE_LOCK_OK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1,
sender=svc-1-sps-4688]
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688:
GRANT_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
CREATE_LOCK[test_lock_name, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
 2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688:
LOCK_GRANTED[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
 {code}
 I've added few extra loggers into Jgroups-4.1.1.Final code and I realized that the
second client lock was not removed from the client lock table before the creation of 3rd
client lock. The issue lays in below piece of code. Owner consists of address and
threadID. If the same thread, on the same VM, creates distributed lock consecutively and
if there is an existing entry in the client lock table for the same owner, the new lock
won't be created. The old client lock will be used to acquire a new distributed lock
:
 {code}
         protected synchronized ClientLock getLock(String name, Owner owner, boolean
create_if_absent) {
             Map<Owner,ClientLock> owners=table.get(name);
             if(owners == null) {
                 if(!create_if_absent)
                     return null;
                 owners=Util.createConcurrentMap(20);
                 Map<Owner,ClientLock> existing=table.putIfAbsent(name,owners);
                 if(existing != null)
                     owners=existing;
             }
             ClientLock lock=owners.get(owner);
             if(lock == null) {
                 if(!create_if_absent)
                     return null;
                 lock=createLock(name, owner);
                 owners.put(owner, lock);
             }
             return lock;
         }
 {code}
 I believe that this issue was introduced by the fix for JGRP-2234 and it is caused by the
race condition. The logic that deletes client lock from the client lock table is now
executed when the client's VM receives RELEASE_LOCK_OK message from the coordinator.
Previously this deletion was executed by the thread in which unlock() method was called.
Now, it is executed by the separate thread which handles RELEASE_LOCK_OK from the
coordinator and this is why we have a race condition here. Here is a sequence which leads
to deadlock:
 1. Create client lock (lock_id=2) 
 2. Send GRANT_LOCK (lock_id=2) to coordinator
 3. Receive LOCK_GRANTED (lock_id=2) from coordinator
 4. Send RELEASE_LOCK (lock_id=2) to coordinator
 5. Call look() method in the same thread (new client lock won't be created as there
is an existing entry in the client lock table for this owner)
 6. Receive RELEASE_LOCK_OK and delete client lock from client lock table.
 7. Send GRANT_LOCK (lock_id=2) to coordinator
 8. Receive LOCK_GRANTED (lock_id=2) from coordinator
 9. No entry in the client lock table. It's not possible to get the thread which needs
to be notified.
 I am attaching a simple program which can be used to reproduce and generated logs.

--
This message was sent by Atlassian Jira
(v7.13.5#713005)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006