]
Bela Ban updated JGRP-2520:
---------------------------
Fix Version/s: 4.2.12
(was: 4.2.11)
CENTRAL_LOCK2: locks not released on kill
-----------------------------------------
Key: JGRP-2520
URL:
https://issues.redhat.com/browse/JGRP-2520
Project: JGroups
Issue Type: Bug
Reporter: Bela Ban
Assignee: Bela Ban
Priority: Major
Fix For: 5.2, 4.2.12
2 emails from D. White:
When a node thread is killed the JChannel/LockService still remains active because the
node JVM is not killed. A new worker thread is created to replace the thread that was
killed. In this case, the cluster view has not changed and therefore the locks remain.
When the node JVM process is killed, that action triggers a cluster view change which is
received by the Coordinator. In this case, the server lock state is rebuilt and the locks
are released.
I think the following will help:
Setup: 3 node cluster, each node with two worker threads. Each set of worker threads has
access to the parent node JChannel.
dwhite-jgroups-node1 (Coordinator)
dwhite-jgroups-node2
dwhite-jgroups-node3
dwhite-jgroups-node2 thread1 acquires lock on resource ENV:ISA_IEA:1
dwhite-jgroups-node2 thread1 acquires lock on resource ENV:GS_GE:1
dwhite-jgroups-node2 thread2 requests lock on resource ENV:ISA_IEA:1
dwhite-jgroups-node1 thread1 requests lock on resource ENV:ISA_IEA:1
dwhite-jgroups-node1 thread2 requests lock on resource ENV:ISA_IEA:1
dwhite-jgroups-node3 thread1 requests lock on resource ENV:ISA_IEA:1
dwhite-jgroups-node3 thread2 requests lock on resource ENV:ISA_IEA:1
Scenario #1:
dwhite-jgroups-node2 thread1 runs too long, does not respond to soft shutdown, and the
node JVM process killed by watch dog service.
[SPEChannelAdapter] viewAccepted received by Coordinator dwhite-jgroups-node1.
Both locks are released.
Scenario #2:
dwhite-jgroups-node2 thread1 runs too long, and the soft shutdown kills thread1 leaving
the server locks in place and the node2 JVM process running.
Watch dog detects locks held too long for ENV:ISA_IEA:1 and ENV:GS_GE:1, and issues
RELEASE_LOCK messages from the Coordinator with the proper Owner.
ENV:GS_GE:1 is released.
ENV:ISA_IEA:1 remains locked, seemingly due to the presence of a GRANT_LOCK request from
dwhite-jgroups-node2 thread2.
Scenario #3 (slight variation on #2):
dwhite-jgroups-node2 thread1 runs too long, and the soft shutdown kills thread1 leaving
the server locks in place and the node2 JVM process running.
Watch dog detects lock held too long on ENV:ISA_IEA:1 and ENV:GS_GE:1 and issues
RELEASE_LOCK from the Coordinator with proper Owner.
Watch dog also removes GRANT_LOCK request for ENV:ISA_IEA:1 from dwhite-jgroups-node2
thread2.
Now both locks are released.
The presence of GRANT_LOCK requests from node1 and node3 does not prevent the release of
the lock for ENV:ISA_IEA:1 held by node2.
Email 2:
Yes, we acquire a lock within a try/catch block and release with finally.
In production, each JVM has two worker threads. If any of the threads runs too a long, a
monitor task force kills the JVM process. If there are acquired locks they do not get
released from the unlock call in the finally block. Usually a JVM is killed because a bad
customer map runs too long and the other thread with acquired locks becomes
"collateral damage". Not every business scenario uses locks. Therefore, the
"orphan lock" scenario doesn't happen every time a JVM process is killed.
Also, both threads are not always active.
We use the CENTRAL_LOCK2 protocol. For some reason the locks acquired from the killed
process may remain in the server locks table. On occasion, the existing Coordinator
doesn't detect the "orphan" locks and revoke them.
Does a view change where the Coordinator has not changed cause that Coordinator to
rebuild the lock state? In a view change where the Coordinator does change, that seems to
fix the problem because the new Coordinator rebuilds the lock state table.
In the case where a new Coordinator is assigned, do the state transfer protocols need to
be in the configuration (e.g. BARRIER, pbcast.STATE_TRANSFER) in order for the new
Coordinator to correctly re-establish the lock state? I don't think so because
CENTRAL_LOCK2 does not use state-transfer; the Coordinator rebuilds the lock state.
To alleviate this problem, we have a lock monitor thread which runs on the Coordinator
node and keeps track of how long each lock has been held. Since no flow can run more than
an hour any lock held for more is definitely an orphan. The lock monitor task issues
RELEASE_LOCK requests using the owner address of the orphan lock. The RELEASE_LOCK message
works in all cases except where there are pending GRANT_LOCK requests in the queue from
the same owner address of the held lock. If the GRANT_LOCK requests are from other
addresses, the RELEASE_LOCK request works.
In order to simulate the problem, a test application ignores the unlock operation in the
finally block purposefully creating the "orphan" in the server locks table.
Other instances of the test application are running with normal lock/unlock operations.
The lock monitor thread on the Coordinator subsequently detects the "lock held too
long" orphan condition and issues the RELEASE_LOCK request on behalf of the orphan
lock owner. Whenever a lock is successfully acquired, the lock monitor task internally
keeps track of the acquired timestamp, owner, and lock ID.
I'd love to get rid of the complex lock monitor and ensure lock revoke operations are
initiated by the Coordinator via the CENTRAL_LOCK2 protocol.
Another enhancement that would completely solve this problem: Allow a timeout to be
specified for holding a lock. The JGroups protocol would then revoke the lock if the
timeout threshold were reached.