[
https://issues.jboss.org/browse/JGRP-2234?page=com.atlassian.jira.plugin....
]
Bram Klein Gunnewiek commented on JGRP-2234:
--------------------------------------------
Does that solve the problem? E.G. in a cluster [A,B,C] with A as coordinator I could
imagine the following: B sends RELEASE_LOCK to A. A receives the unlock request and sends
back RELEASE_LOCK_OK to B. B receives it but A dies right after the reply was sent. C
becomes the new coordinator.
Is it guaranteed that C receives/has a lock table where the unlock from B is processed or
does the solution only makes the failure window smaller?
I was also (briefly) thinking about a solution and my solution would be that the new
coordinator would ask for a confirmation of all locks after a coordinator change. Lets say
C has 3 locks in the lock table after he became the coordinator, all marked as locked by
B. Only 2 of the 3 locks are actually locked, 1 lock was unlocked by B through node A. C
would simply ask B for confirmation that the lock table is still up-to-date (e.g. is it
correct that you have locked locks 1,2,3?) and unlock the (already unlocked) lock after B
replies 'lock 1 and 2 are locked by me, lock 3 isn't'.
Your solution is quicker and with less overhead, if the solution is 100% correct I guess
thats the better option (although acquiring locks in a normal situation is a bit slower
and with more overhead).
Unlocked locks stay locked forever
----------------------------------
Key: JGRP-2234
URL:
https://issues.jboss.org/browse/JGRP-2234
Project: JGroups
Issue Type: Bug
Reporter: Bram Klein Gunnewiek
Assignee: Bela Ban
Fix For: 4.0.11
Attachments: ClusterSplitLockTest.java, jg_clusterlock_output_testfail.txt
As discussed in the mailing list we have issues where locks from the central lock
protocol stay locked forever when the coordinator of the cluster disconnects. We can
reproduce this with the attached ClusterSplitLockTest.java. Its a race condition and we
need to run the test a lot of times (sometimes > 20) before we encounter a failure.
What we think is happening:
In a three node cluster (node A, B and C where node A is the coordinator) unlock requests
from B and/or C can be missed when node A leaves and B and/or C don't have the new
view installed yet. When, for example, node B takes over coordination it creates the lock
table based on the back-ups. Lets say node C has locked the lock with name
'lockX'. Node C performs an unlock of 'lockX' just after node A
(gracefully) leaves and sends the unlock request to node A since node C doesn't have
the correct view installed yet. Node B has recreated the lock table where 'lockX'
is locked by Node C. Node C doesn't resend the unlock request so 'lockX' gets
locked forever.
Attached is the testng test we wrote and the output of a test failure.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)