[jboss-jira] [JBoss JIRA] (JGRP-2234) Unlocked locks stay locked forever

Tue Jan 23 04:37:00 EST 2018

    [ https://issues.jboss.org/browse/JGRP-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13522315#comment-13522315 ] 

Bela Ban commented on JGRP-2234:
--------------------------------

No, I haven't fixed this yet. As per my other comment on a related issue, I'm not too happy about {{CENTRAL_LOCK}}, as it doesn't handle network partitions (split brain scenarios) properly...
I've been thinking to replace / complement the backups with a reconciliation protocol after the coordinator leaves, where the new coord fetches lock information from all lock holders and builds its lock table from that information. This means it would take a bit longer to serve lock requests after a coord change, but we wouldn't need to send requests from the coord to all backups.

OTOH, this doesn't solve the issue of split brain scenarios and the {{Lock}} abstraction, which is a bad abstraction to take locks away from someone (after a split heals) and allows for multiple holders of the same lock during the split...

> Unlocked locks stay locked forever
> ----------------------------------
>
>                 Key: JGRP-2234
>                 URL: https://issues.jboss.org/browse/JGRP-2234
>             Project: JGroups
>          Issue Type: Bug
>            Reporter: Bram Klein Gunnewiek
>            Assignee: Bela Ban
>             Fix For: 4.0.10
>
>         Attachments: ClusterSplitLockTest.java, jg_clusterlock_output_testfail.txt
>
>
> As discussed in the mailing list we have issues where locks from the central lock protocol stay locked forever when the coordinator of the cluster disconnects. We can reproduce this with the attached ClusterSplitLockTest.java. Its a race condition and we need to run the test a lot of times (sometimes > 20) before we encounter a failure. 
> What we think is happening: 
> In a three node cluster (node A, B and C where node A is the coordinator) unlock requests from B and/or C can be missed when node A leaves and B and/or C don't have the new view installed yet. When, for example, node B takes over coordination it creates the lock table based on the back-ups. Lets say node C has locked the lock with name 'lockX'. Node C performs an unlock of 'lockX' just after node A (gracefully) leaves and sends the unlock request to node A since node C doesn't have the correct view installed yet. Node B has and recreated the lock table where 'lockX' is locked by Node C. Node C doesn't resend the unlock request so 'lockX' gets locked forever.
> Attached is the testng test we wrote and the output of a test failure.

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)