[JBoss JIRA] (JGRP-2234) Unlocked locks stay locked forever
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/JGRP-2234?page=com.atlassian.jira.plugin.... ]
Bela Ban edited comment on JGRP-2234 at 2/6/18 7:23 AM:
--------------------------------------------------------
Clients need to have the following information:
* Locks they acquired
* Pending lock requests; locks which they want to acquire but for which they haven't yet received a LOCK_GRANTED response
* Pending lock release requests; lock that have been released, but for which no RELEASE_LOCK_OK response has been received
* Ditto for conditions, but we'll tackle them in a second stage
The reconciliation protocol queues all new requests on the coord and asks all members for their lock information. Once the coord has received this information from all members, it applies this and then drains the queue of pending requests.
It is important that the requests are ordered per member, ie. a release(L) cannot come before a lock(L).
Since {{CENTRAL_LOCK}} allows for multiple members to hold the same lock in a split brain scenario, we need to think about how to handle merging where the coord detects that multiple members hold the same lock...
was (Author: belaban):
Clients need to have the following information:
* Locks they acquired
* Pending lock requests; locks which they want to acquire but for which they haven't yet received a LOCK_GRANTED response
* Pending lock release requests; lock that have been released, but for which no RELEASE_LOCK_OK response has been received
* Ditto for conditions, but we'll tackle them in a second stage
The reconciliation protocol queues all new requests on the coord and asks all members for their lock information. Once the coord has received this information from all members, it applies this and the drains the queue of pending requests.
It is important that the requests are ordered per member, ie. a release(L) cannot come before a lock(L).
Since {{CENTRAL_LOCK}} allows for multiple members to hold the same lock in a split brain scenario, we need to think about how to handle merging where the coord detects that multiple members hold the same lock...
> Unlocked locks stay locked forever
> ----------------------------------
>
> Key: JGRP-2234
> URL: https://issues.jboss.org/browse/JGRP-2234
> Project: JGroups
> Issue Type: Bug
> Reporter: Bram Klein Gunnewiek
> Assignee: Bela Ban
> Fix For: 4.0.11
>
> Attachments: ClusterSplitLockTest.java, jg_clusterlock_output_testfail.txt
>
>
> As discussed in the mailing list we have issues where locks from the central lock protocol stay locked forever when the coordinator of the cluster disconnects. We can reproduce this with the attached ClusterSplitLockTest.java. Its a race condition and we need to run the test a lot of times (sometimes > 20) before we encounter a failure.
> What we think is happening:
> In a three node cluster (node A, B and C where node A is the coordinator) unlock requests from B and/or C can be missed when node A leaves and B and/or C don't have the new view installed yet. When, for example, node B takes over coordination it creates the lock table based on the back-ups. Lets say node C has locked the lock with name 'lockX'. Node C performs an unlock of 'lockX' just after node A (gracefully) leaves and sends the unlock request to node A since node C doesn't have the correct view installed yet. Node B has recreated the lock table where 'lockX' is locked by Node C. Node C doesn't resend the unlock request so 'lockX' gets locked forever.
> Attached is the testng test we wrote and the output of a test failure.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
8 years, 3 months
[JBoss JIRA] (JGRP-2234) Unlocked locks stay locked forever
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/JGRP-2234?page=com.atlassian.jira.plugin.... ]
Bela Ban edited comment on JGRP-2234 at 2/6/18 7:23 AM:
--------------------------------------------------------
Clients need to have the following information:
* Locks they acquired
* Pending lock requests; locks which they want to acquire but for which they haven't yet received a LOCK_GRANTED response
* Pending lock release requests; lock that have been released, but for which no RELEASE_LOCK_OK response has been received
* Ditto for conditions, but we'll tackle them in a second stage
The reconciliation protocol queues all new requests on the coord and asks all members for their lock information. Once the coord has received this information from all members, it applies this and the drains the queue of pending requests.
It is important that the requests are ordered per member, ie. a release(L) cannot come before a lock(L).
Since {{CENTRAL_LOCK}} allows for multiple members to hold the same lock in a split brain scenario, we need to think about how to handle merging where the coord detects that multiple members hold the same lock...
was (Author: belaban):
Clients need to have the following information:
* Locks they acquired
* Pending lock requests; locks which they want to acquire but for which they haven't yet received a LOCK_GRANTED response
* Pending lock release requests; lock that have been released, but for which no RELEASE_LOCK_OK response has been received
The reconciliation protocol queues all new requests on the coord and asks all members for their lock information. Once the coord has received this information from all members, it applies this and the drains the queue of pending requests.
It is important that the requests are ordered per member, ie. a release(L) cannot come before a lock(L).
Since {{CENTRAL_LOCK}} allows for multiple members to hold the same lock in a split brain scenario, we need to think about how to handle merging where the coord detects that multiple members hold the same lock...
> Unlocked locks stay locked forever
> ----------------------------------
>
> Key: JGRP-2234
> URL: https://issues.jboss.org/browse/JGRP-2234
> Project: JGroups
> Issue Type: Bug
> Reporter: Bram Klein Gunnewiek
> Assignee: Bela Ban
> Fix For: 4.0.11
>
> Attachments: ClusterSplitLockTest.java, jg_clusterlock_output_testfail.txt
>
>
> As discussed in the mailing list we have issues where locks from the central lock protocol stay locked forever when the coordinator of the cluster disconnects. We can reproduce this with the attached ClusterSplitLockTest.java. Its a race condition and we need to run the test a lot of times (sometimes > 20) before we encounter a failure.
> What we think is happening:
> In a three node cluster (node A, B and C where node A is the coordinator) unlock requests from B and/or C can be missed when node A leaves and B and/or C don't have the new view installed yet. When, for example, node B takes over coordination it creates the lock table based on the back-ups. Lets say node C has locked the lock with name 'lockX'. Node C performs an unlock of 'lockX' just after node A (gracefully) leaves and sends the unlock request to node A since node C doesn't have the correct view installed yet. Node B has recreated the lock table where 'lockX' is locked by Node C. Node C doesn't resend the unlock request so 'lockX' gets locked forever.
> Attached is the testng test we wrote and the output of a test failure.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
8 years, 3 months
[JBoss JIRA] (JGRP-2234) Unlocked locks stay locked forever
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/JGRP-2234?page=com.atlassian.jira.plugin.... ]
Bela Ban commented on JGRP-2234:
--------------------------------
Clients need to have the following information:
* Locks they acquired
* Pending lock requests; locks which they want to acquire but for which they haven't yet received a LOCK_GRANTED response
* Pending lock release requests; lock that have been released, but for which no RELEASE_LOCK_OK response has been received
The reconciliation protocol queues all new requests on the coord and asks all members for their lock information. Once the coord has received this information from all members, it applies this and the drains the queue of pending requests.
It is important that the requests are ordered per member, ie. a release(x) cannot come before a lock(x).
Since {{CENTRAL_LOCK}} allows for multiple members to hold the same lock in a split brain scenario, we need to think about how to handle merging where the coord detects that multiple members hold the same lock...
> Unlocked locks stay locked forever
> ----------------------------------
>
> Key: JGRP-2234
> URL: https://issues.jboss.org/browse/JGRP-2234
> Project: JGroups
> Issue Type: Bug
> Reporter: Bram Klein Gunnewiek
> Assignee: Bela Ban
> Fix For: 4.0.11
>
> Attachments: ClusterSplitLockTest.java, jg_clusterlock_output_testfail.txt
>
>
> As discussed in the mailing list we have issues where locks from the central lock protocol stay locked forever when the coordinator of the cluster disconnects. We can reproduce this with the attached ClusterSplitLockTest.java. Its a race condition and we need to run the test a lot of times (sometimes > 20) before we encounter a failure.
> What we think is happening:
> In a three node cluster (node A, B and C where node A is the coordinator) unlock requests from B and/or C can be missed when node A leaves and B and/or C don't have the new view installed yet. When, for example, node B takes over coordination it creates the lock table based on the back-ups. Lets say node C has locked the lock with name 'lockX'. Node C performs an unlock of 'lockX' just after node A (gracefully) leaves and sends the unlock request to node A since node C doesn't have the correct view installed yet. Node B has recreated the lock table where 'lockX' is locked by Node C. Node C doesn't resend the unlock request so 'lockX' gets locked forever.
> Attached is the testng test we wrote and the output of a test failure.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
8 years, 3 months
[JBoss JIRA] (JGRP-2234) Unlocked locks stay locked forever
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/JGRP-2234?page=com.atlassian.jira.plugin.... ]
Bela Ban edited comment on JGRP-2234 at 2/6/18 7:22 AM:
--------------------------------------------------------
Clients need to have the following information:
* Locks they acquired
* Pending lock requests; locks which they want to acquire but for which they haven't yet received a LOCK_GRANTED response
* Pending lock release requests; lock that have been released, but for which no RELEASE_LOCK_OK response has been received
The reconciliation protocol queues all new requests on the coord and asks all members for their lock information. Once the coord has received this information from all members, it applies this and the drains the queue of pending requests.
It is important that the requests are ordered per member, ie. a release(L) cannot come before a lock(L).
Since {{CENTRAL_LOCK}} allows for multiple members to hold the same lock in a split brain scenario, we need to think about how to handle merging where the coord detects that multiple members hold the same lock...
was (Author: belaban):
Clients need to have the following information:
* Locks they acquired
* Pending lock requests; locks which they want to acquire but for which they haven't yet received a LOCK_GRANTED response
* Pending lock release requests; lock that have been released, but for which no RELEASE_LOCK_OK response has been received
The reconciliation protocol queues all new requests on the coord and asks all members for their lock information. Once the coord has received this information from all members, it applies this and the drains the queue of pending requests.
It is important that the requests are ordered per member, ie. a release(x) cannot come before a lock(x).
Since {{CENTRAL_LOCK}} allows for multiple members to hold the same lock in a split brain scenario, we need to think about how to handle merging where the coord detects that multiple members hold the same lock...
> Unlocked locks stay locked forever
> ----------------------------------
>
> Key: JGRP-2234
> URL: https://issues.jboss.org/browse/JGRP-2234
> Project: JGroups
> Issue Type: Bug
> Reporter: Bram Klein Gunnewiek
> Assignee: Bela Ban
> Fix For: 4.0.11
>
> Attachments: ClusterSplitLockTest.java, jg_clusterlock_output_testfail.txt
>
>
> As discussed in the mailing list we have issues where locks from the central lock protocol stay locked forever when the coordinator of the cluster disconnects. We can reproduce this with the attached ClusterSplitLockTest.java. Its a race condition and we need to run the test a lot of times (sometimes > 20) before we encounter a failure.
> What we think is happening:
> In a three node cluster (node A, B and C where node A is the coordinator) unlock requests from B and/or C can be missed when node A leaves and B and/or C don't have the new view installed yet. When, for example, node B takes over coordination it creates the lock table based on the back-ups. Lets say node C has locked the lock with name 'lockX'. Node C performs an unlock of 'lockX' just after node A (gracefully) leaves and sends the unlock request to node A since node C doesn't have the correct view installed yet. Node B has recreated the lock table where 'lockX' is locked by Node C. Node C doesn't resend the unlock request so 'lockX' gets locked forever.
> Attached is the testng test we wrote and the output of a test failure.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
8 years, 3 months
[JBoss JIRA] (WFLY-9763) Move legacy subsystems to a legacy feature pack
by Jeff Mesnil (JIRA)
[ https://issues.jboss.org/browse/WFLY-9763?page=com.atlassian.jira.plugin.... ]
Jeff Mesnil updated WFLY-9763:
------------------------------
Description:
This issue is the result of the proposal to move legacy subsystems in a a legacy feature pack as explained in http://wildfly-development.1055759.n5.nabble.com/proposal-Move-legacy-ext...
Impacted subsystems:
* cmp
* configadmin
* jaxr
* messaging
* web
GA of the created feature-pack: org.wildfly:wildfly-legacy-feature-pack
The version will be final and will start at 12.0.0.Alpha1 so that there is no confusion with the GAV of the subsystems .
Proposed GitHub repository: github.com/wildfly/wildfly-legacy
All the subsystems will keep the same versions than WildFly for the time being.
The main functional change is that migration tests are no longer in the legacy subsystems but in the new subsystems that takes responsibility for the migration (e.g. undertow for web, messaging-activemq for messaging).
was:
This issue is the result of the proposal to move legacy subsystems in a a legacy feature pack as explained in http://wildfly-development.1055759.n5.nabble.com/proposal-Move-legacy-ext...
Impacted subsystems:
* cmp
* configadmin
* jaxr
* messaging
* web
GAV of the created feature-pack: org.wildfly:wildfly-legacy-feature-pack:${wildfly.version}
Proposed GitHub repository: github.com/wildfly/wildfly-legacy
All the subsystems will keep the same versions than WildFly for the time being.
The main functional change is that migration tests are no longer in the legacy subsystems but in the new subsystems that takes responsibility for the migration (e.g. undertow for web, messaging-activemq for messaging).
> Move legacy subsystems to a legacy feature pack
> -----------------------------------------------
>
> Key: WFLY-9763
> URL: https://issues.jboss.org/browse/WFLY-9763
> Project: WildFly
> Issue Type: Task
> Affects Versions: 11.0.0.Final
> Reporter: Jeff Mesnil
> Assignee: Jeff Mesnil
>
> This issue is the result of the proposal to move legacy subsystems in a a legacy feature pack as explained in http://wildfly-development.1055759.n5.nabble.com/proposal-Move-legacy-ext...
> Impacted subsystems:
> * cmp
> * configadmin
> * jaxr
> * messaging
> * web
> GA of the created feature-pack: org.wildfly:wildfly-legacy-feature-pack
> The version will be final and will start at 12.0.0.Alpha1 so that there is no confusion with the GAV of the subsystems .
> Proposed GitHub repository: github.com/wildfly/wildfly-legacy
> All the subsystems will keep the same versions than WildFly for the time being.
> The main functional change is that migration tests are no longer in the legacy subsystems but in the new subsystems that takes responsibility for the migration (e.g. undertow for web, messaging-activemq for messaging).
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
8 years, 3 months
[JBoss JIRA] (JGRP-2234) Unlocked locks stay locked forever
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/JGRP-2234?page=com.atlassian.jira.plugin.... ]
Bela Ban edited comment on JGRP-2234 at 2/6/18 6:49 AM:
--------------------------------------------------------
Yes, I guess my solution would require lock acquire/release acks to be sent back to the requesters only after successfully copying to the backup member(s). This increases latency quite a bit, and that's why I tend to favor your proposal.
If you read my comment dated Jan 23, this is actually what I had in mind with the reconciliation protocol I mentioned. This would reduce latency for lock acquisition/release, which is critical, at the cost of a more involved reconciliation protocol when the coord changes. I do like the tradeoff though...
was (Author: belaban):
Yes, I guess my solution would require lock acquire/release acks to be sent back to the requesters only after successfully copying to the backup member(s). This increases latency quite a bit, and that's why I tend to favor your proposal.
If you read my comment dated Jan 23, this is actually what I had in mind with reconciliation protocol I mentioned. This would reduce latency for lock acquisition/release, which is critical, at the cost of a more involved reconciliation protocol when the coord changes. I do like the tradeoff though...
> Unlocked locks stay locked forever
> ----------------------------------
>
> Key: JGRP-2234
> URL: https://issues.jboss.org/browse/JGRP-2234
> Project: JGroups
> Issue Type: Bug
> Reporter: Bram Klein Gunnewiek
> Assignee: Bela Ban
> Fix For: 4.0.11
>
> Attachments: ClusterSplitLockTest.java, jg_clusterlock_output_testfail.txt
>
>
> As discussed in the mailing list we have issues where locks from the central lock protocol stay locked forever when the coordinator of the cluster disconnects. We can reproduce this with the attached ClusterSplitLockTest.java. Its a race condition and we need to run the test a lot of times (sometimes > 20) before we encounter a failure.
> What we think is happening:
> In a three node cluster (node A, B and C where node A is the coordinator) unlock requests from B and/or C can be missed when node A leaves and B and/or C don't have the new view installed yet. When, for example, node B takes over coordination it creates the lock table based on the back-ups. Lets say node C has locked the lock with name 'lockX'. Node C performs an unlock of 'lockX' just after node A (gracefully) leaves and sends the unlock request to node A since node C doesn't have the correct view installed yet. Node B has recreated the lock table where 'lockX' is locked by Node C. Node C doesn't resend the unlock request so 'lockX' gets locked forever.
> Attached is the testng test we wrote and the output of a test failure.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
8 years, 3 months
[JBoss JIRA] (JGRP-2234) Unlocked locks stay locked forever
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/JGRP-2234?page=com.atlassian.jira.plugin.... ]
Bela Ban commented on JGRP-2234:
--------------------------------
Yes, I guess my solution would require lock acquire/release acks to be sent back to the requesters only after successfully copying to the backup member(s). This increases latency quite a bit, and that's why I tend to favor your proposal.
If you read my comment dated Jan 23, this is actually what I had in mind with reconciliation protocol I mentioned. This would reduce latency for lock acquisition/release, which is critical, at the cost of a more involved reconciliation protocol when the coord changes. I do like the tradeoff though...
> Unlocked locks stay locked forever
> ----------------------------------
>
> Key: JGRP-2234
> URL: https://issues.jboss.org/browse/JGRP-2234
> Project: JGroups
> Issue Type: Bug
> Reporter: Bram Klein Gunnewiek
> Assignee: Bela Ban
> Fix For: 4.0.11
>
> Attachments: ClusterSplitLockTest.java, jg_clusterlock_output_testfail.txt
>
>
> As discussed in the mailing list we have issues where locks from the central lock protocol stay locked forever when the coordinator of the cluster disconnects. We can reproduce this with the attached ClusterSplitLockTest.java. Its a race condition and we need to run the test a lot of times (sometimes > 20) before we encounter a failure.
> What we think is happening:
> In a three node cluster (node A, B and C where node A is the coordinator) unlock requests from B and/or C can be missed when node A leaves and B and/or C don't have the new view installed yet. When, for example, node B takes over coordination it creates the lock table based on the back-ups. Lets say node C has locked the lock with name 'lockX'. Node C performs an unlock of 'lockX' just after node A (gracefully) leaves and sends the unlock request to node A since node C doesn't have the correct view installed yet. Node B has recreated the lock table where 'lockX' is locked by Node C. Node C doesn't resend the unlock request so 'lockX' gets locked forever.
> Attached is the testng test we wrote and the output of a test failure.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
8 years, 3 months
[JBoss JIRA] (JGRP-2234) Unlocked locks stay locked forever
by Bram Klein Gunnewiek (JIRA)
[ https://issues.jboss.org/browse/JGRP-2234?page=com.atlassian.jira.plugin.... ]
Bram Klein Gunnewiek commented on JGRP-2234:
--------------------------------------------
Does that solve the problem? E.G. in a cluster [A,B,C] with A as coordinator I could imagine the following: B sends RELEASE_LOCK to A. A receives the unlock request and sends back RELEASE_LOCK_OK to B. B receives it but A dies right after the reply was sent. C becomes the new coordinator.
Is it guaranteed that C receives/has a lock table where the unlock from B is processed or does the solution only makes the failure window smaller?
I was also (briefly) thinking about a solution and my solution would be that the new coordinator would ask for a confirmation of all locks after a coordinator change. Lets say C has 3 locks in the lock table after he became the coordinator, all marked as locked by B. Only 2 of the 3 locks are actually locked, 1 lock was unlocked by B through node A. C would simply ask B for confirmation that the lock table is still up-to-date (e.g. is it correct that you have locked locks 1,2,3?) and unlock the (already unlocked) lock after B replies 'lock 1 and 2 are locked by me, lock 3 isn't'.
Your solution is quicker and with less overhead, if the solution is 100% correct I guess thats the better option (although acquiring locks in a normal situation is a bit slower and with more overhead).
> Unlocked locks stay locked forever
> ----------------------------------
>
> Key: JGRP-2234
> URL: https://issues.jboss.org/browse/JGRP-2234
> Project: JGroups
> Issue Type: Bug
> Reporter: Bram Klein Gunnewiek
> Assignee: Bela Ban
> Fix For: 4.0.11
>
> Attachments: ClusterSplitLockTest.java, jg_clusterlock_output_testfail.txt
>
>
> As discussed in the mailing list we have issues where locks from the central lock protocol stay locked forever when the coordinator of the cluster disconnects. We can reproduce this with the attached ClusterSplitLockTest.java. Its a race condition and we need to run the test a lot of times (sometimes > 20) before we encounter a failure.
> What we think is happening:
> In a three node cluster (node A, B and C where node A is the coordinator) unlock requests from B and/or C can be missed when node A leaves and B and/or C don't have the new view installed yet. When, for example, node B takes over coordination it creates the lock table based on the back-ups. Lets say node C has locked the lock with name 'lockX'. Node C performs an unlock of 'lockX' just after node A (gracefully) leaves and sends the unlock request to node A since node C doesn't have the correct view installed yet. Node B has recreated the lock table where 'lockX' is locked by Node C. Node C doesn't resend the unlock request so 'lockX' gets locked forever.
> Attached is the testng test we wrote and the output of a test failure.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
8 years, 3 months