[infinispan-issues] [JBoss JIRA] (ISPN-2966) NBST: Concurrent leavers can lead to deadlock

Wed Mar 27 08:52:43 EDT 2013

    [ https://issues.jboss.org/browse/ISPN-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763445#comment-12763445 ] 

Dan Berindei commented on ISPN-2966:
------------------------------------

[~mircea.markus] I think this is pretty critical, because the transactions on NodeE are blocked waiting for NodeE to receive transaction data from NodeH.

[~pruivo] After node NodeH sends the LEAVE command to the coordinator, it may not receive any transactions any more. So it would be wrong for it to reply with a successful response to a GET_TRANSACTIONS command. Instead of accepting the REBALANCE_START command and replying to GET_TRANSACTIONS normally, we should wake up the GET_TRANSACTIONS thread and send an exception back to NodeE and NodeG.

There are some steps missing from the description:
{noformat}
8) NodeE delivers the LEAVE message from NodeH
9) NodeE broadcasts a CH_UPDATE message to NodeG and to itself (in the same thread)
10) While processing the CH_UPDATE message, NodeE tries to lock the LocalCacheStatus object. But it can't because the thread that's processing the REBALANCE_START command is already holding the lock.
11a) Eventually the GET_TRANSACTIONS commands time out, NodeE requests and receives transactions from NodeG, and then it can process the LEAVE command from NodeH. 
11b) Alternatively, the LEAVE command times out on NodeH, and when the JGroups channel on NodeH shuts down the GET_TRANSACTIONS commands on NodeE and NodeG fail with a SuspectException.
{noformat}

If we change step 9) to process the CH_UPDATE message asynchronously, that will be enough to allow NodeH to stop and for the GET_TRANSACTIONS commands to fail.

> NBST: Concurrent leavers can lead to deadlock
> ---------------------------------------------
>
>                 Key: ISPN-2966
>                 URL: https://issues.jboss.org/browse/ISPN-2966
>             Project: Infinispan
>          Issue Type: Bug
>            Reporter: Pedro Ruivo
>            Assignee: Dan Berindei
>              Labels: state_transfer
>             Fix For: 5.3.0.Final
>
>         Attachments: thread-dump.txt, trace.log
>
>
> This sequence of events, leads to a thread deadlock in the coordinator
> {code}
> 1) NodeF sends LEAVE message. new topologyId=8
> 2) NodeE delivers REBALANCE_START(8)
> 3) NodeF and NodeG delivers REBALANCE_START(8)
> 4) NodeH delivers GET_TRANSACTION(8) from NodeE ==> Transactions were requested by node ConcurrentNonOverlappingLeaveTest-NodeE-28744 with topology 8, greater than the local topology (7). Waiting for topology 8 to be installed locally.
> 5) NodeH sends LEAVE message. new topologyId=9
> 6) NodeH delivers REBALANCE_START(8) ==> Ignoring rebalance 8 for cache dist that doesn't exist locally
> 7) NodeH delivers GET_TRANSACTION(8) from NodeG ==> Transactions were requested by node ConcurrentNonOverlappingLeaveTest-NodeG-31669 with topology 8, greater than the local topology (7). Waiting for topology 8 to be installed locally.
> {code}
> Possible solutions are:
>  - send the REBALANCE_START/CH_UPDATE async
>  - throw an exception when a GET_TRANSACTION is received and the node is shutting down.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira