[JBoss JIRA] (ISPN-2966) NBST: Concurrent leavers can lead to deadlock

Wednesday, 27 March 2013

    [
https://issues.jboss.org/browse/ISPN-2966?page=com.atlassian.jira.plugin....
] 

Dan Berindei commented on ISPN-2966:
------------------------------------

[~mircea.markus] I think this is pretty critical, because the transactions on NodeE are
blocked waiting for NodeE to receive transaction data from NodeH.

[~pruivo] After node NodeH sends the LEAVE command to the coordinator, it may not receive
any transactions any more. So it would be wrong for it to reply with a successful response
to a GET_TRANSACTIONS command. Instead of accepting the REBALANCE_START command and
replying to GET_TRANSACTIONS normally, we should wake up the GET_TRANSACTIONS thread and
send an exception back to NodeE and NodeG.

There are some steps missing from the description:
{noformat}
8) NodeE delivers the LEAVE message from NodeH
9) NodeE broadcasts a CH_UPDATE message to NodeG and to itself (in the same thread)
10) While processing the CH_UPDATE message, NodeE tries to lock the LocalCacheStatus
object. But it can't because the thread that's processing the REBALANCE_START
command is already holding the lock.
11a) Eventually the GET_TRANSACTIONS commands time out, NodeE requests and receives
transactions from NodeG, and then it can process the LEAVE command from NodeH. 
11b) Alternatively, the LEAVE command times out on NodeH, and when the JGroups channel on
NodeH shuts down the GET_TRANSACTIONS commands on NodeE and NodeG fail with a
SuspectException.
{noformat}

If we change step 9) to process the CH_UPDATE message asynchronously, that will be enough
to allow NodeH to stop and for the GET_TRANSACTIONS commands to fail.

...
 NBST: Concurrent leavers can lead to deadlock
 ---------------------------------------------

                 Key: ISPN-2966
                 URL: https://issues.jboss.org/browse/ISPN-2966
             Project: Infinispan
          Issue Type: Bug
            Reporter: Pedro Ruivo
            Assignee: Dan Berindei
              Labels: state_transfer
             Fix For: 5.3.0.Final

         Attachments: thread-dump.txt, trace.log

 This sequence of events, leads to a thread deadlock in the coordinator
 {code}
 1) NodeF sends LEAVE message. new topologyId=8
 2) NodeE delivers REBALANCE_START(8)
 3) NodeF and NodeG delivers REBALANCE_START(8)
 4) NodeH delivers GET_TRANSACTION(8) from NodeE ==> Transactions were requested by
node ConcurrentNonOverlappingLeaveTest-NodeE-28744 with topology 8, greater than the local
topology (7). Waiting for topology 8 to be installed locally.
 5) NodeH sends LEAVE message. new topologyId=9
 6) NodeH delivers REBALANCE_START(8) ==> Ignoring rebalance 8 for cache dist that
doesn't exist locally
 7) NodeH delivers GET_TRANSACTION(8) from NodeG ==> Transactions were requested by
node ConcurrentNonOverlappingLeaveTest-NodeG-31669 with topology 8, greater than the local
topology (7). Waiting for topology 8 to be installed locally.
 {code}
 Possible solutions are:
  - send the REBALANCE_START/CH_UPDATE async
  - throw an exception when a GET_TRANSACTION is received and the node is shutting down.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009