[
https://issues.jboss.org/browse/ISPN-2966?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-2966:
------------------------------------
[~mircea.markus] I think this is pretty critical, because the transactions on NodeE are
blocked waiting for NodeE to receive transaction data from NodeH.
[~pruivo] After node NodeH sends the LEAVE command to the coordinator, it may not receive
any transactions any more. So it would be wrong for it to reply with a successful response
to a GET_TRANSACTIONS command. Instead of accepting the REBALANCE_START command and
replying to GET_TRANSACTIONS normally, we should wake up the GET_TRANSACTIONS thread and
send an exception back to NodeE and NodeG.
There are some steps missing from the description:
{noformat}
8) NodeE delivers the LEAVE message from NodeH
9) NodeE broadcasts a CH_UPDATE message to NodeG and to itself (in the same thread)
10) While processing the CH_UPDATE message, NodeE tries to lock the LocalCacheStatus
object. But it can't because the thread that's processing the REBALANCE_START
command is already holding the lock.
11a) Eventually the GET_TRANSACTIONS commands time out, NodeE requests and receives
transactions from NodeG, and then it can process the LEAVE command from NodeH.
11b) Alternatively, the LEAVE command times out on NodeH, and when the JGroups channel on
NodeH shuts down the GET_TRANSACTIONS commands on NodeE and NodeG fail with a
SuspectException.
{noformat}
If we change step 9) to process the CH_UPDATE message asynchronously, that will be enough
to allow NodeH to stop and for the GET_TRANSACTIONS commands to fail.
NBST: Concurrent leavers can lead to deadlock
---------------------------------------------
Key: ISPN-2966
URL:
https://issues.jboss.org/browse/ISPN-2966
Project: Infinispan
Issue Type: Bug
Reporter: Pedro Ruivo
Assignee: Dan Berindei
Labels: state_transfer
Fix For: 5.3.0.Final
Attachments: thread-dump.txt, trace.log
This sequence of events, leads to a thread deadlock in the coordinator
{code}
1) NodeF sends LEAVE message. new topologyId=8
2) NodeE delivers REBALANCE_START(8)
3) NodeF and NodeG delivers REBALANCE_START(8)
4) NodeH delivers GET_TRANSACTION(8) from NodeE ==> Transactions were requested by
node ConcurrentNonOverlappingLeaveTest-NodeE-28744 with topology 8, greater than the local
topology (7). Waiting for topology 8 to be installed locally.
5) NodeH sends LEAVE message. new topologyId=9
6) NodeH delivers REBALANCE_START(8) ==> Ignoring rebalance 8 for cache dist that
doesn't exist locally
7) NodeH delivers GET_TRANSACTION(8) from NodeG ==> Transactions were requested by
node ConcurrentNonOverlappingLeaveTest-NodeG-31669 with topology 8, greater than the local
topology (7). Waiting for topology 8 to be installed locally.
{code}
Possible solutions are:
- send the REBALANCE_START/CH_UPDATE async
- throw an exception when a GET_TRANSACTION is received and the node is shutting down.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira