[infinispan-issues] [JBoss JIRA] (ISPN-2316) Distributed deadlock in StateTransferInterceptor
Radim Vansa (JIRA)
jira-events at lists.jboss.org
Wed Sep 19 10:00:34 EDT 2012
[ https://issues.jboss.org/browse/ISPN-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719925#comment-12719925 ]
Radim Vansa commented on ISPN-2316:
-----------------------------------
In the hudson job above there is another interesting fact: As the slave3 requests GET_TRANSACTIONS in thread OOB-23 at 11:12:50,997, this request is processed on slave2 by OOB-52 at 11:12:51,002 and delayed until 11:13:51,003 when the StressorThread-7 times out performing the rpc back to slave3 and releases its locks (then the transactions are retrieved as indicated by 11:13:51,003 TRACE [org.infinispan.statetransfer.StateProviderImpl] (OOB-52,edg-perf02-60212) Found 20 transaction(s) to transfer). However, the thread OOB-23 on slave3 does not receive the response and it stays locked (see stack traces for this thread after 11:13:51 when it should already receive the response).
> Distributed deadlock in StateTransferInterceptor
> ------------------------------------------------
>
> Key: ISPN-2316
> URL: https://issues.jboss.org/browse/ISPN-2316
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer, Transactions
> Affects Versions: 5.2.0.Alpha3
> Reporter: Radim Vansa
> Assignee: Mircea Markus
> Priority: Critical
>
> When using transactions, a distributed deadlock may occur when a node is joining under these circumstances:
> 1) the new node requests transactions using GET_TRANSACTIONS
> 2) the old node tries to commit a transaction, broadcasting PrepareCommand - in StateTransferIntreceptor it locks the transactionLock in shared way
> 3) the request GET_TRANSACTIONS comes on the new node, the node is waiting for the transactionLock (it requires it exclusively)
> 4) transaction commit on new node is waiting for the commandsLock (requires this in shared way) but it is locked exclusively by the onTopologyUpdate - addTransfer - requestTransactions ( = synchronous GET_TRANSACTIONS).
> Found in some traces, but not required:
> After the transaction commit times out on old node releasing the lock, the GET_TRANSACTION request may continue, but the state transfer itself can also timeout if not set properly longer.
> The transaction commit continues on the new node after the ST times out, until it is found invalid (rolled back).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the infinispan-issues
mailing list