[
https://issues.jboss.org/browse/ISPN-2316?page=com.atlassian.jira.plugin....
]
Radim Vansa commented on ISPN-2316:
-----------------------------------
In the hudson job above there is another interesting fact: As the slave3 requests
GET_TRANSACTIONS in thread OOB-23 at 11:12:50,997, this request is processed on slave2 by
OOB-52 at 11:12:51,002 and delayed until 11:13:51,003 when the StressorThread-7 times out
performing the rpc back to slave3 and releases its locks (then the transactions are
retrieved as indicated by 11:13:51,003 TRACE
[org.infinispan.statetransfer.StateProviderImpl] (OOB-52,edg-perf02-60212) Found 20
transaction(s) to transfer). However, the thread OOB-23 on slave3 does not receive the
response and it stays locked (see stack traces for this thread after 11:13:51 when it
should already receive the response).
Distributed deadlock in StateTransferInterceptor
------------------------------------------------
Key: ISPN-2316
URL:
https://issues.jboss.org/browse/ISPN-2316
Project: Infinispan
Issue Type: Bug
Components: State transfer, Transactions
Affects Versions: 5.2.0.Alpha3
Reporter: Radim Vansa
Assignee: Mircea Markus
Priority: Critical
When using transactions, a distributed deadlock may occur when a node is joining under
these circumstances:
1) the new node requests transactions using GET_TRANSACTIONS
2) the old node tries to commit a transaction, broadcasting PrepareCommand - in
StateTransferIntreceptor it locks the transactionLock in shared way
3) the request GET_TRANSACTIONS comes on the new node, the node is waiting for the
transactionLock (it requires it exclusively)
4) transaction commit on new node is waiting for the commandsLock (requires this in
shared way) but it is locked exclusively by the onTopologyUpdate - addTransfer -
requestTransactions ( = synchronous GET_TRANSACTIONS).
Found in some traces, but not required:
After the transaction commit times out on old node releasing the lock, the
GET_TRANSACTION request may continue, but the state transfer itself can also timeout if
not set properly longer.
The transaction commit continues on the new node after the ST times out, until it is
found invalid (rolled back).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira