[
https://issues.jboss.org/browse/ISPN-2966?page=com.atlassian.jira.plugin....
]
Pedro Ruivo commented on ISPN-2966:
-----------------------------------
[~anistor] I don't know if the someone have noticed, because the test didn't fail.
The only cause is the core test suite takes 6/7min instead of 4 min (in my case).
However, the leave() in LocalTopologyManager does the following:
1) runningCaches.remove(cacheName);
2) send LEAVE to coordinator
and the leaving node when it receives the REBALANCE_START, it discards it because the
cache no longer exists, leading to the deadlock. Isn't possible to remove the cache
after sending the LEAVE to the coordinator?
1) send LEAVE to coordinator
2) runningCaches.remove(cacheName);
This way, the leaving node will process the REBALANCE_START(8) (assuming the example
above) and it will avoid the deadlock (because the LEAVE is blocked due to the
REBALANCE_START).
Is it possible to solve in this way? any thoughs?
NBST: Concurrent leavers can lead to deadlock
---------------------------------------------
Key: ISPN-2966
URL:
https://issues.jboss.org/browse/ISPN-2966
Project: Infinispan
Issue Type: Bug
Reporter: Pedro Ruivo
Assignee: Pedro Ruivo
Labels: state_transfer
Fix For: 5.3.0.Final
Attachments: thread-dump.txt, trace.log
This sequence of events, leads to a thread deadlock in the coordinator
{code}
1) NodeF sends LEAVE message. new topologyId=8
2) NodeE delivers REBALANCE_START(8)
3) NodeF and NodeG delivers REBALANCE_START(8)
4) NodeH delivers GET_TRANSACTION(8) from NodeE ==> Transactions were requested by
node ConcurrentNonOverlappingLeaveTest-NodeE-28744 with topology 8, greater than the local
topology (7). Waiting for topology 8 to be installed locally.
5) NodeH sends LEAVE message. new topologyId=9
6) NodeH delivers REBALANCE_START(8) ==> Ignoring rebalance 8 for cache dist that
doesn't exist locally
7) NodeH delivers GET_TRANSACTION(8) from NodeG ==> Transactions were requested by
node ConcurrentNonOverlappingLeaveTest-NodeG-31669 with topology 8, greater than the local
topology (7). Waiting for topology 8 to be installed locally.
{code}
Possible solutions are:
- send the REBALANCE_START/CH_UPDATE async
- throw an exception when a GET_TRANSACTION is received and the node is shutting down.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira