[infinispan-issues] [JBoss JIRA] (ISPN-2966) NBST: Concurrent leavers can lead to deadlock

Wed Mar 27 07:23:42 EDT 2013

    [ https://issues.jboss.org/browse/ISPN-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763417#comment-12763417 ] 

Pedro Ruivo commented on ISPN-2966:
-----------------------------------

[~anistor] I don't know if the someone have noticed, because the test didn't fail. The only cause is the core test suite takes 6/7min instead of 4 min (in my case).

However, the leave() in LocalTopologyManager does the following:

1) runningCaches.remove(cacheName);
2) send LEAVE to coordinator

and the leaving node when it receives the REBALANCE_START, it discards it because the cache no longer exists, leading to the deadlock. Isn't possible to remove the cache after sending the LEAVE to the coordinator?

1) send LEAVE to coordinator
2) runningCaches.remove(cacheName);

This way, the leaving node will process the REBALANCE_START(8) (assuming the example above) and it will avoid the deadlock (because the LEAVE is blocked due to the REBALANCE_START). 

Is it possible to solve in this way? any thoughs?

> NBST: Concurrent leavers can lead to deadlock
> ---------------------------------------------
>
>                 Key: ISPN-2966
>                 URL: https://issues.jboss.org/browse/ISPN-2966
>             Project: Infinispan
>          Issue Type: Bug
>            Reporter: Pedro Ruivo
>            Assignee: Pedro Ruivo
>              Labels: state_transfer
>             Fix For: 5.3.0.Final
>
>         Attachments: thread-dump.txt, trace.log
>
>
> This sequence of events, leads to a thread deadlock in the coordinator
> {code}
> 1) NodeF sends LEAVE message. new topologyId=8
> 2) NodeE delivers REBALANCE_START(8)
> 3) NodeF and NodeG delivers REBALANCE_START(8)
> 4) NodeH delivers GET_TRANSACTION(8) from NodeE ==> Transactions were requested by node ConcurrentNonOverlappingLeaveTest-NodeE-28744 with topology 8, greater than the local topology (7). Waiting for topology 8 to be installed locally.
> 5) NodeH sends LEAVE message. new topologyId=9
> 6) NodeH delivers REBALANCE_START(8) ==> Ignoring rebalance 8 for cache dist that doesn't exist locally
> 7) NodeH delivers GET_TRANSACTION(8) from NodeG ==> Transactions were requested by node ConcurrentNonOverlappingLeaveTest-NodeG-31669 with topology 8, greater than the local topology (7). Waiting for topology 8 to be installed locally.
> {code}
> Possible solutions are:
>  - send the REBALANCE_START/CH_UPDATE async
>  - throw an exception when a GET_TRANSACTION is received and the node is shutting down.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira