[infinispan-issues] [JBoss JIRA] (ISPN-3878) Unhandled failing ST cancel leads to deadlock

Radim Vansa (JIRA) issues at jboss.org
Wed Jan 8 08:52:33 EST 2014


    [ https://issues.jboss.org/browse/ISPN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934441#comment-12934441 ] 

Radim Vansa commented on ISPN-3878:
-----------------------------------

I think I've found why the RSVP response does not arrive: as TxCompletionNotificationCommand is sent as non-OOB and this waits until the topology is installed, the ordered RSVP response cannot be delivered. After the cancel command times out, the topology change is finished (in SCI.onTopologyUpdate: finally { ... }) and only then the ordered commands can arrive.
                
> Unhandled failing ST cancel leads to deadlock
> ---------------------------------------------
>
>                 Key: ISPN-3878
>                 URL: https://issues.jboss.org/browse/ISPN-3878
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 6.0.1.Final
>            Reporter: Radim Vansa
>            Assignee: Dan Berindei
>            Priority: Critical
>
> Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can be executed in parallel is when the coordinator is leaving a cluster; it sends REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and sends REBALANCE_START as well.
> 1. Node is requesting segments for the old topology, StateConsumerImpl.isTransferThreadRunning is set to true
> 2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
> 3. New rebalance is started, changing the CH - requested segment is not in the new CH
> 4. Some ST are canceled, the cancel command is sent and taking a long time
> 5. StateReponseCommand is received, but in SCI.applyState it is found out that this segment is no longer owned so the task is not completed/cancelled
> 6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more cancellations are executed
> Result: the inbound transfer thread is stuck and rebalance is never completed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


More information about the infinispan-issues mailing list