[infinispan-issues] [JBoss JIRA] (ISPN-3878) Unhandled failing ST cancel leads to deadlock
Dan Berindei (JIRA)
issues at jboss.org
Wed Jan 8 07:46:33 EST 2014
[ https://issues.jboss.org/browse/ISPN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934423#comment-12934423 ]
Dan Berindei edited comment on ISPN-3878 at 1/8/14 7:45 AM:
------------------------------------------------------------
We should also change {{InboundTransferTask.sendCancelCommand()}} to send the cancel commands as OOB.
was (Author: dan.berindei):
Summarizing the discussion with Radim on IRC: the cancel command is already sent asynchronously, but it also uses the RSVP flag, which makes it half-synchronous... And the RSVP ACK message is not tagged as OOB, which means it can be easily delayed by a random asynchronous command that takes too long (maybe because it's waiting for the new topology).
With UNICAST3, RSVP should no longer be necessary, so we can fix this by removing the code that sets the RSVP flag automatically for all the state transfer commands. We should also change {{InboundTransferTask.sendCancelCommand()}} to send the cancel commands as OOB.
> Unhandled failing ST cancel leads to deadlock
> ---------------------------------------------
>
> Key: ISPN-3878
> URL: https://issues.jboss.org/browse/ISPN-3878
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 6.0.1.Final
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Critical
>
> Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can be executed in parallel is when the coordinator is leaving a cluster; it sends REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and sends REBALANCE_START as well.
> 1. Node is requesting segments for the old topology, StateConsumerImpl.isTransferThreadRunning is set to true
> 2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
> 3. New rebalance is started, changing the CH - requested segment is not in the new CH
> 4. Some ST are canceled, the cancel command is sent and taking a long time
> 5. StateReponseCommand is received, but in SCI.applyState it is found out that this segment is no longer owned so the task is not completed/cancelled
> 6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more cancellations are executed
> Result: the inbound transfer thread is stuck and rebalance is never completed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the infinispan-issues
mailing list