[
https://issues.jboss.org/browse/ISPN-3878?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-3878:
------------------------------------
I think the cancel command can't be sent asynchronously, because we want to know that
nobody is sending state by the time the new rebalance starts. (The cancelling of the
transfer tasks should happen during the handling of the CH_UPDATE that's sent by the
new coordinator, not during the REBALANCE_START that follows.)
On the other hand, perhaps we don't need the CANCEL_STATE_TRANSFER commands at all,
and we could just cancel all outbound transfer tasks when we install a new cache topology
without a pending CH in StateProviderImpl.
Unhandled failing ST cancel leads to deadlock
---------------------------------------------
Key: ISPN-3878
URL:
https://issues.jboss.org/browse/ISPN-3878
Project: Infinispan
Issue Type: Bug
Components: State transfer
Affects Versions: 6.0.1.Final
Reporter: Radim Vansa
Assignee: Dan Berindei
Priority: Critical
Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can
be executed in parallel is when the coordinator is leaving a cluster; it sends
REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and
sends REBALANCE_START as well.
1. Node is requesting segments for the old topology,
StateConsumerImpl.isTransferThreadRunning is set to true
2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
3. New rebalance is started, changing the CH - requested segment is not in the new CH
4. Some ST are canceled, the cancel command is sent and taking a long time
5. StateReponseCommand is received, but in SCI.applyState it is found out that this
segment is no longer owned so the task is not completed/cancelled
6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more
cancellations are executed
Result: the inbound transfer thread is stuck and rebalance is never completed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira