[
https://issues.jboss.org/browse/ISPN-3878?page=com.atlassian.jira.plugin....
]
Radim Vansa edited comment on ISPN-3878 at 1/8/14 8:53 AM:
-----------------------------------------------------------
I think I've found why the RSVP response does not arrive: as
TxCompletionNotificationCommand (from other node with already installed new topology) is
sent as non-OOB and this waits until the topology is installed, the ordered RSVP response
cannot be delivered. After the cancel command times out, the topology change is finished
(in SCI.onTopologyUpdate: finally { ... }) and only then the ordered commands can arrive.
was (Author: rvansa):
I think I've found why the RSVP response does not arrive: as
TxCompletionNotificationCommand is sent as non-OOB and this waits until the topology is
installed, the ordered RSVP response cannot be delivered. After the cancel command times
out, the topology change is finished (in SCI.onTopologyUpdate: finally { ... }) and only
then the ordered commands can arrive.
Unhandled failing ST cancel leads to deadlock
---------------------------------------------
Key: ISPN-3878
URL:
https://issues.jboss.org/browse/ISPN-3878
Project: Infinispan
Issue Type: Bug
Components: State transfer
Affects Versions: 6.0.1.Final
Reporter: Radim Vansa
Assignee: Dan Berindei
Priority: Critical
Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can
be executed in parallel is when the coordinator is leaving a cluster; it sends
REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and
sends REBALANCE_START as well.
1. Node is requesting segments for the old topology,
StateConsumerImpl.isTransferThreadRunning is set to true
2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
3. New rebalance is started, changing the CH - requested segment is not in the new CH
4. Some ST are canceled, the cancel command is sent and taking a long time
5. StateReponseCommand is received, but in SCI.applyState it is found out that this
segment is no longer owned so the task is not completed/cancelled
6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more
cancellations are executed
Result: the inbound transfer thread is stuck and rebalance is never completed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira