[JBoss JIRA] (ISPN-3878) Unhandled failing ST cancel leads to deadlock

Wednesday, 8 January 2014

    [
https://issues.jboss.org/browse/ISPN-3878?page=com.atlassian.jira.plugin....
] 

Radim Vansa commented on ISPN-3878:
-----------------------------------

The cause for the TimeoutException is RSVP protocol. RSVP response was delayed for >
60s (our replication timeout). This response is NOT sent with OOB flag, therefore, other
messages processing could delay delivering it.
As Dan recommended, with UNICAST3 the RSVP protocol theoretically is not necessary
anymore. Trying to run the tests without RSVP now.

...
 Unhandled failing ST cancel leads to deadlock
 ---------------------------------------------

                 Key: ISPN-3878
                 URL: https://issues.jboss.org/browse/ISPN-3878
             Project: Infinispan
          Issue Type: Bug
          Components: State transfer
    Affects Versions: 6.0.1.Final
            Reporter: Radim Vansa
            Assignee: Dan Berindei
            Priority: Critical

 Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can
be executed in parallel is when the coordinator is leaving a cluster; it sends
REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and
sends REBALANCE_START as well.
 1. Node is requesting segments for the old topology,
StateConsumerImpl.isTransferThreadRunning is set to true
 2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
 3. New rebalance is started, changing the CH - requested segment is not in the new CH
 4. Some ST are canceled, the cancel command is sent and taking a long time
 5. StateReponseCommand is received, but in SCI.applyState it is found out that this
segment is no longer owned so the task is not completed/cancelled
 6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more
cancellations are executed
 Result: the inbound transfer thread is stuck and rebalance is never completed. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009