[infinispan-dev] Blocking issue in TO State Transfer

Bela Ban bban at redhat.com
Tue Feb 26 11:31:31 EST 2013


On 2/26/13 5:14 PM, Pedro Ruivo wrote:
> So, in this case, the regular message will block until the OOB message is delivered.


No, the regular message should get delivered as soon as the OOB message 
has been *received* (not *delivered*). Unless there are previous regular 
messages from the same sender which are delivered in the same thread, 
and one of them is blocked in application code...


> however, the OOB message is being block in the application
> until the regular message is delivered. And there is no way to pick the
> regular message from the window list while the OOB is blocked, right?
> (assuming no more incoming messages)

This actually should happen, as they're delivered by different threads !


> so, if everybody agrees, if I move the OOB message to another thread,
> everything should work fine...
>
> On 02/26/2013 03:50 PM, Bela Ban wrote:
>> On 2/26/13 4:15 PM, Dan Berindei wrote:
>>>
>>> On Tue, Feb 26, 2013 at 12:57 PM, Pedro Ruivo <pedro at infinispan.org
>>> <mailto:pedro at infinispan.org>> wrote:
>>>
>>>      hi,
>>>
>>>      I found the blocking problem with the state transfer this morning.
>>>      It happens because of the reordering of a regular and OOB message.
>>>
>>>      Below, is a simplification of what is happening for two nodes
>>>
>>>      A: total order broadcasts rebalance_start
>>>
>>>      B: (incoming thread) delivers rebalance_start
>>>      B: has no segments to request so the rebalance is done
>>>      B: sends async request with rebalance_confirm (unicast #x)
>>>      B: sends the rebalance_start response (unicast #x+1) (the response
>>>      is a regular message)
>>>
>>>      A: receives rebalance_start response (unicast #x+1)
>>>      A: in UNICAST2, it detects the message is out-of-order and blocks
>>>      the response in the sender window (i.e. the message #x is missing)
>>>      A: receives the rebalance_confirm (unicast #x)
>>>      A: delivers rebalance_confirm. Infinispan blocks this command
>>>      until all the rebalance_start responses are received ==> this
>>>      originates a deadlock! (because the response is blocked in unicast
>>>      layer)
>>>
>>>      Question: can the request's response message be sent always as
>>>      OOB? (I think the answer should be no...)
>>>
>>>
>>> We could, if Bela adds the send(Message) method to the Response
>>> interface...
>> I created a JIRA yesterday: https://issues.jboss.org/browse/JGRP-1602.
>> I'm wondering though if you *really* need it, as making all responses
>> OOB is a bad idea IMO, see below...
>>
>>
>>> and personally I think it would be better to make all responses OOB
>>> (as in JGroups 3.2.x). I don't have any data to back this up, though...
>> Intuitively, I think indiscriminatingly marking all responses as OOB
>> is bad, especially in the light of the async invocation API which will
>> make all messages non-blocking, at least in the OOB or reg thread pools.
>>
>> The code in 3.3 *does* actually copy the flags of the request into the
>> response, so if the request is async (OOB), so will the response be.
>> For async RPCs (regular messages), you're not getting any response
>> anyway, so no worries here...
>>
>>
>>>      My suggestion: when I deliver a rebalance_confirm command (that it
>>>      is send async), can I move it to a thread in
>>>      async_thread_pool_executor?
>>>
>>>
>>> I have WIP fix for https://issues.jboss.org/browse/ISPN-2825, which
>>> should stop blocking the REBALANCE_CONFIRM commands on the
>>> coordinator: https://github.com/danberindei/infinispan/tree/t_2825_m
>>>
>>> I haven't issued a PR yet because I'm still getting a failure in
>>> ClusterTopologyManagerTest, I think because of a JGroups issue (RSVP
>>> not receiving an ACK from itself). I'll let you know when I find out...
>>
>>
>> Yes, please do that. I saw in London that you could reproduce it in
>> your test, so it should be simple to find the root cause.
>>
>>
>>
>>>      Weird thing: last night I tried more than 5x time in a row with
>>>      UNICAST3 and it never blocks. can this meaning a problem with
>>>      UNICAST3 or I had just lucky?
>>>
>>>
>>> Even though the REBALANCE_CONFIRM command is sent async, the message
>>> is still OOB. I think UNICAST/2/3 should not block any regular
>>> message waiting for the processing of an OOB message, as long as that
>>> message was received, so maybe the problem is in UNICAST2?
>> If the OOB thread added the OOB message, then it will simply pass it
>> up. However, the regular thread needs to wait for gaps in the receiver
>> table to fill, as it doesn't know what type of message will be
>> received (could be regular).
>>
>> As soon as the OOB message has been added to the table, the regular
>> message will get delivered
>>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)



More information about the infinispan-dev mailing list