[infinispan-dev] Blocking issue in TO State Transfer

Pedro Ruivo pedro at infinispan.org
Wed Feb 27 04:55:09 EST 2013


OK, thanks for the feedback Bela =)

Pedro

On 02/27/2013 09:13 AM, Bela Ban wrote:
> OK, here's what happens:
>
> - A's receiver table forB is at #6, this means that next message from B
> must be #7
> - A receives B#8 (regular message from B)
> - A adds B#8 to B's receiver table, but doesn't deliver it (not OOB,and
> not #7)
> - A receives OOB message B#7 from B
> - The OOB thread delivers B#7 immediately
> - Infinispan blocks on B#7
> - Unless another message from B is received, B#8 will *not* get
> delivered: as you can see in the codebelow, the OOB thread would check
> *after* delivering B#7 if there are more messages to be delivered, but
> because it is blocked by Infinispan, it cannot deliver B#8.
>
> This is one of the rare cases where an OOB thread gets to deliver
> regular messages.
>
> The root cause is that Infinispan blocks on an OOB message; but OOB
> messages should never block! This is another reason why an Infinispan
> application thread pool makes a lot of sense !
>
>
>       // An OOB message is passed up immediately. Later, when remove() is
> called, we discard it. This affects ordering !
>           // http://jira.jboss.com/jira/browse/JGRP-377
>           if(msg.isFlagSet(Message.OOB) && added) {
>               try {
>                   up_prot.up(evt);
>               }
>               catch(Throwable t) {
>                   log.error("couldn't deliver OOB message " + msg, t);
>               }
>           }
>
>           //The OOB thread never gets here as it is blocked in
> up_prot.up()by Infinispan.
>
>           final AtomicBoolean processing=win.getProcessing();
>           if(!processing.compareAndSet(false, true))
>               return true;
>
>
>
> On 2/26/13 7:35 PM, Pedro Ruivo wrote:
>> On 02/26/2013 04:31 PM, Bela Ban wrote:
>>> On 2/26/13 5:14 PM, Pedro Ruivo wrote:
>>>> So, in this case, the regular message will block until the OOB
>>>> message is delivered.
>>> No, the regular message should get delivered as soon as the OOB message
>>> has been *received* (not *delivered*). Unless there are previous regular
>>> messages from the same sender which are delivered in the same thread,
>>> and one of them is blocked in application code...
>> In attachment is part of the log. I only know that the response is
>> disappearing between UNICAST2 and the ISPN unmarshaller.
>>
>> could you please take a look?
>>
>> the response is being sent and received and I don't understand why
>> ISPN is not receive it
>>
>> Thanks
>> Pedro
>>>
>>>> however, the OOB message is being block in the application
>>>> until the regular message is delivered. And there is no way to pick the
>>>> regular message from the window list while the OOB is blocked, right?
>>>> (assuming no more incoming messages)
>>> This actually should happen, as they're delivered by different threads !
>>>
>>>
>>>> so, if everybody agrees, if I move the OOB message to another thread,
>>>> everything should work fine...
>>>>
>>>> On 02/26/2013 03:50 PM, Bela Ban wrote:
>>>>> On 2/26/13 4:15 PM, Dan Berindei wrote:
>>>>>> On Tue, Feb 26, 2013 at 12:57 PM, Pedro Ruivo <pedro at infinispan.org
>>>>>> <mailto:pedro at infinispan.org>> wrote:
>>>>>>
>>>>>>        hi,
>>>>>>
>>>>>>        I found the blocking problem with the state transfer this
>>>>>> morning.
>>>>>>        It happens because of the reordering of a regular and OOB
>>>>>> message.
>>>>>>
>>>>>>        Below, is a simplification of what is happening for two nodes
>>>>>>
>>>>>>        A: total order broadcasts rebalance_start
>>>>>>
>>>>>>        B: (incoming thread) delivers rebalance_start
>>>>>>        B: has no segments to request so the rebalance is done
>>>>>>        B: sends async request with rebalance_confirm (unicast #x)
>>>>>>        B: sends the rebalance_start response (unicast #x+1) (the
>>>>>> response
>>>>>>        is a regular message)
>>>>>>
>>>>>>        A: receives rebalance_start response (unicast #x+1)
>>>>>>        A: in UNICAST2, it detects the message is out-of-order and
>>>>>> blocks
>>>>>>        the response in the sender window (i.e. the message #x is
>>>>>> missing)
>>>>>>        A: receives the rebalance_confirm (unicast #x)
>>>>>>        A: delivers rebalance_confirm. Infinispan blocks this command
>>>>>>        until all the rebalance_start responses are received ==> this
>>>>>>        originates a deadlock! (because the response is blocked in
>>>>>> unicast
>>>>>>        layer)
>>>>>>
>>>>>>        Question: can the request's response message be sent always as
>>>>>>        OOB? (I think the answer should be no...)
>>>>>>
>>>>>>
>>>>>> We could, if Bela adds the send(Message) method to the Response
>>>>>> interface...
>>>>> I created a JIRA yesterday: https://issues.jboss.org/browse/JGRP-1602.
>>>>> I'm wondering though if you *really* need it, as making all responses
>>>>> OOB is a bad idea IMO, see below...
>>>>>
>>>>>
>>>>>> and personally I think it would be better to make all responses OOB
>>>>>> (as in JGroups 3.2.x). I don't have any data to back this up,
>>>>>> though...
>>>>> Intuitively, I think indiscriminatingly marking all responses as OOB
>>>>> is bad, especially in the light of the async invocation API which will
>>>>> make all messages non-blocking, at least in the OOB or reg thread
>>>>> pools.
>>>>>
>>>>> The code in 3.3 *does* actually copy the flags of the request into the
>>>>> response, so if the request is async (OOB), so will the response be.
>>>>> For async RPCs (regular messages), you're not getting any response
>>>>> anyway, so no worries here...
>>>>>
>>>>>
>>>>>>        My suggestion: when I deliver a rebalance_confirm command
>>>>>> (that it
>>>>>>        is send async), can I move it to a thread in
>>>>>>        async_thread_pool_executor?
>>>>>>
>>>>>>
>>>>>> I have WIP fix for https://issues.jboss.org/browse/ISPN-2825, which
>>>>>> should stop blocking the REBALANCE_CONFIRM commands on the
>>>>>> coordinator: https://github.com/danberindei/infinispan/tree/t_2825_m
>>>>>>
>>>>>> I haven't issued a PR yet because I'm still getting a failure in
>>>>>> ClusterTopologyManagerTest, I think because of a JGroups issue (RSVP
>>>>>> not receiving an ACK from itself). I'll let you know when I find
>>>>>> out...
>>>>> Yes, please do that. I saw in London that you could reproduce it in
>>>>> your test, so it should be simple to find the root cause.
>>>>>
>>>>>
>>>>>
>>>>>>        Weird thing: last night I tried more than 5x time in a row with
>>>>>>        UNICAST3 and it never blocks. can this meaning a problem with
>>>>>>        UNICAST3 or I had just lucky?
>>>>>>
>>>>>>
>>>>>> Even though the REBALANCE_CONFIRM command is sent async, the message
>>>>>> is still OOB. I think UNICAST/2/3 should not block any regular
>>>>>> message waiting for the processing of an OOB message, as long as that
>>>>>> message was received, so maybe the problem is in UNICAST2?
>>>>> If the OOB thread added the OOB message, then it will simply pass it
>>>>> up. However, the regular thread needs to wait for gaps in the receiver
>>>>> table to fill, as it doesn't know what type of message will be
>>>>> received (could be regular).
>>>>>
>>>>> As soon as the OOB message has been added to the table, the regular
>>>>> message will get delivered
>>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev



More information about the infinispan-dev mailing list