[infinispan-dev] Blocking issue in TO State Transfer

Bela Ban bban at redhat.com
Wed Feb 27 04:13:20 EST 2013


OK, here's what happens:

- A's receiver table forB is at #6, this means that next message from B 
must be #7
- A receives B#8 (regular message from B)
- A adds B#8 to B's receiver table, but doesn't deliver it (not OOB,and 
not #7)
- A receives OOB message B#7 from B
- The OOB thread delivers B#7 immediately
- Infinispan blocks on B#7
- Unless another message from B is received, B#8 will *not* get 
delivered: as you can see in the codebelow, the OOB thread would check 
*after* delivering B#7 if there are more messages to be delivered, but 
because it is blocked by Infinispan, it cannot deliver B#8.

This is one of the rare cases where an OOB thread gets to deliver 
regular messages.

The root cause is that Infinispan blocks on an OOB message; but OOB 
messages should never block! This is another reason why an Infinispan 
application thread pool makes a lot of sense !


     // An OOB message is passed up immediately. Later, when remove() is 
called, we discard it. This affects ordering !
         // http://jira.jboss.com/jira/browse/JGRP-377
         if(msg.isFlagSet(Message.OOB) && added) {
             try {
                 up_prot.up(evt);
             }
             catch(Throwable t) {
                 log.error("couldn't deliver OOB message " + msg, t);
             }
         }

         //The OOB thread never gets here as it is blocked in 
up_prot.up()by Infinispan.

         final AtomicBoolean processing=win.getProcessing();
         if(!processing.compareAndSet(false, true))
             return true;



On 2/26/13 7:35 PM, Pedro Ruivo wrote:
> On 02/26/2013 04:31 PM, Bela Ban wrote:
>> On 2/26/13 5:14 PM, Pedro Ruivo wrote:
>>> So, in this case, the regular message will block until the OOB 
>>> message is delivered.
>>
>> No, the regular message should get delivered as soon as the OOB message
>> has been *received* (not *delivered*). Unless there are previous regular
>> messages from the same sender which are delivered in the same thread,
>> and one of them is blocked in application code...
> In attachment is part of the log. I only know that the response is 
> disappearing between UNICAST2 and the ISPN unmarshaller.
>
> could you please take a look?
>
> the response is being sent and received and I don't understand why 
> ISPN is not receive it
>
> Thanks
> Pedro
>>
>>
>>> however, the OOB message is being block in the application
>>> until the regular message is delivered. And there is no way to pick the
>>> regular message from the window list while the OOB is blocked, right?
>>> (assuming no more incoming messages)
>> This actually should happen, as they're delivered by different threads !
>>
>>
>>> so, if everybody agrees, if I move the OOB message to another thread,
>>> everything should work fine...
>>>
>>> On 02/26/2013 03:50 PM, Bela Ban wrote:
>>>> On 2/26/13 4:15 PM, Dan Berindei wrote:
>>>>> On Tue, Feb 26, 2013 at 12:57 PM, Pedro Ruivo <pedro at infinispan.org
>>>>> <mailto:pedro at infinispan.org>> wrote:
>>>>>
>>>>>       hi,
>>>>>
>>>>>       I found the blocking problem with the state transfer this 
>>>>> morning.
>>>>>       It happens because of the reordering of a regular and OOB 
>>>>> message.
>>>>>
>>>>>       Below, is a simplification of what is happening for two nodes
>>>>>
>>>>>       A: total order broadcasts rebalance_start
>>>>>
>>>>>       B: (incoming thread) delivers rebalance_start
>>>>>       B: has no segments to request so the rebalance is done
>>>>>       B: sends async request with rebalance_confirm (unicast #x)
>>>>>       B: sends the rebalance_start response (unicast #x+1) (the 
>>>>> response
>>>>>       is a regular message)
>>>>>
>>>>>       A: receives rebalance_start response (unicast #x+1)
>>>>>       A: in UNICAST2, it detects the message is out-of-order and 
>>>>> blocks
>>>>>       the response in the sender window (i.e. the message #x is 
>>>>> missing)
>>>>>       A: receives the rebalance_confirm (unicast #x)
>>>>>       A: delivers rebalance_confirm. Infinispan blocks this command
>>>>>       until all the rebalance_start responses are received ==> this
>>>>>       originates a deadlock! (because the response is blocked in 
>>>>> unicast
>>>>>       layer)
>>>>>
>>>>>       Question: can the request's response message be sent always as
>>>>>       OOB? (I think the answer should be no...)
>>>>>
>>>>>
>>>>> We could, if Bela adds the send(Message) method to the Response
>>>>> interface...
>>>> I created a JIRA yesterday: https://issues.jboss.org/browse/JGRP-1602.
>>>> I'm wondering though if you *really* need it, as making all responses
>>>> OOB is a bad idea IMO, see below...
>>>>
>>>>
>>>>> and personally I think it would be better to make all responses OOB
>>>>> (as in JGroups 3.2.x). I don't have any data to back this up, 
>>>>> though...
>>>> Intuitively, I think indiscriminatingly marking all responses as OOB
>>>> is bad, especially in the light of the async invocation API which will
>>>> make all messages non-blocking, at least in the OOB or reg thread 
>>>> pools.
>>>>
>>>> The code in 3.3 *does* actually copy the flags of the request into the
>>>> response, so if the request is async (OOB), so will the response be.
>>>> For async RPCs (regular messages), you're not getting any response
>>>> anyway, so no worries here...
>>>>
>>>>
>>>>>       My suggestion: when I deliver a rebalance_confirm command 
>>>>> (that it
>>>>>       is send async), can I move it to a thread in
>>>>>       async_thread_pool_executor?
>>>>>
>>>>>
>>>>> I have WIP fix for https://issues.jboss.org/browse/ISPN-2825, which
>>>>> should stop blocking the REBALANCE_CONFIRM commands on the
>>>>> coordinator: https://github.com/danberindei/infinispan/tree/t_2825_m
>>>>>
>>>>> I haven't issued a PR yet because I'm still getting a failure in
>>>>> ClusterTopologyManagerTest, I think because of a JGroups issue (RSVP
>>>>> not receiving an ACK from itself). I'll let you know when I find 
>>>>> out...
>>>>
>>>> Yes, please do that. I saw in London that you could reproduce it in
>>>> your test, so it should be simple to find the root cause.
>>>>
>>>>
>>>>
>>>>>       Weird thing: last night I tried more than 5x time in a row with
>>>>>       UNICAST3 and it never blocks. can this meaning a problem with
>>>>>       UNICAST3 or I had just lucky?
>>>>>
>>>>>
>>>>> Even though the REBALANCE_CONFIRM command is sent async, the message
>>>>> is still OOB. I think UNICAST/2/3 should not block any regular
>>>>> message waiting for the processing of an OOB message, as long as that
>>>>> message was received, so maybe the problem is in UNICAST2?
>>>> If the OOB thread added the OOB message, then it will simply pass it
>>>> up. However, the regular thread needs to wait for gaps in the receiver
>>>> table to fill, as it doesn't know what type of message will be
>>>> received (could be regular).
>>>>
>>>> As soon as the OOB message has been added to the table, the regular
>>>> message will get delivered
>>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)



More information about the infinispan-dev mailing list