[infinispan-dev] Blocking issue in TO State Transfer

Pedro Ruivo pedro at infinispan.org
Tue Feb 26 13:35:12 EST 2013


On 02/26/2013 04:31 PM, Bela Ban wrote:
> On 2/26/13 5:14 PM, Pedro Ruivo wrote:
>> So, in this case, the regular message will block until the OOB message is delivered.
>
> No, the regular message should get delivered as soon as the OOB message
> has been *received* (not *delivered*). Unless there are previous regular
> messages from the same sender which are delivered in the same thread,
> and one of them is blocked in application code...
In attachment is part of the log. I only know that the response is 
disappearing between UNICAST2 and the ISPN unmarshaller.

could you please take a look?

the response is being sent and received and I don't understand why ISPN 
is not receive it

Thanks
Pedro
>
>
>> however, the OOB message is being block in the application
>> until the regular message is delivered. And there is no way to pick the
>> regular message from the window list while the OOB is blocked, right?
>> (assuming no more incoming messages)
> This actually should happen, as they're delivered by different threads !
>
>
>> so, if everybody agrees, if I move the OOB message to another thread,
>> everything should work fine...
>>
>> On 02/26/2013 03:50 PM, Bela Ban wrote:
>>> On 2/26/13 4:15 PM, Dan Berindei wrote:
>>>> On Tue, Feb 26, 2013 at 12:57 PM, Pedro Ruivo <pedro at infinispan.org
>>>> <mailto:pedro at infinispan.org>> wrote:
>>>>
>>>>       hi,
>>>>
>>>>       I found the blocking problem with the state transfer this morning.
>>>>       It happens because of the reordering of a regular and OOB message.
>>>>
>>>>       Below, is a simplification of what is happening for two nodes
>>>>
>>>>       A: total order broadcasts rebalance_start
>>>>
>>>>       B: (incoming thread) delivers rebalance_start
>>>>       B: has no segments to request so the rebalance is done
>>>>       B: sends async request with rebalance_confirm (unicast #x)
>>>>       B: sends the rebalance_start response (unicast #x+1) (the response
>>>>       is a regular message)
>>>>
>>>>       A: receives rebalance_start response (unicast #x+1)
>>>>       A: in UNICAST2, it detects the message is out-of-order and blocks
>>>>       the response in the sender window (i.e. the message #x is missing)
>>>>       A: receives the rebalance_confirm (unicast #x)
>>>>       A: delivers rebalance_confirm. Infinispan blocks this command
>>>>       until all the rebalance_start responses are received ==> this
>>>>       originates a deadlock! (because the response is blocked in unicast
>>>>       layer)
>>>>
>>>>       Question: can the request's response message be sent always as
>>>>       OOB? (I think the answer should be no...)
>>>>
>>>>
>>>> We could, if Bela adds the send(Message) method to the Response
>>>> interface...
>>> I created a JIRA yesterday: https://issues.jboss.org/browse/JGRP-1602.
>>> I'm wondering though if you *really* need it, as making all responses
>>> OOB is a bad idea IMO, see below...
>>>
>>>
>>>> and personally I think it would be better to make all responses OOB
>>>> (as in JGroups 3.2.x). I don't have any data to back this up, though...
>>> Intuitively, I think indiscriminatingly marking all responses as OOB
>>> is bad, especially in the light of the async invocation API which will
>>> make all messages non-blocking, at least in the OOB or reg thread pools.
>>>
>>> The code in 3.3 *does* actually copy the flags of the request into the
>>> response, so if the request is async (OOB), so will the response be.
>>> For async RPCs (regular messages), you're not getting any response
>>> anyway, so no worries here...
>>>
>>>
>>>>       My suggestion: when I deliver a rebalance_confirm command (that it
>>>>       is send async), can I move it to a thread in
>>>>       async_thread_pool_executor?
>>>>
>>>>
>>>> I have WIP fix for https://issues.jboss.org/browse/ISPN-2825, which
>>>> should stop blocking the REBALANCE_CONFIRM commands on the
>>>> coordinator: https://github.com/danberindei/infinispan/tree/t_2825_m
>>>>
>>>> I haven't issued a PR yet because I'm still getting a failure in
>>>> ClusterTopologyManagerTest, I think because of a JGroups issue (RSVP
>>>> not receiving an ACK from itself). I'll let you know when I find out...
>>>
>>> Yes, please do that. I saw in London that you could reproduce it in
>>> your test, so it should be simple to find the root cause.
>>>
>>>
>>>
>>>>       Weird thing: last night I tried more than 5x time in a row with
>>>>       UNICAST3 and it never blocks. can this meaning a problem with
>>>>       UNICAST3 or I had just lucky?
>>>>
>>>>
>>>> Even though the REBALANCE_CONFIRM command is sent async, the message
>>>> is still OOB. I think UNICAST/2/3 should not block any regular
>>>> message waiting for the processing of an OOB message, as long as that
>>>> message was received, so maybe the problem is in UNICAST2?
>>> If the OOB thread added the OOB message, then it will simply pass it
>>> up. However, the regular thread needs to wait for gaps in the receiver
>>> table to fill, as it doesn't know what type of message will be
>>> received (could be regular).
>>>
>>> As soon as the OOB message has been added to the table, the regular
>>> message will get delivered
>>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-------------- next part --------------
### sending the rebalance_confirm (async)

2013-02-25 19:09:16,920 TRACE (TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] Replication task sending CacheTopologyControlCommand{cache=___defaultcache, type=REBALANCE_CONFIRM, sender=FullSyncWriteSkewTotalOrderTest-NodeB-36894, joinInfo=null, topologyId=1, currentCH=null, pendingCH=null, throwable=null, viewId=1} to single recipient FullSyncWriteSkewTotalOrderTest-NodeA-19450 with response mode GET_NONE
2013-02-25 19:09:16,921 TRACE (TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.jgroups.blocks.Request] sending request (id=44)
2013-02-25 19:09:16,921 TRACE (TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeB-36894 --> DATA(FullSyncWriteSkewTotalOrderTest-NodeA-19450: #7, conn_id=0)
2013-02-25 19:09:16,921 TRACE (TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeB-36894: sending msg to FullSyncWriteSkewTotalOrderTest-NodeA-19450, src=FullSyncWriteSkewTotalOrderTest-NodeB-36894, headers are RequestCorrelator: id=200, type=REQ, id=44, rsp_expected=false, UNICAST2: DATA, seqno=7, TCP: [channel_name=ISPN]
2013-02-25 19:09:16,921 TRACE (TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.jgroups.protocols.TCP] dest=127.0.0.1:8100 (137 bytes)

### sending the response (by the same thread while processing the same command)

2013-02-25 19:09:16,921 TRACE (TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.jgroups.blocks.RequestCorrelator] sending rsp for 33 to FullSyncWriteSkewTotalOrderTest-NodeA-19450
2013-02-25 19:09:16,921 TRACE (TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeB-36894 --> DATA(FullSyncWriteSkewTotalOrderTest-NodeA-19450: #8, conn_id=0)
2013-02-25 19:09:16,921 TRACE (TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeB-36894: sending msg to FullSyncWriteSkewTotalOrderTest-NodeA-19450, src=FullSyncWriteSkewTotalOrderTest-NodeB-36894, headers are RequestCorrelator: id=200, type=RSP, id=33, rsp_expected=false, UNICAST2:DATA, seqno=8, TCP: [channel_name=ISPN]
2013-02-25 19:09:16,921 TRACE (TOA-Delivery-Thread-FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.jgroups.protocols.TCP] dest=127.0.0.1:8100 (82 bytes)

### in the A side, it starts to deliver the response

2013-02-25 19:09:16,922 TRACE (Incoming-1,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450: received [dst: FullSyncWriteSkewTotalOrderTest-NodeA-19450, src: FullSyncWriteSkewTotalOrderTest-NodeB-36894 (3 headers), size=6 bytes, flags=DONT_BUNDLE|NO_FC], headers are RequestCorrelator: id=200, type=RSP, id=33, rsp_expected=false, UNICAST2: DATA, seqno=8, TCP: [channel_name=ISPN]
2013-02-25 19:09:16,922 TRACE (Incoming-1,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeA-19450 <-- DATA(FullSyncWriteSkewTotalOrderTest-NodeB-36894: #8)

### I have no track of this response afte this point... then, the OOB message with the new request

2013-02-25 19:09:16,922 TRACE (OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450: received [dst: FullSyncWriteSkewTotalOrderTest-NodeA-19450, src: FullSyncWriteSkewTotalOrderTest-NodeB-36894 (3 headers), size=61 bytes, flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER|RSVP], headers are RequestCorrelator: id=200, type=REQ, id=44, rsp_expected=false, UNICAST2: DATA, seqno=7, TCP: [channel_name=ISPN]
2013-02-25 19:09:16,922 TRACE (OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeA-19450 <-- DATA(FullSyncWriteSkewTotalOrderTest-NodeB-36894: #7)
2013-02-25 19:09:16,922 TRACE (OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.FC] sender FullSyncWriteSkewTotalOrderTest-NodeB-36894 minus 61 credits, 1999492 remaining
2013-02-25 19:09:16,922 TRACE (OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.blocks.RequestCorrelator] calling (org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher) with request 44
2013-02-25 19:09:16,922 TRACE (OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.infinispan.marshall.jboss.AbstractJBossMarshaller] Start unmarshaller after retrieving marshaller from thread local

### OOB-2 is blocked until RSP for id=33 is delivered... in the meanwhile, I have a lot of Timers be triggered like these: (and the same for STABLE/NAKACK2)

2013-02-25 19:10:15,751 TRACE (Timer-5,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeA-19450 --> STABLE(FullSyncWriteSkewTotalOrderTest-NodeA-19450: 2-2, conn_id=0)
2013-02-25 19:10:15,751 TRACE (Timer-5,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450: sending msg to FullSyncWriteSkewTotalOrderTest-NodeA-19450, src=FullSyncWriteSkewTotalOrderTest-NodeA-19450, headers are UNICAST2: STABLE, seqno=2, TCP: [channel_name=ISPN]
2013-02-25 19:10:15,751 TRACE (Timer-5,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450: looping back message [dst: FullSyncWriteSkewTotalOrderTest-NodeA-19450, src: FullSyncWriteSkewTotalOrderTest-NodeA-19450 (2 headers), size=0 bytes, flags=OOB]
2013-02-25 19:10:15,751 TRACE (Timer-5,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeA-19450 --> STABLE(FullSyncWriteSkewTotalOrderTest-NodeB-36894: 6-8, conn_id=0)
2013-02-25 19:10:15,751 TRACE (Timer-5,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450: sending msg to FullSyncWriteSkewTotalOrderTest-NodeB-36894, src=FullSyncWriteSkewTotalOrderTest-NodeA-19450, headers are UNICAST2: STABLE, seqno=6, TCP: [channel_name=ISPN]
2013-02-25 19:10:15,751 TRACE (OOB-8,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeA-19450: received [dst: FullSyncWriteSkewTotalOrderTest-NodeA-19450, src: FullSyncWriteSkewTotalOrderTest-NodeA-19450 (2 headers), size=0 bytes, flags=OOB], headers are UNICAST2: STABLE, seqno=2, TCP: [channel_name=ISPN]
2013-02-25 19:10:15,751 TRACE (OOB-8,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeA-19450 <-- STABLE(FullSyncWriteSkewTotalOrderTest-NodeA-19450: 2-2, conn_id=0)
2013-02-25 19:10:15,782 TRACE (OOB-1,ISPN,FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.jgroups.protocols.TCP] FullSyncWriteSkewTotalOrderTest-NodeB-36894: received message batch of 1 messages from FullSyncWriteSkewTotalOrderTest-NodeA-19450
2013-02-25 19:10:15,782 TRACE (OOB-1,ISPN,FullSyncWriteSkewTotalOrderTest-NodeB-36894) [org.jgroups.protocols.UNICAST2] FullSyncWriteSkewTotalOrderTest-NodeB-36894 <-- STABLE(FullSyncWriteSkewTotalOrderTest-NodeA-19450: 6-8, conn_id=0)


### timeout occurs! these are the responses received for the request:

2013-02-25 19:13:16,758 TRACE (asyncTransportThread-0,FullSyncWriteSkewTotalOrderTest-NodeA) [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] Responses: 
[sender=FullSyncWriteSkewTotalOrderTest-NodeB-36894, received=false, suspected=false]
[sender=FullSyncWriteSkewTotalOrderTest-NodeA-19450, retval=SuccessfulResponse{responseValue=null} , received=true, suspected=false]

### OOB-2 continues and I have no track in the missing response...

2013-02-25 19:13:16,759 TRACE (OOB-2,ISPN,FullSyncWriteSkewTotalOrderTest-NodeA-19450) [org.infinispan.topology.RebalanceConfirmationCollector] Rebalance confirmation collector 1 at ___de
faultcache received confirmation for FullSyncWriteSkewTotalOrderTest-NodeB-36894, remaining list is [FullSyncWriteSkewTotalOrderTest-NodeA-19450]



More information about the infinispan-dev mailing list