[infinispan-dev] Blocking issue in TO State Transfer

Dan Berindei dan.berindei at gmail.com
Tue Feb 26 10:15:31 EST 2013


On Tue, Feb 26, 2013 at 12:57 PM, Pedro Ruivo <pedro at infinispan.org> wrote:

> hi,
>
> I found the blocking problem with the state transfer this morning. It
> happens because of the reordering of a regular and OOB message.
>
> Below, is a simplification of what is happening for two nodes
>
> A: total order broadcasts rebalance_start
>
> B: (incoming thread) delivers rebalance_start
> B: has no segments to request so the rebalance is done
> B: sends async request with rebalance_confirm (unicast #x)
> B: sends the rebalance_start response (unicast #x+1) (the response is a
> regular message)
>
> A: receives rebalance_start response (unicast #x+1)
> A: in UNICAST2, it detects the message is out-of-order and blocks the
> response in the sender window (i.e. the message #x is missing)
> A: receives the rebalance_confirm (unicast #x)
> A: delivers rebalance_confirm. Infinispan blocks this command until all
> the rebalance_start responses are received ==> this originates a deadlock!
> (because the response is blocked in unicast layer)
>
> Question: can the request's response message be sent always as OOB? (I
> think the answer should be no...)
>
>
We could, if Bela adds the send(Message) method to the Response
interface... and personally I think it would be better to make all
responses OOB (as in JGroups 3.2.x). I don't have any data to back this up,
though...



> My suggestion: when I deliver a rebalance_confirm command (that it is send
> async), can I move it to a thread in async_thread_pool_executor?
>
>
I have WIP fix for https://issues.jboss.org/browse/ISPN-2825, which should
stop blocking the REBALANCE_CONFIRM commands on the coordinator:
https://github.com/danberindei/infinispan/tree/t_2825_m

I haven't issued a PR yet because I'm still getting a failure in
ClusterTopologyManagerTest, I think because of a JGroups issue (RSVP not
receiving an ACK from itself). I'll let you know when I find out...



> Weird thing: last night I tried more than 5x time in a row with UNICAST3
> and it never blocks. can this meaning a problem with UNICAST3 or I had just
> lucky?
>
>
Even though the REBALANCE_CONFIRM command is sent async, the message is
still OOB. I think UNICAST/2/3 should not block any regular message waiting
for the processing of an OOB message, as long as that message was received,
so maybe the problem is in UNICAST2?



> Any other suggestion?
>
> Cheers,
> Pedro
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20130226/edf30d35/attachment.html 


More information about the infinispan-dev mailing list