On Tue, Feb 26, 2013 at 12:57 PM, Pedro Ruivo <pedro(a)infinispan.org> wrote:
hi,
I found the blocking problem with the state transfer this morning. It
happens because of the reordering of a regular and OOB message.
Below, is a simplification of what is happening for two nodes
A: total order broadcasts rebalance_start
B: (incoming thread) delivers rebalance_start
B: has no segments to request so the rebalance is done
B: sends async request with rebalance_confirm (unicast #x)
B: sends the rebalance_start response (unicast #x+1) (the response is a
regular message)
A: receives rebalance_start response (unicast #x+1)
A: in UNICAST2, it detects the message is out-of-order and blocks the
response in the sender window (i.e. the message #x is missing)
A: receives the rebalance_confirm (unicast #x)
A: delivers rebalance_confirm. Infinispan blocks this command until all
the rebalance_start responses are received ==> this originates a deadlock!
(because the response is blocked in unicast layer)
Question: can the request's response message be sent always as OOB? (I
think the answer should be no...)
We could, if Bela adds the send(Message) method to the Response
interface... and personally I think it would be better to make all
responses OOB (as in JGroups 3.2.x). I don't have any data to back this up,
though...
My suggestion: when I deliver a rebalance_confirm command (that it is
send
async), can I move it to a thread in async_thread_pool_executor?
I have WIP fix for
https://issues.jboss.org/browse/ISPN-2825, which should
stop blocking the REBALANCE_CONFIRM commands on the coordinator:
https://github.com/danberindei/infinispan/tree/t_2825_m
I haven't issued a PR yet because I'm still getting a failure in
ClusterTopologyManagerTest, I think because of a JGroups issue (RSVP not
receiving an ACK from itself). I'll let you know when I find out...
Weird thing: last night I tried more than 5x time in a row with
UNICAST3
and it never blocks. can this meaning a problem with UNICAST3 or I had just
lucky?
Even though the REBALANCE_CONFIRM command is sent async, the message is
still OOB. I think UNICAST/2/3 should not block any regular message waiting
for the processing of an OOB message, as long as that message was received,
so maybe the problem is in UNICAST2?
Any other suggestion?
Cheers,
Pedro