]
Vladimir Blagojevic resolved JGRP-335.
--------------------------------------
Resolution: Done
Should be resolved now. Tested all flush tests with hundreds of runs - 100% passing.
Hangs with FLUSH
----------------
Key: JGRP-335
URL:
http://jira.jboss.com/jira/browse/JGRP-335
Project: JGroups
Issue Type: Bug
Affects Versions: 2.3 SP1
Reporter: Bela Ban
Assigned To: Vladimir Blagojevic
Priority: Blocker
Fix For: 2.4
2 use cases where we can run into the problem (members A and B).
#1 View change
* A is running, B joins
* B is *not* blocking in FLUSH, A is blocking after START_FLUSH
* A starts the flush
* A returns the new view to B in a JOIN_RSP. This causes B's Channel.connect() to
return
* B sends a unicast message to A, to which A sends a response *in the same thread*
(service STATE_REQ)
* A competes the flush, multicasting a STOP_FLUSH message
* The STATE_REQ at A hangs on FLUSH.down()
* The STOP_FLUSH at A can never unblock FLUSH.down() because it was received *after*
the STATE_REQ from B !
SOLUTION:
1. Make B block in FLUSH.down() as soon as the client sends the JOIN_REQ to A
2. Make STOP_FLUSH *synchronous*. This means we only return from Channel.connect()
(for example) once every member has ack'ed the STOP_FLUSH. See next issue (state
transfer) for a description of what happens if we don't do this.
#2 State transfer
* A and B are members of the group
* B calls Channel.getState()
* A and B receive a START_FLUSH, start the block in FLUSH
* State is transferred from A to B
* B multicasts a STOP_FLUSH and *immediately afterwards* sends a *unicast* message
(which can 'pass' multicast messages, as they're unrelated)
* A happens to receive the unicast message *before* the STOP_FLUSH. The unicast
blocks and the STOP_FLUSH, which would unblock it, cannot be delivered
SOLUTION:
1. Same as solution 2 above. If we make the STOP_FLUSH phase synchronous, connect() or
getState() only return once everyone has been unblocked
LONG TERM SOLUTION:
* The much better solution of course is to make the STOP_FLUSH message *out-of-band*,
so it can be delivered in parallel to other messages, and is not blocked (e.g.) by the
unicast in the queue. So even if the unicast message was blocked waiting for STOP_FLUSH,
once STOP_FLUSH has been received, it will be delivered, causing the unicast to unblock
* Once we have this solution in place (2.5, threadless stack and out-of-band
messages), we can revert the STOP_FLUSH to only use 1 phase rather than 2
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: