[jboss-jira] [JBoss JIRA] Commented: (JGRP-335) Hangs with FLUSH
Bela Ban (JIRA)
jira-events at jboss.com
Tue Oct 3 04:19:41 EDT 2006
[ http://jira.jboss.com/jira/browse/JGRP-335?page=comments#action_12344516 ]
Bela Ban commented on JGRP-335:
-------------------------------
Unit tests are
#1: FlushTest.testJoinFollowedByUnicast()
#2: FlushTest.testStateTransferFollowedByUnicast()
Note that the second test will probably pass most of the times, as here timing is very important and chances that the unicast is received before the STOP_FLUSH are very small. However test #1 will almost always fail, as the joining member can send a unicast before STOP_FLUSH has been received !
> Hangs with FLUSH
> ----------------
>
> Key: JGRP-335
> URL: http://jira.jboss.com/jira/browse/JGRP-335
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.3 SP1
> Reporter: Bela Ban
> Assigned To: Bela Ban
> Fix For: 2.4
>
>
> 2 use cases where we can run into the problem (members A and B).
> #1 View change
> * A is running, B joins
> * B is *not* blocking in FLUSH, A is blocking after START_FLUSH
> * A starts the flush
> * A returns the new view to B in a JOIN_RSP. This causes B's Channel.connect() to return
> * B sends a unicast message to A, to which A sends a response *in the same thread* (service STATE_REQ)
> * A competes the flush, multicasting a STOP_FLUSH message
> * The STATE_REQ at A hangs on FLUSH.down()
> * The STOP_FLUSH at A can never unblock FLUSH.down() because it was received *after* the STATE_REQ from B !
> SOLUTION:
> 1. Make B block in FLUSH.down() as soon as the client sends the JOIN_REQ to A
> 2. Make STOP_FLUSH *synchronous*. This means we only return from Channel.connect() (for example) once every member has ack'ed the STOP_FLUSH. See next issue (state transfer) for a description of what happens if we don't do this.
> #2 State transfer
> * A and B are members of the group
> * B calls Channel.getState()
> * A and B receive a START_FLUSH, start the block in FLUSH
> * State is transferred from A to B
> * B multicasts a STOP_FLUSH and *immediately afterwards* sends a *unicast* message (which can 'pass' multicast messages, as they're unrelated)
> * A happens to receive the unicast message *before* the STOP_FLUSH. The unicast blocks and the STOP_FLUSH, which would unblock it, cannot be delivered
> SOLUTION:
> 1. Same as solution 2 above. If we make the STOP_FLUSH phase synchronous, connect() or getState() only return once everyone has been unblocked
> LONG TERM SOLUTION:
> * The much better solution of course is to make the STOP_FLUSH message *out-of-band*, so it can be delivered in parallel to other messages, and is not blocked (e.g.) by the unicast in the queue. So even if the unicast message was blocked waiting for STOP_FLUSH, once STOP_FLUSH has been received, it will be delivered, causing the unicast to unblock
> * Once we have this solution in place (2.5, threadless stack and out-of-band messages), we can revert the STOP_FLUSH to only use 1 phase rather than 2
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list