[
http://jira.jboss.com/jira/browse/JGRP-335?page=all ]
Bela Ban updated JGRP-335:
--------------------------
Description:
2 use cases where we can run into the problem (members A and B).
#1 View change
* A is running, B joins
* B is *not* blocking in FLUSH, A is blocking after START_FLUSH
* A starts the flush
* A returns the new view to B in a JOIN_RSP. This causes B's Channel.connect() to
return
* B sends a unicast message to A, to which A sends a response *in the same thread*
(service STATE_REQ)
* A competes the flush, multicasting a STOP_FLUSH message
* The STATE_REQ at A hangs on FLUSH.down()
* The STOP_FLUSH at A can never unblock FLUSH.down() because it was received *after*
the STATE_REQ from B !
SOLUTION:
1. Make B block in FLUSH.down() as soon as the client sends the JOIN_REQ to A
2. Make STOP_FLUSH *synchronous*. This means we only return from Channel.connect() (for
example) once every member has ack'ed the STOP_FLUSH. See next issue (state transfer)
for a description of what happens if we don't do this.
#2 State transfer
* A and B are members of the group
* B calls Channel.getState()
* A and B receive a START_FLUSH, start the block in FLUSH
* State is transferred from A to B
* B multicasts a STOP_FLUSH and *immediately afterwards* sends a *unicast* message
(which can 'pass' multicast messages, as they're unrelated)
* A happens to receive the unicast message *before* the STOP_FLUSH. The unicast blocks
and the STOP_FLUSH, which would unblock it, cannot be delivered
SOLUTION:
1. Same as solution 2 above. If we make the STOP_FLUSH phase synchronous, connect() or
getState() only return once everyone has been unblocked
LONG TERM SOLUTION:
* The much better solution of course is to make the STOP_FLUSH message *out-of-band*,
so it can be delivered in parallel to other messages, and is not blocked (e.g.) by the
unicast in the queue. So even if the unicast message was blocked waiting for STOP_FLUSH,
once STOP_FLUSH has been received, it will be delivered, causing the unicast to unblock
* Once we have this solution in place (2.5, threadless stack and out-of-band
messages), we can revert the STOP_FLUSH to only use 1 phase rather than 2
was:
2 use cases where we can run into the problem (members A and B).
#1 View change
* A is running, B joins
* B is *not* blocking in FLUSH, A is blocking after START_FLUSH
* A starts the flush
* A returns the new view to B in a JOIN_RSP. This causes B's Channel.connect() to
return
* B sends a unicast message to A, to which A sends a response *in the same thread*
(service STATE_REQ)
* A competes the flush, multicasting a STOP_FLUSH message
* The STATE_REQ at A hangs on FLUSH.down()
* The STOP_FLUSH at A can never unblock FLUSH.down() because it was received *after*
the STATE_REQ from B !
SOLUTION:
1. Make B block in FLUSH.down() as soon as the client sends the JOIN_REQ to A
2. Make STOP_FLUSH *synchronous*. This means we only return from Channel.connect() (for
example) once every member has ack'ed the STOP_FLUSH. See next issue (state transfer)
for a description of what happens if we don't do this.
#2 State transfer
* A and B are members of the group
* B calls Channel.getState()
* A and B receive a START_FLUSH, start the block in FLUSH
* State is transferred from A to B
* B multicasts a STOP_FLUSH and *immediately afterwards* sends a *unicast* message
(which can 'pass' multicast messages, as they're unrelated)
* A happens to receive the unicast message *before* the STOP_FLUSH. The unicast blocks
and the STOP_FLUSH, which would unblock it, cannot be delivered
SOLUTION:
1. Same as solution 2 above. If we make the STOP_FLUSH phase synchronous, connect() or
getState() only return once everyone has been unblocked
LONG TERM SOLUTION:
* The much better solution of course is to make the STOP_FLUSH message *out-of-band*,
so it can be delivered in parallel to other messages, and is not blocked (e.g.) by the
unicast in the queue. So even if the unicast message was blocked waiting for STOP_FLUSH,
once STOP_FLUSH has been received, it will be delivered, causing the unicast to unblock
* Once we have this solution in place (2.5, threadless stack and out-of-band
messages), we can revert the STOP_FLUSH to only use 1 phase rather than 2
Priority: Blocker (was: Major)
We need to do an extensive review of FLUSH, and need to describe its precise semantics, as
this issue blocks the release of 2.4.
The document needs to contain, for view changes and state transfer:
- what are the interactions between the members, e.g. START_FLUSH, STOP_FLUSH,
FLUSH_COMPLETED etc. Best is to add this as a graphical interaction diagram !
- when do we receive a block(), when an unblock(), also with respect to view changes and
state transfer callbacks *on all members*
Hangs with FLUSH
----------------
Key: JGRP-335
URL:
http://jira.jboss.com/jira/browse/JGRP-335
Project: JGroups
Issue Type: Bug
Affects Versions: 2.3 SP1
Reporter: Bela Ban
Assigned To: Vladimir Blagojevic
Priority: Blocker
Fix For: 2.4
2 use cases where we can run into the problem (members A and B).
#1 View change
* A is running, B joins
* B is *not* blocking in FLUSH, A is blocking after START_FLUSH
* A starts the flush
* A returns the new view to B in a JOIN_RSP. This causes B's Channel.connect() to
return
* B sends a unicast message to A, to which A sends a response *in the same thread*
(service STATE_REQ)
* A competes the flush, multicasting a STOP_FLUSH message
* The STATE_REQ at A hangs on FLUSH.down()
* The STOP_FLUSH at A can never unblock FLUSH.down() because it was received *after*
the STATE_REQ from B !
SOLUTION:
1. Make B block in FLUSH.down() as soon as the client sends the JOIN_REQ to A
2. Make STOP_FLUSH *synchronous*. This means we only return from Channel.connect()
(for example) once every member has ack'ed the STOP_FLUSH. See next issue (state
transfer) for a description of what happens if we don't do this.
#2 State transfer
* A and B are members of the group
* B calls Channel.getState()
* A and B receive a START_FLUSH, start the block in FLUSH
* State is transferred from A to B
* B multicasts a STOP_FLUSH and *immediately afterwards* sends a *unicast* message
(which can 'pass' multicast messages, as they're unrelated)
* A happens to receive the unicast message *before* the STOP_FLUSH. The unicast
blocks and the STOP_FLUSH, which would unblock it, cannot be delivered
SOLUTION:
1. Same as solution 2 above. If we make the STOP_FLUSH phase synchronous, connect() or
getState() only return once everyone has been unblocked
LONG TERM SOLUTION:
* The much better solution of course is to make the STOP_FLUSH message *out-of-band*,
so it can be delivered in parallel to other messages, and is not blocked (e.g.) by the
unicast in the queue. So even if the unicast message was blocked waiting for STOP_FLUSH,
once STOP_FLUSH has been received, it will be delivered, causing the unicast to unblock
* Once we have this solution in place (2.5, threadless stack and out-of-band
messages), we can revert the STOP_FLUSH to only use 1 phase rather than 2
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira