Guys,
Vladimir and I found a problem today with using FLUSH in a JBC cache.
Following is a description of the issue and some proposed solutions.
Comments are welcome.
Please see docs/design/FLUSH.txt in JGroups for background info on how
FLUSH works.
A) We have a problem in that the FLUSH protocol makes the decision to
shut off the ability to pass messages down the channel independently at
each node. The protocol doesn't include anything at the JGroups level
to readily support coordination between nodes as to when to shut off
down messages. But, JBC needs coordination since it needs to make RPC
calls around the cluster (e.g. commit()) as part of how it handles
FLUSH.
Basically, when the FLUSH protocol on a node receives a message telling
it to START_FLUSH, it calls block() on the JBC instance. JBC does what
it needs to do, then returns from block(). Following the return from
block() the FLUSH protocol in that channel then begins blocking any
further down() messages.
Problem is as follows. 2 node REPL_SYNC cluster, A B where A is just
starting up and thus initiates a FLUSH:
1) JBC on B has tx in progress, just starting the 2PC. Sends out the
prepare().
2) A sends out a START_FLUSH message.
3) A gets START_FLUSH, calls block() on JBC.
4) JBC on A is new, doesn't have much going on, very quickly returns
from block(). A will no longer pass *down* any messages below FLUSH.
5) A gets the prepare() (no problem, FLUSH doesn't block up messages,
just down messages.)
6) A executes the prepare(), but can't send the response to B because
FLUSH is blocking the channel.
7) B gets the START_FLUSH, calls block() on JBC.
8) JBC B doesn't immediately return from block() as it is giving the
prepare() some time to complete (avoid unnecessary tx rollback). But
prepare() won't complete because A's channel is blocking the RPC
response!! Eventually JBC B's block() impl will have to roll back the
tx.
Basically you have a race condition between calls to block() and
prepare() calls, and can have different winners on different nodes.
B) A solution we discussed, rejected and then came back to this evening
(please read FLUSH.txt to understand the change we're discussing):
Channel does not block down messages when block() returns. Rather it
just sends out a FLUSH_OK message (see FLUSH.txt). It shouldn't
initiate any new cluster activity (e.g. a prepare()) after sending
FLUSH_OK, but it can respond to RPC calls. When it gets a FLUSH_OK from
all the other members, it then blocks down messages and multicasts a
FLUSH_COMPLETED to the cluster.
Differences from the current FLUSH impl:
1) Node doesn't begin blocking down messages before sending FLUSH_OK.
2) Node begins blocking down messages before sending FLUSH_COMPLETED.
3) Node multicasts FLUSH_COMPLETED, rather than unicasting to the node
that initiated the FLUSH.
4) Nodes regard the FLUSH_COMPLETED as the last message from another
node, rather than the FLUSH_OK.
A downside of this idea is it changes the semantics of flush and
requires JGroups changes. We'd definitely like input from Bela on this.
Also, since we initially rejecting it, we haven't fully thought it
through. (As I'm editing this to send out I see there is no way to tell
JBC after it returns from block() to not let any "new" activity through
-- big hole. I'm back to rejecting this approach.)
C) Alternative idea we discussed was to do application level
coordination around the cluster, i.e. add something similar to the
existing FLUSH_OK/FLUSH_COMPLETED, but at the JBC level. Revising the
previous scenario:
1) JBC on B has tx in progress, just starting the 2PC. Sends out the
prepare().
2) A sends out a START_FLUSH message.
3) A gets START_FLUSH, calls block().
4) JBC on A is new, doesn't have much going on, so doesn't do cleanup
work on its own node.
4.1) JBC on A sends out an RPC call with its address as an arg to a new
"flushReady()" method added to TreeCache. (Other name for method is
fine.)
4.2) JBC on A blocks waiting for flushReady() RPC calls from all the
other members. Does not return from block().
5) A gets the prepare() (no problem, FLUSH doesn't block up messages,
just down messages.)
6) A executes the prepare(), can send the response to B because FLUSH
isn't blocking the channel.
7) B gets the START_FLUSH, calls block().
8) JBC B doesn't immediately return from block() as it detects it has a
2PC in progress and is giving the prepare() some time to complete (avoid
unnecessary tx rollback).
9) JBC B receives flushReady() call from A, adds entry to a vector
recording A is ready.
10) B receives prepare() response from A, sends commit().
12) B sends out RPC call with its address to "flushReady()" method
11) A receives commit(), commits tx.
12) A receives flushReady() call from B. Adds entry to a vector
recording that B is ready.
13) A sees that all other nodes are ready, returns from block().
14) B sees that all other nodes are ready, returns from block().
Downside to this is complexity and requirement to add another method for
the "flushReady()" RPC.
D) A 3rd alternative is to just accept the problem. The problem is a
race condition where A blocks down events but then receives a prepare().
Its response to prepare() cannot be sent. The effect is JBC B's impl of
FLUSH will detect the prepare() isn't progressing and at some point roll
back the tx. This will result in a rollback() message being sent to A.
A can receive it and roll back the tx. IIRC a rollback() is always
async, so A does not need to send a response. A and B end up in a valid
state.
Downside of this is the tx gets rolled back. This could be a frequent
occurrence in high load scenarios because a new node in the cluster
could be expected to very quickly call blockOK(), possibly even before
the START_FLUSH message goes out on the wire.
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat