[infinispan-dev] FLUSH-medium and NBST

Fri Jun 1 09:31:39 EDT 2012

Hi Bela

Let me try to expand a bit on how I see a potential FLUSH-medium protocol.

The NBST design starts off from the premise that we can reject
commands tagged with an older cache view id and the originator will
retry the command with the proper view id (and on the proper owners).
This is quite straightforward in sync caches, but can get messy in
async caches (or even with only async commit phase). Note that I'm
only talking now about a new node joining, not about a node leaving
(which is another scenario where FLUSH helps, but unfortunately only
in replicated mode).

# Async commit

Let's say we have a cluster with three nodes, A (the coordinator), B
and C, and a distributed cache async commit phase and numOwners = 1
running on it. The current cache view is {1|A, B, C}
1. B starts a transaction Tx1 { put(k1, v1) }. It sends the
prepare(Tx1) command to C, which is the only owner of key k1, and gets
a success response.
2. B sends the commit(Tx1) command to C, but the message is delayed.
It then proceeds to delete Tx1 from its transaction table.
3. D joins the cluster, sends a requestJoin(D) to A, and A sends a
prepareView({2|A, B, C, D}) to B, C, and D.
4. C receives the prepareView({2|A, B, C, D}) command, sees that it's
no longer the owner of k, and sends an applyState(prepare(Tx1))
command to D.
5. C receives the commit(Tx1) command, discards it because it has an
old view id.
6. D receives the prepareView({2|A, B, C, D}) command.
6. D receives the applyState(prepare(Tx1)) command and blocks any
other transaction from acquiring the lock on key k1. B will never send
a commit(Tx1) command to D, so the lock is held until B or D die.

I can see two ways of solving this:
a. In step 5, instead of rejecting the commit(Tx1) command, C could
forward it to the new owner (D).
b. In step 4, before pushing the transaction information to D, C could
wait to receive all commit commands sent by any node in the old cache
view.

In the previous version of the NBST document I had chosen to forward
the prepare(Tx1) command from C to D, but that seemed overly
complicated. I'm not so sure any more... (It could be because I was
trying to use forwarding to "solve" the scenario where the originator
dies after sending a message, but I'll send a separate email for
that.)

In the latest version of the NBST design document I chose option b). I
had an idea for an API (whether in JGroups or in Infinispan), more or
less like a distributed CountDownLatch, that would ensure when a
thread returns from await() any messages sent by the other members (to
this node) before their countDown() call have already been delivered.
I'd also need to identify latches by increasing id numbers (like the
cache view id) and whenever someone calls countDown(id) everyone
blocked in await(id - 1) should unblock.

The TO guys have an alternative that's quite a bit simpler (although
it does mean a longer "down" time): block until we receive commit
commands for all the transactions we have in our transaction table.
This means that if a 3rd party XA resource on B takes a very long time
to prepare the transaction, C that would have to wait for that XA
resource to finish preparing, and everyone else would have to wait for
C to send its transaction data.

# Async prepare

If the prepare would have been asynchronous, we wouldn't have a stale
lock (because 1PC is automatically activated when prepares are async),
but we would lose the prepare on the joiner:
1. B starts a transaction Tx1 { put(k1, v1) }. It sends the
prepare(Tx1) command to C, which is the only owner of key k1, but the
message is delayed.
2. D joins the cluster, sends a requestJoin(D) to A, and A sends a
prepareView({2|A, B, C, D}) to B, C, and D.
3. C and D receive the prepareView({2|A, B, C, D}), C doesn't have any
transaction information to send to D.
4. C receives the prepare(Tx1) command, discards it because it has an
old view id.
5. D never receives the prepare(Tx1) command, the entry k1->v1 is lost.

It's definitely not as bad as having a stale lock, but still, losing
data without any node leaving the cache doesn't look good.

We have the same two options here as well:
a. In step C, forward the prepare(Tx1) command to the new owner (D).
b. Never reject prepare commands with viewId-1: block state transfer
on C as long as there are any in-flight commands.

We can't use the TO guys' alternative for b) here, because C has no
idea that it will receive the prepare(Tx1) command.

# TOB/TOM

in TOB/TOM the prepare command is asynchronous, but it is also looped
back to the originator in total order with the prepareView command. So
it's not possible for C to reject the prepare(Tx1) command without B
also rejecting it and resending it to D.

That's about it. Thanks for looking into it Bela!

--
Dan