[infinispan-dev] [Cloudtm-discussion] CloudTM: Additional Atomic broadcast based replication mechanism integrated in Infinispan

Wed Apr 20 13:18:44 EDT 2011

On 4/17/11 7:55 PM, Manik Surtani wrote:
> Excellent stuff, Paolo and Pedro.  My comments below, inline.  Cc'ing infinispan-dev as well.
Thanks Manik!
> On 16 Apr 2011, at 21:47, Paolo Romano wrote:
>
> ...
> I presume this implementation still plays nice with XA/JTA?  In that transactions marked for rollback, etc are respected?
Kind of :)

We didn't really polish the code to have it fully integrated with 
XA/JTA, but this could be done without too many problems.

Basically, if Infinispan is the only resource in a distributed xact, and 
it is configured to work in replicated mode, then it could externalize a 
1 phase commit interface. In this case, the output of the commit phase 
would be that of the AB-based certification, avoiding totally 2PC.

If on the other hand, Infinispan (replicated) needs to be enlisted in a 
distributed transaction encompassing other resources, 2PC is 
unavoidable. In this case, when an (replicated) Infinispan node receives 
a prepare message from some external coordinator it could i) AB-cast the 
prepare message to the other replicas, and ii) do the lock acquisition 
and write-skew validation to determine the vote to be sent back to the 
external coordinator. (Note that all replicas are guaranteed to 
determine the same outcome here, given the total order guarantees of the 
ABcast and that the certification procedure is deterministic.) The 
write-back and lock release should instead be done upon receipt of the 
final decision (2nd phase of the 2PC) from the coordinator.

Does this answer your question?

> ...
>> Finally the sequencer is, in fact, a privileged node, as it can commit transactions much faster than the other nodes, as it can self-assign the order in which transactions need to be processed. This may not be very fair at high contention rate, as it gets higher chances of committing transactions, but does make it MUCH faster overall.
>>
>> Concerning blocking scenarios in case of failures. Just like 2PC is blocking in case of crashes of a node coordinating a transaction, this replication scheme is also blocking, but this time in case of crashes of the sequencer. The comparison in terms of liveness guarantees seem therefore quite fair. (Note that it would have been possible to make this replication mechanism non-blocking, at the cost of one extra communication step. But we opted not to do it, to compare protocols more fairly).
> When you say make the replication mechanism non-blocking, you are referring to asynchronous communication?
No. I was referring the need to block to wait for the recovery of a 
crashed node in order to determine the outcome of transactions stuck in 
their commit phase. 2PC needs to block (or to resort to heuristic 
decisions possibly violating Atomicity) upon crash of the corresponding 
coordinator node. The implemented AB-based replication mechanism needs 
to block in case the sequencer crashes, as he may have ordered and 
committed transactions (before crashing) that the other nodes have not 
seen.
>> To evaluate performance we ran the same kind of experiments we used in our recent mail where we evaluted a Primary backup-based replication scheme. All nodes only do write operations, transactions of 10 statements, one of which being a put. Accesses are uniformly distributed to 1K, 10K, 100K data items. Machines are 8cores, 8GB RAM, radargun is using 10 threads per ndoe.
> Is your test still forcing deadlocks by working on the same keyset on each node?
Yes. In the test, we've generated transactions with 9 reads and 1 write. 
The accesses are distributed uniformly on keysets of sizes {1K, 10K, 100K}
> ...
> Cheers

Cheers

     Paolo
> Manik