[infinispan-dev] [Cloudtm-discussion] CloudTM: Additional Atomic broadcast based replication mechanism integrated in Infinispan

Manik Surtani manik at jboss.org
Sun Apr 17 14:55:01 EDT 2011


Excellent stuff, Paolo and Pedro.  My comments below, inline.  Cc'ing infinispan-dev as well.

On 16 Apr 2011, at 21:47, Paolo Romano wrote:

> Hi,
> 
> at INESC-ID, we have just finished brewing another (full) replication protocol for Infinispan, in the context of the Cloud-TM project. The credits go to Pedro Ruivo, a Msc student working in our group, who developed this as part of his thesis work.
> 
> Like the current Infinispan's 2PC-based replication mechanism, also this is a multi-master scheme in which transactions are executed locally by any node and without any coordination among replicas till the commit phase.
> 
> Rather than using 2PC, however, this scheme scheme relies on a single Atomic broadcast (AB, a.k.a. total order broadcast) to disseminate transactions' writesets while ensuring totally ordered delivery of messages across the various nodes.
> 
> The replication scheme is, in fact, quite simple. Upon delivery of the AB with the writeset of a transaction, we ensure that each replica executes the following operations in the order established by the AB layer:
> - acquiring locks on the writeset, possibly stealing the lock from any locally running transaction (and flagging it as aborted)
> - performing a write-skew check (in case Infinispan is configured to do this check)
> - applying the writeset
> - release the locks.

I presume this implementation still plays nice with XA/JTA?  In that transactions marked for rollback, etc are respected?

> 
> ABcast can be implemented in many many ways, but for simplicity we used the sequencer based implementation available in JGroups. This scheme is quite simple:
> - upon an atomic broadcast, the message is simply broadcast to all nodes
> - upon receipt of the message, however, the nodes do not deliver it immediately to the application layer (Infinispan in this case). Instead, a special node (e.g. the one with lesser id in the group), namely the sequencer, broadcasts back the order in which processes should deliver this message.
> - finally, the nodes deliver the atomic broadcast message in the order specified by the sequencer.
> 
> The main advantage of this replication mechanism is that it avoids distributed deadlocks, which makes its performance way better at high contention (note that local deadlocks may still occur, as Infinispan's encounter locking scheme does not prevent them. But these are much more rare as the number of "contending" threads is much lower).

Definitely.

> Further, locks are held shorter with respect to 2PC. With 2PC, the cohort nodes, maintain the locks on behalf of the coordinator, since the reception of the prepare and until they receive the commit message. This encompasses a round trip (collecting votes from all cohorts, and sending back the decision). With the ABcast scheme, instead, the locks acquired on behalf of remote transactions, are held only for the time strictly required to writeback locally.

Very true.  Nice.

> Finally the sequencer is, in fact, a privileged node, as it can commit transactions much faster than the other nodes, as it can self-assign the order in which transactions need to be processed. This may not be very fair at high contention rate, as it gets higher chances of committing transactions, but does make it MUCH faster overall.
> 
> Concerning blocking scenarios in case of failures. Just like 2PC is blocking in case of crashes of a node coordinating a transaction, this replication scheme is also blocking, but this time in case of crashes of the sequencer. The comparison in terms of liveness guarantees seem therefore quite fair. (Note that it would have been possible to make this replication mechanism non-blocking, at the cost of one extra communication step. But we opted not to do it, to compare protocols more fairly).

When you say make the replication mechanism non-blocking, you are referring to asynchronous communication?

> To evaluate performance we ran the same kind of experiments we used in our recent mail where we evaluted a Primary backup-based replication scheme. All nodes only do write operations, transactions of 10 statements, one of which being a put. Accesses are uniformly distributed to 1K, 10K, 100K data items. Machines are 8cores, 8GB RAM, radargun is using 10 threads per ndoe.

Is your test still forcing deadlocks by working on the same keyset on each node?

> Summary of results (see attached pdf):
> - Throughput, page 1: the plots show significant speedups. Especially at high contention, where 2PC stumbles upon very frequent distributed deadlocks. Note that we enabled deadlock detection in all experiments running 2PC.
> - Abort rate, page 2: abort rates are similar in both protocols (note the logscale on y-axis). But the AB-based solution, avoiding deadlocks, detects the need to abort transactions much faster than 2PC.
> - Average Commit duration, page 3: at high contention, 2PC trashes due to deadlocks (despite the deadlock detection mechanism), and this is reflected in the Avg. commit duration as well. In the other cases (10K and 100K keys), where there is less contention, the commit duration of the two protocols is similar (see next).
> - To bundle or not to bundle?, page 4: the throughput results shown in page 1 have actually a little trick :-P We used "some" bundling ONLY in the case of transactions spanning 100K data items. We noticed that w/o bundling, the commit time duration of the AB-based scheme got way lousier and experimented with a few bundling values till we  got decent performance (See plot at page 4). This is explainable since with 100K keys, being the contention low, less transactions get aborted along their execution. This means that more transaction reach the commit phase, and, in its turn, more atomic broadcasts hit the network layer, and more load for the sequencing node. Bundling, as you can see in page 4, makes the trick however. We did not retry the tests with the other protocols with and w/o bundling, as this would have been very time consuming. BTW, I believe that the results would not have been very different.
> 
> Let me open a small parenthesis about bundling: this can be an extremely effective weapon, but manual configuration is a bid headache and extremely time consuming. I believe that it would be very useful to have some self-tuning mechanism for it in JGroups. In fact, we've recently got very nice results using Reinforcement-learning to tackle this problem (well, to be precise, not really the same problem, but a very similar one):
> 
>    http://www.inesc-id.pt/ficheiros/publicacoes/7163.pdf
> 
> ...but I've implemented the prototype of this solution in Appia, as I knew it much better than JGroups. What do you think about integrating a similar mechanism in JGroup's ergonomics?
> 
> Back to replication: we are now working on a similar solution for partial replication (a.k.a. distribution in Infinispan).

Very keen to see this work with distribution.  :-)

Cheers
Manik
--
Manik Surtani
manik at jboss.org
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Performance Evaluation.pdf
Type: application/pdf
Size: 76349 bytes
Desc: not available
Url : http://lists.jboss.org/pipermail/infinispan-dev/attachments/20110417/8222f444/attachment-0001.pdf 


More information about the infinispan-dev mailing list