[infinispan-dev] CloudTM: Additional Atomic broadcast based replication mechanism integrated in Infinispan

Sat Apr 16 16:47:12 EDT 2011

Hi,

at INESC-ID, we have just finished brewing another (full) replication 
protocol for Infinispan, in the context of the Cloud-TM project. The 
credits go to Pedro Ruivo, a Msc student working in our group, who 
developed this as part of his thesis work.

Like the current Infinispan's 2PC-based replication mechanism, also this 
is a multi-master scheme in which transactions are executed locally by 
any node and without any coordination among replicas till the commit phase.

Rather than using 2PC, however, this scheme scheme relies on a single 
Atomic broadcast (AB, a.k.a. total order broadcast) to disseminate 
transactions' writesets while ensuring totally ordered delivery of 
messages across the various nodes.

The replication scheme is, in fact, quite simple. Upon delivery of the 
AB with the writeset of a transaction, we ensure that each replica 
executes the following operations in the order established by the AB layer:
- acquiring locks on the writeset, possibly stealing the lock from any 
locally running transaction (and flagging it as aborted)
- performing a write-skew check (in case Infinispan is configured to do 
this check)
- applying the writeset
- release the locks.

ABcast can be implemented in many many ways, but for simplicity we used 
the sequencer based implementation available in JGroups. This scheme is 
quite simple:
- upon an atomic broadcast, the message is simply broadcast to all nodes
- upon receipt of the message, however, the nodes do not deliver it 
immediately to the application layer (Infinispan in this case). Instead, 
a special node (e.g. the one with lesser id in the group), namely the 
sequencer, broadcasts back the order in which processes should deliver 
this message.
- finally, the nodes deliver the atomic broadcast message in the order 
specified by the sequencer.

The main advantage of this replication mechanism is that it avoids 
distributed deadlocks, which makes its performance way better at high 
contention (note that local deadlocks may still occur, as Infinispan's 
encounter locking scheme does not prevent them. But these are much more 
rare as the number of "contending" threads is much lower).
Further, locks are held shorter with respect to 2PC. With 2PC, the 
cohort nodes, maintain the locks on behalf of the coordinator, since the 
reception of the prepare and until they receive the commit message. This 
encompasses a round trip (collecting votes from all cohorts, and sending 
back the decision). With the ABcast scheme, instead, the locks acquired 
on behalf of remote transactions, are held only for the time strictly 
required to writeback locally.
Finally the sequencer is, in fact, a privileged node, as it can commit 
transactions much faster than the other nodes, as it can self-assign the 
order in which transactions need to be processed. This may not be very 
fair at high contention rate, as it gets higher chances of committing 
transactions, but does make it MUCH faster overall.

Concerning blocking scenarios in case of failures. Just like 2PC is 
blocking in case of crashes of a node coordinating a transaction, this 
replication scheme is also blocking, but this time in case of crashes of 
the sequencer. The comparison in terms of liveness guarantees seem 
therefore quite fair. (Note that it would have been possible to make 
this replication mechanism non-blocking, at the cost of one extra 
communication step. But we opted not to do it, to compare protocols more 
fairly).

To evaluate performance we ran the same kind of experiments we used in 
our recent mail where we evaluted a Primary backup-based replication 
scheme. All nodes only do write operations, transactions of 10 
statements, one of which being a put. Accesses are uniformly distributed 
to 1K, 10K, 100K data items. Machines are 8cores, 8GB RAM, radargun is 
using 10 threads per ndoe.

Summary of results (see attached pdf):
- Throughput, page 1: the plots show significant speedups. Especially at 
high contention, where 2PC stumbles upon very frequent distributed 
deadlocks. Note that we enabled deadlock detection in all experiments 
running 2PC.
- Abort rate, page 2: abort rates are similar in both protocols (note 
the logscale on y-axis). But the AB-based solution, avoiding deadlocks, 
detects the need to abort transactions much faster than 2PC.
- Average Commit duration, page 3: at high contention, 2PC trashes due 
to deadlocks (despite the deadlock detection mechanism), and this is 
reflected in the Avg. commit duration as well. In the other cases (10K 
and 100K keys), where there is less contention, the commit duration of 
the two protocols is similar (see next).
- To bundle or not to bundle?, page 4: the throughput results shown in 
page 1 have actually a little trick :-P We used "some" bundling ONLY in 
the case of transactions spanning 100K data items. We noticed that w/o 
bundling, the commit time duration of the AB-based scheme got way 
lousier and experimented with a few bundling values till we  got decent 
performance (See plot at page 4). This is explainable since with 100K 
keys, being the contention low, less transactions get aborted along 
their execution. This means that more transaction reach the commit 
phase, and, in its turn, more atomic broadcasts hit the network layer, 
and more load for the sequencing node. Bundling, as you can see in page 
4, makes the trick however. We did not retry the tests with the other 
protocols with and w/o bundling, as this would have been very time 
consuming. BTW, I believe that the results would not have been very 
different.

Let me open a small parenthesis about bundling: this can be an extremely 
effective weapon, but manual configuration is a bid headache and 
extremely time consuming. I believe that it would be very useful to have 
some self-tuning mechanism for it in JGroups. In fact, we've recently 
got very nice results using Reinforcement-learning to tackle this 
problem (well, to be precise, not really the same problem, but a very 
similar one):

     http://www.inesc-id.pt/ficheiros/publicacoes/7163.pdf

...but I've implemented the prototype of this solution in Appia, as I 
knew it much better than JGroups. What do you think about integrating a 
similar mechanism in JGroup's ergonomics?

Back to replication: we are now working on a similar solution for 
partial replication (a.k.a. distribution in Infinispan).

I'll keep you posted on our progresses!

Cheers,

     Paolo

PS: The code is available on GitHUB, if you want to take a look at it: 
https://github.com/cloudtm/infinispan/tree/abcast_replication

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Performance Evaluation.pdf
Type: application/pdf
Size: 76348 bytes
Desc: not available
Url : http://lists.jboss.org/pipermail/infinispan-dev/attachments/20110416/6573002d/attachment-0001.pdf