[infinispan-dev] CloudTM: Additional Atomic broadcast based replication mechanism integrated in Infinispan
Paolo Romano
romano at inesc-id.pt
Sat Apr 16 16:47:12 EDT 2011
Hi,
at INESC-ID, we have just finished brewing another (full) replication
protocol for Infinispan, in the context of the Cloud-TM project. The
credits go to Pedro Ruivo, a Msc student working in our group, who
developed this as part of his thesis work.
Like the current Infinispan's 2PC-based replication mechanism, also this
is a multi-master scheme in which transactions are executed locally by
any node and without any coordination among replicas till the commit phase.
Rather than using 2PC, however, this scheme scheme relies on a single
Atomic broadcast (AB, a.k.a. total order broadcast) to disseminate
transactions' writesets while ensuring totally ordered delivery of
messages across the various nodes.
The replication scheme is, in fact, quite simple. Upon delivery of the
AB with the writeset of a transaction, we ensure that each replica
executes the following operations in the order established by the AB layer:
- acquiring locks on the writeset, possibly stealing the lock from any
locally running transaction (and flagging it as aborted)
- performing a write-skew check (in case Infinispan is configured to do
this check)
- applying the writeset
- release the locks.
ABcast can be implemented in many many ways, but for simplicity we used
the sequencer based implementation available in JGroups. This scheme is
quite simple:
- upon an atomic broadcast, the message is simply broadcast to all nodes
- upon receipt of the message, however, the nodes do not deliver it
immediately to the application layer (Infinispan in this case). Instead,
a special node (e.g. the one with lesser id in the group), namely the
sequencer, broadcasts back the order in which processes should deliver
this message.
- finally, the nodes deliver the atomic broadcast message in the order
specified by the sequencer.
The main advantage of this replication mechanism is that it avoids
distributed deadlocks, which makes its performance way better at high
contention (note that local deadlocks may still occur, as Infinispan's
encounter locking scheme does not prevent them. But these are much more
rare as the number of "contending" threads is much lower).
Further, locks are held shorter with respect to 2PC. With 2PC, the
cohort nodes, maintain the locks on behalf of the coordinator, since the
reception of the prepare and until they receive the commit message. This
encompasses a round trip (collecting votes from all cohorts, and sending
back the decision). With the ABcast scheme, instead, the locks acquired
on behalf of remote transactions, are held only for the time strictly
required to writeback locally.
Finally the sequencer is, in fact, a privileged node, as it can commit
transactions much faster than the other nodes, as it can self-assign the
order in which transactions need to be processed. This may not be very
fair at high contention rate, as it gets higher chances of committing
transactions, but does make it MUCH faster overall.
Concerning blocking scenarios in case of failures. Just like 2PC is
blocking in case of crashes of a node coordinating a transaction, this
replication scheme is also blocking, but this time in case of crashes of
the sequencer. The comparison in terms of liveness guarantees seem
therefore quite fair. (Note that it would have been possible to make
this replication mechanism non-blocking, at the cost of one extra
communication step. But we opted not to do it, to compare protocols more
fairly).
To evaluate performance we ran the same kind of experiments we used in
our recent mail where we evaluted a Primary backup-based replication
scheme. All nodes only do write operations, transactions of 10
statements, one of which being a put. Accesses are uniformly distributed
to 1K, 10K, 100K data items. Machines are 8cores, 8GB RAM, radargun is
using 10 threads per ndoe.
Summary of results (see attached pdf):
- Throughput, page 1: the plots show significant speedups. Especially at
high contention, where 2PC stumbles upon very frequent distributed
deadlocks. Note that we enabled deadlock detection in all experiments
running 2PC.
- Abort rate, page 2: abort rates are similar in both protocols (note
the logscale on y-axis). But the AB-based solution, avoiding deadlocks,
detects the need to abort transactions much faster than 2PC.
- Average Commit duration, page 3: at high contention, 2PC trashes due
to deadlocks (despite the deadlock detection mechanism), and this is
reflected in the Avg. commit duration as well. In the other cases (10K
and 100K keys), where there is less contention, the commit duration of
the two protocols is similar (see next).
- To bundle or not to bundle?, page 4: the throughput results shown in
page 1 have actually a little trick :-P We used "some" bundling ONLY in
the case of transactions spanning 100K data items. We noticed that w/o
bundling, the commit time duration of the AB-based scheme got way
lousier and experimented with a few bundling values till we got decent
performance (See plot at page 4). This is explainable since with 100K
keys, being the contention low, less transactions get aborted along
their execution. This means that more transaction reach the commit
phase, and, in its turn, more atomic broadcasts hit the network layer,
and more load for the sequencing node. Bundling, as you can see in page
4, makes the trick however. We did not retry the tests with the other
protocols with and w/o bundling, as this would have been very time
consuming. BTW, I believe that the results would not have been very
different.
Let me open a small parenthesis about bundling: this can be an extremely
effective weapon, but manual configuration is a bid headache and
extremely time consuming. I believe that it would be very useful to have
some self-tuning mechanism for it in JGroups. In fact, we've recently
got very nice results using Reinforcement-learning to tackle this
problem (well, to be precise, not really the same problem, but a very
similar one):
http://www.inesc-id.pt/ficheiros/publicacoes/7163.pdf
...but I've implemented the prototype of this solution in Appia, as I
knew it much better than JGroups. What do you think about integrating a
similar mechanism in JGroup's ergonomics?
Back to replication: we are now working on a similar solution for
partial replication (a.k.a. distribution in Infinispan).
I'll keep you posted on our progresses!
Cheers,
Paolo
PS: The code is available on GitHUB, if you want to take a look at it:
https://github.com/cloudtm/infinispan/tree/abcast_replication
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Performance Evaluation.pdf
Type: application/pdf
Size: 76348 bytes
Desc: not available
Url : http://lists.jboss.org/pipermail/infinispan-dev/attachments/20110416/6573002d/attachment-0001.pdf
More information about the infinispan-dev
mailing list