[infinispan-dev] CloudTM: Additional Atomic broadcast based replication mechanism integrated in Infinispan

Sun Apr 17 06:05:54 EDT 2011

On 4/16/11 10:47 PM, Paolo Romano wrote:
> Hi,
>
> at INESC-ID, we have just finished brewing another (full) replication
> protocol for Infinispan, in the context of the Cloud-TM project. The
> credits go to Pedro Ruivo, a Msc student working in our group, who
> developed this as part of his thesis work.
>
> Like the current Infinispan's 2PC-based replication mechanism, also this
> is a multi-master scheme in which transactions are executed locally by
> any node and without any coordination among replicas till the commit phase.
>
> Rather than using 2PC, however, this scheme scheme relies on a single
> Atomic broadcast (AB, a.k.a. total order broadcast) to disseminate
> transactions' writesets while ensuring totally ordered delivery of
> messages across the various nodes.
>
> The replication scheme is, in fact, quite simple. Upon delivery of the
> AB with the writeset of a transaction, we ensure that each replica
> executes the following operations in the order established by the AB layer:
> - acquiring locks on the writeset, possibly stealing the lock from any
> locally running transaction (and flagging it as aborted)
> - performing a write-skew check (in case Infinispan is configured to do
> this check)
> - applying the writeset
> - release the locks.

Using atomic broadcasts instead of 2PC to disseminate updates is 
something I always thought was a good idea !
(I think something similar has been implemented in the context of a 
research project at ETH Zurich in Gustavo Alonzo's group (with Postgresql))

I think your performance results show that this schema results in fewer 
(lock) collisions and faster updates - despite the sequencer approach 
used. This *might* even be faster with a token based total order protocol...

> ABcast can be implemented in many many ways, but for simplicity we used
> the sequencer based implementation available in JGroups. This scheme is
> quite simple:
> - upon an atomic broadcast, the message is simply broadcast to all nodes
> - upon receipt of the message, however, the nodes do not deliver it
> immediately to the application layer (Infinispan in this case). Instead,
> a special node (e.g. the one with lesser id in the group), namely the
> sequencer, broadcasts back the order in which processes should deliver
> this message.
> - finally, the nodes deliver the atomic broadcast message in the order
> specified by the sequencer.
>
> The main advantage of this replication mechanism is that it avoids
> distributed deadlocks, which makes its performance way better at high
> contention

+1

  (note that local deadlocks may still occur, as Infinispan's
> encounter locking scheme does not prevent them. But these are much more
> rare as the number of "contending" threads is much lower).
> Further, locks are held shorter with respect to 2PC.

Exactly !

>  With 2PC, the
> cohort nodes, maintain the locks on behalf of the coordinator, since the
> reception of the prepare and until they receive the commit message. This
> encompasses a round trip (collecting votes from all cohorts, and sending
> back the decision). With the ABcast scheme, instead, the locks acquired
> on behalf of remote transactions, are held only for the time strictly
> required to writeback locally.
> Finally the sequencer is, in fact, a privileged node, as it can commit
> transactions much faster than the other nodes, as it can self-assign the
> order in which transactions need to be processed.

This is something I haven't thought about yet. JGroups itself doesn't 
re-arrange the total order; it basically establishes it on a 
first-come-first-serve basis. But you're right, arranging incoming 
requests into a different total order might make sense. Example: if you 
have P1, R2, P2, P3, R2, it might make sense to package {P1 and P2} and 
{R1 and R2} together, to make this even more efficient... Interesting !

>  This may not be very
> fair at high contention rate, as it gets higher chances of committing
> transactions, but does make it MUCH faster overall.

Agreed. Compare it to java.util.concurrent reentrant locks: fairness 
slows down this implementation, so IMO non-fairness is ok as long as 
there is no starvation.

> - To bundle or not to bundle?, page 4: the throughput results shown in
> page 1 have actually a little trick :-P We used "some" bundling ONLY in
> the case of transactions spanning 100K data items. We noticed that w/o
> bundling, the commit time duration of the AB-based scheme got way
> lousier and experimented with a few bundling values till we got decent
> performance (See plot at page 4). This is explainable since with 100K
> keys, being the contention low, less transactions get aborted along
> their execution. This means that more transaction reach the commit
> phase, and, in its turn, more atomic broadcasts hit the network layer,
> and more load for the sequencing node. Bundling, as you can see in page
> 4, makes the trick however. We did not retry the tests with the other
> protocols with and w/o bundling, as this would have been very time
> consuming. BTW, I believe that the results would not have been very
> different.

Did you apply bundling at the configuration level, or per message ? Not 
sure if you know about the ability to override bundling on a per-message 
basis: the NO_BUNDLE flag set in a message overrides the bundling 
configuration, and a message with such a flag set is sent immediately.

> Let me open a small parenthesis about bundling: this can be an extremely
> effective weapon, but manual configuration is a bid headache and
> extremely time consuming. I believe that it would be very useful to have
> some self-tuning mechanism for it in JGroups. In fact, we've recently
> got very nice results using Reinforcement-learning to tackle this
> problem (well, to be precise, not really the same problem, but a very
> similar one):
>
> http://www.inesc-id.pt/ficheiros/publicacoes/7163.pdf

I'll take a look. In the context of ergonomics, I've been wanting to set 
the max_bundle_time value dynamically, so this input is certainly helpful !

> ...but I've implemented the prototype of this solution in Appia, as I
> knew it much better than JGroups. What do you think about integrating a
> similar mechanism in JGroup's ergonomics?

+1. Definitely something I've been wanting to do...

-- 
Bela Ban
Lead JGroups / Clustering Team
JBoss