On 4/16/11 10:47 PM, Paolo Romano wrote:
Hi,
at INESC-ID, we have just finished brewing another (full) replication
protocol for Infinispan, in the context of the Cloud-TM project. The
credits go to Pedro Ruivo, a Msc student working in our group, who
developed this as part of his thesis work.
Like the current Infinispan's 2PC-based replication mechanism, also this
is a multi-master scheme in which transactions are executed locally by
any node and without any coordination among replicas till the commit phase.
Rather than using 2PC, however, this scheme scheme relies on a single
Atomic broadcast (AB, a.k.a. total order broadcast) to disseminate
transactions' writesets while ensuring totally ordered delivery of
messages across the various nodes.
The replication scheme is, in fact, quite simple. Upon delivery of the
AB with the writeset of a transaction, we ensure that each replica
executes the following operations in the order established by the AB layer:
- acquiring locks on the writeset, possibly stealing the lock from any
locally running transaction (and flagging it as aborted)
- performing a write-skew check (in case Infinispan is configured to do
this check)
- applying the writeset
- release the locks.
Using atomic broadcasts instead of 2PC to disseminate updates is
something I always thought was a good idea !
(I think something similar has been implemented in the context of a
research project at ETH Zurich in Gustavo Alonzo's group (with Postgresql))
I think your performance results show that this schema results in fewer
(lock) collisions and faster updates - despite the sequencer approach
used. This *might* even be faster with a token based total order protocol...
ABcast can be implemented in many many ways, but for simplicity we
used
the sequencer based implementation available in JGroups. This scheme is
quite simple:
- upon an atomic broadcast, the message is simply broadcast to all nodes
- upon receipt of the message, however, the nodes do not deliver it
immediately to the application layer (Infinispan in this case). Instead,
a special node (e.g. the one with lesser id in the group), namely the
sequencer, broadcasts back the order in which processes should deliver
this message.
- finally, the nodes deliver the atomic broadcast message in the order
specified by the sequencer.
The main advantage of this replication mechanism is that it avoids
distributed deadlocks, which makes its performance way better at high
contention
+1
(note that local deadlocks may still occur, as Infinispan's
encounter locking scheme does not prevent them. But these are much
more
rare as the number of "contending" threads is much lower).
Further, locks are held shorter with respect to 2PC.
Exactly !
With 2PC, the
cohort nodes, maintain the locks on behalf of the coordinator, since the
reception of the prepare and until they receive the commit message. This
encompasses a round trip (collecting votes from all cohorts, and sending
back the decision). With the ABcast scheme, instead, the locks acquired
on behalf of remote transactions, are held only for the time strictly
required to writeback locally.
Finally the sequencer is, in fact, a privileged node, as it can commit
transactions much faster than the other nodes, as it can self-assign the
order in which transactions need to be processed.
This is something I haven't thought about yet. JGroups itself doesn't
re-arrange the total order; it basically establishes it on a
first-come-first-serve basis. But you're right, arranging incoming
requests into a different total order might make sense. Example: if you
have P1, R2, P2, P3, R2, it might make sense to package {P1 and P2} and
{R1 and R2} together, to make this even more efficient... Interesting !
This may not be very
fair at high contention rate, as it gets higher chances of committing
transactions, but does make it MUCH faster overall.
Agreed. Compare it to java.util.concurrent reentrant locks: fairness
slows down this implementation, so IMO non-fairness is ok as long as
there is no starvation.
- To bundle or not to bundle?, page 4: the throughput results shown
in
page 1 have actually a little trick :-P We used "some" bundling ONLY in
the case of transactions spanning 100K data items. We noticed that w/o
bundling, the commit time duration of the AB-based scheme got way
lousier and experimented with a few bundling values till we got decent
performance (See plot at page 4). This is explainable since with 100K
keys, being the contention low, less transactions get aborted along
their execution. This means that more transaction reach the commit
phase, and, in its turn, more atomic broadcasts hit the network layer,
and more load for the sequencing node. Bundling, as you can see in page
4, makes the trick however. We did not retry the tests with the other
protocols with and w/o bundling, as this would have been very time
consuming. BTW, I believe that the results would not have been very
different.
Did you apply bundling at the configuration level, or per message ? Not
sure if you know about the ability to override bundling on a per-message
basis: the NO_BUNDLE flag set in a message overrides the bundling
configuration, and a message with such a flag set is sent immediately.
Let me open a small parenthesis about bundling: this can be an
extremely
effective weapon, but manual configuration is a bid headache and
extremely time consuming. I believe that it would be very useful to have
some self-tuning mechanism for it in JGroups. In fact, we've recently
got very nice results using Reinforcement-learning to tackle this
problem (well, to be precise, not really the same problem, but a very
similar one):
http://www.inesc-id.pt/ficheiros/publicacoes/7163.pdf
I'll take a look. In the context of ergonomics, I've been wanting to set
the max_bundle_time value dynamically, so this input is certainly helpful !
...but I've implemented the prototype of this solution in Appia,
as I
knew it much better than JGroups. What do you think about integrating a
similar mechanism in JGroup's ergonomics?
+1. Definitely something I've been wanting to do...
--
Bela Ban
Lead JGroups / Clustering Team
JBoss