Excellent stuff, Paolo and Pedro. My comments below, inline. Cc'ing infinispan-dev
as well.
On 16 Apr 2011, at 21:47, Paolo Romano wrote:
Hi,
at INESC-ID, we have just finished brewing another (full) replication protocol for
Infinispan, in the context of the Cloud-TM project. The credits go to Pedro Ruivo, a Msc
student working in our group, who developed this as part of his thesis work.
Like the current Infinispan's 2PC-based replication mechanism, also this is a
multi-master scheme in which transactions are executed locally by any node and without any
coordination among replicas till the commit phase.
Rather than using 2PC, however, this scheme scheme relies on a single Atomic broadcast
(AB, a.k.a. total order broadcast) to disseminate transactions' writesets while
ensuring totally ordered delivery of messages across the various nodes.
The replication scheme is, in fact, quite simple. Upon delivery of the AB with the
writeset of a transaction, we ensure that each replica executes the following operations
in the order established by the AB layer:
- acquiring locks on the writeset, possibly stealing the lock from any locally running
transaction (and flagging it as aborted)
- performing a write-skew check (in case Infinispan is configured to do this check)
- applying the writeset
- release the locks.
I presume this implementation still plays nice with XA/JTA? In that transactions marked
for rollback, etc are respected?
ABcast can be implemented in many many ways, but for simplicity we used the sequencer
based implementation available in JGroups. This scheme is quite simple:
- upon an atomic broadcast, the message is simply broadcast to all nodes
- upon receipt of the message, however, the nodes do not deliver it immediately to the
application layer (Infinispan in this case). Instead, a special node (e.g. the one with
lesser id in the group), namely the sequencer, broadcasts back the order in which
processes should deliver this message.
- finally, the nodes deliver the atomic broadcast message in the order specified by the
sequencer.
The main advantage of this replication mechanism is that it avoids distributed deadlocks,
which makes its performance way better at high contention (note that local deadlocks may
still occur, as Infinispan's encounter locking scheme does not prevent them. But these
are much more rare as the number of "contending" threads is much lower).
Definitely.
Further, locks are held shorter with respect to 2PC. With 2PC, the
cohort nodes, maintain the locks on behalf of the coordinator, since the reception of the
prepare and until they receive the commit message. This encompasses a round trip
(collecting votes from all cohorts, and sending back the decision). With the ABcast
scheme, instead, the locks acquired on behalf of remote transactions, are held only for
the time strictly required to writeback locally.
Very true. Nice.
Finally the sequencer is, in fact, a privileged node, as it can
commit transactions much faster than the other nodes, as it can self-assign the order in
which transactions need to be processed. This may not be very fair at high contention
rate, as it gets higher chances of committing transactions, but does make it MUCH faster
overall.
Concerning blocking scenarios in case of failures. Just like 2PC is blocking in case of
crashes of a node coordinating a transaction, this replication scheme is also blocking,
but this time in case of crashes of the sequencer. The comparison in terms of liveness
guarantees seem therefore quite fair. (Note that it would have been possible to make this
replication mechanism non-blocking, at the cost of one extra communication step. But we
opted not to do it, to compare protocols more fairly).
When you say make the replication mechanism non-blocking, you are referring to
asynchronous communication?
To evaluate performance we ran the same kind of experiments we used
in our recent mail where we evaluted a Primary backup-based replication scheme. All nodes
only do write operations, transactions of 10 statements, one of which being a put.
Accesses are uniformly distributed to 1K, 10K, 100K data items. Machines are 8cores, 8GB
RAM, radargun is using 10 threads per ndoe.
Is your test still forcing deadlocks by working on the same keyset on each node?
Summary of results (see attached pdf):
- Throughput, page 1: the plots show significant speedups. Especially at high contention,
where 2PC stumbles upon very frequent distributed deadlocks. Note that we enabled deadlock
detection in all experiments running 2PC.
- Abort rate, page 2: abort rates are similar in both protocols (note the logscale on
y-axis). But the AB-based solution, avoiding deadlocks, detects the need to abort
transactions much faster than 2PC.
- Average Commit duration, page 3: at high contention, 2PC trashes due to deadlocks
(despite the deadlock detection mechanism), and this is reflected in the Avg. commit
duration as well. In the other cases (10K and 100K keys), where there is less contention,
the commit duration of the two protocols is similar (see next).
- To bundle or not to bundle?, page 4: the throughput results shown in page 1 have
actually a little trick :-P We used "some" bundling ONLY in the case of
transactions spanning 100K data items. We noticed that w/o bundling, the commit time
duration of the AB-based scheme got way lousier and experimented with a few bundling
values till we got decent performance (See plot at page 4). This is explainable since
with 100K keys, being the contention low, less transactions get aborted along their
execution. This means that more transaction reach the commit phase, and, in its turn, more
atomic broadcasts hit the network layer, and more load for the sequencing node. Bundling,
as you can see in page 4, makes the trick however. We did not retry the tests with the
other protocols with and w/o bundling, as this would have been very time consuming. BTW, I
believe that the results would not have been very different.
Let me open a small parenthesis about bundling: this can be an extremely effective
weapon, but manual configuration is a bid headache and extremely time consuming. I believe
that it would be very useful to have some self-tuning mechanism for it in JGroups. In
fact, we've recently got very nice results using Reinforcement-learning to tackle this
problem (well, to be precise, not really the same problem, but a very similar one):
http://www.inesc-id.pt/ficheiros/publicacoes/7163.pdf
...but I've implemented the prototype of this solution in Appia, as I knew it much
better than JGroups. What do you think about integrating a similar mechanism in
JGroup's ergonomics?
Back to replication: we are now working on a similar solution for partial replication
(a.k.a. distribution in Infinispan).
Very keen to see this work with distribution. :-)
Cheers
Manik
--
Manik Surtani
manik(a)jboss.org
twitter.com/maniksurtani
Lead, Infinispan
http://www.infinispan.org