Hi
On 11 Feb 2011, at 17:56, Sebastiano Peluso wrote:
Hi all,
first let us introduce ourselves given that it's the first time we write in this
mailing list.
Our names are Sebastiano Peluso and Diego Didona, and we are working at INESC-ID Lisbon
in the context of the Cloud-TM project. Our work is framed in the context of self-adaptive
replication mechanisms (please refer to previous message by Paolo on this mailing list and
to the
www.cloudtm.eu website for additional details on this project).
Up to date we have been working on developing a relatively simple primary-backup (PB)
replication mechanism, which we integrated within Infinispan 4.2 and 5.0. In this kind of
scheme only one node (called the primary) is allowed to process update transactions,
whereas the remaining nodes only process read-only transactions. This allows coping very
efficiently with write transactions, as the primary does not have to incur in the cost of
remote coordination schemes nor in distributed deadlocks, which can hamper performance at
high contention with two phase commit schemes. With PB, in fact, the primary can serialize
transactions locally, and simply propagate the updates to the backups for fault-tolerance
as well as to allow them to process read-only transactions on fresh data of course. On the
other hand, the primary is clearly prone to become the bottleneck especially in large
clusters and write intensive workloads.
So when you say "serialize transactions locally" does this mean you don't do
distributed locking? Or is "locally" not referring simply to the primary?
Thus, this scheme does not really represent a replacement for the default 2PC protocol,
but rather an alternative approach that results particularly attractive (as we will
illustrate in the following with some radargun-based performance results) in small scale
clusters, or, in "elastic" cloud scenarios, in periods where the workload is
either read dominated or not very intense.
I'm not sure what you mean by a "replacement" for 2PC? 2PC is for atomicity
(consensus). It's the A in ACID. It's not the C, I or D.
Being this scheme particularly efficient in these scenarios, in fact,
its adoption would allow to minimize the number of resources acquired from the cloud
provider in these periods, with direct benefits in terms of cost reductions. In Cloud-TM,
in fact, we aim at designing autonomic mechanisms that would dynamically switch among
multiple replication mechanisms depending on the current workload characteristics.
Replication and transactions aren't mutually inconsistent things. They can be merged
quite well :-)
http://www.cs.ncl.ac.uk/publications/articles/papers/845.pdf
http://www.cs.ncl.ac.uk/publications/inproceedings/papers/687.pdf
http://www.cs.ncl.ac.uk/publications/inproceedings/papers/586.pdf
http://www.cs.ncl.ac.uk/publications/inproceedings/papers/596.pdf
http://www.cs.ncl.ac.uk/publications/trnn/papers/117.pdf
http://www.cs.ncl.ac.uk/publications/inproceedings/papers/621.pdf
The reason I mention these is because I don't quite understand what your goal is/was
with regards to transactions and replication.
Before discussing the results of our preliminary benchmarking study, we would like to
briefly overview how we integrated this replication mechanism within Infinispan. Any
comment/feedback is clearly highly appreciated. First of all we have defined a new
command, namely PassiveReplicationCommand, that is a subclass of PrepareCommand. We had to
define a new command because we had to design customized "visiting" methods for
the interceptors. Note that our protocol only affects the commit phase of a transaction,
specifically, during the prepare phase, in the prepare method of the TransactionXaAdapter
class, if the Primary Backup mode is enabled, then a PassiveReplicationCommand is built by
the CommandsFactory and it is passed to the invoke method of the invoker. The
PassiveReplicationCommand is then visited by the all the interceptors in the chain, by
means of the visitPassiveReplicationCommand methods. We describe more in detail the
operations performed by the non-trivial interceptors:
-TxInterceptor: like in the 2PC protocol, if the context is not originated locally, then
for each modification stored in the PassiveReplicationCommand the chain of interceptors is
invoked.
-LockingInterceptor: first the next interceptor is called, then the cleanupLocks is
performed with the second parameter set to true (commit the operations). This operation is
always safe: on the primary it is called only after that the acks from all the slaves are
received (see the ReplicationInterceptor below); on the slave there are no concurrent
conflicting writes since these are already locally serialized by the locking scheme
performed at the primary.
-ReplicationInterceptor: first it invokes the next iterceptor; then if the predicate
shouldInvokeRemoteTxCommand(ctx) is true, then the method
rpcManager.broadcastRpcCommand(command, true, false) is performed, that replicates the
modifications in a synchronous mode (waiting for an explicit ack from all the backups).
As for the commit phase:
-Given that in the Primary Backup the prepare phase works as a commit too, the commit
method on a TransactionXaAdapter object in this case simply returns.
On the resulting extended Infinispan version a subset of unit/functional tests were
executed and successfully passed:
- commands.CommandIdUniquenessTest
- replication.SyncReplicatedAPITest
- replication.SyncReplImplicitLockingTest
- replication.SyncReplLockingTest
- replication.SyncReplTest
- replication.SyncCacheListenerTest
- replication.ReplicationExceptionTest
We have tested this solution using a customized version of Radargun. Our customizations
were first of all aimed at having each thread accessing data within transactions, instead
of executing single put/get operations. In addition, now every Stresser thread accesses
with uniform probability all of the keys stored by Infinispan, thus generating conflicts
with a probability proportional to the number of concurrently active threads and inversely
proportional to the total number of keys maintained.
As already hinted, our results highlight that, depending on the current workload/number
of nodes in the system, it is possible to identify scenarios where the PB scheme
significantly outperforms the current 2PC scheme, and vice versa.
I'm obviously missing something here, because it still seems like you are attempting
to compare incomparable things. It's not like comparing apples and oranges (at least
they are both fruit), but more like apples and broccoli ;-)
Our experiments were performed in a cluster of homogeneous 8-core
(Xeon(a)2.16GhZ) nodes interconnected via a Gigabit Ethernet and running a Linux Kernel 64
bit version 2.6.32-21-server. The results were obtained by running for 30 seconds 8
parallel Stresser threads per nodes, and letting the number of node vary from 2 to 10. In
2PC, each thread executes transactions which consist of 10 (get/put) operations, with a
10% of probability of generating a put operation. With PB, the same kind of transactions
are executed on the primary, but the backups execute read-only transactions composed of 10
get operations. This allows to compare the maximum throughput of update transactions
provided by the two compared schemes, without excessively favoring PB by keeping the
backups totally idle.
The total write transactions' throughput exhibited by the cluster (i.e. not the
throughput per node) is shown in the attached plots, relevant to caches containing 1000,
10000 and 100000 keys. As already discussed, the lower the number of keys, the higher the
chance of contention and the probability of aborts due to conflicts. In particular with
the 2PC scheme the number of failed transactions steadily increases at high contention up
to 6% (to the best of our understanding in particular due to distributed deadlocks). With
PB, instead the number of failed txs due to contention is always 0.
Maybe you don't mean 2PC, but pessimistic locking? I'm still not too sure. But I
am pretty sure it's not 2PC you mean.
Note that, currently, we are assuming that backup nodes do not generate update
transactions. In practice this corresponds to assuming the presence of some load balancing
scheme which directs (all and only) update transactions to the primary node, and read
transactions to the backup.
Can I say it again? Serialisability?
In the negative case (a put operation is generated on a backup node),
we simply throw a PassiveReplicationException at the CacheDelegate level. This is probably
suboptimal/undesirable in real settings, as update transactions may be transparently
rerouted (at least in principle!) to the primary node in a RPC-style. Any suggestion on
how to implement such a redirection facility in a transparent/non-intrusive manner would
be highly appreciated of course! ;-)
To conclude, we are currently working on a statistical model that is able to predict the
best suited replication scheme given the current workload/number of machines, as well as
on a mechanism to dynamically switch from one replication scheme to the other.
We'll keep you posted on our progresses!
Regards,
Sebastiano Peluso, Diego Didona
<PCvsPR_WRITE_TX_1000keys.png><PCvsPR_WRITE_TX_10000keys.png><PCvsPR_WRITE_TX_100000keys.png>------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb_____________________________________...
Cloudtm-discussion mailing list
Cloudtm-discussion(a)lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion
---
Mark Little
mlittle(a)redhat.com
JBoss, by Red Hat
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor,
Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael
Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland).