[infinispan-dev] Primary-Backup replication scheme in Infinispan

Fri Feb 11 12:56:37 EST 2011

Hi all,

first let us introduce ourselves given that it's the first time we write 
in this mailing list.

Our names are Sebastiano Peluso and Diego Didona, and we are working at 
INESC-ID Lisbon in the context of the Cloud-TM project. Our work is 
framed in the context of self-adaptive replication mechanisms (please 
refer to previous message by Paolo on this mailing list and to the 
www.cloudtm.eu website for additional details on this project).

Up to date we have been working on developing a relatively simple 
primary-backup (PB) replication mechanism, which we integrated within 
Infinispan 4.2 and 5.0. In this kind of scheme only one node (called the 
primary) is allowed to process update transactions, whereas the 
remaining nodes only process read-only transactions. This allows coping 
very efficiently with write transactions, as the primary does not have 
to incur in the cost of remote coordination schemes nor in distributed 
deadlocks, which can hamper performance at high contention with two 
phase commit schemes. With PB, in fact, the primary can serialize 
transactions locally, and simply propagate the updates to the backups 
for fault-tolerance as well as to allow them to process read-only 
transactions on fresh data of course. On the other hand, the primary is 
clearly prone to become the bottleneck especially in large clusters and 
write intensive workloads.

Thus, this scheme does not really represent a replacement for the 
default 2PC protocol, but rather an alternative approach that results 
particularly attractive (as we will illustrate in the following with 
some radargun-based performance results) in small scale clusters, or, in 
"elastic" cloud scenarios, in periods where the workload is either read 
dominated or not very intense. Being this scheme particularly efficient 
in these scenarios, in fact, its adoption would allow to minimize the 
number of resources acquired from the cloud provider in these periods, 
with direct benefits in terms of cost reductions. In Cloud-TM, in fact, 
we aim at designing autonomic mechanisms that would dynamically switch 
among multiple replication mechanisms depending on the current workload 
characteristics.

Before discussing the results of our preliminary benchmarking study, we 
would like to briefly overview how we integrated this replication 
mechanism within Infinispan. Any comment/feedback is clearly highly 
appreciated. First of all we have defined a new command, namely 
PassiveReplicationCommand, that is a subclass of PrepareCommand. We had 
to define a new command because we had to design customized "visiting" 
methods for the interceptors. Note that our protocol only affects the 
commit phase of a transaction, specifically, during the prepare phase, 
in the prepare method of the TransactionXaAdapter class, if the Primary 
Backup mode is enabled, then a PassiveReplicationCommand is built by the 
CommandsFactory and it is passed to the invoke method of the invoker. 
The PassiveReplicationCommand is then visited by the all the 
interceptors in the chain, by means of the 
visitPassiveReplicationCommand methods. We describe more in detail the 
operations performed by the non-trivial interceptors:

-TxInterceptor: like in the 2PC protocol, if the context is not 
originated locally, then for each modification stored in the 
PassiveReplicationCommand the chain of interceptors is invoked.
-LockingInterceptor: first the next interceptor is called, then the 
cleanupLocks is performed with the second parameter set to true (commit 
the operations). This operation is always safe: on the primary it is 
called only after that the acks from all the slaves are received  (see 
the ReplicationInterceptor below); on the slave there are no concurrent 
conflicting writes since these are already locally serialized by the 
locking scheme performed at the primary.
-ReplicationInterceptor: first it invokes the next iterceptor; then if 
the predicate shouldInvokeRemoteTxCommand(ctx) is true, then the method 
rpcManager.broadcastRpcCommand(command, true, false) is performed, that 
replicates the modifications in a synchronous mode (waiting for an 
explicit ack from all the backups).

As for the commit phase:
-Given that in the Primary Backup the prepare phase works as a commit 
too, the commit method on a TransactionXaAdapter object in this case 
simply returns.

On the resulting extended Infinispan version a subset of unit/functional 
tests were executed and successfully passed:
- commands.CommandIdUniquenessTest
- replication.SyncReplicatedAPITest
- replication.SyncReplImplicitLockingTest
- replication.SyncReplLockingTest
- replication.SyncReplTest
- replication.SyncCacheListenerTest
- replication.ReplicationExceptionTest

We have tested this solution using a customized version of Radargun. Our 
customizations were first of all aimed at having each thread accessing 
data within transactions, instead of executing single put/get 
operations. In addition, now every Stresser thread accesses with uniform 
probability all of the keys stored by Infinispan, thus generating 
conflicts with a probability proportional to the number of concurrently 
active threads and inversely proportional to the total number of keys 
maintained.

As already hinted, our results highlight that, depending on the current 
workload/number of nodes in the system, it is possible to identify 
scenarios where the PB scheme significantly outperforms the current 2PC 
scheme, and vice versa. Our experiments were performed in a cluster of 
homogeneous 8-core (Xeon at 2.16GhZ) nodes interconnected via a Gigabit 
Ethernet and running a Linux Kernel 64 bit version 2.6.32-21-server. The 
results were obtained by running for 30 seconds 8 parallel Stresser 
threads per nodes, and letting the number of node vary from 2 to 10. In 
2PC, each thread executes transactions which consist of 10 (get/put) 
operations, with a 10% of probability of generating a put operation. 
With PB, the same kind of transactions are executed on the primary, but 
the backups execute read-only transactions composed of 10 get 
operations. This allows to compare the maximum throughput of update 
transactions provided by the two compared schemes, without excessively 
favoring PB by keeping the backups totally idle.

The total write transactions' throughput exhibited by the cluster (i.e. 
not the throughput per node) is shown in the attached plots, relevant to 
caches containing 1000, 10000 and 100000 keys. As already discussed, the 
lower the number of keys, the higher the chance of contention and the 
probability of aborts due to conflicts. In particular with the 2PC 
scheme the number of failed transactions steadily increases at high 
contention up to 6% (to the best of our understanding in particular due 
to distributed deadlocks). With PB, instead the number of failed txs due 
to contention is always 0.

Note that, currently, we are assuming that backup nodes do not generate 
update transactions. In practice this corresponds to assuming the 
presence of some load balancing scheme which directs (all and only) 
update transactions to the primary node, and read transactions to the 
backup. In the negative case (a put operation is generated on a backup 
node), we simply throw a PassiveReplicationException at the 
CacheDelegate level. This is probably suboptimal/undesirable in real 
settings, as update transactions may be transparently rerouted (at least 
in principle!) to the primary node in a RPC-style. Any suggestion on how 
to implement such a redirection facility in a transparent/non-intrusive 
manner would be highly appreciated of course! ;-)

To conclude, we are currently working on a statistical model that is 
able to predict the best suited replication scheme given the current 
workload/number of machines, as well as on a mechanism to dynamically 
switch from one replication scheme to the other.

We'll keep you posted on our progresses!

Regards,
    Sebastiano Peluso, Diego Didona
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PCvsPR_WRITE_TX_1000keys.png
Type: image/png
Size: 31055 bytes
Desc: not available
Url : http://lists.jboss.org/pipermail/infinispan-dev/attachments/20110211/2d382671/attachment-0003.png 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PCvsPR_WRITE_TX_10000keys.png
Type: image/png
Size: 35134 bytes
Desc: not available
Url : http://lists.jboss.org/pipermail/infinispan-dev/attachments/20110211/2d382671/attachment-0004.png 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PCvsPR_WRITE_TX_100000keys.png
Type: image/png
Size: 37406 bytes
Desc: not available
Url : http://lists.jboss.org/pipermail/infinispan-dev/attachments/20110211/2d382671/attachment-0005.png