[infinispan-dev] [Cloudtm-discussion] Re: Primary-Backup replication scheme in Infinispan

Thu Feb 17 19:59:27 EST 2011

On 2/17/11 9:02 PM, Mark Little wrote:
> Hi Paolo,
>
> Sent from my iPhone
>
> On 17 Feb 2011, at 19:51, Paolo Romano <romano at inesc-id.pt 
> <mailto:romano at inesc-id.pt>> wrote:
>
>> Hi Mark,
>>
>> let me try to clarify the scope and goals of the work Sebastiano and 
>> Diego are doing, as I think there may have been some misunderstanding 
>> here.
>>
>> First an important premise, which I think was not clear in their 
>> previous message. We are here considering the *full replication mode* 
>> of Infinispan, in which every node maintains all copies of the same 
>> data items. This implies that there is no need to fetch data from 
>> remote nodes during transaction execution. Also, Infinispan was 
>> configured NOT TO use eager locking. In other words, during 
>> transaction's execution Infinispan acquires locks only locally.
>
> Ok that was certainly not clear.
>
>>
>> What it is being compared is a single-master replication protocol 
>> (the classic primary-backup scheme) vs the 2PC based multi-master 
>> replication protocol (which is the one currently implemented by 
>> Infinispan).
>
> Since two phase commit is associated a lot with transaction systems ;) 
> and probably more so than replication, I think we probably need to be 
> more careful with the implicit assumptions that are carried with any 
> discussions in future :) It should help avoid confusion.

I see you point. But I guess that the one of main reasons why several 
details of the Infinispan's 2PC based replication algorithm were omitted 
in the mail was not make it so long to discourage reading it ;-)

>
>>
>> The two protocols behave identically during transactions' execution. 
>> What differs between the two protocols is the handling of the commit 
>> phase.
>>
>> In the single master mode, as no remote conflict can occur and local 
>> conflicts are managed by local concurrency control, data updates can 
>> be simply pushed to the backups. More precisely (this should answer 
>> also a question of Mircea), the primary DURING the commit phase, in a 
>> synchronous fashion, propagates the updates to the backups, waits for 
>> their ack, and only then returns from the commit request to the 
>> application.
>>
>> In the multi-master mode, 2PC is used to materialize conflicts among 
>> transactions executed at different nodes, and to ensure agreement on 
>> the transaction's outcome. Thus 2PC serves  to enforce both atomicity 
>> and isolation.
>
> Not quite :)
>
> 2PC is a consensus protocol, pure and simple. I think what you are 
> meaning is that some of the participants do additional work during the 
> two phases, such as happens in a TM where locks may be propagated, 
> promoted or released. But the reason I mention this is because we as 
> an industry (and academia) have confused the world by not being clear 
> on what 2PC brings. You'd be surprised by the number of people who 
> believe 2PC is responsible for all of the ACID semantics. I think we 
> need to be very clear on such distinctions.

... true, it's the lock acquisition taking place during the voting phase 
of 2PC that ensures isolation, not 2PC per se. I called this approach 
2PC based multi-master replication protocol when I Introduced it in my 
mail... 2PC here was just a shorthand.

>
>>
>> The plots attached in the mail from Diego and Sebastiano are 
>> comparing the performance of two replication strategies that provide 
>> the same degree of consistency/failure resiliency...at least assuming 
>> perfect failure detection,
>
> And we both know there is no such thing, unless you want to get into 
> the world of quantum mechanics :-)

....still there are many real systems that are actually designed 
assuming perfect failure detection... possibly relying on some dedicated 
hardware mechanism (which encapsulate the perfect failure detection 
assumption) to enforce accuracy by forcing suspected nodes to shut down 
falsely suspected nodes, e.g. http://en.wikipedia.org/wiki/STONITH.

>
>> I'd prefer to avoid delving into a (actually interesting) discussion 
>> of non-blocking guarantees of the two protocols in partially or 
>> eventually synchronous environments in this mailing list.
>
> I'd be interested in continuing off list then.

well, trying to make a very long story (probably too) short one may say 
that:

- in an asynchronous system (even if augmented with eventually perfect 
failure detection), classic 2PC blocks upon coordinator crashes. This, 
pragmatically, forces to heuristic decisions that may lead to atomicity 
violations. Using more expensive commit protocols, such as Paxos Commit 
(http://research.microsoft.com/apps/pubs/default.aspx?id=64636), one can 
enforce atomicity also in eventually synchronous systems (tolerating f 
faults out of 2f+1 replicas).

- With primary backup (PB), in an asynchronous system the issue is the 
split-brain syndrome (this wiki definition is not the best ever but it's 
the only I could rapidly find: 
http://en.wikipedia.org/wiki/Split-brain_(Computing) ). Unsurprisingly, 
also for PB, it is also possible to design (more expensive) variants 
working in partially synchronous systems. An example is Vertical Paxos 
(research.microsoft.com/pubs/80907/podc09v6.pdf). Or assuming virtual 
synchrony (http://en.wikipedia.org/wiki/Virtual_synchrony),  by 
propagating the primary's updates via uniform reliable broadcast (also 
called safe delivery in virtual synchrony's gergon, 
http://www.cs.huji.ac.il/labs/transis/lab-projects/guide/chap3.html#safe). 
Or, mapping the solution to classic Paxos, by having the role of the 
primary coincide with that of the leader in the Multi-paxos protocol 
(http://en.wikipedia.org/wiki/Paxos_algorithm#Multi-Paxos).

Note that, in an asynchronous system, the PB implementation shown in the 
plots might violate consistency in case of false failure suspicions of 
the primary, just like the current Infinispan's 2PC based replication 
scheme might do in case of false failure suspicions of the coordinator. 
So, in terms of required synchrony assumptions, the 2 considered 
protocols are fairly comparable.

>
>> Thus I disagree when you say that they are not comparable.
>
> Well hopefully you now see why I still disagree. If the term 2PC is 
> being used here to mean the atomicity/consensus protocol only, then 
> they are not comparable. If it's a shorthand for more than that then 
> we should use a different notation.

Considering that 2PC was a shorthand for "2PC based multi-master 
replication scheme" ... I hope that we agree now that the performance of 
the two protocols are comparable ;-)

>
>>
>> IMO, the only arguable unfairness in the comparison is that the 
>> multi-master scheme allows processing update transactions at every 
>> node, whereas the single-master scheme should  rely either on 1) a 
>> load balancing scheme to ensure redirecting all the update 
>> transactions to the master, or 2) rely on a redirection scheme to 
>> redirect towards the master any update transactions "originated" by 
>> the backups. The testbed considered in the current evaluation is 
>> "emulating" option  1, which does favour the PB scheme. On the other 
>> hand, we are currently studying a non-intrusive way to implement a 
>> transparent redirection scheme (option 2) in Infinispan, so 
>> suggestions are very welcome! ;-)
>>
>> I hope this post clarified your perplexities!! :-)
>
> Yes and no :)
>
>>
>> Cheers,
>>
>>     Paolo
>>
>>
>>
>> On 2/17/11 3:53 PM, Mark Little wrote:
>>> Hi
>>>
>>> On 11 Feb 2011, at 17:56, Sebastiano Peluso wrote:
>>>
>>>> Hi all,
>>>>
>>>> first let us introduce ourselves given that it's the first time we 
>>>> write in this mailing list.
>>>>
>>>> Our names are Sebastiano Peluso and Diego Didona, and we are 
>>>> working at INESC-ID Lisbon in the context of the Cloud-TM project. 
>>>> Our work is framed in the context of self-adaptive replication 
>>>> mechanisms (please refer to previous message by Paolo on this 
>>>> mailing list and to the www.cloudtm.eu <http://www.cloudtm.eu> 
>>>> website for additional details on this project).
>>>>
>>>> Up to date we have been working on developing a relatively simple 
>>>> primary-backup (PB) replication mechanism, which we integrated 
>>>> within Infinispan 4.2 and 5.0. In this kind of scheme only one node 
>>>> (called the primary) is allowed to process update transactions, 
>>>> whereas the remaining nodes only process read-only transactions. 
>>>> This allows coping very efficiently with write transactions, as the 
>>>> primary does not have to incur in the cost of remote coordination 
>>>> schemes nor in distributed deadlocks, which can hamper performance 
>>>> at high contention with two phase commit schemes. With PB, in fact, 
>>>> the primary can serialize transactions locally, and simply 
>>>> propagate the updates to the backups for fault-tolerance as well as 
>>>> to allow them to process read-only transactions on fresh data of 
>>>> course. On the other hand, the primary is clearly prone to become 
>>>> the bottleneck especially in large clusters and write intensive 
>>>> workloads.
>>>
>>> So when you say "serialize transactions locally" does this mean you 
>>> don't do distributed locking? Or is "locally" not referring simply 
>>> to the primary?
>>>
>>>>
>>>> Thus, this scheme does not really represent a replacement for the 
>>>> default 2PC protocol, but rather an alternative approach that 
>>>> results particularly attractive (as we will illustrate in the 
>>>> following with some radargun-based performance results) in small 
>>>> scale clusters, or, in "elastic" cloud scenarios, in periods where 
>>>> the workload is either read dominated or not very intense.
>>>
>>> I'm not sure what you mean by a "replacement" for 2PC? 2PC is for 
>>> atomicity (consensus). It's the A in ACID. It's not the C, I or D.
>>>
>>>> Being this scheme particularly efficient in these scenarios, in 
>>>> fact, its adoption would allow to minimize the number of resources 
>>>> acquired from the cloud provider in these periods, with direct 
>>>> benefits in terms of cost reductions. In Cloud-TM, in fact, we aim 
>>>> at designing autonomic mechanisms that would dynamically switch 
>>>> among multiple replication mechanisms depending on the current 
>>>> workload characteristics.
>>>
>>> Replication and transactions aren't mutually inconsistent things. 
>>> They can be merged quite well :-)
>>>
>>> http://www.cs.ncl.ac.uk/publications/articles/papers/845.pdf
>>> http://www.cs.ncl.ac.uk/publications/inproceedings/papers/687.pdf
>>> http://www.cs.ncl.ac.uk/publications/inproceedings/papers/586.pdf
>>> http://www.cs.ncl.ac.uk/publications/inproceedings/papers/596.pdf
>>> http://www.cs.ncl.ac.uk/publications/trnn/papers/117.pdf
>>> http://www.cs.ncl.ac.uk/publications/inproceedings/papers/621.pdf
>>>
>>> The reason I mention these is because I don't quite understand what 
>>> your goal is/was with regards to transactions and replication.
>>>
>>>>
>>>> Before discussing the results of our preliminary benchmarking 
>>>> study, we would like to briefly overview how we integrated this 
>>>> replication mechanism within Infinispan. Any comment/feedback is 
>>>> clearly highly appreciated. First of all we have defined a new 
>>>> command, namely PassiveReplicationCommand, that is a subclass of 
>>>> PrepareCommand. We had to define a new command because we had to 
>>>> design customized "visiting" methods for the interceptors. Note 
>>>> that our protocol only affects the commit phase of a transaction, 
>>>> specifically, during the prepare phase, in the prepare method of 
>>>> the TransactionXaAdapter class, if the Primary Backup mode is 
>>>> enabled, then a PassiveReplicationCommand is built by the 
>>>> CommandsFactory and it is passed to the invoke method of the 
>>>> invoker. The PassiveReplicationCommand is then visited by the all 
>>>> the interceptors in the chain, by means of the 
>>>> visitPassiveReplicationCommand methods. We describe more in detail 
>>>> the operations performed by the non-trivial interceptors:
>>>>
>>>> -TxInterceptor: like in the 2PC protocol, if the context is not 
>>>> originated locally, then for each modification stored in the 
>>>> PassiveReplicationCommand the chain of interceptors is invoked.
>>>> -LockingInterceptor: first the next interceptor is called, then the 
>>>> cleanupLocks is performed with the second parameter set to true 
>>>> (commit the operations). This operation is always safe: on the 
>>>> primary it is called only after that the acks from all the slaves 
>>>> are received  (see the ReplicationInterceptor below); on the slave 
>>>> there are no concurrent conflicting writes since these are already 
>>>> locally serialized by the locking scheme performed at the primary.
>>>> -ReplicationInterceptor: first it invokes the next iterceptor; then 
>>>> if the predicate shouldInvokeRemoteTxCommand(ctx) is true, then the 
>>>> method rpcManager.broadcastRpcCommand(command, true, false) is 
>>>> performed, that replicates the modifications in a synchronous mode 
>>>> (waiting for an explicit ack from all the backups).
>>>>
>>>> As for the commit phase:
>>>> -Given that in the Primary Backup the prepare phase works as a 
>>>> commit too, the commit method on a TransactionXaAdapter object in 
>>>> this case simply returns.
>>>>
>>>> On the resulting extended Infinispan version a subset of 
>>>> unit/functional tests were executed and successfully passed:
>>>> - commands.CommandIdUniquenessTest
>>>> - replication.SyncReplicatedAPITest
>>>> - replication.SyncReplImplicitLockingTest
>>>> - replication.SyncReplLockingTest
>>>> - replication.SyncReplTest
>>>> - replication.SyncCacheListenerTest
>>>> - replication.ReplicationExceptionTest
>>>>
>>>>
>>>> We have tested this solution using a customized version of 
>>>> Radargun. Our customizations were first of all aimed at having each 
>>>> thread accessing data within transactions, instead of executing 
>>>> single put/get operations. In addition, now every Stresser thread 
>>>> accesses with uniform probability all of the keys stored by 
>>>> Infinispan, thus generating conflicts with a probability 
>>>> proportional to the number of concurrently active threads and 
>>>> inversely proportional to the total number of keys maintained.
>>>>
>>>> As already hinted, our results highlight that, depending on the 
>>>> current workload/number of nodes in the system, it is possible to 
>>>> identify scenarios where the PB scheme significantly outperforms 
>>>> the current 2PC scheme, and vice versa.
>>>
>>> I'm obviously missing something here, because it still seems like 
>>> you are attempting to compare incomparable things. It's not like 
>>> comparing apples and oranges (at least they are both fruit), but 
>>> more like apples and broccoli ;-)
>>>
>>>> Our experiments were performed in a cluster of homogeneous 8-core 
>>>> (Xeon at 2.16GhZ) nodes interconnected via a Gigabit Ethernet and 
>>>> running a Linux Kernel 64 bit version 2.6.32-21-server. The results 
>>>> were obtained by running for 30 seconds 8 parallel Stresser threads 
>>>> per nodes, and letting the number of node vary from 2 to 10. In 
>>>> 2PC, each thread executes transactions which consist of 10 
>>>> (get/put) operations, with a 10% of probability of generating a put 
>>>> operation. With PB, the same kind of transactions are executed on 
>>>> the primary, but the backups execute read-only transactions 
>>>> composed of 10 get operations. This allows to compare the maximum 
>>>> throughput of update transactions provided by the two compared 
>>>> schemes, without excessively favoring PB by keeping the backups 
>>>> totally idle.
>>>>
>>>> The total write transactions' throughput exhibited by the cluster 
>>>> (i.e. not the throughput per node) is shown in the attached plots, 
>>>> relevant to caches containing 1000, 10000 and 100000 keys. As 
>>>> already discussed, the lower the number of keys, the higher the 
>>>> chance of contention and the probability of aborts due to 
>>>> conflicts. In particular with the 2PC scheme the number of failed 
>>>> transactions steadily increases at high contention up to 6% (to the 
>>>> best of our understanding in particular due to distributed 
>>>> deadlocks). With PB, instead the number of failed txs due to 
>>>> contention is always 0.
>>>
>>> Maybe you don't mean 2PC, but pessimistic locking? I'm still not too 
>>> sure. But I am pretty sure it's not 2PC you mean.
>>>
>>>>
>>>> Note that, currently, we are assuming that backup nodes do not 
>>>> generate update transactions. In practice this corresponds to 
>>>> assuming the presence of some load balancing scheme which directs 
>>>> (all and only) update transactions to the primary node, and read 
>>>> transactions to the backup.
>>>
>>> Can I say it again? Serialisability?
>>>
>>>> In the negative case (a put operation is generated on a backup 
>>>> node), we simply throw a PassiveReplicationException at the 
>>>> CacheDelegate level. This is probably suboptimal/undesirable in 
>>>> real settings, as update transactions may be transparently rerouted 
>>>> (at least in principle!) to the primary node in a RPC-style. Any 
>>>> suggestion on how to implement such a redirection facility in a 
>>>> transparent/non-intrusive manner would be highly appreciated of 
>>>> course! ;-)
>>>>
>>>> To conclude, we are currently working on a statistical model that 
>>>> is able to predict the best suited replication scheme given the 
>>>> current workload/number of machines, as well as on a mechanism to 
>>>> dynamically switch from one replication scheme to the other.
>>>>
>>>>
>>>> We'll keep you posted on our progresses!
>>>>
>>>> Regards,
>>>>   Sebastiano Peluso, Diego Didona
>>>> <PCvsPR_WRITE_TX_1000keys.png><PCvsPR_WRITE_TX_10000keys.png><PCvsPR_WRITE_TX_100000keys.png>------------------------------------------------------------------------------
>>>> The ultimate all-in-one performance toolkit: Intel(R) Parallel 
>>>> Studio XE:
>>>> Pinpoint memory and threading errors before they happen.
>>>> Find and fix more than 250 security defects in the development cycle.
>>>> Locate bottlenecks in serial and parallel code that limit performance.
>>>> http://p.sf.net/sfu/intel-dev2devfeb_______________________________________________
>>>> Cloudtm-discussion mailing list
>>>> Cloudtm-discussion at lists.sourceforge.net 
>>>> <mailto:Cloudtm-discussion at lists.sourceforge.net>
>>>> https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion
>>>
>>> ---
>>> Mark Little
>>> mlittle at redhat.com <mailto:mlittle at redhat.com>
>>>
>>> JBoss, by Red Hat
>>> Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod 
>>> Street, Windsor, Berkshire, SI4 1TE, United Kingdom.
>>> Registered in UK and Wales under Company Registration No. 3798903 
>>> Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt 
>>> Parsons (USA) and Brendan Lane (Ireland).
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
>>> Pinpoint memory and threading errors before they happen.
>>> Find and fix more than 250 security defects in the development cycle.
>>> Locate bottlenecks in serial and parallel code that limit performance.
>>> http://p.sf.net/sfu/intel-dev2devfeb
>>>
>>>
>>> _______________________________________________
>>> Cloudtm-discussion mailing list
>>> Cloudtm-discussion at lists.sourceforge.net  <mailto:Cloudtm-discussion at lists.sourceforge.net>
>>> https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion
>>
>>
>> ------------------------------------------------------------------------------
>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
>> Pinpoint memory and threading errors before they happen.
>> Find and fix more than 250 security defects in the development cycle.
>> Locate bottlenecks in serial and parallel code that limit performance.
>> http://p.sf.net/sfu/intel-dev2devfeb
>> _______________________________________________
>> Cloudtm-discussion mailing list
>> Cloudtm-discussion at lists.sourceforge.net 
>> <mailto:Cloudtm-discussion at lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion
>
>
> ------------------------------------------------------------------------------
> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
> Pinpoint memory and threading errors before they happen.
> Find and fix more than 250 security defects in the development cycle.
> Locate bottlenecks in serial and parallel code that limit performance.
> http://p.sf.net/sfu/intel-dev2devfeb
>
>
> _______________________________________________
> Cloudtm-discussion mailing list
> Cloudtm-discussion at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion

-- 

Paolo Romano, PhD
Senior Researcher
INESC-ID
Rua Alves Redol, 9
1000-059, Lisbon Portugal
Tel. + 351 21 3100300
Fax  + 351 21 3145843
Webpage http://www.gsd.inesc-id.pt/~romanop

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20110218/746317ff/attachment-0001.html