[infinispan-dev] [Cloudtm-discussion] [SPAM] Re: Primary-Backup replication scheme in Infinispan

Wed Mar 9 13:58:56 EST 2011

On 3/9/11 11:41 AM, Manik Surtani wrote:
> Paolo,
>
>
> On 17 Feb 2011, at 19:51, Paolo Romano wrote:
>
>> First an important premise, which I think was not clear in their previous message. We are here considering the *full replication mode* of Infinispan, in which every node maintains all copies of the same data items. This implies that there is no need to fetch data from remote nodes during transaction execution. Also, Infinispan was configured NOT TO use eager locking. In other words, during transaction's execution Infinispan acquires locks only locally.
> So are you suggesting that this scheme maintains a single, global master node for the entire cluster, for *all* keys?  Doesn't this become a bottleneck, and how do you deal with the master node failing?
Hi Manik,

of course the primary (or master) can become a bottleneck if the number 
of update transactions is very large. If the % of write transactions is 
very high, however, then we have to distinguish two cases: low vs high 
contention.

At high contention, in fact, the 2PC-based replication scheme used by 
Infinispan (2PC from now for the sake of brevity ;-) ) falls prey of 
deadlocks and starts trashing. This is the reasons why 2PC's performance 
is so poor in the plot attached to Diego and Sebastiano's mail for the 
case of 1000 keys. Using the primary-backup, being concurrency regulated 
locally at the primary, and much more efficiently, the actual performace 
is overall much better.

Clearly 2PC *can* scale much better if there's no contention and a write 
intensive workload, as it can use the horsepower of more nodes to 
process writes...  but this does depend on the workload.

One of the results we would like to achieve with Cloud-TM is designing 
mechanisms that adaptively commute between multiple replication schemes 
depending on the scale of the platform and its current workload. That's 
just a first step in this direction!

Concerning failures of the master, this is not an issue. In fact, it 
waits synchronously for the replies of the backups. Thus, if it fails, 
it will be purged by the current view, and as a new one is delivered we 
can elect a new primary.  The only glitch that may occur is on more 
complicated failure scenario:
Primary sends an update, say "u",to the backups {B1,....Bn}.
Primary crashes while doing so.
B1 receives u, while the others B2,..,Bn do not.
B1 runs some read-only transaction that see u (and money is dispensed, 
missiles are fired and other horrible things happen due to that)
B1 crashes
A new view is delivered which excludes the former primary and B1.
B2 (for instance) is elected the new primary, but "u" is lost.

This can be avoided by having the backups acknowledging each other the 
reception of the updates before committing them (more formally, we 
should have the primary disseminating updates via Uniform Reliable 
Broadcast [1]).... but at the moment we're not doing this mainly for the 
sake of simplicity, but we expect the results to be very similar.

Cheers,

     Paolo

[1] Uniform Reliable Multicast in a Virtually Synchronous Environment, 
citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.5421
> Cheers
> Manik
>
> --
> Manik Surtani
> manik at jboss.org
> twitter.com/maniksurtani
>
> Lead, Infinispan
> http://www.infinispan.org
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- 

Paolo Romano, PhD
Senior Researcher
INESC-ID
Rua Alves Redol, 9
1000-059, Lisbon Portugal
Tel. + 351 21 3100300
Fax  + 351 21 3145843
Webpage http://www.gsd.inesc-id.pt/~romanop