X-datacentre replication: design suggestions

withTx and withCacheManager test...

Keeping track of locked nodes

Mircea Markus

Friday, 10 February 2012 Fri, 10 Feb '12

9:02 a.m.

Hi, I've starte a document[1] that contains 3 possible approaches for implementing the X-datacentre replication functionality. This is a highly requested bit of functionality for 5.2 and involves interaction between several components: e.g. state transfer, transactions hotrod, jgroups etc. Can you please take a look and comment? [1] https://community.jboss.org/wiki/CrossDatacenterReplication-Design Cheers Mircea -- Mircea Markus twitter.com/mirceamarkus Sr. Software Engineer, Infinispan http://www.infinispan.org

Show replies by date

Bela Ban

Friday, 10 February Fri, 10 Feb

9:47 a.m.

I'm going to comment via this mailing list, and we can later summarize and append to the wiki (I don't like discussions on the wiki... :-))... #1 Virtual cluster Rehashing: I believe we'll have to come up with a custom consistent hash that knows about the 2 sites, and places all of the data owners into the local site. E.g. it would be bad if we create a key K in LON and make the primary owner T (in SFO) and the backup owner B in LON ! This should also minimize rehashing across sites, and prefer rehashing within sites. In terms of data access, the idea is that all writes would only go through 1 site, e.g. LON being active and SFO being a backup (in which reads could happen); my design didn't assume concurrent writes in multiple sites (at least not to the same data). Locking: Same as above Configuration: numOwners would have to be extended, e.g. we could introduce a property numPrimarySiteOwners=2 and numBackupSiteOwners=1, TDB #2 Hot Rod based Having to connect to potentially all nodes is bad, as customers possibly only want to open 1 port for cross-datacenter traffic. I assume the RemoteCacheStore would have to be enabled in all nodes in a given site for this to work ? How would you handle requests sent during the crash of a HotRod endpoint ? Would they get queued, similar to https://issues.jboss.org/browse/JGRP-1401 ? How would initial state transfer be done between sites ? E.g. LON has been up for 2 days and now we start SFO. Does that mean we will effectively have to transfer *all* of the data in LON --> SFO ? #3 Custom bridge I like the fact that both sites are configured with possibly different numOwners, e.g. LON=2 and SFO=1 This will not work if you need to invoke blocking RPCs between sites: the copying of traffic to A is always assumed to be asynchronous. Plus, the recipient in SFO wouldn't know the original sender; in the example, the sender would always be X. How do you handle the case where the relay (A or X) crashes and messages are sent during that time, before a new relay is elected ? How do you do state transfer, e.g. bootstrap SFO from LON ? I guess you transfer the entire state from LON --> SFO, right ? SUMMARY: I would definitely *not* do #2. I do like #3 and #1, perhaps we need to focus a bit on #3, to see if there are other deficiencies we haven't seen yet, as I'm already familiar with #1. Cheers, On 2/10/12 4:02 PM, Mircea Markus wrote:

...

-- Bela Ban Lead JGroups (http://www.jgroups.org) JBoss / Red Hat

Dan Berindei

1:17 p.m.

Hi Mircea I think you're missing an intro section with the use cases we want to handle in 5.2. The requirements for having a backup datacenter that isn't accessed by any clients are pretty different from the requirements for multiple "primary" datacenters that are all handling requests at the same time, so we should have a clear image of what we are trying to achieve. On Fri, Feb 10, 2012 at 5:47 PM, Bela Ban <bban(a)redhat.com> wrote:

...

For some use cases I would think it's essential that the data is replicated in both sites (async, of course). But I second the idea that there must be at least one backup in the same site as the primary owner, or rehashes become really expensive.

...

In terms of data access, the idea is that all writes would only go through 1 site, e.g. LON being active and SFO being a backup (in which reads could happen); my design didn't assume concurrent writes in multiple sites (at least not to the same data).

I like the idea of designating one site as the "master", but I'm pretty sure we want to handle use cases where there is more than one master. Perhaps we could incorporate Erik's suggestion on the wiki and allow the selection of a master site both at a cache level and at a key level.

...

Locking: Same as above

Having only one master site would certainly make locking easier. In the other scenarios I'm not sure if we should always designate a master site for a key and always lock there, or if we should only lock on a "local primary owner" that's in the same site as the transaction originator.

...

Configuration: numOwners would have to be extended, e.g. we could introduce a property numPrimarySiteOwners=2 and numBackupSiteOwners=1, TDB

Should we allow the user to specify the number of backup sites as well? That's assuming we are going to support more than two sites...

...

#2 Hot Rod based Having to connect to potentially all nodes is bad, as customers possibly only want to open 1 port for cross-datacenter traffic.

I'm not sure I agree. It would be bad if we expect to connect over the internet to the remote site, but I assume that everyone would use a VPN to communicate between sites anyway.

...

I assume the RemoteCacheStore would have to be enabled in all nodes in a given site for this to work ? How would you handle requests sent during the crash of a HotRod endpoint ? Would they get queued, similar to https://issues.jboss.org/browse/JGRP-1401 ? How would initial state transfer be done between sites ? E.g. LON has been up for 2 days and now we start SFO. Does that mean we will effectively have to transfer *all* of the data in LON --> SFO ? #3 Custom bridge I like the fact that both sites are configured with possibly different numOwners, e.g. LON=2 and SFO=1

...

This will not work if you need to invoke blocking RPCs between sites: the copying of traffic to A is always assumed to be asynchronous. Plus, the recipient in SFO wouldn't know the original sender; in the example, the sender would always be X.

Mircea was saying that we'd only send over committed modifications, so no locking RPCs that need to be synchronous. I'm not sure if the original sender is still relevant in this case.

...

How do you handle the case where the relay (A or X) crashes and messages are sent during that time, before a new relay is elected ?

I suspect this one is the tricky bit... Mircea did mention that he'd like to reuse the retransmission logic from RELAY, but I'm not sure how easy that would be.

...

How do you do state transfer, e.g. bootstrap SFO from LON ? I guess you transfer the entire state from LON --> SFO, right ?

For 5.2 I think this would be acceptable.

...

SUMMARY: I would definitely *not* do #2. I do like #3 and #1, perhaps we need to focus a bit on #3, to see if there are other deficiencies we haven't seen yet, as I'm already familiar with #1. Cheers, On 2/10/12 4:02 PM, Mircea Markus wrote: > Hi, > > I've starte a document[1] that contains 3 possible approaches for implementing the X-datacentre replication functionality. > This is a highly requested bit of functionality for 5.2 and involves interaction between several components: e.g. state transfer, transactions hotrod, jgroups etc. Can you please take a look and comment? > > [1] https://community.jboss.org/wiki/CrossDatacenterReplication-Design -- Bela Ban Lead JGroups (http://www.jgroups.org) JBoss / Red Hat _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Mircea Markus

1:21 p.m.

...

+1. That becabe very clear after reading Bela's email :-)

...

On Fri, Feb 10, 2012 at 5:47 PM, Bela Ban <bban(a)redhat.com> wrote: > I'm going to comment via this mailing list, and we can later > summarize > and append to the wiki (I don't like discussions on the wiki... > :-))... > > #1 Virtual cluster > > Rehashing: > > I believe we'll have to come up with a custom consistent hash that > knows > about the 2 sites, and places all of the data owners into the local > site. E.g. it would be bad if we create a key K in LON and make the > primary owner T (in SFO) and the backup owner B in LON ! > This should also minimize rehashing across sites, and prefer > rehashing > within sites. > For some use cases I would think it's essential that the data is replicated in both sites (async, of course). But I second the idea that there must be at least one backup in the same site as the primary owner, or rehashes become really expensive.

+1. Or rehash can be inproved to send statet only if the joiner is colocated (vs send state if node is primary owner).

...

> In terms of data access, the idea is that all writes would only go > through 1 site, e.g. LON being active and SFO being a backup (in > which > reads could happen); my design didn't assume concurrent writes in > multiple sites (at least not to the same data).

+1. I'll make that clearer in the doc as well.

...

> I like the idea of designating one site as the "master", but I'm pretty sure we want to handle use cases where there is more than one master.

...

Perhaps we could incorporate Erik's suggestion on the wiki and allow the selection of a master site both at a cache level and at a key level.

+1 Thanks Dan!

Sanne Grinovero

Monday, 13 February Mon, 13 Feb

2:06 a.m.

On 10 February 2012 19:17, Dan Berindei <dan.berindei(a)gmail.com> wrote:

...

Mircea was saying that we'd only send over committed modifications, so no locking RPCs that need to be synchronous.

I understand the performance cost would be otherwise high, but is this not defeating the purpose? I would expect as a user that - if I need cross-datacenter replication - I would get a strong guarantee that committed data is safe in both locations, not that some might still need to be sent. So I would expect that - for transactional operations only - the other locations participate in the prepare phase. If we go for async replication, won't we need as well some way to guarantee that the other replicas apply changes in the correct order? About the cons in #3, it states "the state transfer needs to be customized to send all cluster's state through the end-point." Why? Can't we have each node connect directly? No endpoint -> no overloading. If you think of it as an async cachestore, you would have each entry managed by it's primary owner and possibly sent directly to the right target (assuming routers and VPNs are transparent to our logic).

Mircea Markus

5:07 a.m.

----- Original Message -----

...

From: "Sanne Grinovero" <sanne(a)infinispan.org> To: "infinispan -Dev List" <infinispan-dev(a)lists.jboss.org> Sent: Monday, February 13, 2012 8:06:03 AM Subject: Re: [infinispan-dev] X-datacentre replication: design suggestions On 10 February 2012 19:17, Dan Berindei <dan.berindei(a)gmail.com> wrote: > Mircea was saying that we'd only send over committed modifications, > so > no locking RPCs that need to be synchronous. I understand the performance cost would be otherwise high, but is this not defeating the purpose? I would expect as a user that - if I need cross-datacenter replication - I would get a strong guarantee that committed data is safe in both locations, not that some might still need to be sent.

If you configure an *sync* bridge then you'd get the confirmation during the commit. What you don't get by participating in the preapre is consistency guarnatees for two tx writing to same data in different datacenters - but not sure we want to support that?

...

So I would expect that - for transactional operations only - the other locations participate in the prepare phase. If we go for async replication, won't we need as well some way to guarantee that the other replicas apply changes in the correct order?

Can you please detail on this scenario?

...

About the cons in #3, it states "the state transfer needs to be customized to send all cluster's state through the end-point." Why? Can't we have each node connect directly? No endpoint -> no overloading.

that's what #2 does. Possible, but it has its cons, mainly the number of x-site connections. See bela's comment in a previous email and Erik's comment here: https://community.jboss.org/wiki/CrossDatacenterReplication-Design

Sanne Grinovero

5:22 a.m.

On 13 February 2012 11:07, Mircea Markus <mmarkus(a)redhat.com> wrote:

...

----- Original Message ----- > From: "Sanne Grinovero" <sanne(a)infinispan.org> > To: "infinispan -Dev List" <infinispan-dev(a)lists.jboss.org> > Sent: Monday, February 13, 2012 8:06:03 AM > Subject: Re: [infinispan-dev] X-datacentre replication: design suggestions > > On 10 February 2012 19:17, Dan Berindei <dan.berindei(a)gmail.com> > wrote: > > Mircea was saying that we'd only send over committed modifications, > > so > > no locking RPCs that need to be synchronous. > > I understand the performance cost would be otherwise high, but is > this > not defeating the purpose? > I would expect as a user that - if I need cross-datacenter > replication > - I would get a strong guarantee that committed data is safe in both > locations, not that some might still need to be sent. If you configure an *sync* bridge then you'd get the confirmation during the commit. What you don't get by participating in the preapre is consistency guarnatees for two tx writing to same data in different datacenters - but not sure we want to support that? > So I would expect that - for transactional operations only - the > other > locations participate in the prepare phase. > If we go for async replication, won't we need as well some way to > guarantee that the other replicas apply changes in the correct order? Can you please detail on this scenario?

I have two transactions, run in sequence: - TX1 updates A - TX2 deletes A If you send them both in async to the other datacenter, you'll have to make sure they are not reordered. I know JGroups can provide that guarantee, just wondering if it's the plan to do this via JGroups, or if we have an own buffer that we can still guarantee the order, or even better merge changes on the same keys like we do with the async CacheStore.

Bela Ban

6:49 a.m.

On 2/13/12 12:22 PM, Sanne Grinovero wrote:

...

This use case cannot be supported by option #3 AFAICS: the bridge between LON and SFO is completely asynchronous and doesn't know anything about sync or async RPCs, so TX2 can be applied in SFO before TX1. With option #2, if you run with SYNC, then the TX1 will complete before TX2 is started. Unless, of course, the 2 transactions are started on different nodes: then, it depends on which TX acquires the locks first, but this is the same behavior as in a local cluster. Hmm, how does option #3 handle TXs anyway ? Does a PUT on B and C not trigger anything until the TX has been completed successfully, and then we forward the outcome of the TX to A, which forwards it to X ? How does X then apply it ? Special operation or a set of PUT and REMOVES ? -- Bela Ban Lead JGroups (http://www.jgroups.org) JBoss / Red Hat

Manik Surtani

7:59 a.m.

On 13 Feb 2012, at 12:49, Bela Ban wrote:

...

Hmm, how does option #3 handle TXs anyway ? Does a PUT on B and C not trigger anything until the TX has been completed successfully, and then we forward the outcome of the TX to A, which forwards it to X ? How does X then apply it ? Special operation or a set of PUT and REMOVES ?

Well, the PrepareCommand contains a collection of WriteCommands… so this would be pushed to the next DC. -- Manik Surtani manik(a)jboss.org twitter.com/maniksurtani Lead, Infinispan http://www.infinispan.org

Dan Berindei

9:21 a.m.

On Mon, Feb 13, 2012 at 2:49 PM, Bela Ban <bban(a)redhat.com> wrote:

...

On 2/13/12 12:22 PM, Sanne Grinovero wrote: > I have two transactions, run in sequence: > - TX1 updates A > - TX2 deletes A > > If you send them both in async to the other datacenter, you'll have to > make sure they are not reordered. I know JGroups can provide that > guarantee, just wondering if it's the plan to do this via JGroups, or > if we have an own buffer that we can still guarantee the order, or > even better merge changes on the same keys like we do with the async CacheStore. This use case cannot be supported by option #3 AFAICS: the bridge between LON and SFO is completely asynchronous and doesn't know anything about sync or async RPCs, so TX2 can be applied in SFO before TX1. With option #2, if you run with SYNC, then the TX1 will complete before TX2 is started. Unless, of course, the 2 transactions are started on different nodes: then, it depends on which TX acquires the locks first, but this is the same behavior as in a local cluster.

Not if we use the RemoteCacheStore... A cache store is only called during commit + HotRod doesn't support transactions. So I believe option #2 can also support only async replication between sites.

Mircea Markus

9:59 a.m.

----- Original Message -----

...

From: "Bela Ban" <bban(a)redhat.com> To: infinispan-dev(a)lists.jboss.org Sent: Monday, February 13, 2012 12:49:56 PM Subject: Re: [infinispan-dev] X-datacentre replication: design suggestions On 2/13/12 12:22 PM, Sanne Grinovero wrote: > I have two transactions, run in sequence: > - TX1 updates A > - TX2 deletes A > > If you send them both in async to the other datacenter, you'll have > to > make sure they are not reordered. I know JGroups can provide that > guarantee, just wondering if it's the plan to do this via JGroups, > or > if we have an own buffer that we can still guarantee the order, or > even better merge changes on the same keys like we do with the > async CacheStore. This use case cannot be supported by option #3 AFAICS: the bridge between LON and SFO is completely asynchronous and doesn't know anything about sync or async RPCs, so TX2 can be applied in SFO before TX1.

Why can't the bridge send the messages in the order in which they were enqueued?

...

With option #2, if you run with SYNC, then the TX1 will complete before TX2 is started. Unless, of course, the 2 transactions are started on different nodes: then, it depends on which TX acquires the locks first, but this is the same behavior as in a local cluster. Hmm, how does option #3 handle TXs anyway ? Does a PUT on B and C not trigger anything until the TX has been completed successfully, and then we forward the outcome of the TX to A, which forwards it to X ?

That's what i have in mind.

...

How does X then apply it ? Special operation or a set of PUT and REMOVES ?

The simplest approach is for X to starts a tx, run all the ops and commit it. A more advanced apprach is for X to determine, based on the key set, which scenarios node is the most appropriate to run the tx and delegate run of the transaction to the given node. Actually the second approch can be easily implemented through a distriuted executor.

Bela Ban

10:15 a.m.

On 2/13/12 4:59 PM, Mircea Markus wrote:

...

> This use case cannot be supported by option #3 AFAICS: the bridge > between LON and SFO is completely asynchronous and doesn't know > anything > about sync or async RPCs, so TX2 can be applied in SFO before TX1. Why can't the bridge send the messages in the order in which they were enqueued?

It *does* deliver the messages in the order in which they were sent. I was referring to the async case in general: commits from different nodes hit the bridge in an undefined order; in your example it could be TX1 --> TX2 or TX2 --> TX1. (Note that I mixed up option #2 and option #1 in my previous email) -- Bela Ban Lead JGroups (http://www.jgroups.org) JBoss / Red Hat

Mircea Markus

9:48 a.m.

...

Thanks for the clarification, Sanne. Tx1 needs to place the info into the bridge's async queue *before* releasing the lock. Commit is a good fit, as durign commit the lock is stil held. Then the bridge must make sure that it doesn't reorder elements in the queue.

Mircea Markus

Friday, 10 February Fri, 10 Feb

1:18 p.m.

...

#1 Virtual cluster Rehashing: I believe we'll have to come up with a custom consistent hash that knows about the 2 sites, and places all of the data owners into the local site. E.g. it would be bad if we create a key K in LON and make the primary owner T (in SFO) and the backup owner B in LON !

I think this approach works very well if you use SFO as an hot-standby for LON: the purpose of SFO bein is to pick up the load in the case LON goes down for whatever reason. In an follow the sun approach, in which LON is up for 12h and then SFO takes over the next 12h there's this won't work.

...

This should also minimize rehashing across sites, and prefer rehashing within sites.

Not sure about that: AFAIK the way rehashing is implemented, is always the primary owner that generates the state. So if the joiner is in SFO then the burden of join would be much higher.

...

Yes indeed this would work for the master-slave (or hot standby scenario). IMO we should also consider the master-master (follow the sun) approach as well.

...

Locking: Same as above Configuration: numOwners would have to be extended, e.g. we could introduce a property numPrimarySiteOwners=2 and numBackupSiteOwners=1, TDB

yes indeed these attributes would be specific to the new CH function you added.

...

#2 Hot Rod based Having to connect to potentially all nodes is bad, as customers possibly only want to open 1 port for cross-datacenter traffic.

+1.

...

I assume the RemoteCacheStore would have to be enabled in all nodes in a given site for this to work ?

yes. that's a cons for this approach.

...

How would you handle requests sent during the crash of a HotRod endpoint ? Would they get queued, similar to https://issues.jboss.org/browse/JGRP-1401 ?

that would need to be re-implemented indeed. It's the first in the list of cons for this approach.

...

How would initial state transfer be done between sites ? E.g. LON has been up for 2 days and now we start SFO. Does that mean we will effectively have to transfer *all* of the data in LON --> SFO ?

Transfering all the data is required as the data must be mirrored between the sites. Of course we'll onloy transfer every entry once, and not numOwners times.

...

#3 Custom bridge I like the fact that both sites are configured with possibly different numOwners, e.g. LON=2 and SFO=1 This will not work if you need to invoke blocking RPCs between sites: the copying of traffic to A is always assumed to be asynchronous.

Why that? async replication is intended as a default, but sync replication should be possible as well. Just that the bridge would make the invocation in a synchronious way.

...

Plus, the recipient in SFO wouldn't know the original sender; in the example, the sender would always be X.

Not sure why this is bad for a functional perspective. As long as the sites end up being in sync.

...

How do you handle the case where the relay (A or X) crashes and messages are sent during that time, before a new relay is elected ?

A subset of RELAY can be used for this, together with JGRP-1401.

...

How do you do state transfer, e.g. bootstrap SFO from LON ? I guess you transfer the entire state from LON --> SFO, right ?

Indeed state transfer seems to be the biggest challange with this approach. I have some ideas, I'd also like to get Dan's input on this.

...

SUMMARY: I would definitely *not* do #2.

...

I do like #3 and #1, perhaps we need to focus a bit on #3, to see if there are other deficiencies we haven't seen yet, as I'm already familiar with #1.

thanks bela!

4479

days inactive

4482

days old

infinispan-dev@lists.jboss.org

Manage subscription

13 comments

5 participants

tags (0)

participants (5)

Bela Ban
Dan Berindei
Manik Surtani
Mircea Markus
Sanne Grinovero

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

X-datacentre replication: design suggestions