[infinispan-dev] Use cases for x-datacenter replication
Bela Ban
bban at redhat.com
Tue Feb 14 03:18:31 EST 2012
On 2/13/12 7:07 PM, Erik Salter wrote:
> Hi all,
>
> Since I'm deploying my applications into 3 data centers, I commented at the
> wiki site. Since the bulk of the discussion seems to be here, I'll offer
> some additional thoughts.
>
>
> - First and foremost, I have three data centers (JGRP-1313). All of my data
> centers follow a model where the data is partitioned between them with
> geographic failover.
Why do you have 3 data centers ?
- Follow-the-sun ?
- Are all of them active, responsible for a subset of the entire dataset ?
- Does A use B as backup location, B use C and C use A ? Or does A place
backup copies in both B and C ?
- Are the 3 DCs connected to each other (mesh), or is A connected to B,
and B to C, so A and C cannot talk directly (and have to route traffic
via B) ?
> - These data centers have about 50-70 ms latency (RTT) between them. That
> means I (and the customer) can live with synchronous replication for the
> time being, especially for the transactional scenario. In fact, in one of
> my cache use cases, this is almost required.
OK. This would favor option #1. In general, if we want SYNC-anything
between DCs, RELAY is required as we need to know (1) the originator of
a request and (2) how to route a response back to it.
> - In the current state transfer model, a node joining means that all the
> nodes from all the sites are involved in rehashing. The rehash operation
> should be enhanced in this case to prefer using the local nodes as state
> providers.
+1. This needs to be paired with a custom ConsistentHash implementation,
which knows which keys are local and which ones remote. Since
TopologyAwareAddresses are used, the CH function can actually make this
decision, today. Of course, we need to go remote if we have no local owners.
> I know that this cannot be necessarily guaranteed, as a loss of
> multiple nodes at a site might mean the state only exists on another data
> center. But for a minimum of 10M records, I can't afford the existing
> limitations on state transfer. (And it goes without saying the NBST is an
> absolute must)
+1 again. This should also be paired with manual or batching state
transfer (rebalancing): we could delay state transfer to joining nodes
until (1) a number of JOINs have happened, (2) some time has elapsed or
(3) a manual state transfer request is triggered (by a sysadmin). For
LEAVEs, state transfer should ensure relatively quickly, as this might
lead to too few copies of data being around, and that may or may not be
critical.
One use case is to bring up a new site (data center): here we should
disable state transfer and only do it (in one fell swoop) when all the
nodes in the new site are up, e.g. by manually triggering it.
> - I will need a custom consistent hash that is data center-aware that can
> group keys into a local site. All of my data is site-specific.
Yes, exactly, similar to what I mentioned above.
> - I mentioned a +1 model for local key ownership on the wiki, Taking that a
> step further, has there been any thought to a quorum model? Here's what I'm
> concerned about. Most of the time, a data center won't fail -- rather
> there'll be some intermittent packet loss between the two. If I have 2
> owners on the key's local data center and 1 each on the backup sites, I may
> not want to rehash if I can't reach a data center. I can understand if a
> prepared tx fails -- those I can retry in application code.
I would suggest that we have a different hashwheel for each DC, so that
the disappearing of one DC doesn't lead to rehashing in another DC.
Say we have DCs A, B and C. A given key K is stored twice in A (primary
owner), once in B and once in C. When B fails, neither A nor C should
have to do any rehashing. The numOwners (in a custom CH impl on A) would
be 2 for A, 1 for B and 1 for C. The same custom CH impl could have
numOwners=2 for B, 1 for A and 1 for C on B.
--
Bela Ban
Lead JGroups (http://www.jgroups.org)
JBoss / Red Hat
More information about the infinispan-dev
mailing list