[infinispan-dev] Use cases for x-datacenter replication

Bela Ban bban at redhat.com
Tue Feb 14 03:18:31 EST 2012



On 2/13/12 7:07 PM, Erik Salter wrote:
> Hi all,
>
> Since I'm deploying my applications into 3 data centers, I commented at the
> wiki site.  Since the bulk of the discussion seems to be here, I'll offer
> some additional thoughts.
>
>
> - First and foremost, I have three data centers (JGRP-1313).  All of my data
> centers follow a model where the data is partitioned between them with
> geographic failover.


Why do you have 3 data centers ?
- Follow-the-sun ?
- Are all of them active, responsible for a subset of the entire dataset ?
- Does A use B as backup location, B use C and C use A ? Or does A place 
backup copies in both B and C ?
- Are the 3 DCs connected to each other (mesh), or is A connected to B, 
and B to C, so A and C cannot talk directly (and have to route traffic 
via B) ?


> - These data centers have about 50-70 ms latency (RTT) between them.  That
> means I (and the customer) can live with synchronous replication for the
> time being, especially for the transactional scenario.  In fact, in one of
> my cache use cases, this is almost required.


OK. This would favor option #1. In general, if we want SYNC-anything 
between DCs, RELAY is required as we need to know (1) the originator of 
a request and (2) how to route a response back to it.


> - In the current state transfer model, a node joining means that all the
> nodes from all the sites are involved in rehashing.  The rehash operation
> should be enhanced in this case to prefer using the local nodes as state
> providers.


+1. This needs to be paired with a custom ConsistentHash implementation, 
which knows which keys are local and which ones remote. Since 
TopologyAwareAddresses are used, the CH function can actually make this 
decision, today. Of course, we need to go remote if we have no local owners.


>   I know that this cannot be necessarily guaranteed, as a loss of
> multiple nodes at a site might mean the state only exists on another data
> center.  But for a minimum of 10M records, I can't afford the existing
> limitations on state transfer.  (And it goes without saying the NBST is an
> absolute must)


+1 again. This should also be paired with manual or batching state 
transfer (rebalancing): we could delay state transfer to joining nodes 
until (1) a number of JOINs have happened, (2) some time has elapsed or 
(3) a manual state transfer request is triggered (by a sysadmin). For 
LEAVEs, state transfer should ensure relatively quickly, as this might 
lead to too few copies of data being around, and that may or may not be 
critical.
One use case is to bring up a new site (data center): here we should 
disable state transfer and only do it (in one fell swoop) when all the 
nodes in the new site are up, e.g. by manually triggering it.


> - I will need a custom consistent hash that is data center-aware that can
> group keys into a local site.  All of my data is site-specific.


Yes, exactly, similar to what I mentioned above.


> - I mentioned a +1 model for local key ownership on the wiki,  Taking that a
> step further, has there been any thought to a quorum model?  Here's what I'm
> concerned about.  Most of the time, a data center won't fail -- rather
> there'll be some intermittent packet loss between the two.  If I have 2
> owners on the key's local data center and 1 each on the backup sites, I may
> not want to rehash if I can't reach a data center.  I can understand if a
> prepared tx fails -- those I can retry in application code.


I would suggest that we have a different hashwheel for each DC, so that 
the disappearing of one DC doesn't lead to rehashing in another DC.

Say we have DCs A, B and C. A given key K is stored twice in A (primary 
owner), once in B and once in C. When B fails, neither A nor C should 
have to do any rehashing. The numOwners (in a custom CH impl on A) would 
be 2 for A, 1 for B and 1 for C. The same custom CH impl could have 
numOwners=2 for B, 1 for A and 1 for C on B.



-- 
Bela Ban
Lead JGroups (http://www.jgroups.org)
JBoss / Red Hat


More information about the infinispan-dev mailing list