On 2/13/12 7:07 PM, Erik Salter wrote:
Hi all,
Since I'm deploying my applications into 3 data centers, I commented at the
wiki site. Since the bulk of the discussion seems to be here, I'll offer
some additional thoughts.
- First and foremost, I have three data centers (JGRP-1313). All of my data
centers follow a model where the data is partitioned between them with
geographic failover.
Why do you have 3 data centers ?
- Follow-the-sun ?
- Are all of them active, responsible for a subset of the entire dataset ?
- Does A use B as backup location, B use C and C use A ? Or does A place
backup copies in both B and C ?
- Are the 3 DCs connected to each other (mesh), or is A connected to B,
and B to C, so A and C cannot talk directly (and have to route traffic
via B) ?
- These data centers have about 50-70 ms latency (RTT) between them.
That
means I (and the customer) can live with synchronous replication for the
time being, especially for the transactional scenario. In fact, in one of
my cache use cases, this is almost required.
OK. This would favor option #1. In general, if we want SYNC-anything
between DCs, RELAY is required as we need to know (1) the originator of
a request and (2) how to route a response back to it.
- In the current state transfer model, a node joining means that all
the
nodes from all the sites are involved in rehashing. The rehash operation
should be enhanced in this case to prefer using the local nodes as state
providers.
+1. This needs to be paired with a custom ConsistentHash implementation,
which knows which keys are local and which ones remote. Since
TopologyAwareAddresses are used, the CH function can actually make this
decision, today. Of course, we need to go remote if we have no local owners.
I know that this cannot be necessarily guaranteed, as a loss of
multiple nodes at a site might mean the state only exists on another data
center. But for a minimum of 10M records, I can't afford the existing
limitations on state transfer. (And it goes without saying the NBST is an
absolute must)
+1 again. This should also be paired with manual or batching state
transfer (rebalancing): we could delay state transfer to joining nodes
until (1) a number of JOINs have happened, (2) some time has elapsed or
(3) a manual state transfer request is triggered (by a sysadmin). For
LEAVEs, state transfer should ensure relatively quickly, as this might
lead to too few copies of data being around, and that may or may not be
critical.
One use case is to bring up a new site (data center): here we should
disable state transfer and only do it (in one fell swoop) when all the
nodes in the new site are up, e.g. by manually triggering it.
- I will need a custom consistent hash that is data center-aware that
can
group keys into a local site. All of my data is site-specific.
Yes, exactly, similar to what I mentioned above.
- I mentioned a +1 model for local key ownership on the wiki, Taking
that a
step further, has there been any thought to a quorum model? Here's what I'm
concerned about. Most of the time, a data center won't fail -- rather
there'll be some intermittent packet loss between the two. If I have 2
owners on the key's local data center and 1 each on the backup sites, I may
not want to rehash if I can't reach a data center. I can understand if a
prepared tx fails -- those I can retry in application code.
I would suggest that we have a different hashwheel for each DC, so that
the disappearing of one DC doesn't lead to rehashing in another DC.
Say we have DCs A, B and C. A given key K is stored twice in A (primary
owner), once in B and once in C. When B fails, neither A nor C should
have to do any rehashing. The numOwners (in a custom CH impl on A) would
be 2 for A, 1 for B and 1 for C. The same custom CH impl could have
numOwners=2 for B, 1 for A and 1 for C on B.
--
Bela Ban
Lead JGroups (
http://www.jgroups.org)
JBoss / Red Hat