Not pretending I've thought this one through fully, but here are a few
general thoughts.
For DIST, we might have to rethink transactional guarantees. The CAP
theorem [1] for example states that you cannot have strong consistency
with partitions and availability, but that consistency is eventual. We
have to think about what transactions mean with DIST.
Remember our discussions in Neuchatel around Data Center Replication ?
We thought hard about state transfer (with NBST) between islands, and
came up with a hard complex design. Later, we trashed that design for a
much simpler solution, which periodically pushes changes from one island
to the other.
We already concluded that DIST doesn't need a state transfer when a new
node joins, but the existing node will rehash and the new node will get
some of their data. Same for a crashed (or leaving) node.
So the question is what would break if we adopted the following simple
mechanism:
* On a view change V2, *every node* computes the current membership
ML1 and the next membership ML2
* For every key K, we compute the set of nodes on which K need to
reside for ML1 (S1) and ML2 (S2).
* If S1 == S2, we don't need to do anything
* Else, we compute the new nodes on which K needs to reside and copy
K to it (simple put(K,V) ?). We also remove(K,V) on the nodes on
which K should NOT reside anymore
* This sounds like a lot of work, but remember that a good consisten
hash function should rehash 1/N keys at most, e.g. if we have
100'000 keys and 100 nodes, we should rehash 1'000 keys at most
* The rehashing could be done in the background: going with eventual
consistency, we could define a grace period in which this
rehashing is done. During this time, the consistent hash function
works on ML1 *and* ML2, whereas the rehashing works only on ML2 !
* When the rehashing is done, every node agrees and switches to ML2
* For a cache with repl-count -1 (immortal), we could simple do a
full state transfer with NBST.
My 2 cents, want to chat over skype today ?
[1]
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
Manik Surtani wrote:
So the blocker for distribution now is rehashing. This is much
trickier a problem than I previously thought, since it brings all of
the concerns we have with state transfer - the ability to generate and
apply state, preferably while not stopping the cluster, not
overwriting state from ongoing transactions, etc.
I've detailed out two approaches - one is pessimistic, based on FLUSH,
and probably won't be implemented but I've included it here for
completeness. The second is more optimistic, based on NBST, although
several degrees more complex. Still not happy with it as a solution
though, but it could be a fallback. Still researching other
alternatives as well, and an open to suggestions and ideas.
Anyway, here we go. For each of the steps outlined below, ML1 is the
old member list prior to a topology change, and ML2 is the new one.
A. Pessimistic approach
-------------------
JOIN:
1. Joiner FLUSHes
2. Joiner requests state from *all* other members (concurrently,
using separate threads?) using channel.getState(address)
3. All other caches iterate through their local data container.
3.1. If the cache is the primary owner based on ML1 and the joiner is
*an* owner based on ML2, this entry is written to the state stream.
3.2. If the cache is no longer an owner of the entry based on ML2,
the entry is moved to L1 cache if enabled (otherwise, removed).
4. All caches close their state streams.
5. Joiner, after applying state received, releases FLUSH and joins
the cluster.
LEAVE:
1. Coordinator requests a FLUSH
2. All caches iterate through their data containers.
2.1. If based on ML1, the cache is the primary owner of an entry, and
based on ML1 and ML2, there is a new owner who does not have the
entry, an RPC request is sent to the new owner.
2.1.1. The new owner requests state from the primary owner with
channel.getState()
2.1.2. The primary owner writes this state for the new owner.
2.2. If the owner is no longer an owner based on ML2, the entry is
moved to L1 if enabled (otherwise, removed)
3. Caches close any state streams opened. Readers apply all state
received.
4. Caches send an RPC to the coordinator informing the coord of an
end-of-rehash phase.
5. Coordinator releases FLUSH
This process will involve multiple streams connecting each cache to
potentially every other cache. Perhaps streams over UDP multicast may
be more efficient here?
NBST approach
-----------------------
This approach is very similar to NBST where the FLUSH phases defined
in the pessimistic approach are replaced by modification logging and
streaming of the modification log, using RPC to control a brief
partial lock on modifications when the last of the state is written.
Also, to allow for ongoing remote gets to retrieve correct state when
dealing with a race of making a request from a new owner when the new
owner hasn't applied state, or an old owner when the old owner has
already removed state, we should support responding to a remote get
even if you are no longer an owner - either by using L1, or my making
another remote get in turn with the view that you deem correct.
This does increase complexity over NBST quite significantly though,
since each cache would need to maintain a transaction log for each and
every other cache based on which keys are mapped there, to allow state
application as per the pessimistic steps above to be accurate.
Like I said, not the most elegant, but I thought I'd just throw it out
there until I come up with a better approach. :-)
--
Bela Ban
Lead JGroups / Clustering Team
JBoss - a division of Red Hat