Re: [infinispan-dev] Rehashing for DIST

Wednesday, 6 May 2009

Not pretending I've thought this one through fully, but here are a few 
general thoughts.

For DIST, we might have to rethink transactional guarantees. The CAP 
theorem [1] for example states that you cannot have strong consistency 
with partitions and availability, but that consistency is eventual. We 
have to think about what transactions mean with DIST.

Remember our discussions in Neuchatel around Data Center Replication ? 
We thought hard about state transfer (with NBST) between islands, and 
came up with a hard complex design. Later, we trashed that design for a 
much simpler solution, which periodically pushes changes from one island 
to the other.

We already concluded that DIST doesn't need a state transfer when a new 
node joins, but the existing node will rehash and the new node will get 
some of their data. Same for a crashed (or leaving) node.

So the question is what would break if we adopted the following simple 
mechanism:

    * On a view change V2, *every node* computes the current membership
      ML1 and the next membership ML2
    * For every key K, we compute the set of nodes on which K need to
      reside for ML1 (S1) and ML2 (S2).
    * If S1 == S2, we don't need to do anything
    * Else, we compute the new nodes on which K needs to reside and copy
      K to it (simple put(K,V) ?). We also remove(K,V) on the nodes on
      which K should NOT reside anymore
    * This sounds like a lot of work, but remember that a good consisten
      hash function should rehash 1/N keys at most, e.g. if we have
      100'000 keys and 100 nodes, we should rehash 1'000 keys at most
    * The rehashing could be done in the background: going with eventual
      consistency, we could define a grace period in which this
      rehashing is done. During this time, the consistent hash function
      works on ML1 *and* ML2, whereas the rehashing works only on ML2 !
    * When the rehashing is done, every node agrees and switches to ML2
    * For a cache with repl-count -1 (immortal), we could simple do a
      full state transfer with NBST.

My 2 cents, want to chat over skype today ?

[1] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

Manik Surtani wrote:
...
 So the blocker for distribution now is rehashing.  This is much 
 trickier a problem than I previously thought, since it brings all of 
 the concerns we have with state transfer - the ability to generate and 
 apply state, preferably while not stopping the cluster, not 
 overwriting state from ongoing transactions, etc.

 I've detailed out two approaches - one is pessimistic, based on FLUSH, 
 and probably won't be implemented but I've included it here for 
 completeness.  The second is more optimistic, based on NBST, although 
 several degrees more complex.  Still not happy with it as a solution 
 though, but it could be a fallback.  Still researching other 
 alternatives as well, and an open to suggestions and ideas.

 Anyway, here we go.  For each of the steps outlined below, ML1 is the 
 old member list prior to a topology change, and ML2 is the new one.

 A. Pessimistic approach
 -------------------

 JOIN:

 1.  Joiner FLUSHes
 2.  Joiner requests state from *all* other members (concurrently, 
 using separate threads?) using channel.getState(address)
 3.  All other caches iterate through their local data container.
 3.1.  If the cache is the primary owner based on ML1 and the joiner is 
 *an* owner based on ML2, this entry is written to the state stream.
 3.2.  If the cache is no longer an owner of the entry based on ML2, 
 the entry is moved to L1 cache if enabled (otherwise, removed).
 4.  All caches close their state streams.
 5.  Joiner, after applying state received, releases FLUSH and joins 
 the cluster.

 LEAVE:

 1.  Coordinator requests a FLUSH
 2.  All caches iterate through their data containers.
 2.1.  If based on ML1, the cache is the primary owner of an entry, and 
 based on ML1 and ML2, there is a new owner who does not have the 
 entry, an RPC request is sent to the new owner.
 2.1.1.  The new owner requests state from the primary owner with 
 channel.getState()
 2.1.2.  The primary owner writes this state for the new owner.
 2.2.  If the owner is no longer an owner based on ML2, the entry is 
 moved to L1 if enabled (otherwise, removed)
 3.  Caches close any state streams opened.  Readers apply all state 
 received.
 4.  Caches send an RPC to the coordinator informing the coord of an 
 end-of-rehash phase.
 5.  Coordinator releases FLUSH

 This process will involve multiple streams connecting each cache to 
 potentially every other cache.  Perhaps streams over UDP multicast may 
 be more efficient here?

 NBST approach
 -----------------------

 This approach is very similar to NBST where the FLUSH phases defined 
 in the pessimistic approach are replaced by modification logging and 
 streaming of the modification log, using RPC to control a brief 
 partial lock on modifications when the last of the state is written.  
 Also, to allow for ongoing remote gets to retrieve correct state when 
 dealing with a race of making a request from a new owner when the new 
 owner hasn't applied state, or an old owner when the old owner has 
 already removed state, we should support responding to a remote get 
 even if you are no longer an owner - either by using L1, or my making 
 another remote get in turn with the view that you deem correct.

 This does increase complexity over NBST quite significantly though, 
 since each cache would need to maintain a transaction log for each and 
 every other cache based on which keys are mapped there, to allow state 
 application as per the pessimistic steps above to be accurate.

 Like I said, not the most elegant, but I thought I'd just throw it out 
 there until I come up with a better approach. :-)

-- 
Bela Ban
Lead JGroups / Clustering Team
JBoss - a division of Red Hat

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [infinispan-dev] Rehashing for DIST