[infinispan-dev] Rehashing for DIST

Bela Ban bban at redhat.com
Wed May 6 01:22:49 EDT 2009


Not pretending I've thought this one through fully, but here are a few 
general thoughts.

For DIST, we might have to rethink transactional guarantees. The CAP 
theorem [1] for example states that you cannot have strong consistency 
with partitions and availability, but that consistency is eventual. We 
have to think about what transactions mean with DIST.

Remember our discussions in Neuchatel around Data Center Replication ? 
We thought hard about state transfer (with NBST) between islands, and 
came up with a hard complex design. Later, we trashed that design for a 
much simpler solution, which periodically pushes changes from one island 
to the other.

We already concluded that DIST doesn't need a state transfer when a new 
node joins, but the existing node will rehash and the new node will get 
some of their data. Same for a crashed (or leaving) node.

So the question is what would break if we adopted the following simple 
mechanism:

    * On a view change V2, *every node* computes the current membership
      ML1 and the next membership ML2
    * For every key K, we compute the set of nodes on which K need to
      reside for ML1 (S1) and ML2 (S2).
    * If S1 == S2, we don't need to do anything
    * Else, we compute the new nodes on which K needs to reside and copy
      K to it (simple put(K,V) ?). We also remove(K,V) on the nodes on
      which K should NOT reside anymore
    * This sounds like a lot of work, but remember that a good consisten
      hash function should rehash 1/N keys at most, e.g. if we have
      100'000 keys and 100 nodes, we should rehash 1'000 keys at most
    * The rehashing could be done in the background: going with eventual
      consistency, we could define a grace period in which this
      rehashing is done. During this time, the consistent hash function
      works on ML1 *and* ML2, whereas the rehashing works only on ML2 !
    * When the rehashing is done, every node agrees and switches to ML2
    * For a cache with repl-count -1 (immortal), we could simple do a
      full state transfer with NBST.

My 2 cents, want to chat over skype today ?

[1] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html


Manik Surtani wrote:
> So the blocker for distribution now is rehashing.  This is much 
> trickier a problem than I previously thought, since it brings all of 
> the concerns we have with state transfer - the ability to generate and 
> apply state, preferably while not stopping the cluster, not 
> overwriting state from ongoing transactions, etc.
>
> I've detailed out two approaches - one is pessimistic, based on FLUSH, 
> and probably won't be implemented but I've included it here for 
> completeness.  The second is more optimistic, based on NBST, although 
> several degrees more complex.  Still not happy with it as a solution 
> though, but it could be a fallback.  Still researching other 
> alternatives as well, and an open to suggestions and ideas.
>
> Anyway, here we go.  For each of the steps outlined below, ML1 is the 
> old member list prior to a topology change, and ML2 is the new one.
>
> A. Pessimistic approach
> -------------------
>
> JOIN:
>
> 1.  Joiner FLUSHes
> 2.  Joiner requests state from *all* other members (concurrently, 
> using separate threads?) using channel.getState(address)
> 3.  All other caches iterate through their local data container.
> 3.1.  If the cache is the primary owner based on ML1 and the joiner is 
> *an* owner based on ML2, this entry is written to the state stream.
> 3.2.  If the cache is no longer an owner of the entry based on ML2, 
> the entry is moved to L1 cache if enabled (otherwise, removed).
> 4.  All caches close their state streams.
> 5.  Joiner, after applying state received, releases FLUSH and joins 
> the cluster.
>
> LEAVE:
>
> 1.  Coordinator requests a FLUSH
> 2.  All caches iterate through their data containers.
> 2.1.  If based on ML1, the cache is the primary owner of an entry, and 
> based on ML1 and ML2, there is a new owner who does not have the 
> entry, an RPC request is sent to the new owner.
> 2.1.1.  The new owner requests state from the primary owner with 
> channel.getState()
> 2.1.2.  The primary owner writes this state for the new owner.
> 2.2.  If the owner is no longer an owner based on ML2, the entry is 
> moved to L1 if enabled (otherwise, removed)
> 3.  Caches close any state streams opened.  Readers apply all state 
> received.
> 4.  Caches send an RPC to the coordinator informing the coord of an 
> end-of-rehash phase.
> 5.  Coordinator releases FLUSH
>
> This process will involve multiple streams connecting each cache to 
> potentially every other cache.  Perhaps streams over UDP multicast may 
> be more efficient here?
>
> NBST approach
> -----------------------
>
> This approach is very similar to NBST where the FLUSH phases defined 
> in the pessimistic approach are replaced by modification logging and 
> streaming of the modification log, using RPC to control a brief 
> partial lock on modifications when the last of the state is written.  
> Also, to allow for ongoing remote gets to retrieve correct state when 
> dealing with a race of making a request from a new owner when the new 
> owner hasn't applied state, or an old owner when the old owner has 
> already removed state, we should support responding to a remote get 
> even if you are no longer an owner - either by using L1, or my making 
> another remote get in turn with the view that you deem correct.
>
> This does increase complexity over NBST quite significantly though, 
> since each cache would need to maintain a transaction log for each and 
> every other cache based on which keys are mapped there, to allow state 
> application as per the pessimistic steps above to be accurate.
>
> Like I said, not the most elegant, but I thought I'd just throw it out 
> there until I come up with a better approach. :-)
>

-- 
Bela Ban
Lead JGroups / Clustering Team
JBoss - a division of Red Hat




More information about the infinispan-dev mailing list