[infinispan-dev] Rehashing for DIST

Wed May 6 05:09:51 EDT 2009

On 6 May 2009, at 06:22, Bela Ban wrote:

> Not pretending I've thought this one through fully, but here are a  
> few general thoughts.

No worries, that's what brainstorming is all about.  :-)

> For DIST, we might have to rethink transactional guarantees. The CAP  
> theorem [1] for example states that you cannot have strong  
> consistency with partitions and availability, but that consistency  
> is eventual. We have to think about what transactions mean with DIST.
>
> Remember our discussions in Neuchatel around Data Center  
> Replication ? We thought hard about state transfer (with NBST)  
> between islands, and came up with a hard complex design. Later, we  
> trashed that design for a much simpler solution, which periodically  
> pushes changes from one island to the other.
>
> We already concluded that DIST doesn't need a state transfer when a  
> new node joins, but the existing node will rehash and the new node  
> will get some of their data. Same for a crashed (or leaving) node.
>
> So the question is what would break if we adopted the following  
> simple mechanism:
>
>   * On a view change V2, *every node* computes the current membership
>     ML1 and the next membership ML2
>   * For every key K, we compute the set of nodes on which K need to
>     reside for ML1 (S1) and ML2 (S2).
>   * If S1 == S2, we don't need to do anything
>   * Else, we compute the new nodes on which K needs to reside and copy
>     K to it (simple put(K,V) ?). We also remove(K,V) on the nodes on
>     which K should NOT reside anymore
>   * This sounds like a lot of work, but remember that a good consisten
>     hash function should rehash 1/N keys at most, e.g. if we have
>     100'000 keys and 100 nodes, we should rehash 1'000 keys at most
>   * The rehashing could be done in the background: going with eventual
>     consistency, we could define a grace period in which this
>     rehashing is done. During this time, the consistent hash function
>     works on ML1 *and* ML2, whereas the rehashing works only on ML2 !
>   * When the rehashing is done, every node agrees and switches to ML2
>   * For a cache with repl-count -1 (immortal), we could simple do a
>     full state transfer with NBST.

This similar to what I originally had in mind as well, but here are  
the tripping points I encountered:

1.  Concurrent state change.  E.g., S1 determines that K, which mapped  
to {S1, S2} originally should now only exist on {S2, S3}.  At the same  
time, a thread is updating K on S1.  We end up with a race on reading  
the value of K to stream to S3.  I suppose if we assume we need to  
obtain write locks for this read, we should be OK though (since, after  
all, it is a write - on S3).
2.  Multiple view changes.  During the course of rehashing, the view  
could change again.  Widening the membership list until rehashes are  
complete.  I suppose this could still work, if we have a  
RehashComplete message that is broadcast by everyone in the old view,  
containing a view ID so we know which view needs to be removed from  
the ML.
3.  Repl count of -1 is not a problem, this is not DIST but REPL.  So  
not handled here.
4.  +1 to rehashing being a background process
5.  At the moment we only perform a remote get if a) the entry is not  
in the data container (which represents both cached entries and L1)  
_and_ b) the cache instance in question is not an owner of the key.   
Otherwise it is assumed that the entry is null.  This will need to  
change with the above algorithm, since we now have 2 MLs, we could be  
deemed the data owner of a key, but the key has yet to be migrated.   
So this would need to be taken into account when selecting whether or  
not to do a remote get - and indeed who to broadcast the remote get  
call to.

Let me have a think, this may well still be workable.

> My 2 cents, want to chat over skype today ?

Sure - you around in the afternoon?  I need to finish some other stuff  
I am in the middle of this AM.

> [1] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
>
>
> Manik Surtani wrote:
>> So the blocker for distribution now is rehashing.  This is much  
>> trickier a problem than I previously thought, since it brings all  
>> of the concerns we have with state transfer - the ability to  
>> generate and apply state, preferably while not stopping the  
>> cluster, not overwriting state from ongoing transactions, etc.
>>
>> I've detailed out two approaches - one is pessimistic, based on  
>> FLUSH, and probably won't be implemented but I've included it here  
>> for completeness.  The second is more optimistic, based on NBST,  
>> although several degrees more complex.  Still not happy with it as  
>> a solution though, but it could be a fallback.  Still researching  
>> other alternatives as well, and an open to suggestions and ideas.
>>
>> Anyway, here we go.  For each of the steps outlined below, ML1 is  
>> the old member list prior to a topology change, and ML2 is the new  
>> one.
>>
>> A. Pessimistic approach
>> -------------------
>>
>> JOIN:
>>
>> 1.  Joiner FLUSHes
>> 2.  Joiner requests state from *all* other members (concurrently,  
>> using separate threads?) using channel.getState(address)
>> 3.  All other caches iterate through their local data container.
>> 3.1.  If the cache is the primary owner based on ML1 and the joiner  
>> is *an* owner based on ML2, this entry is written to the state  
>> stream.
>> 3.2.  If the cache is no longer an owner of the entry based on ML2,  
>> the entry is moved to L1 cache if enabled (otherwise, removed).
>> 4.  All caches close their state streams.
>> 5.  Joiner, after applying state received, releases FLUSH and joins  
>> the cluster.
>>
>> LEAVE:
>>
>> 1.  Coordinator requests a FLUSH
>> 2.  All caches iterate through their data containers.
>> 2.1.  If based on ML1, the cache is the primary owner of an entry,  
>> and based on ML1 and ML2, there is a new owner who does not have  
>> the entry, an RPC request is sent to the new owner.
>> 2.1.1.  The new owner requests state from the primary owner with  
>> channel.getState()
>> 2.1.2.  The primary owner writes this state for the new owner.
>> 2.2.  If the owner is no longer an owner based on ML2, the entry is  
>> moved to L1 if enabled (otherwise, removed)
>> 3.  Caches close any state streams opened.  Readers apply all state  
>> received.
>> 4.  Caches send an RPC to the coordinator informing the coord of an  
>> end-of-rehash phase.
>> 5.  Coordinator releases FLUSH
>>
>> This process will involve multiple streams connecting each cache to  
>> potentially every other cache.  Perhaps streams over UDP multicast  
>> may be more efficient here?
>>
>> NBST approach
>> -----------------------
>>
>> This approach is very similar to NBST where the FLUSH phases  
>> defined in the pessimistic approach are replaced by modification  
>> logging and streaming of the modification log, using RPC to control  
>> a brief partial lock on modifications when the last of the state is  
>> written.  Also, to allow for ongoing remote gets to retrieve  
>> correct state when dealing with a race of making a request from a  
>> new owner when the new owner hasn't applied state, or an old owner  
>> when the old owner has already removed state, we should support  
>> responding to a remote get even if you are no longer an owner -  
>> either by using L1, or my making another remote get in turn with  
>> the view that you deem correct.
>>
>> This does increase complexity over NBST quite significantly though,  
>> since each cache would need to maintain a transaction log for each  
>> and every other cache based on which keys are mapped there, to  
>> allow state application as per the pessimistic steps above to be  
>> accurate.
>>
>> Like I said, not the most elegant, but I thought I'd just throw it  
>> out there until I come up with a better approach. :-)
>>
>
> -- 
> Bela Ban
> Lead JGroups / Clustering Team
> JBoss - a division of Red Hat
>

--
Manik Surtani
manik at jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org