[infinispan-dev] Rehashing for DIST

Tue May 5 12:19:05 EDT 2009

So the blocker for distribution now is rehashing.  This is much  
trickier a problem than I previously thought, since it brings all of  
the concerns we have with state transfer - the ability to generate and  
apply state, preferably while not stopping the cluster, not  
overwriting state from ongoing transactions, etc.

I've detailed out two approaches - one is pessimistic, based on FLUSH,  
and probably won't be implemented but I've included it here for  
completeness.  The second is more optimistic, based on NBST, although  
several degrees more complex.  Still not happy with it as a solution  
though, but it could be a fallback.  Still researching other  
alternatives as well, and an open to suggestions and ideas.

Anyway, here we go.  For each of the steps outlined below, ML1 is the  
old member list prior to a topology change, and ML2 is the new one.

A. Pessimistic approach
-------------------

JOIN:

1.  Joiner FLUSHes
2.  Joiner requests state from *all* other members (concurrently,  
using separate threads?) using channel.getState(address)
3.  All other caches iterate through their local data container.
3.1.  If the cache is the primary owner based on ML1 and the joiner is  
*an* owner based on ML2, this entry is written to the state stream.
3.2.  If the cache is no longer an owner of the entry based on ML2,  
the entry is moved to L1 cache if enabled (otherwise, removed).
4.  All caches close their state streams.
5.  Joiner, after applying state received, releases FLUSH and joins  
the cluster.

LEAVE:

1.  Coordinator requests a FLUSH
2.  All caches iterate through their data containers.
2.1.  If based on ML1, the cache is the primary owner of an entry, and  
based on ML1 and ML2, there is a new owner who does not have the  
entry, an RPC request is sent to the new owner.
2.1.1.  The new owner requests state from the primary owner with  
channel.getState()
2.1.2.  The primary owner writes this state for the new owner.
2.2.  If the owner is no longer an owner based on ML2, the entry is  
moved to L1 if enabled (otherwise, removed)
3.  Caches close any state streams opened.  Readers apply all state  
received.
4.  Caches send an RPC to the coordinator informing the coord of an  
end-of-rehash phase.
5.  Coordinator releases FLUSH

This process will involve multiple streams connecting each cache to  
potentially every other cache.  Perhaps streams over UDP multicast may  
be more efficient here?

NBST approach
-----------------------

This approach is very similar to NBST where the FLUSH phases defined  
in the pessimistic approach are replaced by modification logging and  
streaming of the modification log, using RPC to control a brief  
partial lock on modifications when the last of the state is written.   
Also, to allow for ongoing remote gets to retrieve correct state when  
dealing with a race of making a request from a new owner when the new  
owner hasn't applied state, or an old owner when the old owner has  
already removed state, we should support responding to a remote get  
even if you are no longer an owner - either by using L1, or my making  
another remote get in turn with the view that you deem correct.

This does increase complexity over NBST quite significantly though,  
since each cache would need to maintain a transaction log for each and  
every other cache based on which keys are mapped there, to allow state  
application as per the pessimistic steps above to be accurate.

Like I said, not the most elegant, but I thought I'd just throw it out  
there until I come up with a better approach. :-)

Cheers
--
Manik Surtani
manik at jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org