So the blocker for distribution now is rehashing. This is much
trickier a problem than I previously thought, since it brings all of
the concerns we have with state transfer - the ability to generate and
apply state, preferably while not stopping the cluster, not
overwriting state from ongoing transactions, etc.
I've detailed out two approaches - one is pessimistic, based on FLUSH,
and probably won't be implemented but I've included it here for
completeness. The second is more optimistic, based on NBST, although
several degrees more complex. Still not happy with it as a solution
though, but it could be a fallback. Still researching other
alternatives as well, and an open to suggestions and ideas.
Anyway, here we go. For each of the steps outlined below, ML1 is the
old member list prior to a topology change, and ML2 is the new one.
A. Pessimistic approach
-------------------
JOIN:
1. Joiner FLUSHes
2. Joiner requests state from *all* other members (concurrently,
using separate threads?) using channel.getState(address)
3. All other caches iterate through their local data container.
3.1. If the cache is the primary owner based on ML1 and the joiner is
*an* owner based on ML2, this entry is written to the state stream.
3.2. If the cache is no longer an owner of the entry based on ML2,
the entry is moved to L1 cache if enabled (otherwise, removed).
4. All caches close their state streams.
5. Joiner, after applying state received, releases FLUSH and joins
the cluster.
LEAVE:
1. Coordinator requests a FLUSH
2. All caches iterate through their data containers.
2.1. If based on ML1, the cache is the primary owner of an entry, and
based on ML1 and ML2, there is a new owner who does not have the
entry, an RPC request is sent to the new owner.
2.1.1. The new owner requests state from the primary owner with
channel.getState()
2.1.2. The primary owner writes this state for the new owner.
2.2. If the owner is no longer an owner based on ML2, the entry is
moved to L1 if enabled (otherwise, removed)
3. Caches close any state streams opened. Readers apply all state
received.
4. Caches send an RPC to the coordinator informing the coord of an
end-of-rehash phase.
5. Coordinator releases FLUSH
This process will involve multiple streams connecting each cache to
potentially every other cache. Perhaps streams over UDP multicast may
be more efficient here?
NBST approach
-----------------------
This approach is very similar to NBST where the FLUSH phases defined
in the pessimistic approach are replaced by modification logging and
streaming of the modification log, using RPC to control a brief
partial lock on modifications when the last of the state is written.
Also, to allow for ongoing remote gets to retrieve correct state when
dealing with a race of making a request from a new owner when the new
owner hasn't applied state, or an old owner when the old owner has
already removed state, we should support responding to a remote get
even if you are no longer an owner - either by using L1, or my making
another remote get in turn with the view that you deem correct.
This does increase complexity over NBST quite significantly though,
since each cache would need to maintain a transaction log for each and
every other cache based on which keys are mapped there, to allow state
application as per the pessimistic steps above to be accurate.
Like I said, not the most elegant, but I thought I'd just throw it out
there until I come up with a better approach. :-)
Cheers
--
Manik Surtani
manik(a)jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org