Thinking more about the wait time, this might be problematic:
* V1={A,B,C,D}
* K maps to C and D, repl-count=2
* V2={A,B,C}
* V3={A,B}
* If C and D crashed (say within 10 seconds), then we need to
rebalance K immediately upon receiving V2 because repl-count < 2
Why not handle views immediately and rebalance elements ? We can always
think of optimizations later...
Rebalancing should occur a lot when starting the cluster, but at this
time, we don't have many elements in the cache anyway. During operation,
views should be infrequent. And, if we pick a good consistent hashing
algorithm, rehashing should only affect 1/N of all elements anyway.
And remember: at the JGroups level, we can bundle views (= process
multiple LEAVE or JOIN requests together and generate only 1 view) with
GMS.view_bundling, this might help reducing churn without any code
changes, at least for the moment.
Manik Surtani wrote:
On 15 Apr 2009, at 13:49, Bela Ban wrote:
> Looks good. A few suggestions though:
>
> * I'd move the old part (Data Partitioning) to a separate wiki and
> reference it rather than including it
This is the old wiki. :-) It is the same thing, I have just updated
it and this is why I don't want to remove the old stuff in case people
have bookmarked it or are still referencing it. Lets see though, if
it gets too cluttered I'll archive it somewhere.
> * The wait time for rehashing is problematic
> o On the good side, we don't have too much rehashing, but...
> o Clients using the old and new view might get stale data when
> we've just rehashed but the client fetched the data from the
> old view
The way I see it a union of views will be used.
E.g.,
V1: {A, B, C, D}
V2: {A, B, C, D, E}
K maps to {A, D} in V1 and {D, E} in V2.
The moment the view change is detected, a timer is started that runs
for rehashWait millis. During this period, all puts, gets, etc. go to
a union of where keys are hashed (minus dead hosts), e.g., in this
case, {A, D, E}. This way these 3 nodes are kept in sync for all new
puts. Could be more than 2 views by this stage.
The only tricky situation is dealing with a nonexistent value. (How
do we know whether K is null, or whether it just hasn't been moved
over to the new owner yet?). To deal with this, the only thing I can
think of is to use a RspFilter and "double-check" all nulls by waiting
for others in the replication group to respond as well.
Now at what point to we switch over to the new view completely?
Rather than broadcast a "new view installed" message or something of
the sort, I reckon we just use the timeout. From the time the new
view was received, everyone starts a timer (they all have to rehash
anyway) and at the end of this, we stop using a combined view and
switch to a single, latest view.
When dealing with the race in switching to a single view, I suggest a
new response type (NOT_OWNER?) which a node can reply with, causing
the requestor to wait for the next response. Similar to receiving a
null in the case above.
> o In general, I like the idea of 'bundling' views, to avoid
> potentially nenecessary rehashing
> o In case of a GET during a rehashing, maybe simply execute a
> multicast and pick the response with the highest view ID... ?
A possible fallback for when a null response is seen. Is view id
available in a RspFilter? Since this is where we decide whether to
accept the answer or wait for a better one.
Cheers
--
Manik Surtani
manik(a)jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org
--
Bela Ban
Lead JGroups / Clustering Team
JBoss - a division of Red Hat