On Tue, Mar 20, 2012 at 2:51 PM, Mircea Markus <mircea.markus(a)jboss.com> wrote:
Hi Dan,
Some notes:
Ownership Information:
- I remember the discussion with Sanne about an algorithm that wouldn't
require view prepare/rollback, but I think it would be very interring to see
it described in detail somewhere as all the points you raised in the
document are very closely tied to that
Yes, actually I'm trying to describe precisely that. Of course, there
are some complications...
- "However, it's not that clear what to do when a node
leaves while the
cache is in "steady state"" --> if (numOwners-1) nodes crash a state
transfer is (most likely) wanted in order not to loose consistency in the
eventuality another node goes down. By default numOwners is 2, so in the
first release, can't we just assume that for every leave we'd want
immediately issue a state transfer? I think this would cover most of our
user's needs and simplify the problem considerably.
Not necessarily: if the user has a TopologyAwareConsistentHash and has
split his cluster in two "sites" or "racks", he can bring down an
entire site/rack before rebalancing, without the risk of losing any
data.
Recovery
- I agree with your point re:recovery. This shouldn't be considered hight
prio in the first release. The recovery info is kept in an independent
cache, which allows a lot of flexibility: e.g. can point to a shared cache
store so that recovery caches on other nodes can read that info when needed.
I don't agree about the shared cache store, because then every cache
transaction would have to write to that store. I think that would be
cost-prohibitive.
Locking/sync..
"The state transfer task will acquire the DCL in exclusive mode after
updating the cache view id and release it after obtaining the data container
snapshot. This will ensure the state transfer task can't proceed on the old
owner until all write commands that knew about the old CH have finished
executing." <-- that means that incoming writes would block for the duration
of iterating the DataContainer? That shouldn't be too bad, unless a cache
store is present..
Ok, first of all, in the new draft merged the data container lock and
the cache view lock, I don't think making them separate actually
helped with anything but it made things harder to explain.
The idea was that incoming writes would only block while state
transfer gets the iterator to the data container, which is very short.
But the state transfer will also block on all the writes to the data
container - to make sure that we won't miss any writes.
The algorithm is like this:
Write command
===
1. acquire read lock
2. check cache view id
3. write entry to data container
4. release read lock
State Transfer
===
1. acquire write lock
2. update ch
3. update cache view id
4. release read lock
"CacheViewInterceptor [..] also needs to block in the interval
between a
node noticing a leave and actually receiving the new cache view from the
coordinator" <-- why can't the local cache star computing the new CH
independently and not wait for the coordinator..?
Because the local cache doesn't listen for join and leave requests,
only the coordinator does, and the cache view is computed based on the
requests that the coordinator has seen (and not on the JGroups cluster
joins and leaves).
Cheers
Dan