On Fri, Oct 12, 2012 at 7:05 PM, Sanne Grinovero <sanne(a)infinispan.org>wrote:
Hi all,
during design documents review I remember having some objections on
the wording used between "performance" "time" and "state
transfer",
but since I then stopped following this aspect I don't know how far
the implementation picked this up.
I have been pointing out that the goal should not to be to minimize
the time it takes to complete the state transfer, but to minimize
the
performance impact during state transfers.
Yep, this is indeed the goal of NBST. Mircea's email is a bit confusing, he
starts by saying we try to minimize the impact on throughput during state
transfer but then he switches to how long state transfer takes.
Because not all nodes receive the rebalance start event at the same time,
and because the forwarding of commands to new owners is not optimal, I
expect to see a visible drop in throughput during state transfer.
Maybe we should use something like "total drop in throughput compared to
the initial cluster from the moment a node joined to the moment it became
part of the read consistent hash (i.e. the joiner finished receiving state
+ we finished our bookkeeping on the coordinator)".
I'm not sure how easy it would be to do this, but I'd also like to see how
old transactions (with topologyId < current topology id) affect
performance, because they need special handling vis-a-vis locking. So
perhaps we can also measure the throughput in the post-join cluster while
TransactionTable.minTxViewId (soon to be called minTxTopologyId) <
StateConsumerImpl.cacheTopology.topologyId and compare it to the throughput
in the cluster after everything has settled.
The design should be able to deal with multiple state transfers
triggering while the current one(s) didn't complete; this is doable as
long as we don't abort partial state transfers and "move on" from the
intermediate states towards the stable state. I think it's also easy
to demonstrate that such a design minimizes the bytes flying around
the network, as the intermediate states are certainly "closer" (in
terms of distribution of data) to the end state than the initial
state.
We don't do this (yet). Actually the part about interrupting state
transfers seemed the most tricky part in the "tree of cache views" NBST
design, so in the last version we don't interrupt state transfer at all.
In order to really minimize transfers in such a design you'd have to
compute the consistent hash based on the previous consistent hash *and* on
what segments the new owners have already received - otherwise each
incremental change could undo all the progress made during the previous
state transfer. But I don't think we'll do that very soon...
Our solution at the moment is to ignore any joins or leaves during state
transfer, and only start a new state transfer once the current one is done.
This is possible because when the primary owner of a segment leaves the
first backup automatically becomes the new primary owner, so transactions
can go on without any state transfer. In the end, I think the best solution
to avoid wasted transfers is to allow the adiministrator to suspend
rebalancing for a period of time.
A consequence is that the longer it takes to finalize a state
transfer, the higher the chances to aggregate multiple view changes in
less data shipping events, the less work is done.
We do have this, but if a second node leaves while we're still transferring
state, the user had better configured numOwners > 2...
In short it's good to see the state transfer doesn't take too
long -
as data might be lost if multiple failures happen in that timespan -
but I definitely think we should *not* have it as a goal to minimize
this time, the goal should be to minimize response time; ideally this
should be as good during ST as when we're stable: clients expect some
degree of reliability measured in terms of maximum & average response
times.
The response time is definitely not going to be the same during state
transfers, except maybe for reads. But we do need to measure how bad it is.
Cheers
Dan