[infinispan-dev] State transfer performance

Dan Berindei dan.berindei at gmail.com
Sun Oct 14 09:40:35 EDT 2012


On Fri, Oct 12, 2012 at 7:05 PM, Sanne Grinovero <sanne at infinispan.org>wrote:

> Hi all,
> during design documents review I remember having some objections on
> the wording used between "performance" "time" and "state transfer",
> but since I then stopped following this aspect I don't know how far
> the implementation picked this up.
>
>
I have been pointing out that the goal should not to be to minimize
> the time it takes to complete the state transfer, but to minimize the
> performance impact during state transfers.
>
>
Yep, this is indeed the goal of NBST. Mircea's email is a bit confusing, he
starts by saying we try to minimize the impact on throughput during state
transfer but then he switches to how long state transfer takes.

Because not all nodes receive the rebalance start event at the same time,
and because the forwarding of commands to new owners is not optimal, I
expect to see a visible drop in throughput during state transfer.

Maybe we should use something like "total drop in throughput compared to
the initial cluster from the moment a node joined to the moment it became
part of the read consistent hash (i.e. the joiner finished receiving state
+ we finished our bookkeeping on the coordinator)".

I'm not sure how easy it would be to do this, but I'd also like to see how
old transactions (with topologyId < current topology id) affect
performance, because they need special handling vis-a-vis locking. So
perhaps we can also measure the throughput in the post-join cluster while
TransactionTable.minTxViewId (soon to be called minTxTopologyId) <
StateConsumerImpl.cacheTopology.topologyId and compare it to the throughput
in the cluster after everything has settled.



> The design should be able to deal with multiple state transfers
> triggering while the current one(s) didn't complete; this is doable as
> long as we don't abort partial state transfers and "move on" from the
> intermediate states towards the stable state. I think it's also easy
> to demonstrate that such a design minimizes the bytes flying around
> the network, as the intermediate states are certainly "closer" (in
> terms of distribution of data) to the end state than the initial
> state.
>
>
We don't do this (yet). Actually the part about interrupting state
transfers seemed the most tricky part in the "tree of cache views" NBST
design, so in the last version we don't interrupt state transfer at all.

In order to really minimize transfers in such a design you'd have to
compute the consistent hash based on the previous consistent hash *and* on
what segments the new owners have already received - otherwise each
incremental change could undo all the progress made during the previous
state transfer. But I don't think we'll do that very soon...

Our solution at the moment is to ignore any joins or leaves during state
transfer, and only start a new state transfer once the current one is done.
This is possible because when the primary owner of a segment leaves the
first backup automatically becomes the new primary owner, so transactions
can go on without any state transfer. In the end, I think the best solution
to avoid wasted transfers is to allow the adiministrator to suspend
rebalancing for a period of time.


> A consequence is that the longer it takes to finalize a state
> transfer, the higher the chances to aggregate multiple view changes in
> less data shipping events, the less work is done.
>
>
We do have this, but if a second node leaves while we're still transferring
state, the user had better configured numOwners > 2...


> In short it's good to see the state transfer doesn't take too long -
> as data might be lost if multiple failures happen in that timespan -
> but I definitely think we should *not* have it as a goal to minimize
> this time, the goal should be to minimize response time; ideally this
> should be as good during ST as when we're stable: clients expect some
> degree of reliability measured in terms of maximum & average response
> times.
>
>
The response time is definitely not going to be the same during state
transfers, except maybe for reads. But we do need to measure how bad it is.

Cheers
Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20121014/fcbc8fb3/attachment.html 


More information about the infinispan-dev mailing list