<br><div class="gmail_quote">On Fri, Oct 12, 2012 at 7:05 PM, Sanne Grinovero <span dir="ltr"><<a href="mailto:sanne@infinispan.org" target="_blank">sanne@infinispan.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi all,<br>
during design documents review I remember having some objections on<br>
the wording used between "performance" "time" and "state transfer",<br>
but since I then stopped following this aspect I don't know how far<br>
the implementation picked this up.<br>
<br></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I have been pointing out that the goal should not to be to minimize<br>
the time it takes to complete the state transfer, but to minimize the<br>
performance impact during state transfers.<br>
<br></blockquote><div><br>Yep, this is indeed the goal of NBST. Mircea's email is a bit confusing, he starts by saying we try to
minimize the impact on throughput during state transfer but then he
switches to how long state transfer takes.<br><br>Because not all nodes
receive the rebalance start event at the same time, and because the
forwarding of commands to new owners is not optimal, I expect to see a
visible drop in throughput during state transfer. <br><br>Maybe we
should use something like "total drop in throughput compared to the
initial cluster from the moment a node joined to the moment it became
part of the read consistent hash (i.e. the joiner finished receiving
state + we finished our bookkeeping on the coordinator)".<br><br>I'm not
sure how easy it would be to do this, but I'd also like to see how old
transactions (with topologyId < current topology id) affect
performance, because they need special handling vis-a-vis locking. So
perhaps we can also measure the throughput in the post-join cluster
while TransactionTable.minTxViewId (soon to be called minTxTopologyId)
< StateConsumerImpl.cacheTopology.topologyId and compare it to the
throughput in the cluster after everything has settled.<br><br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The design should be able to deal with multiple state transfers<br>
triggering while the current one(s) didn't complete; this is doable as<br>
long as we don't abort partial state transfers and "move on" from the<br>
intermediate states towards the stable state. I think it's also easy<br>
to demonstrate that such a design minimizes the bytes flying around<br>
the network, as the intermediate states are certainly "closer" (in<br>
terms of distribution of data) to the end state than the initial<br>
state.<br>
<br></blockquote><div><br>We don't do this (yet). Actually the part about interrupting state transfers seemed the most tricky part in the "tree of cache views" NBST design, so in the last version we don't interrupt state transfer at all.<br>
<br>In order to really minimize transfers in such a design you'd have to compute the consistent hash based on the previous consistent hash *and* on what segments the new owners have already received - otherwise each incremental change could undo all the progress made during the previous state transfer. But I don't think we'll do that very soon...<br>
<br>Our solution at the moment is to ignore any joins or leaves during state transfer, and only start a new state transfer once the current one is done. This is possible because when the primary owner of a segment leaves the first backup automatically becomes the new primary owner, so transactions can go on without any state transfer. In the end, I think the best solution to avoid wasted transfers is to allow the adiministrator to suspend rebalancing for a period of time.<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
A consequence is that the longer it takes to finalize a state<br>
transfer, the higher the chances to aggregate multiple view changes in<br>
less data shipping events, the less work is done.<br>
<br></blockquote><div><br>We do have this, but if a second node leaves while we're still transferring state, the user had better configured numOwners > 2...<br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
In short it's good to see the state transfer doesn't take too long -<br>
as data might be lost if multiple failures happen in that timespan -<br>
but I definitely think we should *not* have it as a goal to minimize<br>
this time, the goal should be to minimize response time; ideally this<br>
should be as good during ST as when we're stable: clients expect some<br>
degree of reliability measured in terms of maximum & average response<br>
times.<br>
<br></blockquote><div><br>The response time is definitely not going to be the same during state transfers, except maybe for reads. But we do need to measure how bad it is.<br> <br></div></div>Cheers<br>Dan<br><br>