<br><div class="gmail_quote">On Fri, Oct 12, 2012 at 7:05 PM, Sanne Grinovero <span dir="ltr">&lt;<a href="mailto:sanne@infinispan.org" target="_blank">sanne@infinispan.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi all,<br>

during design documents review I remember having some objections on<br>

the wording used between &quot;performance&quot; &quot;time&quot; and &quot;state transfer&quot;,<br>

but since I then stopped following this aspect I don&#39;t know how far<br>

the implementation picked this up.<br> 

<br></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I have been pointing out that the goal should not to be to minimize<br>

the time it takes to complete the state transfer, but to minimize the<br>

performance impact during state transfers.<br>

<br></blockquote><div><br>Yep, this is indeed the goal of NBST. Mircea&#39;s email is a bit confusing, he starts by saying we try to 

minimize the impact on throughput during state transfer but then he 

switches to how long state transfer takes.<br><br>Because not all nodes 

receive the rebalance start event at the same time, and because the 

forwarding of commands to new owners is not optimal, I expect to see a 

visible drop in throughput during state transfer. <br><br>Maybe we 

should use something like &quot;total drop in throughput compared to the 

initial cluster from the moment a node joined to the moment it became 

part of the read consistent hash (i.e. the joiner finished receiving 

state + we finished our bookkeeping on the coordinator)&quot;.<br><br>I&#39;m not

 sure how easy it would be to do this, but I&#39;d also like to see how old 

transactions (with topologyId &lt; current topology id) affect 

performance, because they need special handling vis-a-vis locking. So 

perhaps we can also measure the throughput in the post-join cluster 

while TransactionTable.minTxViewId (soon to be called minTxTopologyId) 

&lt; StateConsumerImpl.cacheTopology.topologyId and compare it to the 

throughput in the cluster after everything has settled.<br><br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

The design should be able to deal with multiple state transfers<br>

triggering while the current one(s) didn&#39;t complete; this is doable as<br>

long as we don&#39;t abort partial state transfers and &quot;move on&quot; from the<br>

intermediate states towards the stable state. I think it&#39;s also easy<br>

to demonstrate that such a design minimizes the bytes flying around<br>

the network, as the intermediate states are certainly &quot;closer&quot; (in<br>

terms of distribution of data) to the end state than the initial<br>

state.<br>

<br></blockquote><div><br>We don&#39;t do this (yet). Actually the part about interrupting state transfers seemed the most tricky part in the &quot;tree of cache views&quot; NBST design, so in the last version we don&#39;t interrupt state transfer at all.<br>

<br>In order to really minimize transfers in such a design you&#39;d have to compute the consistent hash based on the previous consistent hash *and* on what segments the new owners have already received - otherwise each incremental change could undo all the progress made during the previous state transfer. But I don&#39;t think we&#39;ll do that very soon...<br>

<br>Our solution at the moment is to ignore any joins or leaves during state transfer, and only start a new state transfer once the current one is done. This is possible because when the primary owner of a segment leaves the first backup automatically becomes the new primary owner, so transactions can go on without any state transfer. In the end, I think the best solution to avoid wasted transfers is to allow the adiministrator to suspend rebalancing for a period of time.<br>

 </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

A consequence is that the longer it takes to finalize a state<br>

transfer, the higher the chances to aggregate multiple view changes in<br>

less data shipping events, the less work is done.<br>

<br></blockquote><div><br>We do have this, but if a second node leaves while we&#39;re still transferring state, the user had better configured numOwners &gt; 2...<br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


In short it&#39;s good to see the state transfer doesn&#39;t take too long -<br>

as data might be lost if multiple failures happen in that timespan -<br>

but I definitely think we should *not* have it as a goal to minimize<br>

this time, the goal should be to minimize response time; ideally this<br>

should be as good during ST as when we&#39;re stable: clients expect some<br>

degree of reliability measured in terms of maximum &amp; average response<br>

times.<br>

<br></blockquote><div><br>The response time is definitely not going to be the same during state transfers, except maybe for reads. But we do need to measure how bad it is.<br> <br></div></div>Cheers<br>Dan<br><br>