<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Oct 10, 2014 at 6:49 PM, Emmanuel Bernard <span dir="ltr">&lt;<a href="mailto:emmanuel@hibernate.org" target="_blank">emmanuel@hibernate.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">When wrestling with the subject, here is what I had in mind.<br>

<br>

The M/R coordinator node sends the M task per segment on the node where<br>

the segment is primary.<br></blockquote><div><br></div><div>What&#39;s M? Is it just a shorthand for &quot;map&quot;, or is it a new parameter that controls the number of map/combine tasks sent at once?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

Each &quot;per-segment&quot; M task is executed and is offered the way to push<br>

intermediary results in a temp cache.<br></blockquote><div><br></div><div>Just to be clear, the user-provided mapper and combiner don&#39;t know anything about the intermediary cache (which doesn&#39;t have to be temporary, if it&#39;s shared by all M/R tasks). They only interact with the Collector interface.</div><div>The map/combine task on the other hand is our code, and it deals with the intermediary cache directly.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

The intermediary results are stored with a composite key [imtermKey-i, seg-j].<br>

The M/R coordinator waits for all M tasks to return. If one does not<br>

(timeout, rehash), the following happens:<br></blockquote><div><br></div><div>We can&#39;t allow time out map tasks, or they will keep writing to the intermediate cache in parallel with the retried tasks. So the originator has to wait for a response from each node to which it sent a map task.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

- delete [intermKey-i, seg-i] (that operation could be handled by the<br>

  new per-segment M before the map task is effectively started)<br>

- ship the M task for that segment-i to the new primary owner of<br>

  segment-i<br>

<br>

When all M tasks are received the Reduce phase will read all [intermKey-i, *]<br>

keys and reduce them.<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Note that if the reduction phase is itself distributed, we could apply<br>

the same key per segment and shipping split for these.<br></blockquote><div><br></div><div>Sure, we have to retry reduce tasks when the primary owner changes, and it makes sense to retry as little as possible.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

Again the tricky part is to expose the ability to write to intermediary<br>

caches per segment without exposing segments per se as well as let<br>

someone see a concatenated view if intermKey-i from all segments subkeys<br>

during reduction.<br></blockquote><div><br></div><div>Writing to and reading from the intermediate cache is already abstracted from user code (in the Mapper and Reducer interfaces). So we don&#39;t need to worry about exposing extra details to the user.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

Thoughts?<br>

<br>

Dan, I did not quite get what alternative approach you wanted to<br>

propose. Care to respin it for a slow brain? :)<br></blockquote><div><br></div><div>I think where we differ is that I don&#39;t think user code needs to know about how we store the intermediate values and what we retry, as long as their mappers/combiners/reducers don&#39;t have side effects.</div><div><br></div><div>Otherwise I was thinking on the same lines: send 1 map/combine task for each segment (maybe with a cap on the number of segments being processed at the same time on each node), split the intermediate values per input segment, cancel+retry each map task if the topology changes and the executing node is no longer an owner. If the reduce phase is distributed, run 1 reduce task per segment as well, and cancel+retry the reduce task if the executing node is no longer an owner.</div><div><br></div><div>I had some ideas about assigning each map/combine phase a UUID and making the intermediate keys [intermKey, seg, mctask] to allow the originator to retry a map/combine task without waiting for the previous one to finish, but I don&#39;t think I mentioned that before :)<br></div><div><br></div><div>There are also some details that I&#39;m worried about:</div><div><br></div><div>1) If the reduce phase is distributed, and the intermediate cache is non-transactional, any topology change in the intermediate cache will require us to retry all the map/combine tasks that were running at the time on any node (even if some nodes did not detect the topology change yet). So it would make sense to limit the number of map/combine tasks that are processed at one time, in order to limit the amount of tasks we retry (OR require the intermediate cache to be transactional).</div><div><br></div><div>2) Running a separate map/combine task for each segment is not really an option until we implement the the segment-aware data container and cache stores. Without that change, it will make everything much slower, because of all the extra iterations for each segment.</div><div><br class="">3) And finally, all this will be overkill when the input cache is small, and the time needed to process the data is comparable to the time needed to send all those extra RPCs.<br></div><div><br></div><div>So I&#39;m thinking it might be better to adopt Vladimir&#39;s suggestion to retry everything if we detect a topology change in the input and/or intermediate cache at the end of the M/R task, at least in the first phase. </div><div><br></div><div>Cheers</div><div>Dan</div><div><br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<span class=""><font color="#888888"><br>

Emmanuel<br>

</font></span><div class=""><div class="h5"><br>

On Fri 2014-10-10 10:03, Dan Berindei wrote:<br>

&gt; &gt; &gt; I&#39;d rather not expose this to the user. Instead, we could split the<br>

&gt; &gt; &gt; intermediary values for each key by the source segment, and do the<br>

&gt; &gt; &gt; invalidation of the retried segments in our M/R framework (e.g. when we<br>

&gt; &gt; &gt; detect that the primary owner at the start of the map/combine phase is<br>

&gt; &gt; &gt; not an owner at all at the end).<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; I think we have another problem with the publishing of intermediary<br>

&gt; &gt; &gt; values not being idempotent. The default configuration for the<br>

&gt; &gt; &gt; intermediate cache is non-transactional, and retrying the put(delta)<br>

&gt; &gt; &gt; command after a topology change could add the same intermediate values<br>

&gt; &gt; &gt; twice. A transactional intermediary cache should be safe, though,<br>

&gt; &gt; &gt; because the tx won&#39;t commit on the old owner until the new owner knows<br>

&gt; &gt; &gt; about the tx.<br>

&gt; &gt;<br>

&gt; &gt; can you elaborate on it?<br>

&gt; &gt;<br>

&gt;<br>

&gt; say we have a cache with numOwners=2, owners(k) = [A, B]<br>

&gt; C will become the primary owner of k, but for now owners(k) = [A, B, C]<br>

&gt; O sends put(delta) to A (the primary)<br>

&gt; A sends put(delta) to B, C<br>

&gt; B sees a topology change (owners(k) = [C, B]), doesn&#39;t apply the delta and<br>

&gt; replies with an OutdatedTopologyException<br>

&gt; C applies the delta<br>

&gt; A resends put(delta) to C (new primary)<br>

&gt; C sends put(delta) to B, applies the delta again<br>

&gt;<br>

&gt; I think it could be solved with versions, I just wanted to point out that<br>

&gt; we don&#39;t do that now.<br>

&gt;<br>

&gt;<br>

&gt; &gt;<br>

&gt; &gt; anyway, I think the retry mechanism should solve it. If we detect a<br>

&gt; &gt; topology change (during the iteration of segment _i_) and the segment<br>

&gt; &gt; _i_ is moved, then we can cancel the iteration, remove all the<br>

&gt; &gt; intermediate values generated in segment _i_ and restart (on the primary<br>

&gt; &gt; owner).<br>

&gt; &gt;<br>

&gt;<br>

&gt; The problem is that the intermediate keys aren&#39;t in the same segment: we<br>

&gt; want the reduce phase to access only keys local to the reducing node, and<br>

&gt; keys in different input segments can yield values for the same intermediate<br>

&gt; key. So like you say, we&#39;d have to retry on every topology change in the<br>

&gt; intermediary cache, not just the ones affecting segment _i_.<br>

&gt;<br>

&gt; There&#39;s another complication: in the scenario above, O may only get the<br>

&gt; topology update with owners(k) = [C, B] after the map/combine phase<br>

&gt; completed. So the originator of the M/R job would have to watch for<br>

&gt; topology changes seen by any node, and invalidate/retry any input segments<br>

&gt; that could have been affected. All that without slowing down the<br>

&gt; no-topology-change case too much...<br>

&gt;<br>

&gt; &gt;<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt;     &gt;<br>

&gt; &gt; &gt;     &gt; But before getting ahead of ourselves, what do you thing of the<br>

&gt; &gt; general idea? Even without retry framework, this approach would be more<br>

&gt; &gt; stable than our current per node approach during topology changes and<br>

&gt; &gt; improve dependability.<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt;     Doing it solely based on segment would remove the possibility of<br>

&gt; &gt; &gt;     having duplicates.  However without a mechanism to send a new request<br>

&gt; &gt; &gt;     on rehash it would be possible to only find a subset of values (if a<br>

&gt; &gt; &gt;     segment is removed while iterating on it).<br>

</div></div><div class=""><div class="h5">_______________________________________________<br>

infinispan-dev mailing list<br>

<a href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a><br>

<a href="https://lists.jboss.org/mailman/listinfo/infinispan-dev" target="_blank">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a><br>

</div></div></blockquote></div><br></div></div>