<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Feb 24, 2014 at 10:55 PM, Vladimir Blagojevic <span dir="ltr"><<a href="mailto:vblagoje@redhat.com" target="_blank">vblagoje@redhat.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">See inline<br>
<div class="">On 2/24/2014, 12:57 PM, Mircea Markus wrote:<br>
> On Feb 19, 2014, at 8:45 PM, Vladimir Blagojevic <<a href="mailto:vblagoje@redhat.com">vblagoje@redhat.com</a>> wrote:<br>
><br>
>> Hey guys,<br>
>><br>
>> As some of you might know we have received additional requirements from<br>
>> community and internally to add a few things to dist.executors and<br>
>> map/reduce API. On distributed executors front we need to enable<br>
>> distributed executors to store results into cache directly rather than<br>
>> returning them to invoker [1]. As soon as we introduce this API we also<br>
>> need a asyc. mechanism to allow notifications of subtask<br>
>> completion/failure.<br>
> I think we need both in at the same time :-)<br>
</div>Yes, that is what I actually meant. Poor wording.<br></blockquote><div><br></div><div>Do we really need special support for distributed tasks to write results to another cache? We already allow a task to do<br><br>
cache.getCacheManager().getCache("outputCache").put(k, v)<br><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class="">><br>
>> I was thinking we add a concept of<br>
>> DistributedTaskExecutionListener which can be specified in<br>
>> DistributedTaskBuilder:<br>
>><br>
>> DistributedTaskBuilder<T><br>
>> executionListener(DistributedTaskExecutionListener<K, T> listener);<br>
>><br>
>><br>
>> We needed DistributedTaskExecutionListener anyway. All distributed tasks<br>
>> might use some feedback about task progress, completion/failure and on.<br>
>> My proposal is roughly:<br>
>><br>
>><br>
>> public interface DistributedTaskExecutionListener<K, T> {<br>
>><br>
>> void subtaskSent(Address node, Set<K> inputKeys);<br>
>> void subtaskFailed(Address node, Set<K> inputKeys, Exception e);<br>
>> void subtaskSucceded(Address node, Set<K> inputKeys, T result);<br>
>> void allSubtasksCompleted();<br>
>><br>
>> }<br>
>><br>
>> So much for that.<br>
> I think this it would make sense to add this logic for monitoring, + additional info such as average execution time etc. I'm not sure if this is a generally useful API though, unless there were people asking for it already?<br>
</div>Ok, noted. If you remember any references about this let me know and<br>
I'll incorporate what people actually asked for rather than guess.<br></blockquote><div><br></div><div>Ok, let's wait until we get some actual requests from users then. TBH I don't think distributed tasks with subtasks are something that users care about. E.g. with Map/Reduce the reduce tasks are not subtasks of the map/combine tasks, so this API wouldn't help.<br>
<br></div><div></div><div>Hadoop has a Reporter interface that allows you to report "ticks" and increment counters, maybe we should add something like that instead?<br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div><div class="h5"><br>
><br>
>> If tasks do not use input keys these parameters would<br>
>> be emply sets. Now for [1] we need to add additional methods to<br>
>> DistributedExecutorService. We can not specify result cache in<br>
>> DistributedTaskBuilder as we are still bound to only submit methods in<br>
>> DistributedExecutorService that return futures and we don't want that.<br>
>> We need two new void methods:<br>
>><br>
>> <T, K> void submitEverywhere(DistributedTask<T> task,<br>
>> Cache<DistExecResultKey<K>, T> result);<br>
>> <T, K > void submitEverywhere(DistributedTask<T> task,<br>
>> Cache<DistExecResultKey<K>, T> result, K... input);<br>
>><br>
>><br>
>> Now, why bother with DistExecResultKey? Well we have tasks that use<br>
>> input keys and tasks that don't. So results cache could only be keyed by<br>
>> either keys or execution address, or combination of those two.<br>
>> Therefore, DistExecResultKey could be something like:<br>
>><br>
>> public interface DistExecResultKey<K> {<br>
>><br>
>> Address getExecutionAddress();<br>
>> K getKey();<br>
>><br>
>> }<br>
>><br>
>> If you have a better idea how to address this aspect let us know. So<br>
>> much for distributed executors.<br>
>><br></div></div></blockquote><div><br></div><div>I think we should allow each distributed task to deal with output in its own way, the existing API should be enough.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div><div class="h5">
>><br>
>> For map/reduce we also have to enable storing of map reduce task results<br>
>> into cache [2] and allow users to specify custom cache for intermediate<br>
>> results[3]. Part of task [2] is to allow notification about map/reduce<br>
>> task progress and completion. Just as in dist.executor I would add<br>
>> MapReduceTaskExecutionListener interface:<br>
>><br>
>><br>
>> public interface MapReduceTaskExecutionListener {<br>
>><br>
>> void mapTaskInitialized(Address executionAddress);<br>
>> void mapTaskSucceeded(Address executionAddress);<br>
>> void mapTaskFailed(Address executionTarget, Exception cause);<br>
>> void mapPhaseCompleted();<br>
>><br>
>> void reduceTaskInitialized(Address executionAddress);<br>
>> void reduceTaskSucceeded(Address executionAddress);<br>
>> void reduceTaskFailed(Address address, Exception cause);<br>
>> void reducePhaseCompleted();<br>
>><br>
>> }<br>
> IMO - in the first stage at leas - I would rather use a simpler (Notifying)Future, on which the user can wait till the computation happens: it's simpler and more aligned with the rest of our async API.<br>
><br>
</div></div>What do you mean? We already have futures in MapReduceTask API. This API<br>
is more fine grained and allows monitoring/reporting of task progress.<br>
Please clarify.<br></blockquote><div><br></div><div>I'm not sure about the usefulness of an API like this either... if the intention is to allow the user to collect statistics about duration of various phases, then I think exposing the durations via MapReduceTasks would be better.<br>
</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class=""><br>
>> while MapReduceTask would have an additional method:<br>
>><br>
>> public void execute(Cache<KOut, VOut> resultsCache);<br>
> you could overload it with cache name only method.<br>
</div>Yeah, good idea. Same for usingIntermediateCache? I actually asked you<br>
this here <a href="https://issues.jboss.org/browse/ISPN-4021" target="_blank">https://issues.jboss.org/browse/ISPN-4021</a><br></blockquote><div><br></div><div>+1 to allow a cache name only. For the intermediate cache I don't think it makes sense to allow a Cache version at all.<br>
</div><div> <br></div></div><br></div></div>