On Tue, Feb 25, 2014 at 9:31 PM, Vladimir Blagojevic <vblagoje@redhat.com> wrote:
Hey,

I am starting to like this thread more and more :-) In conclusion, for
distributed executors we are not adding any new APIs because Callable
implementers can already write to cache using existing API. We don't
have to add any new elaborate callback/listener API either as users have
not requested but should investigate Hadoop Reporter like interface to
allow users some sense of task current execution phase.

For map/reduce we will add a new method:

public void execute(Cache<KOut, VOut> resultsCache);

Using fluent MapReduceTask API users would be able to specify an
intermediate cache:

public MapReduceTask<KIn, VIn, KOut, VOut> usingIntermediateCache(String
cacheName);

We are not adding MapReduceTaskExecutionListener but more like JMX stats
for the MapReduce tasks in general: like average execution time, count
etc. Also the ability to cancel a running task through JMX/JON would be
nice.

For statistics, I was thinking of adding a getStatistics() method to MapReduceTask that would return an object with the duration of each phase and the number of keys processed on each node, after the M/R task is done. This could probably be extended such that it gives the user in-progress information as well.

The in-progress information would also tie in nicely with a progress listener, but I feel the events you proposed are too coarse. If the user wanted to display a progress bar in his application, and the cluster only had 2 nodes, the progress bar would hover for half of the time around 0% and for the other half of the time around 50%. So we'd need to keep reporting something while a phase is in progress (e.g. by splitting a node's keys to more than one mapping task, and reporting the end of each subtask), otherwise the listener wouldn't be of much use.

Anyway, this would be something nice to have, but I don't think it's very important, so supplying some global statistics via JMX should be enough for now.

Cheers
Dan