Re: [infinispan-dev] Distributed execution framework - API proposal(s)

Tuesday, 11 January 2011

On 11 Jan 2011, at 15:02, Vladimir Blagojevic wrote:

...
 On 11-01-11 11:37 AM, Manik Surtani wrote:
> What's an intermediate key?  :-)  For each K, V pair, we call map(K, V) which
transforms V to a T.  So all we expect from map() should be T.
> 
>>> * Reducer.reduce()'s return type should not necessarily be the same
>>> type as the mapped transformation.  Also, does the reducer really need
>>> the key? Surely just the transformed version of each K/V pair.
>> Yeah, when we solve the above this should fall into place.
> Yep.  So we then just call reduce(T) for each value of T that we get from the mapping
phase (along with the previous, reduced result of R from the previous invocation).  Just
maybe, we may also need to mention the key for which the mapping took place.  In which
case, reduce(T, R) would look like reduce(K, T, R).  But TBH I don't really see the
use of passing in K.
> 

 They way I understood map/reduce is that for every map(k1,v2) we get 
 transformed result k2,v2. Then we match all k2's that are identical and 
 invoke reduce(k2, collection(v2)) once for each identical k2. If we 
 follow your suggestion then number of invocations in map phase is the 
 same as in reduce phase which does not make sense - that is not map/reduce.

 Would you please have a look at chapter 2 of 
 http://labs.google.com/papers/mapreduce.html
 Maybe I completely lost it? :-) 
You can do that but, even as the paper suggests, you usually end up with just 1 result. 
Essentially it is a case of how simple we want to make the API.  I think for a large
number of the use cases we've come across, mapping a K and V to a single T works fine
(this is a simple value mapping you see in many functional languages).  The alternative is
to map an entire tuple (e.g., k1, v1 becomes k2, v2) but as Java doesn't support
multiple return values, so then you end up with Mapper.map() returning a tuple
(Map.Entry<K2, V2> in Java) which I find clunky.

Perhaps, if there is adequate demand, we could extend Mapper with TupleMapper, which would
support this.

...
>>> * Similarly, Collator.add() should just need the address
and the
>>> reduced result from each node (each node would only produce 1 result!)
>>> 
>>> DistExec
>>> * Why do you have execute() and executeAsync() with no params?  What
>>> do these methods do?
>> In case user provided Factory for DistributedCallable. We have to handle
>> cases where simple no parameter constructor for callables is not
>> sufficient - hence factory.
> Why are we constructing DistributedCallables?  Surely the user passes in an
instance?
> 

 Ok, so we need to serialize/migrate across JVMs this instance of 
 DistributedCallable and invoke it on a target Infinispan node, right? If 
 so, then we need to add K...input as a parameter to call function since 
 each migrated instance (on execution node) needs to have different input 
 for invocation. 
Yeah good point.  My bad.  :)  We'd need to document this appropriately in the
DistributedCallable javadocs.

...
 Also, isn't the name DistributedRemoteTask redundant?  :-)
> 
 Just DistributedTask should suffice. 
+1.

--
Manik Surtani
manik(a)jboss.org
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [infinispan-dev] Distributed execution framework - API proposal(s)