[infinispan-dev] Comments on latest iteration of Distributed Task Execution design

Fri Jul 30 11:36:08 EDT 2010

Hey Manik,

On 2010-07-30, at 11:07 AM, Manik Surtani wrote:

> Vladimir,
> 
> Here are some thoughts on your latest iteration (Version 8) of this page:
> 
> 	http://community.jboss.org/wiki/InfinispanDistributedExecutionFramework
> 
> There are actually 2 distinct use cases that this feature would touch: 
> 
> 	1) pushing a Callable to the node where state is located and pull back a result
> 	2) breaking up a task into Callables and then pushing the Callables out as per (1).
> 
> Now only (2) is true map/reduce (and relevant to fork/join) but for many cases, (1) alone is enough.  So whatever API we propose should support both.  Simple remote code execution to leverage data locality where the user doesn't care about breaking up tasks into subtasks, as well as proper task decomposition.

Yes, that the intention. Build support for 2) but allow 1). Will add interfaces to define 1. I wanted to get some consensus on how 2) should look like and then focus on 1). 

> 
> But we should be clear about these differences - even though on an impl level they are very closely related.  The stuff you have so far handles case (2) pretty well but we should also expose (1).

Will make it clear.

> 
> Some other feedback:
> 
> * I presume distributed task monitoring and annotations are sections that come under "outside of scope"?  They seem to be on the same level so I wasn't sure of your intentions here.

Ok will make this distinction more clear.

> * Proposed interfaces - Not sure if I understand the purpose of DistributedCallable#mapped().  You already assign a cache to the callable in DistributedCallable#initialized(), right?

The intent of initialized call was to bootstrap task into environment at the master node, the node invoking the distribution of callables. The purpose of initialize is two fold. Task implementor can look up input that was given to it in constructor of DistributedCallable and interpret this input (e.g which keys to use, which particular cache etc etc) and in turn inform execution environment which cache it would use for data input so proper ConsistentHash can be passed in preferredExecutionNodes by the environment. I wanted to isolate task implementor from digging deep into Infinispan internals to figure out how to fulfill preferredExecutionNodes contract. preferredExecutionNodes is invoked while task is still at master node before migration to execution node.

The purpose of mapped is to signal task that it has been migrated to execution node and that execution is about to begin. Here implementors would actually load the data which is local to JVM and needed for task computation. 

> * DistributedCallable#preferredExecutionNodes() - do we really want to support this at this stage?  Or is it better to not support arbitrary node selection by end-user code?  

Task can still do arbitrary node selection in DistributedTask#map(). This goes into fulfilling objective of 1 from above. 

> Simpler would be to add a DistributedCallable#getRelatedKeys() which returns a Set of keys which the callable would be expected to touch, so we can decide where to route the task.  Or maybe you want to offer both forms, so that a DistributedCallable could *either* provide a set of nodes to execute on or a set of keys which it expects to touch.

I still like the original approach better :) Maybe it makes more sense with this elaboration now....

Regards,
Vladimir