Hey Manik,
On 2010-07-30, at 11:07 AM, Manik Surtani wrote:
Vladimir,
Here are some thoughts on your latest iteration (Version 8) of this page:
http://community.jboss.org/wiki/InfinispanDistributedExecutionFramework
There are actually 2 distinct use cases that this feature would touch:
1) pushing a Callable to the node where state is located and pull back a result
2) breaking up a task into Callables and then pushing the Callables out as per (1).
Now only (2) is true map/reduce (and relevant to fork/join) but for many cases, (1) alone
is enough. So whatever API we propose should support both. Simple remote code execution
to leverage data locality where the user doesn't care about breaking up tasks into
subtasks, as well as proper task decomposition.
Yes, that the intention. Build support for 2) but allow 1). Will add interfaces to define
1. I wanted to get some consensus on how 2) should look like and then focus on 1).
But we should be clear about these differences - even though on an impl level they are
very closely related. The stuff you have so far handles case (2) pretty well but we
should also expose (1).
Will make it clear.
Some other feedback:
* I presume distributed task monitoring and annotations are sections that come under
"outside of scope"? They seem to be on the same level so I wasn't sure of
your intentions here.
Ok will make this distinction more clear.
* Proposed interfaces - Not sure if I understand the purpose of
DistributedCallable#mapped(). You already assign a cache to the callable in
DistributedCallable#initialized(), right?
The intent of initialized call was to bootstrap task into environment at the master node,
the node invoking the distribution of callables. The purpose of initialize is two fold.
Task implementor can look up input that was given to it in constructor of
DistributedCallable and interpret this input (e.g which keys to use, which particular
cache etc etc) and in turn inform execution environment which cache it would use for data
input so proper ConsistentHash can be passed in preferredExecutionNodes by the
environment. I wanted to isolate task implementor from digging deep into Infinispan
internals to figure out how to fulfill preferredExecutionNodes contract.
preferredExecutionNodes is invoked while task is still at master node before migration to
execution node.
The purpose of mapped is to signal task that it has been migrated to execution node and
that execution is about to begin. Here implementors would actually load the data which is
local to JVM and needed for task computation.
* DistributedCallable#preferredExecutionNodes() - do we really want
to support this at this stage? Or is it better to not support arbitrary node selection by
end-user code?
Task can still do arbitrary node selection in DistributedTask#map(). This goes into
fulfilling objective of 1 from above.
Simpler would be to add a DistributedCallable#getRelatedKeys() which
returns a Set of keys which the callable would be expected to touch, so we can decide
where to route the task. Or maybe you want to offer both forms, so that a
DistributedCallable could *either* provide a set of nodes to execute on or a set of keys
which it expects to touch.
I still like the original approach better :) Maybe it makes more sense with this
elaboration now....
Regards,
Vladimir