[infinispan-dev] New draft for distributed execution framework

Tue Nov 30 07:56:46 EST 2010

On 11/30/2010 07:11 PM, Manik Surtani wrote:
> Hi Trustin
> 
> Thanks for this.  A few comments:
> 
> * 1st section, by node affinity, I presume you mean data locality?

Yes.  Let me fix it.

> * 2nd section: data processing tasks would be sent out to specific nodes where required data is available.  Not to every node.

You're right again. :-)

> * Do we want to support this for replicated mode?  Typically, replicated clusters will be small and won't really benefit from task execution.  Maybe we add in support for replication later, if there is adequate demand?  It shouldn't change the API, etc.

For example, a user might want to run a cluster of 100 nodes with 2
replicas just like Google File System does.  Why is such cluster
supposed to be small?  Could you elaborate?

> * +1 re: multi-cores on each node.  I presume this would rely on an Executor set up on each node for this purpose?  And a user would have the opportunity to configure this executor (e.g., based on the type of hardware in use)?

Yes, that's what I intended.

> * Could you define where each of these tasks are executed?  E.g., in a grid of {A, B, C, D} and you invoke your example on A, the local tasks happen on any node which contains the data it needs, while the global task happens on A?  Surely in many cases it makes sense to "reduce" before pushing results (final or intermediate) on the network?

Each node runs the first two sub-tasks and then one node is chosen to
run the last one.  Timeline will look like this:

  A: Task1 .. Task2 ... Task3
  B: Task1 .. Task2 /
  C: Task1 .. Task2 /
  D: Task1 .. Task2 /

Because Task1 and 2 are executed in the same node, they could be done
via a simple invocation chaining.  The transition from Task2 to Task3
will involve network communication.  In the example, the output is only
one entry, so it will be kept in memory until the node A's Task3 picks
them up.  If Task2 generates a lot of entries, each node will hold them
in memory and block the task execution on ctx.write() until the Task3
picks them up.

One potential problem is when Node A fails while running Task3.  Because
the intermediary entries are loaded into memory and lost as soon as they
are consumed, there's no way to re-retrieve the intermediary output and
the entire task needs to be re-run.  A user can simply split the task to
two parts to store the intermediary output in the cache.

> * Non-Java language support - this should be out of scope for now, as it adds a layer of complexity we don't need.  We will certainly add support for map/reduce over Hot Rod in future, perhaps using JSON/JavaScript, but not for now.  :)
> * Client/Server development: again, out of scope for now.  See above.  Once we have map/reduce over Hot Rod, client/server would be trivial.

Agreed.  I didn't mean to implement in 5.0 unless there is good deal of
demand. :-)

More feed back appreciated in advanced, folks! :-)

-- 
Trustin Lee, http://gleamynode.net/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 294 bytes
Desc: OpenPGP digital signature
Url : http://lists.jboss.org/pipermail/infinispan-dev/attachments/20101130/aab18b78/attachment.bin