On 11/30/2010 07:11 PM, Manik Surtani wrote:
Hi Trustin
Thanks for this. A few comments:
* 1st section, by node affinity, I presume you mean data locality?
Yes. Let me fix it.
* 2nd section: data processing tasks would be sent out to specific
nodes where required data is available. Not to every node.
You're right again. :-)
* Do we want to support this for replicated mode? Typically,
replicated clusters will be small and won't really benefit from task execution. Maybe
we add in support for replication later, if there is adequate demand? It shouldn't
change the API, etc.
For example, a user might want to run a cluster of 100 nodes with 2
replicas just like Google File System does. Why is such cluster
supposed to be small? Could you elaborate?
* +1 re: multi-cores on each node. I presume this would rely on an
Executor set up on each node for this purpose? And a user would have the opportunity to
configure this executor (e.g., based on the type of hardware in use)?
Yes, that's what I intended.
* Could you define where each of these tasks are executed? E.g., in
a grid of {A, B, C, D} and you invoke your example on A, the local tasks happen on any
node which contains the data it needs, while the global task happens on A? Surely in many
cases it makes sense to "reduce" before pushing results (final or intermediate)
on the network?
Each node runs the first two sub-tasks and then one node is chosen to
run the last one. Timeline will look like this:
A: Task1 .. Task2 ... Task3
B: Task1 .. Task2 /
C: Task1 .. Task2 /
D: Task1 .. Task2 /
Because Task1 and 2 are executed in the same node, they could be done
via a simple invocation chaining. The transition from Task2 to Task3
will involve network communication. In the example, the output is only
one entry, so it will be kept in memory until the node A's Task3 picks
them up. If Task2 generates a lot of entries, each node will hold them
in memory and block the task execution on ctx.write() until the Task3
picks them up.
One potential problem is when Node A fails while running Task3. Because
the intermediary entries are loaded into memory and lost as soon as they
are consumed, there's no way to re-retrieve the intermediary output and
the entire task needs to be re-run. A user can simply split the task to
two parts to store the intermediary output in the cache.
* Non-Java language support - this should be out of scope for now, as
it adds a layer of complexity we don't need. We will certainly add support for
map/reduce over Hot Rod in future, perhaps using JSON/JavaScript, but not for now. :)
* Client/Server development: again, out of scope for now. See above. Once we have
map/reduce over Hot Rod, client/server would be trivial.
Agreed. I didn't mean to implement in 5.0 unless there is good deal of
demand. :-)
More feed back appreciated in advanced, folks! :-)
--
Trustin Lee,
http://gleamynode.net/