On 12-02-20 3:43 PM, Manik Surtani wrote:
I was under the impression reduce is distributed too? Don't we
do the mapping on each node, then a first-pass reduce on each node too, before streaming
results back to the caller node?
What we do in first-pass reduce is essentially
combine and we should not
do that blindly because this eager reduction/combine only works when
reduce function is both /commutative/ and /associative/! This can lead
to problems when it is not:
http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4...
So yes first-pass reduce is distributed but second-phase reduce should
be distributed as well! Currently it is not!
This all makes sense as well, however one problem with the consistent
hash approach is that it is prone to change when there is a topology change. How would
you deal with that? Would you maintain a history of consistent hashes?
I don't think I understand! Even if there is a topology change
intermediate results Map<KOut, List<VOut>> will be migrated, all we need
is KOut's, we can hash it and find out where List<VOut> are, no?
Vladimir