On 12-02-20 3:43 PM, Manik Surtani wrote:
I was under the impression reduce is distributed too? Don't we do the mapping on each node, then a first-pass reduce on each node too, before streaming results back to the caller node?
What we do in first-pass reduce is essentially combine and we should
not do that blindly because this eager reduction/combine only works
when reduce function is both
commutative and
associative!
This can lead to problems when it is not:
http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/
So yes first-pass reduce is distributed but second-phase reduce
should be distributed as well! Currently it is not!