On 12-02-20 3:43 PM, Manik Surtani wrote:

I was under the impression reduce is distributed too?  Don't we do the mapping on each node, then a first-pass reduce on each node too, before streaming results back to the caller node?

What we do in first-pass reduce is essentially combine and we should not do that blindly because this eager reduction/combine only works when reduce function is both commutative and associative! This can lead to problems when it is not: http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/

So yes first-pass reduce is distributed but second-phase reduce should be distributed as well! Currently it is not!

This all makes sense as well, however one problem with the consistent hash approach is that it is prone to change when there is a topology change.  How would you deal with that?  Would you maintain a history of consistent hashes?

I don't think I understand! Even if there is a topology change intermediate results Map<KOut, List<VOut>> will be migrated, all we need is KOut's, we can hash it and find out where List<VOut> are, no?

Vladimir