What we do in first-pass reduce is essentially combine and we should not do that blindly because this eager reduction/combine only works when reduce function is both commutative and associative! This can lead to problems when it is not: http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/I was under the impression reduce is distributed too? Don't we do the mapping on each node, then a first-pass reduce on each node too, before streaming results back to the caller node?
I don't think I understand! Even if there is a topology change intermediate results Map<KOut, List<VOut>> will be migrated, all we need is KOut's, we can hash it and find out where List<VOut> are, no?This all makes sense as well, however one problem with the consistent hash approach is that it is prone to change when there is a topology change. How would you deal with that? Would you maintain a history of consistent hashes?