On 13-09-17 11:50 AM, Shane Johnson wrote:
Right. I'm familiar with the map/reduce process and the proposed
improvements.
This part of the blog threw me off:
"as the map/reduce tasks now run in parallel over both the nodes in the cluster and
within the same node (multiple threads)"
To me, it implies that there are now multiple map threads per node. Further, I thought
that the map / reduce 'working set' was limited to what was in memory. I did not
realize that map / reduce would iterate over all of the data both in memory and on disk.
That is good to hear, though I'm curious if it will apply to all cache stores (e.g.
LevelDB) and how ISPN map / reduce handles a data set that is greater than the available
memory. A lot in-memory stores face this limitation when backed by on-disk stores. If the
data is retrieved one entry at a time, I don't see how multiple threads will help.
However, if it is retrieved in bulk I can see how it might. Not entirely sure.
The implementation in MapReduceManagerImpl.java is cache store agnostic.
Algorithm loads all keys (pinned to that owner node) and iterates over
all values one value at at time.
Now that we are breaking this down into details I am not sure how
multiple threads in cache store would help either. Mircea?
Regards,
Vladimir