Thanks Vladimir, I like the hands on approach!
Adding -dev, there's a lot of interest around the parallel M/R so I think others will
have some thoughts on it as well.
So what you're basically doing in your branch is iterate over all the keys in the
cache and then for each key invoke the mapping in a separate thread. Whilst this would
work, I think it has some drawbacks:
- the iteration over the keys in the container happens in sequence, albeit the mapping
phases happening in parallel. This speeds things up a bit but not as much as having the
iteration
happening in parallel, especially when the mapper is fast, which I think it's pretty
common.
- the StatelessTask + some smaller objects are being created for each iterated key.
That's a lot of noise for the GC imo
I think delegating the parallel iteration to the DataContainer (similar to
AdvancedCacheLoader.process (Executor)) would be a better approach IMO:
- the logic is reusable for other components as well, such as querying (to implement
full-scan-like search, or a general purpose parallel iterator over the keys
- object creation is reduced
- the DefaultDetaContainer uses an EquivalentConcurrentHashMapV8 for holding the entries,
which already supports parallel iteration so the heavy lifting is already in place
On Dec 4, 2013, at 5:16 PM, Vladimir Blagojevic <vblagoje(a)redhat.com> wrote:
Here is my M/R parallel execution solution updated to master
https://github.com/vblagoje/infinispan/tree/t_2284_new
Now, I'll work on your solution which I am starting to like actually the more I think
about it. Although I have to admit that I would eviscerate some of your interfaces like
these KeyFilters into more prominent packages so we can all use the same interfaces. Also
I would see if we can genericize some of your interfaces and implementations.
Will keep you updated.
Vladimir
Cheers,
--
Mircea Markus
Infinispan lead (
www.infinispan.org)