[infinispan-dev] Parallel M/R

Radim Vansa rvansa at redhat.com
Mon Dec 9 03:09:53 EST 2013


There is one thing I really don't like about the current implementation: 
DefaultCollector. And any other collection that keeps one (or more) 
object per entry.
We can't assume that if you double the number of objects in memory (and 
in fact, if you map entry to bigger object, you do that), they'd still 
fit into it. Moreover, if you map the objects from cache store as well.
I believe we have to use Collector implemented as bounded queue, and 
start reduction phase on the entries that have been mapped in parallel 
to the mapper phase. Otherwise, say hello to OOME.

Cheers

Radim

PS: And don't keep all the futures just to check that all tasks have 
been finished - use ExecutorAllCompletionService instead.

On 12/06/2013 05:18 PM, Mircea Markus wrote:
> Thanks Vladimir, I like the hands on approach!
> Adding -dev, there's a lot of interest around the parallel M/R so I think others will have some thoughts on it as well.
>
> So what you're basically doing in your branch is iterate over all the keys in the cache and then for each key invoke the mapping in a separate thread. Whilst this would work, I think it has some drawbacks:
> - the iteration over the keys in the container happens in sequence, albeit the mapping phases happening in parallel. This speeds things up a bit but not as much as having the iteration
> happening in parallel, especially when the mapper is fast, which I think it's pretty common.
> - the StatelessTask + some smaller objects are being created for each iterated key. That's a lot of noise for the GC imo
>
> I think delegating the parallel iteration to the DataContainer (similar to AdvancedCacheLoader.process (Executor)) would be a better approach IMO:
> - the logic is reusable for other components as well, such as querying (to implement full-scan-like search, or a general purpose parallel iterator over the keys
> - object creation is reduced
> - the DefaultDetaContainer uses an EquivalentConcurrentHashMapV8 for holding the entries, which already supports parallel iteration so the heavy lifting is already in place
>
> On Dec 4, 2013, at 5:16 PM, Vladimir Blagojevic <vblagoje at redhat.com> wrote:
>
>> Here is my M/R parallel execution solution updated to master https://github.com/vblagoje/infinispan/tree/t_2284_new
>>
>> Now, I'll work on your solution which I am starting to like actually the more I think about it. Although I have to admit that I would eviscerate some of your interfaces like these KeyFilters into more prominent packages so we can all use the same interfaces. Also I would see if we can genericize some of your interfaces and implementations.
>>
>> Will keep you updated.
>>
>> Vladimir
> Cheers,


-- 
Radim Vansa <rvansa at redhat.com>
JBoss DataGrid QA



More information about the infinispan-dev mailing list