[infinispan-dev] Parallel M/R

Mircea Markus mmarkus at redhat.com
Fri Dec 13 05:06:19 EST 2013


> On 9 Dec 2013, at 08:10, Radim Vansa <rvansa at redhat.com> wrote:
> 
> There is one thing I really don't like about the current implementation: 
> DefaultCollector. And any other collection that keeps one (or more) 
> object per entry.
> We can't assume that if you double the number of objects in memory (and 
> in fact, if you map entry to bigger object, you do that), they'd still 
> fit into it. Moreover, if you map the objects from cache store as well.
> I believe we have to use Collector implemented as bounded queue, and 
> start reduction phase on the entries that have been mapped in parallel 
> to the mapper phase. Otherwise, say hello to OOME.

Agreed that's indeed a problem. Not sure it's related to parallel iteration though :-)

> 
> Cheers
> 
> Radim
> 
> PS: And don't keep all the futures just to check that all tasks have 
> been finished - use ExecutorAllCompletionService instead.
> 
>> On 12/06/2013 05:18 PM, Mircea Markus wrote:
>> Thanks Vladimir, I like the hands on approach!
>> Adding -dev, there's a lot of interest around the parallel M/R so I think others will have some thoughts on it as well.
>> 
>> So what you're basically doing in your branch is iterate over all the keys in the cache and then for each key invoke the mapping in a separate thread. Whilst this would work, I think it has some drawbacks:
>> - the iteration over the keys in the container happens in sequence, albeit the mapping phases happening in parallel. This speeds things up a bit but not as much as having the iteration
>> happening in parallel, especially when the mapper is fast, which I think it's pretty common.
>> - the StatelessTask + some smaller objects are being created for each iterated key. That's a lot of noise for the GC imo
>> 
>> I think delegating the parallel iteration to the DataContainer (similar to AdvancedCacheLoader.process (Executor)) would be a better approach IMO:
>> - the logic is reusable for other components as well, such as querying (to implement full-scan-like search, or a general purpose parallel iterator over the keys
>> - object creation is reduced
>> - the DefaultDetaContainer uses an EquivalentConcurrentHashMapV8 for holding the entries, which already supports parallel iteration so the heavy lifting is already in place
>> 
>>> On Dec 4, 2013, at 5:16 PM, Vladimir Blagojevic <vblagoje at redhat.com> wrote:
>>> 
>>> Here is my M/R parallel execution solution updated to master https://github.com/vblagoje/infinispan/tree/t_2284_new
>>> 
>>> Now, I'll work on your solution which I am starting to like actually the more I think about it. Although I have to admit that I would eviscerate some of your interfaces like these KeyFilters into more prominent packages so we can all use the same interfaces. Also I would see if we can genericize some of your interfaces and implementations.
>>> 
>>> Will keep you updated.
>>> 
>>> Vladimir
>> Cheers,
> 
> 
> -- 
> Radim Vansa <rvansa at redhat.com>
> JBoss DataGrid QA
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev



More information about the infinispan-dev mailing list