[infinispan-dev] MapReduce limitations and suggestions.

Etienne Riviere etienne.riviere at unine.ch
Mon Feb 17 03:18:38 EST 2014


Hi Radim,

I might misunderstand your suggestion but many M/R jobs actually require to run the two phases one after the other, and henceforth to store the intermediate results somewhere. While some may slightly reduce intermediate memory usage by using a combiner function (e.g., the word-count example), I don’t see how we can avoid intermediate storage altogether.

Thanks,
Etienne (leads project — as Evangelos who initiated the thread)

On 17 Feb 2014, at 08:48, Radim Vansa <rvansa at redhat.com> wrote:

> I think that the intermediate cache is not required at all. The M/R 
> algorithm itself can (and should!) run with memory occupied by the 
> result of reduction. The current implementation with Map first and 
> Reduce after that will always have these problems, using a cache for 
> temporary caching the result is only a workaround.
> 
> The only situation when temporary cache could be useful is when the 
> result grows linearly (or close to that or even more) with the amount of 
> reduced entries. This would be the case for groupBy producing Map<Color, 
> List<Entry>> from all entries in cache. Then the task does not scale and 
> should be redesigned anyway, but flushing the results into cache backed 
> by cache store could help.
> 
> Radim
> 
> On 02/14/2014 04:54 PM, Vladimir Blagojevic wrote:
>> Tristan,
>> 
>> Actually they are not addressed in this pull request but the feature
>> where custom output cache is used instead of results being returned is
>> next in the implementation pipeline.
>> 
>> Evangelos, indeed, depending on a reducer function all intermediate
>> KOut/VOut pairs might be moved to a single node. How would custom cache
>> help in this case?
>> 
>> Regards,
>> Vladimir
>> 
>> 
>> On 2/14/2014, 10:16 AM, Tristan Tarrant wrote:
>>> Hi Evangelos,
>>> 
>>> you might be interested in looking into a current pull request which
>>> addresses some (all?) of these issues
>>> 
>>> https://github.com/infinispan/infinispan/pull/2300
>>> 
>>> Tristan
>>> 
>>> On 14/02/2014 16:10, Evangelos Vazaios wrote:
>>>> Hello everyone,
>>>> 
>>>> I started using the MapReduce implementation of Infinispan and I came
>>>> across some possible limitations. Thus,  I want to make some suggestions
>>>> about the MapReduce (MR) implementation of Infinispan.
>>>> Depending on the algorithm,  there might be some memory problems,
>>>> especially for intermediate results.
>>>> An example of such a case is  group by. Suppose that we have a cluster
>>>> of 2 nodes with 2 GB  available. Let a distributed cache, where simple
>>>> car objects (id,brand,colour) are stored and the total size of data is
>>>> 3.5GB. If all objects have the same colour , then all 3.5 GB would go to
>>>> only one reducer, as a result an OutOfMemoryException will be thrown.
>>>> 
>>>> To overcome these limitations, I propose to add as parameter the name of
>>>> the intermediate cache to be used. This will enable the creation of a
>>>> custom configured cache that deals with the memory limitations.
>>>> 
>>>> Another feature that I would like to have is to set the name of the
>>>> output cache. The reasoning behind this is similar to the one mentioned
>>>> above.
>>>> 
>>>> I wait for your thoughts on these two suggestions.
>>>> 
>>>> Regards,
>>>> Evangelos
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> 
> -- 
> Radim Vansa <rvansa at redhat.com>
> JBoss DataGrid QA
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev




More information about the infinispan-dev mailing list