[infinispan-dev] MapReduce limitations and suggestions.
Etienne Riviere
etienne.riviere at unine.ch
Mon Feb 17 03:18:38 EST 2014
Hi Radim,
I might misunderstand your suggestion but many M/R jobs actually require to run the two phases one after the other, and henceforth to store the intermediate results somewhere. While some may slightly reduce intermediate memory usage by using a combiner function (e.g., the word-count example), I don’t see how we can avoid intermediate storage altogether.
Thanks,
Etienne (leads project — as Evangelos who initiated the thread)
On 17 Feb 2014, at 08:48, Radim Vansa <rvansa at redhat.com> wrote:
> I think that the intermediate cache is not required at all. The M/R
> algorithm itself can (and should!) run with memory occupied by the
> result of reduction. The current implementation with Map first and
> Reduce after that will always have these problems, using a cache for
> temporary caching the result is only a workaround.
>
> The only situation when temporary cache could be useful is when the
> result grows linearly (or close to that or even more) with the amount of
> reduced entries. This would be the case for groupBy producing Map<Color,
> List<Entry>> from all entries in cache. Then the task does not scale and
> should be redesigned anyway, but flushing the results into cache backed
> by cache store could help.
>
> Radim
>
> On 02/14/2014 04:54 PM, Vladimir Blagojevic wrote:
>> Tristan,
>>
>> Actually they are not addressed in this pull request but the feature
>> where custom output cache is used instead of results being returned is
>> next in the implementation pipeline.
>>
>> Evangelos, indeed, depending on a reducer function all intermediate
>> KOut/VOut pairs might be moved to a single node. How would custom cache
>> help in this case?
>>
>> Regards,
>> Vladimir
>>
>>
>> On 2/14/2014, 10:16 AM, Tristan Tarrant wrote:
>>> Hi Evangelos,
>>>
>>> you might be interested in looking into a current pull request which
>>> addresses some (all?) of these issues
>>>
>>> https://github.com/infinispan/infinispan/pull/2300
>>>
>>> Tristan
>>>
>>> On 14/02/2014 16:10, Evangelos Vazaios wrote:
>>>> Hello everyone,
>>>>
>>>> I started using the MapReduce implementation of Infinispan and I came
>>>> across some possible limitations. Thus, I want to make some suggestions
>>>> about the MapReduce (MR) implementation of Infinispan.
>>>> Depending on the algorithm, there might be some memory problems,
>>>> especially for intermediate results.
>>>> An example of such a case is group by. Suppose that we have a cluster
>>>> of 2 nodes with 2 GB available. Let a distributed cache, where simple
>>>> car objects (id,brand,colour) are stored and the total size of data is
>>>> 3.5GB. If all objects have the same colour , then all 3.5 GB would go to
>>>> only one reducer, as a result an OutOfMemoryException will be thrown.
>>>>
>>>> To overcome these limitations, I propose to add as parameter the name of
>>>> the intermediate cache to be used. This will enable the creation of a
>>>> custom configured cache that deals with the memory limitations.
>>>>
>>>> Another feature that I would like to have is to set the name of the
>>>> output cache. The reasoning behind this is similar to the one mentioned
>>>> above.
>>>>
>>>> I wait for your thoughts on these two suggestions.
>>>>
>>>> Regards,
>>>> Evangelos
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> --
> Radim Vansa <rvansa at redhat.com>
> JBoss DataGrid QA
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
More information about the infinispan-dev
mailing list