[infinispan-dev] MapReduce limitations and suggestions.

Tue Feb 18 03:59:37 EST 2014

Hi Etienne,

how does the requirement for all data provided to Reducer as a whole 
work for distributed caches? There you'd get only a subset of the whole 
mapped set on each node (afaik each node maps the nodes locally and 
performs a reduction before executing the "global" reduction). Or are 
these M/R jobs applicable only to local caches?
I have to admit I have only a limited knowledge of M/R, could you give 
me an example where the algorithm works in distributed environment and 
still cannot be parallelized?

Thanks

Radim

On 02/17/2014 09:18 AM, Etienne Riviere wrote:
> Hi Radim,
>
> I might misunderstand your suggestion but many M/R jobs actually require to run the two phases one after the other, and henceforth to store the intermediate results somewhere. While some may slightly reduce intermediate memory usage by using a combiner function (e.g., the word-count example), I don’t see how we can avoid intermediate storage altogether.
>
> Thanks,
> Etienne (leads project — as Evangelos who initiated the thread)
>
> On 17 Feb 2014, at 08:48, Radim Vansa <rvansa at redhat.com> wrote:
>
>> I think that the intermediate cache is not required at all. The M/R
>> algorithm itself can (and should!) run with memory occupied by the
>> result of reduction. The current implementation with Map first and
>> Reduce after that will always have these problems, using a cache for
>> temporary caching the result is only a workaround.
>>
>> The only situation when temporary cache could be useful is when the
>> result grows linearly (or close to that or even more) with the amount of
>> reduced entries. This would be the case for groupBy producing Map<Color,
>> List<Entry>> from all entries in cache. Then the task does not scale and
>> should be redesigned anyway, but flushing the results into cache backed
>> by cache store could help.
>>
>> Radim
>>
>> On 02/14/2014 04:54 PM, Vladimir Blagojevic wrote:
>>> Tristan,
>>>
>>> Actually they are not addressed in this pull request but the feature
>>> where custom output cache is used instead of results being returned is
>>> next in the implementation pipeline.
>>>
>>> Evangelos, indeed, depending on a reducer function all intermediate
>>> KOut/VOut pairs might be moved to a single node. How would custom cache
>>> help in this case?
>>>
>>> Regards,
>>> Vladimir
>>>
>>>
>>> On 2/14/2014, 10:16 AM, Tristan Tarrant wrote:
>>>> Hi Evangelos,
>>>>
>>>> you might be interested in looking into a current pull request which
>>>> addresses some (all?) of these issues
>>>>
>>>> https://github.com/infinispan/infinispan/pull/2300
>>>>
>>>> Tristan
>>>>
>>>> On 14/02/2014 16:10, Evangelos Vazaios wrote:
>>>>> Hello everyone,
>>>>>
>>>>> I started using the MapReduce implementation of Infinispan and I came
>>>>> across some possible limitations. Thus,  I want to make some suggestions
>>>>> about the MapReduce (MR) implementation of Infinispan.
>>>>> Depending on the algorithm,  there might be some memory problems,
>>>>> especially for intermediate results.
>>>>> An example of such a case is  group by. Suppose that we have a cluster
>>>>> of 2 nodes with 2 GB  available. Let a distributed cache, where simple
>>>>> car objects (id,brand,colour) are stored and the total size of data is
>>>>> 3.5GB. If all objects have the same colour , then all 3.5 GB would go to
>>>>> only one reducer, as a result an OutOfMemoryException will be thrown.
>>>>>
>>>>> To overcome these limitations, I propose to add as parameter the name of
>>>>> the intermediate cache to be used. This will enable the creation of a
>>>>> custom configured cache that deals with the memory limitations.
>>>>>
>>>>> Another feature that I would like to have is to set the name of the
>>>>> output cache. The reasoning behind this is similar to the one mentioned
>>>>> above.
>>>>>
>>>>> I wait for your thoughts on these two suggestions.
>>>>>
>>>>> Regards,
>>>>> Evangelos
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> -- 
>> Radim Vansa <rvansa at redhat.com>
>> JBoss DataGrid QA
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- 
Radim Vansa <rvansa at redhat.com>
JBoss DataGrid QA