On 02/17/2014 10:42 AM, infinispan-dev-request(a)lists.jboss.org wrote:
Hi Etienne
I was going to suggest using a combiner - the combiner would process the
mapper results from just one node, so you should need at most double the
memory on that node. I guess we could reduce the memory requirements even
more if the combiner could run concurrently with the mapper... Vladimir,
does it sound like a reasonable feature request?
There are algorithms where combiners cannot be applied.
I'm afraid in your situation using a cache store wouldn't
help, as the
intermediate values for the same key are stored as a list in a single
entry. So if all cars are red, there would be just one intermediate key in
the intermediate cache, and there would be nothing to evict to the cache
store. Vladimir, do you think we could somehow "chunk" the intermediary
values into multiple entries grouped by the intermediary key, to support
this scenario?
I was thinking a custom cache implementation that maintains the overall
size of cache and each key individually and when a threshold is reached
it spills things on disk. Note that I am not familiar with the internals
of Infinispan, but I think it is doable. Such a cache solves the problem
in both cases (when one key is too large to be in memory as my example
and the case where the keys assigned to one reducer exceeds its memory).
For reference, though, a limited version of what you're asking
for is
already available. You can change the configuration of the intermediary
cache by defining a "__tmpMapReduce" cache in your configuration. That
configuration will be used for all M/R tasks, whether they use the shared
intermediate cache or they create their own.
I have one question about this. if I start two MR tasks at once will
these tasks use the same Cache? thus, the intermediate results are going
to be mixed?. This cache can be used in order as a test case.
Regards,
Evangelos
Cheers
Dan
On Mon, Feb 17, 2014 at 10:18 AM, Etienne Riviere
<etienne.riviere(a)unine.ch>wrote:
> > Hi Radim,
> >
> > I might misunderstand your suggestion but many M/R jobs actually require
> > to run the two phases one after the other, and henceforth to store the
> > intermediate results somewhere. While some may slightly reduce intermediate
> > memory usage by using a combiner function (e.g., the word-count example), I
> > don't see how we can avoid intermediate storage altogether.
> >
> > Thanks,
> > Etienne (leads project -- as Evangelos who initiated the thread)
> >
> > On 17 Feb 2014, at 08:48, Radim Vansa <rvansa(a)redhat.com> wrote:
> >
>> > > I think that the intermediate cache is not required at all. The M/R
>> > > algorithm itself can (and should!) run with memory occupied by the
>> > > result of reduction. The current implementation with Map first and
>> > > Reduce after that will always have these problems, using a cache for
>> > > temporary caching the result is only a workaround.
>> > >
>> > > The only situation when temporary cache could be useful is when the
>> > > result grows linearly (or close to that or even more) with the amount
of
>> > > reduced entries. This would be the case for groupBy producing
Map<Color,
>> > > List<Entry>> from all entries in cache. Then the task does not
scale and
>> > > should be redesigned anyway, but flushing the results into cache
backed
>> > > by cache store could help.
>> > >
>> > > Radim
>> > >
>> > > On 02/14/2014 04:54 PM, Vladimir Blagojevic wrote:
>>> > >> Tristan,
>>> > >>
>>> > >> Actually they are not addressed in this pull request but the
feature
>>> > >> where custom output cache is used instead of results being
returned is
>>> > >> next in the implementation pipeline.
>>> > >>
>>> > >> Evangelos, indeed, depending on a reducer function all
intermediate
>>> > >> KOut/VOut pairs might be moved to a single node. How would
custom cache
>>> > >> help in this case?
>>> > >>
>>> > >> Regards,
>>> > >> Vladimir
>>> > >>
>>> > >>
>>> > >> On 2/14/2014, 10:16 AM, Tristan Tarrant wrote:
>>>> > >>> Hi Evangelos,
>>>> > >>>
>>>> > >>> you might be interested in looking into a current pull
request which
>>>> > >>> addresses some (all?) of these issues
>>>> > >>>
>>>> > >>>
https://github.com/infinispan/infinispan/pull/2300
>>>> > >>>
>>>> > >>> Tristan
>>>> > >>>
>>>> > >>> On 14/02/2014 16:10, Evangelos Vazaios wrote:
>>>>> > >>>> Hello everyone,
>>>>> > >>>>
>>>>> > >>>> I started using the MapReduce implementation of
Infinispan and I came
>>>>> > >>>> across some possible limitations. Thus, I want
to make some
> > suggestions
>>>>> > >>>> about the MapReduce (MR) implementation of
Infinispan.
>>>>> > >>>> Depending on the algorithm, there might be
some memory problems,
>>>>> > >>>> especially for intermediate results.
>>>>> > >>>> An example of such a case is group by. Suppose
that we have a cluster
>>>>> > >>>> of 2 nodes with 2 GB available. Let a
distributed cache, where simple
>>>>> > >>>> car objects (id,brand,colour) are stored and
the total size of data is
>>>>> > >>>> 3.5GB. If all objects have the same colour ,
then all 3.5 GB would go
> > to
>>>>> > >>>> only one reducer, as a result an
OutOfMemoryException will be thrown.
>>>>> > >>>>
>>>>> > >>>> To overcome these limitations, I propose to add
as parameter the name
> > of
>>>>> > >>>> the intermediate cache to be used. This will
enable the creation of a
>>>>> > >>>> custom configured cache that deals with the
memory limitations.
>>>>> > >>>>
>>>>> > >>>> Another feature that I would like to have is to
set the name of the
>>>>> > >>>> output cache. The reasoning behind this is
similar to the one
> > mentioned
>>>>> > >>>> above.
>>>>> > >>>>
>>>>> > >>>> I wait for your thoughts on these two
suggestions.
>>>>> > >>>>
>>>>> > >>>> Regards,
>>>>> > >>>> Evangelos
>>>>> > >>>>
_______________________________________________
>>>>> > >>>> infinispan-dev mailing list
>>>>> > >>>> infinispan-dev(a)lists.jboss.org
>>>>> > >>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>> > >>>>
>>>>> > >>>>
>>>> > >>>
>>>> > >>>
>>>> > >>> _______________________________________________
>>>> > >>> infinispan-dev mailing list
>>>> > >>> infinispan-dev(a)lists.jboss.org
>>>> > >>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> > >> _______________________________________________
>>> > >> infinispan-dev mailing list
>>> > >> infinispan-dev(a)lists.jboss.org
>>> > >>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> > >
>> > >
>> > > --
>> > > Radim Vansa <rvansa(a)redhat.com>
>> > > JBoss DataGrid QA
>> > >
>> > > _______________________________________________
>> > > infinispan-dev mailing list
>> > > infinispan-dev(a)lists.jboss.org
>> > >
https://lists.jboss.org/mailman/listinfo/infinispan-d