[infinispan-dev] MapReduce limitations and suggestions.

Fri Feb 14 10:10:55 EST 2014

Hello everyone,

I started using the MapReduce implementation of Infinispan and I came
across some possible limitations. Thus,  I want to make some suggestions
about the MapReduce (MR) implementation of Infinispan.
Depending on the algorithm,  there might be some memory problems,
especially for intermediate results.
An example of such a case is  group by. Suppose that we have a cluster
of 2 nodes with 2 GB  available. Let a distributed cache, where simple
car objects (id,brand,colour) are stored and the total size of data is
3.5GB. If all objects have the same colour , then all 3.5 GB would go to
only one reducer, as a result an OutOfMemoryException will be thrown.

To overcome these limitations, I propose to add as parameter the name of
the intermediate cache to be used. This will enable the creation of a
custom configured cache that deals with the memory limitations.

Another feature that I would like to have is to set the name of the
output cache. The reasoning behind this is similar to the one mentioned
above.

I wait for your thoughts on these two suggestions.

Regards,
Evangelos