[infinispan-dev] MapReduce limitations and suggestions.
Evangelos Vazaios
vagvaz at gmail.com
Fri Feb 14 10:10:55 EST 2014
Hello everyone,
I started using the MapReduce implementation of Infinispan and I came
across some possible limitations. Thus, I want to make some suggestions
about the MapReduce (MR) implementation of Infinispan.
Depending on the algorithm, there might be some memory problems,
especially for intermediate results.
An example of such a case is group by. Suppose that we have a cluster
of 2 nodes with 2 GB available. Let a distributed cache, where simple
car objects (id,brand,colour) are stored and the total size of data is
3.5GB. If all objects have the same colour , then all 3.5 GB would go to
only one reducer, as a result an OutOfMemoryException will be thrown.
To overcome these limitations, I propose to add as parameter the name of
the intermediate cache to be used. This will enable the creation of a
custom configured cache that deals with the memory limitations.
Another feature that I would like to have is to set the name of the
output cache. The reasoning behind this is similar to the one mentioned
above.
I wait for your thoughts on these two suggestions.
Regards,
Evangelos
More information about the infinispan-dev
mailing list