[infinispan-dev] Map/Reduce or other batch processing on CacheLoader stored entries

Manik Surtani manik at jboss.org
Fri May 25 07:10:27 EDT 2012


Well, the start() bit maybe less useful, but how do you know when the processor has been fed everything it needs, to be able to clean up?

On 25 May 2012, at 12:02, Sanne Grinovero wrote:

> On 25 May 2012 11:33, Manik Surtani <manik at jboss.org> wrote:
>> Yes, as a one-off, but there should be a mechanism to set up internal structures and clean up/send finalisation messages to Hibernate Search or completion RPCs, etc.
> 
> Ah got it. Maybe we need it, but I was initally - maybe naively -
> expecting to deal with initialization myself :
> 
> MassIndexingWorkCollector pwc = new MassIndexingWorkCollector();
> //implements Processor
> pwc.initialize(.custom stuff..) //not defined on Processor
> [cacheLoader?].processEntriesWith(pwc); // Blocking! so we know when
> we finished loading all entries.
> pwc.shutdownWorkers(); //not defined on Processor
> 
> minimal API ;-)
> But I guess when implementing for real I might need something like that.
> 
> Sanne
> 
>> 
>> On 25 May 2012, at 11:31, Sanne Grinovero wrote:
>> 
>>> On 25 May 2012 10:57, Manik Surtani <manik at jboss.org> wrote:
>>>> #processEntriesWith(Processor p)
>>>> 
>>>> Processor extends Lifecycle { // Lifecycle for start() and stop() methods…
>>>>   void process(CacheEntry e);
>>>>   void process(Collection<CacheEntry> e);
>>>>   boolean processMoreEntries();
>>>> }
>>> 
>>> why the LifeCycle start/stop ?
>>> I expect to use it as a one-off, not as something which is
>>> "permanently hooked": looks like you' re thinking about a different
>>> problem?
>>> 
>>> The use case I'm thinkin of is when we need to iterate on all entries
>>> in the cachestore, such as :
>>> - Map/Reduce
>>>  - evaluating the average value of some attribute
>>>  - word counting
>>> - MassIndexer
>>> 
>>> In all the use cases I'm having in mind, you want to process all
>>> entries, and only once.
>>> So #processMoreEntries would be redundant, and I think we should
>>> choose just one between CacheEntry or Collection<CacheEntry>.. let's
>>> go with the simple CacheEntry ?
>>> Should be able to avoid creation of short lived collections, and when
>>> passing collections one would likely need to iterate on each element
>>> anyway to route the invocation so some internal process(CacheEntry e);
>>> 
>>> -- Sanne
>>> 
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> --
>> Manik Surtani
>> manik at jboss.org
>> twitter.com/maniksurtani
>> 
>> Lead, Infinispan
>> http://www.infinispan.org
>> 
>> 
>> 
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
manik at jboss.org
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org






More information about the infinispan-dev mailing list