[hibernate-dev] Batch indexing API

Tue Jul 7 05:02:50 EDT 2009

ok, go ahead what ever that means :).

On  Jul 7, 2009, at 10:49, Sanne Grinovero wrote:

> inline:
>
> 2009/7/5 Emmanuel Bernard <emmanuel at hibernate.org>:
>>
>>>> cacheMode //when would you need something different than Ignore?  
>>>> Also,
>>>> I'd
>>>> rather get CacheMode be a Search class to keep the independance wrt
>>>> Hibernate Core
>>>
>>> Depending on the model it might be much faster using cache when the
>>> indexed entity
>>> is having a @ManyToOne+ at IndexedEmbedded relation to some entity  
>>> having
>>> high
>>> probability to have been indexed already.
>>> Like book->nation of publishing : you might have millions of books,
>>> but just some hundreds
>>> of nations, if these nations need to be reloaded over and over  
>>> lazily
>>> with a second query
>>> a cache helps.
>>> I'll wrap it to a Search specific enum like I've seen in  
>>> Annotations?
>>
>> Did you try? It seems that the first level cache would load the  
>> nation
>> object once per iteration. Provided that cacheMode is unfortunately  
>> a global
>> setting for all entities, I'm wondering what's more efficient in  
>> the end.
>>
>
> Well I'm sure that it's not the best setting for most cases, but yes I
> have tried it
> and there are some situations in which it gives a major performance  
> boost,
> especially on complex models having many relations of this type;
> Also in this case the "first level cache" is very short lived, and
> every thread is having
> it's own... being short lived there's not a big chance to have a
> "first level cache hit";
> at the opposite the second level cache makes sure all "lookup tables"
> are loaded once
> for the whole process.
> Also it makes only sense when using a real cache, properly configured,
> not the Hashtable
> one.
>
>>>
>>>> optimizeAtEnd => optimizeOnFinish
>>>> optimizeAfterPurge
>>>> purgeAllAtStart => purgeBeforeIndexing ?, purgeAllOnStart,  
>>>> purgeOnStart
>>>
>>> I vote for "purgeAllOnStart", I like  "purgeAll" to be consistent.
>>
>> That's reasonable, my idea was to remove All to allow the  
>> implementation to
>> evolve down the road should Lucene provide a more efficient  
>> solution to
>> purge and create a new object but that's a far off bet.
>>
>
> ah well I suppose this is not the only API you'll have to change,
> should that happen :-)
>
>>>
>>>> limitObjects => indexObjectsUpTo, indexFirstObjectsUpTo,
>>>> limitIndexedObjectsTo  //what's the use case?
>>>
>>> Mostly testing and developing; I didn't have this in first design  
>>> but
>>> having to test it often it came out that
>>> it was quite useful if I could just try the effect of some new
>>> Analyzer without having to reindex
>>> millions of records; Also during changes to the entities you might
>>> want to see the effect
>>> of adding some new field / search option without having to wait  
>>> for hours.
>>> I could have deleted data from dev database, but I consider having
>>> this option a bit more flexible;
>>> Actually I can foresee some feature request to be able to restrict  
>>> the
>>> data, but we can think about
>>> that later. For same reasoning we could leave this out for the  
>>> moment,
>>> but it has been very useful for me.
>>
>> OK, let's mark this one as experimental, you seem to want more of  
>> the API.
>>
>>>
>>>> start => Future should get actually return some stats? We can  
>>>> delay that
>>>> but
>>>> I don't like the JavaDoc claiming that we will always return null
>>>
>>> I took that from the recommendations on the Future javadoc itself,  
>>> but
>>> I agree with you it doesn't feel very good.
>>> I could return (like you suggest) a reference to the used
>>> IndexerProgressMonitor
>>> (see
>>> http://fisheye.jboss.org/browse/Hibernate/search/trunk/src/main/java/org/hibernate/search/batchindexing/IndexerProgressMonitor.java?r=16587
>>> )
>>> The API of the IndexerProgressMonitor will be a topic to discuss  
>>> later
>>> (it is HSEARCH-370), for now there is one default impl
>>> which will log progress and some performance stats; that's why it is
>>> missing methods to retrieve the stats.
>>
>> Let's think about that and return null for now, I just want to  
>> relax the
>> strong statement in the javadoc
>>
> I'll remove the return javadoc, leaving it undocumented for know.
>
>>>
>>>
>>>> startAndWait => execute ?
>>>
>>> I preferred to stress the little difference with "start"; don't you
>>> think that having a "start" method and an "execute" method
>>> is not making it clear which one I should call?
>>>
>>
>> I know that's why I put a ? :)
>>
>
> thanks for the insight,
> Sanne