[hibernate-dev] Batch indexing API

Tue Jul 7 04:49:43 EDT 2009

inline:

2009/7/5 Emmanuel Bernard <emmanuel at hibernate.org>:
>
>>> cacheMode //when would you need something different than Ignore? Also,
>>> I'd
>>> rather get CacheMode be a Search class to keep the independance wrt
>>> Hibernate Core
>>
>> Depending on the model it might be much faster using cache when the
>> indexed entity
>> is having a @ManyToOne+ at IndexedEmbedded relation to some entity having
>> high
>> probability to have been indexed already.
>> Like book->nation of publishing : you might have millions of books,
>> but just some hundreds
>> of nations, if these nations need to be reloaded over and over lazily
>> with a second query
>> a cache helps.
>> I'll wrap it to a Search specific enum like I've seen in Annotations?
>
> Did you try? It seems that the first level cache would load the nation
> object once per iteration. Provided that cacheMode is unfortunately a global
> setting for all entities, I'm wondering what's more efficient in the end.
>

Well I'm sure that it's not the best setting for most cases, but yes I
have tried it
and there are some situations in which it gives a major performance boost,
especially on complex models having many relations of this type;
Also in this case the "first level cache" is very short lived, and
every thread is having
it's own... being short lived there's not a big chance to have a
"first level cache hit";
at the opposite the second level cache makes sure all "lookup tables"
are loaded once
for the whole process.
Also it makes only sense when using a real cache, properly configured,
not the Hashtable
one.

>>
>>> optimizeAtEnd => optimizeOnFinish
>>> optimizeAfterPurge
>>> purgeAllAtStart => purgeBeforeIndexing ?, purgeAllOnStart, purgeOnStart
>>
>> I vote for "purgeAllOnStart", I like  "purgeAll" to be consistent.
>
> That's reasonable, my idea was to remove All to allow the implementation to
> evolve down the road should Lucene provide a more efficient solution to
> purge and create a new object but that's a far off bet.
>

ah well I suppose this is not the only API you'll have to change,
should that happen :-)

>>
>>> limitObjects => indexObjectsUpTo, indexFirstObjectsUpTo,
>>> limitIndexedObjectsTo  //what's the use case?
>>
>> Mostly testing and developing; I didn't have this in first design but
>> having to test it often it came out that
>> it was quite useful if I could just try the effect of some new
>> Analyzer without having to reindex
>> millions of records; Also during changes to the entities you might
>> want to see the effect
>> of adding some new field / search option without having to wait for hours.
>> I could have deleted data from dev database, but I consider having
>> this option a bit more flexible;
>> Actually I can foresee some feature request to be able to restrict the
>> data, but we can think about
>> that later. For same reasoning we could leave this out for the moment,
>> but it has been very useful for me.
>
> OK, let's mark this one as experimental, you seem to want more of the API.
>
>>
>>> start => Future should get actually return some stats? We can delay that
>>> but
>>> I don't like the JavaDoc claiming that we will always return null
>>
>> I took that from the recommendations on the Future javadoc itself, but
>> I agree with you it doesn't feel very good.
>> I could return (like you suggest) a reference to the used
>> IndexerProgressMonitor
>> (see
>> http://fisheye.jboss.org/browse/Hibernate/search/trunk/src/main/java/org/hibernate/search/batchindexing/IndexerProgressMonitor.java?r=16587
>> )
>> The API of the IndexerProgressMonitor will be a topic to discuss later
>> (it is HSEARCH-370), for now there is one default impl
>> which will log progress and some performance stats; that's why it is
>> missing methods to retrieve the stats.
>
> Let's think about that and return null for now, I just want to relax the
> strong statement in the javadoc
>
I'll remove the return javadoc, leaving it undocumented for know.

>>
>>
>>> startAndWait => execute ?
>>
>> I preferred to stress the little difference with "start"; don't you
>> think that having a "start" method and an "execute" method
>> is not making it clear which one I should call?
>>
>
> I know that's why I put a ? :)
>

thanks for the insight,
Sanne