Batch indexing API - hibernate-dev - Jboss List Archives

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Batch indexing API

Hibernate core 3.2.7 not on maven...

Fwd: [jsr-303-eg] Make payload...

Sanne Grinovero

Tuesday, 30 June 2009 Tue, 30 Jun '09

9:18 a.m.

Hello, I need some comments about the batch indexing API, so that I can stabilize it and write the documentation; I might even blog about it :-) Here is the current sketch: http://fisheye.jboss.org/browse/Hibernate/search/trunk/src/main/java/org/... Emmanuel I remember you wanted to give me directions to make this "fluent", and you didn't like the method names. ideas? Sanne

Reply

Show replies by date

Emmanuel Bernard

Friday, 3 July Fri, 3 Jul

4:05 a.m.

Overall I think it's good. some proposals (when followed by a ? my feeling is that the original name is as good or better) Generally, I've inversed names to make the most important word first. Indexer => MassIndexer objectLoadingThreads => threadsToLoadObjects objectLoadingBatchSize => batchSizeToLoadObjects documentBuilderThreads => threadsForSubsequentFetching, threadsForFetching indexWriterThreads => threadsIndexingToLucene cacheMode //when would you need something different than Ignore? Also, I'd rather get CacheMode be a Search class to keep the independance wrt Hibernate Core optimizeAtEnd => optimizeOnFinish optimizeAfterPurge purgeAllAtStart => purgeBeforeIndexing ?, purgeAllOnStart, purgeOnStart limitObjects => indexObjectsUpTo, indexFirstObjectsUpTo, limitIndexedObjectsTo //what's the use case? start => Future should get actually return some stats? We can delay that but I don't like the JavaDoc claiming that we will always return null startAndWait => execute ? WTY? On Jun 30, 2009, at 16:18, Sanne Grinovero wrote:

Hello, I need some comments about the batch indexing API, so that I can stabilize it and write the documentation; I might even blog about it :-) Here is the current sketch: http://fisheye.jboss.org/browse/Hibernate/search/trunk/src/main/java/org/... Emmanuel I remember you wanted to give me directions to make this "fluent", and you didn't like the method names. ideas? Sanne _______________________________________________ hibernate-dev mailing list hibernate-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev

Reply

Sanne Grinovero

Saturday, 4 July Sat, 4 Jul

1:59 p.m.

thanks, answering inline: 2009/7/3 Emmanuel Bernard <emmanuel(a)hibernate.org>:

Overall I think it's good. some proposals (when followed by a ? my feeling is that the original name is as good or better) Generally, I've inversed names to make the most important word first. Indexer => MassIndexer

fine

objectLoadingThreads => threadsToLoadObjects objectLoadingBatchSize => batchSizeToLoadObjects documentBuilderThreads => threadsForSubsequentFetching, threadsForFetching indexWriterThreads => threadsIndexingToLucene

agreed on all above

cacheMode //when would you need something different than Ignore? Also, I'd rather get CacheMode be a Search class to keep the independance wrt Hibernate Core

Depending on the model it might be much faster using cache when the indexed entity is having a @ManyToOne+@IndexedEmbedded relation to some entity having high probability to have been indexed already. Like book->nation of publishing : you might have millions of books, but just some hundreds of nations, if these nations need to be reloaded over and over lazily with a second query a cache helps. I'll wrap it to a Search specific enum like I've seen in Annotations?

optimizeAtEnd => optimizeOnFinish optimizeAfterPurge purgeAllAtStart => purgeBeforeIndexing ?, purgeAllOnStart, purgeOnStart

I vote for "purgeAllOnStart", I like "purgeAll" to be consistent.

limitObjects => indexObjectsUpTo, indexFirstObjectsUpTo, limitIndexedObjectsTo //what's the use case?

Mostly testing and developing; I didn't have this in first design but having to test it often it came out that it was quite useful if I could just try the effect of some new Analyzer without having to reindex millions of records; Also during changes to the entities you might want to see the effect of adding some new field / search option without having to wait for hours. I could have deleted data from dev database, but I consider having this option a bit more flexible; Actually I can foresee some feature request to be able to restrict the data, but we can think about that later. For same reasoning we could leave this out for the moment, but it has been very useful for me.

start => Future should get actually return some stats? We can delay that but I don't like the JavaDoc claiming that we will always return null

I took that from the recommendations on the Future javadoc itself, but I agree with you it doesn't feel very good. I could return (like you suggest) a reference to the used IndexerProgressMonitor (see http://fisheye.jboss.org/browse/Hibernate/search/trunk/src/main/java/org/... ) The API of the IndexerProgressMonitor will be a topic to discuss later (it is HSEARCH-370), for now there is one default impl which will log progress and some performance stats; that's why it is missing methods to retrieve the stats.

startAndWait => execute ?

I preferred to stress the little difference with "start"; don't you think that having a "start" method and an "execute" method is not making it clear which one I should call? Sanne

WTY? On Jun 30, 2009, at 16:18, Sanne Grinovero wrote: > Hello, > I need some comments about the batch indexing API, so that I can > stabilize it and write the documentation; > I might even blog about it :-) > > Here is the current sketch: > > http://fisheye.jboss.org/browse/Hibernate/search/trunk/src/main/java/org/... > > Emmanuel I remember you wanted to give me directions to make this > "fluent", and you didn't like the method names. > ideas? > > Sanne > _______________________________________________ > hibernate-dev mailing list > hibernate-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/hibernate-dev

Reply

Emmanuel Bernard

Sunday, 5 July Sun, 5 Jul

4:12 a.m.

> cacheMode //when would you need something different than Ignore? > Also, I'd > rather get CacheMode be a Search class to keep the independance wrt > Hibernate Core Depending on the model it might be much faster using cache when the indexed entity is having a @ManyToOne+@IndexedEmbedded relation to some entity having high probability to have been indexed already. Like book->nation of publishing : you might have millions of books, but just some hundreds of nations, if these nations need to be reloaded over and over lazily with a second query a cache helps. I'll wrap it to a Search specific enum like I've seen in Annotations?

Did you try? It seems that the first level cache would load the nation object once per iteration. Provided that cacheMode is unfortunately a global setting for all entities, I'm wondering what's more efficient in the end.

> optimizeAtEnd => optimizeOnFinish > optimizeAfterPurge > purgeAllAtStart => purgeBeforeIndexing ?, purgeAllOnStart, > purgeOnStart I vote for "purgeAllOnStart", I like "purgeAll" to be consistent.

That's reasonable, my idea was to remove All to allow the implementation to evolve down the road should Lucene provide a more efficient solution to purge and create a new object but that's a far off bet.

> limitObjects => indexObjectsUpTo, indexFirstObjectsUpTo, > limitIndexedObjectsTo //what's the use case? Mostly testing and developing; I didn't have this in first design but having to test it often it came out that it was quite useful if I could just try the effect of some new Analyzer without having to reindex millions of records; Also during changes to the entities you might want to see the effect of adding some new field / search option without having to wait for hours. I could have deleted data from dev database, but I consider having this option a bit more flexible; Actually I can foresee some feature request to be able to restrict the data, but we can think about that later. For same reasoning we could leave this out for the moment, but it has been very useful for me.

OK, let's mark this one as experimental, you seem to want more of the API.

> start => Future should get actually return some stats? We can delay > that but > I don't like the JavaDoc claiming that we will always return null I took that from the recommendations on the Future javadoc itself, but I agree with you it doesn't feel very good. I could return (like you suggest) a reference to the used IndexerProgressMonitor (see http://fisheye.jboss.org/browse/Hibernate/search/trunk/src/main/java/org/... ) The API of the IndexerProgressMonitor will be a topic to discuss later (it is HSEARCH-370), for now there is one default impl which will log progress and some performance stats; that's why it is missing methods to retrieve the stats.

Let's think about that and return null for now, I just want to relax the strong statement in the javadoc

> startAndWait => execute ? I preferred to stress the little difference with "start"; don't you think that having a "start" method and an "execute" method is not making it clear which one I should call?

I know that's why I put a ? :)

Reply

Sanne Grinovero

Tuesday, 7 July Tue, 7 Jul

3:49 a.m.

inline: 2009/7/5 Emmanuel Bernard <emmanuel(a)hibernate.org>:

>> cacheMode //when would you need something different than Ignore? Also, >> I'd >> rather get CacheMode be a Search class to keep the independance wrt >> Hibernate Core > > Depending on the model it might be much faster using cache when the > indexed entity > is having a @ManyToOne+@IndexedEmbedded relation to some entity having > high > probability to have been indexed already. > Like book->nation of publishing : you might have millions of books, > but just some hundreds > of nations, if these nations need to be reloaded over and over lazily > with a second query > a cache helps. > I'll wrap it to a Search specific enum like I've seen in Annotations? Did you try? It seems that the first level cache would load the nation object once per iteration. Provided that cacheMode is unfortunately a global setting for all entities, I'm wondering what's more efficient in the end.

Well I'm sure that it's not the best setting for most cases, but yes I have tried it and there are some situations in which it gives a major performance boost, especially on complex models having many relations of this type; Also in this case the "first level cache" is very short lived, and every thread is having it's own... being short lived there's not a big chance to have a "first level cache hit"; at the opposite the second level cache makes sure all "lookup tables" are loaded once for the whole process. Also it makes only sense when using a real cache, properly configured, not the Hashtable one.

> >> optimizeAtEnd => optimizeOnFinish >> optimizeAfterPurge >> purgeAllAtStart => purgeBeforeIndexing ?, purgeAllOnStart, purgeOnStart > > I vote for "purgeAllOnStart", I like "purgeAll" to be consistent. That's reasonable, my idea was to remove All to allow the implementation to evolve down the road should Lucene provide a more efficient solution to purge and create a new object but that's a far off bet.

ah well I suppose this is not the only API you'll have to change, should that happen :-)

> >> limitObjects => indexObjectsUpTo, indexFirstObjectsUpTo, >> limitIndexedObjectsTo //what's the use case? > > Mostly testing and developing; I didn't have this in first design but > having to test it often it came out that > it was quite useful if I could just try the effect of some new > Analyzer without having to reindex > millions of records; Also during changes to the entities you might > want to see the effect > of adding some new field / search option without having to wait for hours. > I could have deleted data from dev database, but I consider having > this option a bit more flexible; > Actually I can foresee some feature request to be able to restrict the > data, but we can think about > that later. For same reasoning we could leave this out for the moment, > but it has been very useful for me. OK, let's mark this one as experimental, you seem to want more of the API. > >> start => Future should get actually return some stats? We can delay that >> but >> I don't like the JavaDoc claiming that we will always return null > > I took that from the recommendations on the Future javadoc itself, but > I agree with you it doesn't feel very good. > I could return (like you suggest) a reference to the used > IndexerProgressMonitor > (see > http://fisheye.jboss.org/browse/Hibernate/search/trunk/src/main/java/org/... > ) > The API of the IndexerProgressMonitor will be a topic to discuss later > (it is HSEARCH-370), for now there is one default impl > which will log progress and some performance stats; that's why it is > missing methods to retrieve the stats. Let's think about that and return null for now, I just want to relax the strong statement in the javadoc

I'll remove the return javadoc, leaving it undocumented for know.

> > >> startAndWait => execute ? > > I preferred to stress the little difference with "start"; don't you > think that having a "start" method and an "execute" method > is not making it clear which one I should call? > I know that's why I put a ? :)

thanks for the insight, Sanne

Reply

Emmanuel Bernard

4:02 a.m.

ok, go ahead what ever that means :). On Jul 7, 2009, at 10:49, Sanne Grinovero wrote:

inline: 2009/7/5 Emmanuel Bernard <emmanuel(a)hibernate.org>: > >>> cacheMode //when would you need something different than Ignore? >>> Also, >>> I'd >>> rather get CacheMode be a Search class to keep the independance wrt >>> Hibernate Core >> >> Depending on the model it might be much faster using cache when the >> indexed entity >> is having a @ManyToOne+@IndexedEmbedded relation to some entity >> having >> high >> probability to have been indexed already. >> Like book->nation of publishing : you might have millions of books, >> but just some hundreds >> of nations, if these nations need to be reloaded over and over >> lazily >> with a second query >> a cache helps. >> I'll wrap it to a Search specific enum like I've seen in >> Annotations? > > Did you try? It seems that the first level cache would load the > nation > object once per iteration. Provided that cacheMode is unfortunately > a global > setting for all entities, I'm wondering what's more efficient in > the end. > Well I'm sure that it's not the best setting for most cases, but yes I have tried it and there are some situations in which it gives a major performance boost, especially on complex models having many relations of this type; Also in this case the "first level cache" is very short lived, and every thread is having it's own... being short lived there's not a big chance to have a "first level cache hit"; at the opposite the second level cache makes sure all "lookup tables" are loaded once for the whole process. Also it makes only sense when using a real cache, properly configured, not the Hashtable one. >> >>> optimizeAtEnd => optimizeOnFinish >>> optimizeAfterPurge >>> purgeAllAtStart => purgeBeforeIndexing ?, purgeAllOnStart, >>> purgeOnStart >> >> I vote for "purgeAllOnStart", I like "purgeAll" to be consistent. > > That's reasonable, my idea was to remove All to allow the > implementation to > evolve down the road should Lucene provide a more efficient > solution to > purge and create a new object but that's a far off bet. > ah well I suppose this is not the only API you'll have to change, should that happen :-) >> >>> limitObjects => indexObjectsUpTo, indexFirstObjectsUpTo, >>> limitIndexedObjectsTo //what's the use case? >> >> Mostly testing and developing; I didn't have this in first design >> but >> having to test it often it came out that >> it was quite useful if I could just try the effect of some new >> Analyzer without having to reindex >> millions of records; Also during changes to the entities you might >> want to see the effect >> of adding some new field / search option without having to wait >> for hours. >> I could have deleted data from dev database, but I consider having >> this option a bit more flexible; >> Actually I can foresee some feature request to be able to restrict >> the >> data, but we can think about >> that later. For same reasoning we could leave this out for the >> moment, >> but it has been very useful for me. > > OK, let's mark this one as experimental, you seem to want more of > the API. > >> >>> start => Future should get actually return some stats? We can >>> delay that >>> but >>> I don't like the JavaDoc claiming that we will always return null >> >> I took that from the recommendations on the Future javadoc itself, >> but >> I agree with you it doesn't feel very good. >> I could return (like you suggest) a reference to the used >> IndexerProgressMonitor >> (see >> http://fisheye.jboss.org/browse/Hibernate/search/trunk/src/main/java/org/... >> ) >> The API of the IndexerProgressMonitor will be a topic to discuss >> later >> (it is HSEARCH-370), for now there is one default impl >> which will log progress and some performance stats; that's why it is >> missing methods to retrieve the stats. > > Let's think about that and return null for now, I just want to > relax the > strong statement in the javadoc > I'll remove the return javadoc, leaving it undocumented for know. >> >> >>> startAndWait => execute ? >> >> I preferred to stress the little difference with "start"; don't you >> think that having a "start" method and an "execute" method >> is not making it clear which one I should call? >> > > I know that's why I put a ? :) > thanks for the insight, Sanne

Reply

6219

days inactive

6226

days old

hibernate-dev@lists.jboss.org

Manage subscription

5 comments

2 participants

tags (0)

participants (2)

Emmanuel Bernard
Sanne Grinovero