[hibernate-dev] Re: Hibernate Search: massive batch indexing

Emmanuel Bernard emmanuel at hibernate.org
Sat Jun 7 13:22:42 EDT 2008


This sounds very promising.
I don't quite understand why you talk about loading lazy objects though?
On of the recommendations is to load the object and all it's related  
objects before indexing. No lazy triggering should happen.
eg "from User u left join fetch u.address a left join fetch a.country"
if Address and Country are embedded in the User index.

I think the E limitation is fine, we can position this API as offline  
indexing of the data. That's fair enough for a start. I don't like  
your block approach unless you couple it with JMS. I am uncomfortable  
in keeping work to be done for a few hours in a VM without persistent  
mechanism.

"this pool is usually the slowest as it has to initialize many lazy  
fields,
so there are more threads here."
I don't quite understand why this happens.

So what is the impact of your code on the current code base? Do you  
need to change a lot of things? How fast do you think you could have a  
beta in the codebase?

Let's spin a different thread for the "in transaction" pool, I am not  
entirely convinced it actually will speed up things.


On  Jun 6, 2008, at 18:51, Sanne Grinovero wrote:

> Hello list,
>
> I've finally finished some performance test about stuff I wanted to  
> double-check
> before writing stupid ideas to this list, so I feel I can at last  
> propose
> some code to (re)building the index for Hibernate Search.
>
> The present API of Hibernate Search provides a nice and safe
> transactional "index(entity)",
> but even when trying several optimizations it doesn't reach the speed
> of an unsafe (out of transaction) indexer we use in our current
> production environment.
> Also reading the forum it appears that much people are having
> difficulties in using
> the current API, even with a good example in the reference  
> documentation
> some difficulties arise with Seam's transactions and with huge data  
> sets.
> (I'm NOT saying something is broken, just that you need a lot of  
> expertise
> to get it going)
>
> SCENARIO
> =======
>
> * Developers change an entity and want to test the effect on the index
> structure,
>  thay want do to search experiments with the new fields.
> * A production system is up(down)graded to a new(old) release,
> involving index changes.
>  (the system is "down for maintance" but the speed is crucial)
> * Existing index is corrupted/lost. (Again, speed to recover is  
> critical)
> * A Database backup is restored, or data is changed by other jobs.
> * Some crazy developer like me prefers to disable H.Search's event
> listeners for some reason.
>  (I wouldn't generally recommend it, but have met other people who
> have a reasonable
>  argument to do this. Also in our case it is a feature as new entered
> books will be
>  available for loans only from the next day :D)
> * A Lucene update breaks the index format (not so irrationale as they
> just did on trunk).
>
> PERFORMANCE
> =======
>
> In simple use cases, such as less than 1000 entities and not too much
> relationships,
> the exising API outperforms my prototype, as I have some costly setup.
> In more massive tests the setup costs are easily recovered by a much
> faster indexing speed;
> I have many data I could send, I'll just show some and keep the  
> details simple:
>
> entity "Operator": standard complexity, involves loading of +4 objs, 7
> field affect index
> entity "User": moderate complexity, involves loading of +- 20 objs, 12
> affect index data
> entity "Modern": high complexity, loading of 44 entities, many are
> "manyToMany", 25 affect index data
>
> On my laptop (dual core, local MySQL db):
> type		Operator		User		Modern
> number		560			100.000		100.000
> time-current	0,23 secs		45''		270.3''
> time-new	0,43 secs		30''		190''
>
> On a staging server (4 core Xeon with lots of ram and dedicated DB  
> server):
> type		Operator		User		Modern
> number		560			200.000		4.000.000
> time-current	0,09 secs		130''		5h20'
> time-new	0,25 secs		22''		19'
>
> [benchmark disclaimer:
> These timings are meant to be relative to each other for my particular
> code version, I'm not an expert of Java benchmarking at all.
> Also unfortunately I can't really access the same hardware for each  
> tests.
> I used all possible tweaks I am aware of in Hibernate Search, actually
> enabling new needed params to make the test as fair as possible.]
>
> Examining the numbers:
>   with current recommended H.Search strategy I can index 560 simple  
> entities
> in 0,23 seconds; quite fast and newbe users will be impressed.
>   At the other extreme, we index 4 million complex items, but I need  
> more
> than 5 hours to do that; this is more like real use case and it could
> scare several developers.
>   Unfortunately I don't have a complete copy of the DB on my laptop,
> but looking at the numbers it looks like my laptop could finish
> in 3 hours, nearly double the speed of our more-than-twice fast  
> server.
> (yes I've had several memory leaks :-) but they're solved now)
>   The real advantage is the round-trip to database: without multiple
> threading each lazy loaded collection somehow annotated to be indexed
> massively slows down the whole process; If you look at both DB an AS
> servers, they have very low resource usage confirming this, while my  
> laptop
> stays at 70% cpu (and killing my harddrive) because he has data  
> available
> locally, producing a constant feed of strings to my index.
>   When using the new prototype (about 20 threads in 4 different pools)
> I get the 5hours down to less than 20minutes; Also I can start the
> indexing of all 7 indexable types in parallel and it will stay  
> around 20minutes.
> The "User" entity is not as complex as Modern (less lazy loaded data)
> but confirms the same numbers.
>
> ISSUES
> =======
> About the current version I've ready.
> It is not a complete substitute of the current one and is far from  
> perfect;
> currently these limitations apply but could be easily solved:
> (others I am not aware of not listed :-)
>
> A) I need to "read" some hints for each entity; I tinkered with a new
> annotation,
>   configuration properties should work but are likely to be quite
> verbose (HQL);
>   basically I need some hints about fetch strategies appopriate
>   for batch indexing, which are often different than normal use cases.
>
> B) Hibernate Search's indexing of related entities was not available
> when I designed it.
>   I think this change will probably not affect my code,  but I still  
> need to
>   verify the functionality of IndexEmbedded.
>
> C) It is finetuned for our entities and DB, many variables are  
> configurable but
>   some stuff should be made more flexible.
>
> D) Also index sharding didn't exist at the time, I'll need to change  
> some stuff
>   to send the entities to the correct index and acquire the  
> appropriate locks.
>
> The next limitations is not easy to solve, I have some ideas but no  
> one I liked.
>
> E) It is not completely safe to use it during other data modification;
> It's not a problem in our
>   current production but needs much warning in case other people
> wants to use it.
>   The best solution I could think of is to lock the current workqueue
> of H.Search,
>   so to block execution of work objects in the queue and resume the
> execution of
>   this work objects after batch indexing is complete.
>   If some entity disappears (removed from DB but a reference is in
> the queue) it
>   can easily be skipped, if I index "old version" of some other data  
> it will be
>   fixed when scheduled updates from H.S. eventlisteners are resumed;
>   (and the same for new entities).
>   It would be nice to share the same database transaction during the
> whole process,
>   but as I use several threads and many separate sessions I think
> this is not possible
>   (this is the best place to ask I think;-)
>
> GOING PRACTICAL
> ===============
> if (cheater) goto :top
>
> A nice evictAll(class) exists, I would like to add indexAll(class).
> It would be nice to provide non-blocking versions, maybe overloading:
> indexAll(Class clazz, boolean block)
> or provide a Future as return object, so people could wait for one
> or more indexAll requests if they want to.
> There are many parameters to tweak the indexing process, so I'm
> not sure if we should put them in the properties, or have a  
> parameters-
> wrapper object indexAll(Class class, Properties prop), or
> something like makeIndexer(Class class) returning a complex object
> with several setters for finetuning and start() and awaitTermination()
> methods.
>
> the easy part
> --------------
> This part is easy to do as I have it working well, it is a pattern
> involving several executors; the size of each threadPool and of the
> linking queues between them gives the good balance to achieve the
> high throughput.
> First the entities are counted and divided in blocks, these ranges  
> are fed to
> N scrollables opened in N threads, each thread begins iterating on the
> list of entities and feeds detached entities to the next Pool using
> BlockingQueues.
> In the next pool the entities are re-attached using Lock.none,  
> readonly, etc..
> (and many others you may want to tell me) and we get and appropriate
> DocumentBuilder from the SearchFactory to transform it into a Lucene  
> Document;
> this pool is usually the slowest as it has to initialize many lazy  
> fields,
> so there are more threads here.
> Produced documents go to a smaller pool (best I found was for 2-3  
> threads)
> were data is concurrently written to the IndexWriter.
> There's an additional thread for resource monitoring to produce some  
> hints
> about queue sizing and idle threads, to do some finetune and to see  
> instant
> speed reports in logs when enabled.
> For shutdown I use the "poison pill" pattern, and I usually get rid  
> of all
> threads and executors when I'm finished.
> It needs some adaption to take into account of latest Search features
> such as similarity, but is mostly beta-ready.
>
> the difficult part
> -------------------
> Integrating it with the current locking scheme is not really  
> difficult,
> also because the goal is to minimize downtime so I think some downtime
> should be acceptable.
> It would be very nice however integrate this pattern as the default
> writer for indexes, even "in transaction"; I think it could be  
> possible
> even in synchronous mode to split the work of a single transaction  
> across
> the executors and wait for all the work be done at commit.
> You probably don't want to see the "lots of threads" meant for batch  
> indexing,
> but the pools scale quite well to adapt themselves to the load,
> and it's easy (as in clean and maintainable code) to enforce  
> resource limits.
> When integrating at this level the system wouldn't need to stop  
> regular
> Search activity.
>
> any questions? If someone wanted to reproduce my benchmarks I'll
> be glad to send my current code and DB.
>
> kind regards,
> Sanne




More information about the hibernate-dev mailing list