[hibernate-dev] Re: Hibernate Search: massive batch indexing

Mon Jun 9 14:04:52 EDT 2008

On  Jun 7, 2008, at 20:14, Sanne Grinovero wrote:

> thanks for your insights :-)
> I'll try explain myself better inline:
>
> 2008/6/7 Emmanuel Bernard <emmanuel at hibernate.org>:
> This sounds very promising.
> I don't quite understand why you talk about loading lazy objects  
> though?
> On of the recommendations is to load the object and all it's related  
> objects before indexing. No lazy triggering should happen.
> eg "from User u left join fetch u.address a left join fetch a.country"
> if Address and Country are embedded in the User index.
>
> I am talking about the lazy object loading because it is not always  
> possible to
> load the complete object graph eagerly because of the cartesian  
> problem;
> the "hints" I mention in point A is mainly (but not limited to) the  
> left join
> fetch instruction needed to load the root entity.
> However if I put all needed collections in the fetch join I kill the  
> DB
> performance and am flooded by data; I have made many experiments to
> find the "gold  balance" between eager and lazy and know for sure it  
> is much
> faster keeping most stuff out of the initial "fetch join"
> My current "rule of thumb" is to load no more than two additional  
> collections,
> the rest goes lazy.
> Also we should keep in mind the eager/lazy/subselect strategies
> going to be chosen for the entities will probably be selected for
> "normal" business operations finetuning and not for indexing  
> performance;
> I had to fight somehow with other devs needing some setting for
> other usecases in a different way than what I needed to bring indexing
> timings down.

I understand. You could use Hibernate.initialize and batch-size  
upfront to help in this area *before* passing it to Hibernate Search.

>
>
>
> I think the E limitation is fine, we can position this API as  
> offline indexing of the data. That's fair enough for a start. I  
> don't like your block approach unless you couple it with JMS. I am  
> uncomfortable in keeping work to be done for a few hours in a VM  
> without persistent mechanism.
>
> I am glad to hear it's fine to position it as "offline" API, as a  
> start.
> Do you think we should enforce or check it somehow?

Let's add a modal box "Are you sure?" ;)
I don't think you can really enforce that (especially on a cluster).

>
> For later improvements the batching IndexWriter could be "borrowed" by
> the committing transactions to synchronously write their data away,
> we just need to avoid the need of and IndexReader for deletions;
> I've been searching for a solution in my other post... if that could  
> be fixed
> and a single IndexWriter per index could be available you could
> have batch indexing and normal operation available together.

I will answer on the second post.

>
>
> "this pool is usually the slowest as it has to initialize many lazy  
> fields,
> so there are more threads here."
> I don't quite understand why this happens.
>
> I suppose I should show you an ER diagram of our model; in our case  
> but I believe
> in most cases people will search for an object basing his "fulltext"  
> idea on many different
> fields which are external to the main entity: intersecting e.g.  
> author nickname with historic period,
> considering book series, categories and collections, or by a special  
> code in one of
> 30 other legacy library encoding schemes.
> The use case actually shows that very few fields are read from the  
> root entity, but most
> are derived from linked many-to-many entities, sometimes going to a  
> second or third level
> of linked information. I don't think this is just my case, IMHO it  
> is very likely most
> real world applications will have a similar problem, we have to  
> encode in the root
> object many helper fields to make most external links searchable; I  
> believe this is part of the
> "dealing with the mismatch between the index structure and the  
> domain model"
> which is Search's slogan (pasted from homepage).
>
> So what is the impact of your code on the current code base? Do you  
> need to change a lot of things? How fast do you think you could have  
> a beta in the codebase?
>
> I still have not completely understood the locks around the indexes;  
> I believe the impact on current code is not so huge, I should need  
> to know
> how I should "freeze" other activity on the indexes: Indexing could  
> just start but other threads will be waiting a long time; should other
> methods  check and throw an exception when mass indexing is busy?

Let's not envision an exception for the moment.
The locks must be acquired in a specific order, aside from that, this  
should be straightforward

>
> Is it ok for one method to spawn 40 threads?

It's OK if it's there is only one call per VM doing that. If every  
client does that, then that's not good :)

>
> How should the "management / progress monitor API" look like?

Maybe like the Hibernate Statistics. It depends on what the API should  
do

>
> I didn't look at similarity and sharding, is it ok for a first beta  
> to avoid this features? I don't think it should be difficult to  
> figure out, but would like
> to show working code prototypes asap to have early feedback.

no problem

>
> I think that if the answers to above questions don't complicate my  
> current code the effort to integrate it is less than a week of work;  
> unfortunately this translates
> in 4-6 weeks of time as I have other jobs and deadlines, maybe less  
> with some luck.
> How should this be managed? a branch? one commit when done?

If you don't disrupt the rest of the features, then you cand apply  
them in trunk, if you are afraid, then do a branch. But branches are  
pain to merge back in SVN.

>
>
>
> Let's spin a different thread for the "in transaction" pool, I am  
> not entirely convinced it actually will speed up things.
> Yes I agree there probably is not a huge advantage, if any; the main  
> reason would be to have "normal operation" available
> even during mass reindexing, performance improvements would be limited
> to special cases such as a single thread committing several  
> entities: the "several" would benefit from batch behavior.
> The other thread I had already started is linked to this: IMHO we  
> should improve the deletion of entities first.
>
> On  Jun 6, 2008, at 18:51, Sanne Grinovero wrote:
>
> Hello list,
>
> I've finally finished some performance test about stuff I wanted to  
> double-check
> before writing stupid ideas to this list, so I feel I can at last  
> propose
> some code to (re)building the index for Hibernate Search.
>
> The present API of Hibernate Search provides a nice and safe
> transactional "index(entity)",
> but even when trying several optimizations it doesn't reach the speed
> of an unsafe (out of transaction) indexer we use in our current
> production environment.
> Also reading the forum it appears that much people are having
> difficulties in using
> the current API, even with a good example in the reference  
> documentation
> some difficulties arise with Seam's transactions and with huge data  
> sets.
> (I'm NOT saying something is broken, just that you need a lot of  
> expertise
> to get it going)
>
> SCENARIO
> =======
>
> * Developers change an entity and want to test the effect on the index
> structure,
>  thay want do to search experiments with the new fields.
> * A production system is up(down)graded to a new(old) release,
> involving index changes.
>  (the system is "down for maintance" but the speed is crucial)
> * Existing index is corrupted/lost. (Again, speed to recover is  
> critical)
> * A Database backup is restored, or data is changed by other jobs.
> * Some crazy developer like me prefers to disable H.Search's event
> listeners for some reason.
>  (I wouldn't generally recommend it, but have met other people who
> have a reasonable
>  argument to do this. Also in our case it is a feature as new entered
> books will be
>  available for loans only from the next day :D)
> * A Lucene update breaks the index format (not so irrationale as they
> just did on trunk).
>
> PERFORMANCE
> =======
>
> In simple use cases, such as less than 1000 entities and not too much
> relationships,
> the exising API outperforms my prototype, as I have some costly setup.
> In more massive tests the setup costs are easily recovered by a much
> faster indexing speed;
> I have many data I could send, I'll just show some and keep the  
> details simple:
>
> entity "Operator": standard complexity, involves loading of +4 objs, 7
> field affect index
> entity "User": moderate complexity, involves loading of +- 20 objs, 12
> affect index data
> entity "Modern": high complexity, loading of 44 entities, many are
> "manyToMany", 25 affect index data
>
> On my laptop (dual core, local MySQL db):
> type            Operator                User            Modern
> number          560                     100.000         100.000
> time-current    0,23 secs               45''            270.3''
> time-new        0,43 secs               30''            190''
>
> On a staging server (4 core Xeon with lots of ram and dedicated DB  
> server):
> type            Operator                User            Modern
> number          560                     200.000         4.000.000
> time-current    0,09 secs               130''           5h20'
> time-new        0,25 secs               22''            19'
>
> [benchmark disclaimer:
> These timings are meant to be relative to each other for my particular
> code version, I'm not an expert of Java benchmarking at all.
> Also unfortunately I can't really access the same hardware for each  
> tests.
> I used all possible tweaks I am aware of in Hibernate Search, actually
> enabling new needed params to make the test as fair as possible.]
>
> Examining the numbers:
>  with current recommended H.Search strategy I can index 560 simple  
> entities
> in 0,23 seconds; quite fast and newbe users will be impressed.
>  At the other extreme, we index 4 million complex items, but I need  
> more
> than 5 hours to do that; this is more like real use case and it could
> scare several developers.
>  Unfortunately I don't have a complete copy of the DB on my laptop,
> but looking at the numbers it looks like my laptop could finish
> in 3 hours, nearly double the speed of our more-than-twice fast  
> server.
> (yes I've had several memory leaks :-) but they're solved now)
>  The real advantage is the round-trip to database: without multiple
> threading each lazy loaded collection somehow annotated to be indexed
> massively slows down the whole process; If you look at both DB an AS
> servers, they have very low resource usage confirming this, while my  
> laptop
> stays at 70% cpu (and killing my harddrive) because he has data  
> available
> locally, producing a constant feed of strings to my index.
>  When using the new prototype (about 20 threads in 4 different pools)
> I get the 5hours down to less than 20minutes; Also I can start the
> indexing of all 7 indexable types in parallel and it will stay  
> around 20minutes.
> The "User" entity is not as complex as Modern (less lazy loaded data)
> but confirms the same numbers.
>
> ISSUES
> =======
> About the current version I've ready.
> It is not a complete substitute of the current one and is far from  
> perfect;
> currently these limitations apply but could be easily solved:
> (others I am not aware of not listed :-)
>
> A) I need to "read" some hints for each entity; I tinkered with a new
> annotation,
>  configuration properties should work but are likely to be quite
> verbose (HQL);
>  basically I need some hints about fetch strategies appopriate
>  for batch indexing, which are often different than normal use cases.
>
> B) Hibernate Search's indexing of related entities was not available
> when I designed it.
>  I think this change will probably not affect my code,  but I still  
> need to
>  verify the functionality of IndexEmbedded.
>
> C) It is finetuned for our entities and DB, many variables are  
> configurable but
>  some stuff should be made more flexible.
>
> D) Also index sharding didn't exist at the time, I'll need to change  
> some stuff
>  to send the entities to the correct index and acquire the  
> appropriate locks.
>
> The next limitations is not easy to solve, I have some ideas but no  
> one I liked.
>
> E) It is not completely safe to use it during other data modification;
> It's not a problem in our
>  current production but needs much warning in case other people
> wants to use it.
>  The best solution I could think of is to lock the current workqueue
> of H.Search,
>  so to block execution of work objects in the queue and resume the
> execution of
>  this work objects after batch indexing is complete.
>  If some entity disappears (removed from DB but a reference is in
> the queue) it
>  can easily be skipped, if I index "old version" of some other data  
> it will be
>  fixed when scheduled updates from H.S. eventlisteners are resumed;
>  (and the same for new entities).
>  It would be nice to share the same database transaction during the
> whole process,
>  but as I use several threads and many separate sessions I think
> this is not possible
>  (this is the best place to ask I think;-)
>
> GOING PRACTICAL
> ===============
> if (cheater) goto :top
>
> A nice evictAll(class) exists, I would like to add indexAll(class).
> It would be nice to provide non-blocking versions, maybe overloading:
> indexAll(Class clazz, boolean block)
> or provide a Future as return object, so people could wait for one
> or more indexAll requests if they want to.
> There are many parameters to tweak the indexing process, so I'm
> not sure if we should put them in the properties, or have a  
> parameters-
> wrapper object indexAll(Class class, Properties prop), or
> something like makeIndexer(Class class) returning a complex object
> with several setters for finetuning and start() and awaitTermination()
> methods.
>
> the easy part
> --------------
> This part is easy to do as I have it working well, it is a pattern
> involving several executors; the size of each threadPool and of the
> linking queues between them gives the good balance to achieve the
> high throughput.
> First the entities are counted and divided in blocks, these ranges  
> are fed to
> N scrollables opened in N threads, each thread begins iterating on the
> list of entities and feeds detached entities to the next Pool using
> BlockingQueues.
> In the next pool the entities are re-attached using Lock.none,  
> readonly, etc..
> (and many others you may want to tell me) and we get and appropriate
> DocumentBuilder from the SearchFactory to transform it into a Lucene  
> Document;
> this pool is usually the slowest as it has to initialize many lazy  
> fields,
> so there are more threads here.
> Produced documents go to a smaller pool (best I found was for 2-3  
> threads)
> were data is concurrently written to the IndexWriter.
> There's an additional thread for resource monitoring to produce some  
> hints
> about queue sizing and idle threads, to do some finetune and to see  
> instant
> speed reports in logs when enabled.
> For shutdown I use the "poison pill" pattern, and I usually get rid  
> of all
> threads and executors when I'm finished.
> It needs some adaption to take into account of latest Search features
> such as similarity, but is mostly beta-ready.
>
> the difficult part
> -------------------
> Integrating it with the current locking scheme is not really  
> difficult,
> also because the goal is to minimize downtime so I think some downtime
> should be acceptable.
> It would be very nice however integrate this pattern as the default
> writer for indexes, even "in transaction"; I think it could be  
> possible
> even in synchronous mode to split the work of a single transaction  
> across
> the executors and wait for all the work be done at commit.
> You probably don't want to see the "lots of threads" meant for batch  
> indexing,
> but the pools scale quite well to adapt themselves to the load,
> and it's easy (as in clean and maintainable code) to enforce  
> resource limits.
> When integrating at this level the system wouldn't need to stop  
> regular
> Search activity.
>
> any questions? If someone wanted to reproduce my benchmarks I'll
> be glad to send my current code and DB.
>
> kind regards,
> Sanne
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/hibernate-dev/attachments/20080609/2ced728f/attachment.html