[hibernate-dev] Design: HSEARCH-1032 MassIndexer with a live update mechanism

Fri Jul 12 09:38:52 EDT 2013

Another approach I could think of is related to how we would
conceptually implement transaction isolation.

We could create the new index and apply it before the old index. We
could also delete on the fly data from the old index that has been put
in the new index. While the worse case 4x size is still possible I
imagine, it will be much less likely if we assume regular compaction.

That does not solve the problem of where to store the second index and
be backward compatible.

An option would be for the FS case to create a subdirectory in the main
index directory and once reindexed do a lock protected move of
index/new/* into index/* qfter deleting index/*

Emmanuel

On Fri 2013-07-12 13:42, Sanne Grinovero wrote:
> Current priorities on Search are:
>  - Infinispan IndexManager -> me
>  - Metadata API -> Hardy
>  - Multitenancy (aka dynamic Sharding) -> me + Emmanuel + Dimitrios
> 
> Those are all important as they represent hard requirements for other
> projects, but I'd also like to consider at least the basic design for
> how the MassIndexer could operate in "update mode": a highly requested
> mode in which it re-synchronizes the index with the database but
> without wiping out the index, which creates a window in time of the
> application in which results are not complete.
> 
> # Reminder on current design:
>  1- deletes the current index
>  2- scrolls on all entities and uses ADD index operations to add them all again
> 
> There are two basic approaches on the table (other ideas welcome) :
>   - #A Use UPDATE index operations instead, skipping the initial delete
>   - #B Rebuild the index in a secondary directory, then switch
> 
> Let's explore them:
> 
> #A Use UPDATE index operations instead, skipping the initial delete
> 
> ## what
> Technically an UPDATE operation is - in Lucene terms - an atomic
> (delete+add); the benefit is that each query will either see the
> previous document or the updated one, there is no possibility that the
> doc is skipped as there is no possibility to flush the changes between
> the delete and the add operation.
> 
> ## performance
> The reason the current design deletes all elements at the start of the
> process, is that this is a very efficient operation: it targets a
> single term (the class name field) or in some cases targets the whole
> index, so just needs to delete all segments files.
> When doing a delete operation on a per-document base, instead of a
> class, that very likely needs a deletion on multiple terms (which is
> not efficient at all as it needs to IO to seek across multiple disk
> positions), and of course the worse point is that it triggers a delete
> operation for each and every entity. To compare, a single ADD doesn't
> need any disk seek as we can pack multiple operations in one - until
> buffer is full - but any single delete requires N disk seeks (N is not
> directly the number of fields but is proportional to it).
> Based on this, and on experience with the #index() method
> benchmarking, I'm expecting the UPDATE strategy to be approximately a
> thousand times slower than the current MassIndexer implementation..
> considering for some it takes a couple of hours, going to 2000 hours
> is maybe not an option :-) (that's 3 months)
> 
> ## left over entries
> Another problem is that if we scroll on all entities from the
> database, we're failing to delete documents in the index for which
> there is no match anymore.
> So we would need a final phase in which we run the inverse iteration:
> for each element in the index, verify if there is a match in the
> database; sounds like an ugly lot of queries, even if we batch it in
> verification blocks.
> 
> bottomline, looks messy.
> 
> #B Rebuild the index in a secondary directory, then switch
> 
> ## performance
> No big concerns, but we assume there is enough space for at least four
> times the size of the index (because we normally need twice to be able
> to compact one, and we have two to manage).
> 
> ## design
> The good part is that we can reuse most of the existing MassIndexer;
> but transactional changes (those applied by the application during a
> reindexing) need to be redirected to both the indexes: the one being
> used until the rebuild is complete so that the queries stay
> consistent, and also enqueued into the one being built so that they
> don't get lost in case they apply to documents which have already been
> indexed. The queue handling is tricky, because in such case further
> additions actually need to be updates, unless we can keep them on hold
> in a buffer to be applied on the pristine index: could take quite some
> memory, depending on the amount of changes flying in during the
> massindexing. If the queue grows beyond reason we'll need to either
> apply backpressure on the transactions or offload to disk or change to
> an update strategy for the remaining massindexing process.. none of
> these are desirable but I guess people could tune to make this
> condition unlikely.
> 
> ## SPI changes
> With this design we need to be able to:
>  - dynamically instantiate a second Directory in a different path
>  - switch to delegate writes to both directories / one directory
>  - control from where Readers are opened
>  - make sure closed Readers go back to the original pool where they
> come from as their reference source could have been changed
>  - be able to switch (permanently) to a different active index
>  - destroy old index
> 
> I'm afraid each of these can affect our SPIs; likely at least
> IndexManager. I hope we can have all the logic in "behind the scenes"
> code which drives the same SPIs as of today but I'd need a POC to
> verify this.
> 
> ## Directory index path
> If we switch from one Directory to another - thinking about the
> FSDirectory - we're either violating the path configuration options
> from the user or we need to move the new index into the configured
> position when done. If the above sounds a bit complex, I'm actually
> more concerned about implementing such an atomic move on the
> filesystem.
> I guess we could agree that if the user configured an index to be in -
> say - "/var/lucene/persons" we could store the indexes in
> "/var/lucene/persons/index-a" and "/var/lucene/persons/index-b",
> alternating in similar way to the FSMasterDirectoryProvider, but that
> takes away some control on index position and is not backwards
> compatible. Would this be acceptable?
> 
> # Timeline
> This might need to be moved to 5.0 because of the various backwards
> compatibility concerns - ideally if some community user feels to
> participate we could share some early code in experimental branches
> and work together.
> 
> Comments and better ideas welcome :)
> Sanne
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev