[hibernate-dev] Re: improving Search

Mon Jun 9 15:00:02 EDT 2008

On  Jun 7, 2008, at 11:32, Sanne Grinovero wrote:

> Hi all,
> I've been thinking on some more design changes for Hibernate Search,
> I hope to start a creative discussion.
>
>
> A)FSMasterDirectoryProvider and Index Snapshots.
> is currently making copies of it's changing index; Using Lucene's
> SnapshotDeletionPolicy
> you don't need to switch from an index to another, it is possible to
> take snapshots
> of current index even while it is in use, and copy it to another  
> destination.

How does SnapshotDeletionPolicy work exactly, I am not terribly  
familiar with DeletionPolicy yet.
Does it somehow involve not haveing cluster changes (ie intra VM  
policy rather than inter VM?)

>
>
> The main problem I see with this is that the SnapshotDeletion policy
> must be given to the IndexWriter constructor, so the DirectoryProvider
> is going to need interaction with the Workspace;
> Also the DirectoryPRovider would need to know and potentially block
> the opening of a new IndexWriter; this could currently need many  
> changes
> but would be easy when solved together to the next proposal:

The DirectoryProvider has access to the SearchFactoryImplementor, so  
it can go set some flag available to other parts of the system if  
needed.
>
>
>
> B) stop using IndexReader in write mode, doing all work in a single  
> IndexWriter.
> There are currently many locks in Search to coordinate the 3 different
> operations
> done on indexes:
> 1- searching
> 2- deleting
> 3- inserting
> The last two are currently a bottleneck, as each transaction needs  
> to lock
> the index on commit, open an indexreader (expensive), use it (could be
> expensive),
> close it (flushing files), open an indexwriter (expensive), close it
> (flushing again).
> Only at the end of this sequence the index is unlocked and the next  
> transaction
> commit can be processed.
> (Emmanuel, did I get this short explanation right ? Too catastrophic?)
> We discussed sharing of these elements across different committing  
> transactions,
> but it isn't easy to manage and benefits are not clear.

The explanation is correct but a bit cataclysmic. We open the reader  
*only* if there is an actual deletion for the given Directory  
provider. So in some systems we might very well almost never open a rw  
reader.

>
>
> If we could avoid the need to use an IndexReader to alter the index,  
> the same
> IndexWriter could be reused, we could just flush changes when needed.
> The IndexReader appears to be used in two situations:
> the first if for "purgeAll(class)"; this can easily be replaced as the
> same functionality
> is available in IndexWriter.
> The other use is removal of an entity from the index:
>
> The current remove has many //TODO and a good explanations about why
> an IndexReader is needed; I hope we could resolve this issue, I have
> two ideas; The problem is Lucene permits removal only by using a  
> single
> term, but we uniquely identify an entity by using the couple  
> entityName + dbID
> Solutions:
> 1) next release of Lucene will provide removal by Query in  
> IndexWriter,
> we make a BooleanQuery containing both entityName and database ID
> and it should be solved.

We need to check how efficient the implementation has been done. If it  
a mechanism similar to IndexModifier, it's not worth it.

>
> 2) We add an additional field containing a unique identifier for  
> entityname+id,
> use this for removal. This is probably the fastest way for  
> frequently changing
> indexes but it will waste a lot of space.

I don't like this much as the index structure is very natural so far,  
there is no field that is specific to HSearch. Even the class type  
makes sense for third party apps.

>
> 3) To always use sharding when we detect multiple entities
> use the same index. We could define separated indexes for each  
> different
> type of entities and transparently bridge them together when the  
> query needs
> multiple types in return, using the usual MultiReaders.
> This may result in faster queries for single return types but slower  
> queries
> for polymorphic queries.
> When using this sort of sharding deletion becomes easy by single term:
> the database identifier.

This is what happens by default already, one index directory by entity  
type. We could compute a flag at init time to know that (it's alrady  
in place actually) and use it to use a Term query rather than the full  
query if only one entity is present in the index. Let's open a JIRA  
issue

>
>
> If we could enforce the use of one of these strategies we could remove
> most locking;
> also the IndexWriter may be used concurrently by each thread needing  
> to commit;
> it would never need to be closed or reopened as it is always aware of
> the complete index state.

Not if it's updated in a cluster, right?
Plus seeing the contention lock we have experienced (on IndexReader)  
in the recent test case, I want to be sure it's actually faster than  
opening every time.

>
> As the Document analysis process is done in the IndexWriter it is  
> very important
> to use it concurrently; enfin having a single indexWriter per index
> could unlock much more ideas for additional improvements.
> Integrating with the "massive batch reindexer" from my previous post  
> would
> be much easier, and it could become the standard indexing process  
> without
> having to relax any transaction requirement.
>
>
> kind regards,
> and sorry for being so verbose
>
> Sanne