[hibernate-dev] improving Search

Sanne Grinovero sanne.grinovero at gmail.com
Sat Jun 7 11:32:12 EDT 2008


Hi all,
I've been thinking on some more design changes for Hibernate Search,
I hope to start a creative discussion.


A)FSMasterDirectoryProvider and Index Snapshots.
is currently making copies of it's changing index; Using Lucene's
SnapshotDeletionPolicy
you don't need to switch from an index to another, it is possible to
take snapshots
of current index even while it is in use, and copy it to another destination.

The main problem I see with this is that the SnapshotDeletion policy
must be given to the IndexWriter constructor, so the DirectoryProvider
is going to need interaction with the Workspace;
Also the DirectoryPRovider would need to know and potentially block
the opening of a new IndexWriter; this could currently need many changes
but would be easy when solved together to the next proposal:


B) stop using IndexReader in write mode, doing all work in a single IndexWriter.
There are currently many locks in Search to coordinate the 3 different
operations
done on indexes:
1- searching
2- deleting
3- inserting
The last two are currently a bottleneck, as each transaction needs to lock
the index on commit, open an indexreader (expensive), use it (could be
expensive),
close it (flushing files), open an indexwriter (expensive), close it
(flushing again).
Only at the end of this sequence the index is unlocked and the next transaction
commit can be processed.
(Emmanuel, did I get this short explanation right ? Too catastrophic?)
We discussed sharing of these elements across different committing transactions,
but it isn't easy to manage and benefits are not clear.

If we could avoid the need to use an IndexReader to alter the index, the same
IndexWriter could be reused, we could just flush changes when needed.
The IndexReader appears to be used in two situations:
the first if for "purgeAll(class)"; this can easily be replaced as the
same functionality
is available in IndexWriter.
The other use is removal of an entity from the index:

The current remove has many //TODO and a good explanations about why
an IndexReader is needed; I hope we could resolve this issue, I have
two ideas; The problem is Lucene permits removal only by using a single
term, but we uniquely identify an entity by using the couple entityName + dbID
Solutions:
1) next release of Lucene will provide removal by Query in IndexWriter,
we make a BooleanQuery containing both entityName and database ID
and it should be solved.
2) We add an additional field containing a unique identifier for entityname+id,
use this for removal. This is probably the fastest way for frequently changing
indexes but it will waste a lot of space.
3) To always use sharding when we detect multiple entities
use the same index. We could define separated indexes for each different
type of entities and transparently bridge them together when the query needs
multiple types in return, using the usual MultiReaders.
This may result in faster queries for single return types but slower queries
for polymorphic queries.
When using this sort of sharding deletion becomes easy by single term:
the database identifier.

If we could enforce the use of one of these strategies we could remove
most locking;
also the IndexWriter may be used concurrently by each thread needing to commit;
it would never need to be closed or reopened as it is always aware of
the complete index state.
As the Document analysis process is done in the IndexWriter it is very important
to use it concurrently; enfin having a single indexWriter per index
could unlock much more ideas for additional improvements.
Integrating with the "massive batch reindexer" from my previous post would
be much easier, and it could become the standard indexing process without
having to relax any transaction requirement.


kind regards,
and sorry for being so verbose

Sanne



More information about the hibernate-dev mailing list