[hibernate-dev] Hibernate Search 3.1

Tue Feb 26 06:41:10 EST 2008

Hi all,

I've been doing a lot of work with Hibernate Search recently and have
been pushing the lucene search side of it pretty hard.  I've made
several changes to improve performance and functionality and have
created patches for these locally, but would like some feedback on my
approach:

1) Similarity annotation (HSEARCH-136)

I've created a patch for this which does the following:

 * Added a Similarity annotation which can be added at the class level.
 * Modified Workspace.java to set Similarities when creating an IndexWriter
 * In CacheableMultiReader changed visibility of subReaders to package
private so other classes can use them
 * In ReaderProviderHelper added methods which can resolve underlying
IndexReaders from a Searcher or Reader passed in
 * In DocumentBuilder added code to find the similarity annotation and
store its implementation locally
 * In FullTextQueryImpl added code which works out which similarity to
use when creating a Reader and changed all finally instances to use a
common piece of code to close the reader.

This seems to work well in my dev environment, and I'll be sending out
a patch for 3.0.1 later today as I've already had some feedback from
Emmanuel on this one.  The solution is a lot simpler than the first
patch I uploaded to Jira.

2) Explaining results

This uses the new DOCUMENT_ID projection introduced in 3.0.1  to
explain query results (we need this so the customer can understand
their search results in the backoffice interface).  I added an explain
method to both implementations of FullTextQueryImpl which is only
available by casting (e.g. no interface changes).  I think explain()
is probably a fairly advanced function which it's acceptable to access
by casting.

3) Counting results

In the current implementation we only want to perform one Lucene query
per search (all projected).  In order to get a resultcount and the
results themselves it is currently necessary to invoke the Lucene
query twice.

I have made changes to allow this information to propagate through to
the user whilst only making one search invokation, which has obvious
effects on performance:

* Created a class called SearchResultList which extends ArrayList.
This has an extra property for setting and retrieving the total
hitcount.
 * Added the method "List load(int hitCount, EntityInfo ...
entityInfos)" onto the Loader interface.  Keep the existing "List
load(EntityInfo ... entityInfos)" method which can be stubbed by
passing a dummy value if used.
 * Changed the loader implementations themselves to create a
SearchResultList with the hitcount instead of an ArrayList.

There are two ways of then accessing the hitcount:

 a) Casting the list to a SearchResultList
 b) Returning a SearchResultList instead of a List from the Loader
interface and propagating this through
 c) Creating an interface from SearchResultList that extends List and
then having a private implementation but doing as per (2)

I'm particularly interested in some feedback on this as it's a big
performance gain for applications that need the total hit count, and
contains the most breaking changes of any of the things I've done.

4) Caching filter BitSets

In order to fix the problem with readers there's going to need to be a
way of accessing the underlying readers of a CacheableMultiReader in
order to store the appropriate references to cache by.  I think it's
going to be better to either make the subReaders property public or to
define an accessor for it.  I've done this locally so I can hack up a
working caching strategy based on a weakreference to the first reader,
which works for my case but not the general case.

Any feedback on these would be very useful.  I've made the changes
locally, but would like some confirmation about direction before I
start spraying patches around.

Cheers,

Nick