[hibernate-issues] [Hibernate-JIRA] Commented: (HSEARCH-194) Inconsistent performance between hibernate search and pure lucene access

Wed May 28 22:25:33 EDT 2008

    [ http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_30271 ] 

Emmanuel Bernard commented on HSEARCH-194:
------------------------------------------

I've spent a couple of hours to test all this.

My test runs 100 threads in parallel. Each thread querying 'name:maria OR description:long<i>' (100 different queries), accessing the first 100 matching documents, retrieving one field and putting the result in a List<String>.
The index is 5 Mb on disk. Tests do not warm up by preopening Readers.

Plain Lucene
81423 84322 65698 86481 122023 = 87989.4

Hibernate Search regular settings
151810 172000 162000 157000 154000 = 159362

Hibernate Search with DumpReaderStrategy (*)
120000 148000 137000 133000 148000 = 137200
55783 84179 53086 74069 97635 = 

Hibernate Search without the class filter clause (**)
107000 103000 132000 86000 115000 = 108600

(*) The DumpStrategy stores in a static file the reader (prepare it eagerly) and reuses it: there is no lock latency
(**) By default, Hibernate Search add a Boolean clause to all queries to filter to the targeted classes (and their subclasses) unless you do not specify a class type in createFullTextQuery. This test, explicitly remove the filter clause.

So if we use Hibernate Search regular as a base line

HSearch dump is 13.9% faster
HSearch without clause is 31.8% faster
Plain Lucene is 44.7% faster

I could not reproduce the 30+ seconds per query as described in the post.

Potential optimizations:
 - remove the unnecessary clause when we know that there is only one class per index (we need to add some more metadata): that is fairly easy
 - rework the SharedReaderProvider and use some of Sanne's ideas to warm in the background. As it is today, the SharedReaderProvider is very slow to warm up in this particular case (all queries arrive at the same time and we create a new Reader provider most of the time as we don't wait for one to be created today: reusability is limited)

Still todo:
I still cannot completely explain why queries ran in parallel take so long. Contrary to Pure Lucene, the cache does not seem to kick in very well (even if SharedReaderProvider is artificially prewarmed.

We should redo the tests with a pre warm up. The very early tests I did show:
Lucene 7500, HSearch no class clause + pure Reader reuse 40000, Regular HSearch 120000

We have enhancements to do :)

Note the test shown in the forum is unfair as it compares Lucene *not* accessing the documents nor the fields to Hibernate Search applying the field access logic. If you want to avoid accessing fields, you can use getResultSize() (wo calling list()).

> Inconsistent performance between hibernate search and pure lucene access
> ------------------------------------------------------------------------
>
>                 Key: HSEARCH-194
>                 URL: http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-194
>             Project: Hibernate Search
>          Issue Type: Bug
>          Components: query
>    Affects Versions: 3.0.1.GA
>         Environment: Linux - Hibernate 3.2.6, Hibernate Annotations 3..3.1 - Lucene 2.3.1
>            Reporter: Stephane Nicoll
>            Priority: Critical
>         Attachments: Monitor_Usage_Statistics.html
>
>
> I have a simple index that contains:
> * id (pk of the entity)
> * keywords (a list of tokens)
> The index contains 100.000 objects and the keywords field has 2 tokens from a list of 40 different values
> What I want to do is retrieve all the IDs that matches a given lucene query on the keywords. So for that I'm doing something like:
> FullTextSession fullTextSession = Search.createFullTextSession(session);
> QueryParser parser = new QueryParser("keywords", luceneAnalyzer);
> org.apache.lucene.search.Query hibernateQuery = parser.parse("foo AND bar");
> FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(hibernateQuery, target);
> fullTextQuery.setProjection("id");
> fullTextQuery.setResultTransformer(resultTransformer);
> Iterator it = fullTextQuery.iterate();
> Where ResultTransformer is
> private static class FirstObjectResultTransformer implements ResultTransformer {
>         public Object transformTuple(Object[] objects, String[] strings) {
>             return objects[0];
>         }
>         public List transformList(List list) {
>             return list;
>         }
>     }
> If I do a load test with a single thread, the execution time of my lucene query is around 200 msec. If I do a load test with 10 threads, the execution time is 2 sec (per user!). If I run the profiler on the service, I see lots of deadocks on SegmentReader.
> Switching to a "non-shared" strategy removes the deadlocks but it's still slow (1.5 sec).
> Now, If I execute the same query on the same index and the same host with only the lucene API, the query takes around 100msec with 10 concurrent users. I tried to use the lucene API from Hibernate Search but it did not change anything.
> What am I missing? Attached the profiling result.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://opensource.atlassian.com/projects/hibernate/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira