[infinispan-dev] [hibernate-dev] [HSearch] DSL for Lucene queries (was: Re: Query module new API and configurations)

Thu Aug 27 04:48:42 EDT 2009

On 26 août 09, at 22:08, Hardy Ferentschik wrote:

> On Wed, 2009-08-26 at 13:39 +0200, Emmanuel Bernard wrote:
>> I've been thinking about a DSL to build Lucene queries in the last
>> day.
>> What do you think of this proposal?
>
> What do you really gain compared to native Lucene queries?
> If your API achieves exactly the same as what's possible with Lucene
> it is just a 'useless' wrapper.
>
> A wrapper around native Lucene queries would make sense if it could
> somehow use some of the Hibernate Search specific meta data. As an
> extreme example one could generate some meta classes a la JPA2. This  
> way
> one could ensure that you can get help with which field names are
> available.

Remember, Hibernate Search's mission is to make full-text search as  
easy to use as possible to increase the overall technology adoption.

There are several advantages to the DSL API listed below, but let's  
compare my example and the Lucene equivalent and see if you can still  
claim the API to be useless with a straight face.

SealedQueryBuilder qb = searchFactory.withEntityAnalyzer(Address.class);
Query luceneQuery =
qb.must(Occurs.MUST)
     .add(
         qb.boolean(Occurs.Should)
             .add( qb.term("city",  
"Atlanta").boostedTo(4).createQuery() )
             .add( qb.term("address1", "Peachtree").fuzzy().threshold(. 
7).createQuery() )
     )
     .add(
         qb.from("movingDate",  
"200604").to("201201").exclusive().createQuery()
     )
     .createQuery();

vs

BooleanQuery luceneQuery = new BooleanQuery();
BooleanQuery addressLocationQuery = new BooleanQuery();
Query city = new TermQuery( new Term("city", "Atlanta") );
city.setBoost(4f);
addressLocationQuery.add(BooleanClause.Occur.Should, city);
Query address1 = new FuzzyQuery( new Term("address1", "Peachtree"), . 
7 );
addressLocationQuery.add(BooleanClause.Occur.Should, address1);
luceneQuery.add(BooleanClause.Occur.Must, addressLocationQuery);
Query range = new RangeQuery( new Term("movingDate", "200604"), new  
Term("movingDate", "201201", false);
luceneQuery.add(BooleanClause.Occur.Must, range);

Advantages:
  - the query is readable and understandable even to new Lucene users.  
BTW the example is a quite simple one, it does not involve filter,  
search in multiple fields, query negation etc.
  - I have normalized some operations that require knowledge of the  
lucene query hierarchy (eg. ConstantScoreQuery,  
ConstantScorePrefixQuery, ConstrantScoreRangeQuery or PrefixQuery vs  
WildcardQuery)
  - the API shows available options right away using IDE auto- 
completion, not by looking at the Query hierarchy and its  
implementations
  - the API does take the analyzer into account which means that I can  
take my input and use it without thinking much about the underlying  
analyzer used at indexing time. In the example, my plain Lucene  
rewrite of the query will very likely fail because "Atlanta" and  
"Peachtree" should really be "atlanta" and "peachtree". In the API, we  
have the analyzer and can take that into account. Likewise for  
synonyms, phonetic approximation etc.

Even worse, trying to search a user query containing several words in  
different fields is quite difficult in plain Lucene. In the new API it  
could look like:

String search = "harry potter";
SealedQueryBuilder qb = searchFactory.withEntityAnalyzer(Book.class);
Query luceneQuery =
     qb.searchInMultipleFields()
         .onField("title").boostedTo(4)
         .onField("title_ngram")
         .onField("description")
         .onField("description_ngram").boostedTo(.25)
         .forWords(search);

vs

String search = "harry potter";
Analyzer analyzer = searchFactory.getAnalyzer(Book.class);
Map<String,Float> boostPerField = new HashMap<String,Float>(2); //  
boost factors
boostPerField.put( "title", (float) 4);
boostPerField.put( "title_ngram", (float) 1);
boostPerField.put( "description", (float) 1);
boostPerField.put( "description_ngram", (float) .25);

BooleanQuery luceneQuery = new BooleanQuery();
for ( Map.Entry<String, Float> entry : boostPerField.entrySet() ) {
	final String fieldName = entry.getKey();
	final Float boost = entry.getValue();

	List<String> terms = new ArrayList<String>();
	try {
		Reader reader = new StringReader(search);
		TokenStream stream = analyzer.tokenStream( fieldName, reader);
		Token token = new Token();
		token = stream.next(token);
		while (token != null) {
			if (token.termLength() != 0) {
				String term = new String(token.termBuffer(), 0, token.termLength());
				terms.add( term );
			}
			token = stream.next(token);
		}
	}
	catch ( IOException e ) {
		throw new RuntimeException("IO exception while reading String  
stream??", e);
	}

	for (String term : terms) {
		TermQuery termQuery = new TermQuery( new Term( fieldName, term ) );
		termQuery.setBoost( boost );
		luceneQuery.add( termQuery, BooleanClause.Occur.SHOULD );
	}
}

Did I make my case?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20090827/e3606709/attachment-0002.html