On 26 août 09, at 22:08, Hardy Ferentschik wrote:
On Wed, 2009-08-26 at 13:39 +0200, Emmanuel Bernard wrote:
> I've been thinking about a DSL to build Lucene queries in the last
> day.
> What do you think of this proposal?
What do you really gain compared to native Lucene queries?
If your API achieves exactly the same as what's possible with Lucene
it is just a 'useless' wrapper.
A wrapper around native Lucene queries would make sense if it could
somehow use some of the Hibernate Search specific meta data. As an
extreme example one could generate some meta classes a la JPA2. This
way
one could ensure that you can get help with which field names are
available.
Remember, Hibernate Search's mission is to make full-text search as
easy to use as possible to increase the overall technology adoption.
There are several advantages to the DSL API listed below, but let's
compare my example and the Lucene equivalent and see if you can still
claim the API to be useless with a straight face.
SealedQueryBuilder qb = searchFactory.withEntityAnalyzer(Address.class);
Query luceneQuery =
qb.must(Occurs.MUST)
.add(
qb.boolean(Occurs.Should)
.add( qb.term("city",
"Atlanta").boostedTo(4).createQuery() )
.add( qb.term("address1",
"Peachtree").fuzzy().threshold(.
7).createQuery() )
)
.add(
qb.from("movingDate",
"200604").to("201201").exclusive().createQuery()
)
.createQuery();
vs
BooleanQuery luceneQuery = new BooleanQuery();
BooleanQuery addressLocationQuery = new BooleanQuery();
Query city = new TermQuery( new Term("city", "Atlanta") );
city.setBoost(4f);
addressLocationQuery.add(BooleanClause.Occur.Should, city);
Query address1 = new FuzzyQuery( new Term("address1", "Peachtree"), .
7 );
addressLocationQuery.add(BooleanClause.Occur.Should, address1);
luceneQuery.add(BooleanClause.Occur.Must, addressLocationQuery);
Query range = new RangeQuery( new Term("movingDate", "200604"), new
Term("movingDate", "201201", false);
luceneQuery.add(BooleanClause.Occur.Must, range);
Advantages:
- the query is readable and understandable even to new Lucene users.
BTW the example is a quite simple one, it does not involve filter,
search in multiple fields, query negation etc.
- I have normalized some operations that require knowledge of the
lucene query hierarchy (eg. ConstantScoreQuery,
ConstantScorePrefixQuery, ConstrantScoreRangeQuery or PrefixQuery vs
WildcardQuery)
- the API shows available options right away using IDE auto-
completion, not by looking at the Query hierarchy and its
implementations
- the API does take the analyzer into account which means that I can
take my input and use it without thinking much about the underlying
analyzer used at indexing time. In the example, my plain Lucene
rewrite of the query will very likely fail because "Atlanta" and
"Peachtree" should really be "atlanta" and "peachtree". In
the API, we
have the analyzer and can take that into account. Likewise for
synonyms, phonetic approximation etc.
Even worse, trying to search a user query containing several words in
different fields is quite difficult in plain Lucene. In the new API it
could look like:
String search = "harry potter";
SealedQueryBuilder qb = searchFactory.withEntityAnalyzer(Book.class);
Query luceneQuery =
qb.searchInMultipleFields()
.onField("title").boostedTo(4)
.onField("title_ngram")
.onField("description")
.onField("description_ngram").boostedTo(.25)
.forWords(search);
vs
String search = "harry potter";
Analyzer analyzer = searchFactory.getAnalyzer(Book.class);
Map<String,Float> boostPerField = new HashMap<String,Float>(2); //
boost factors
boostPerField.put( "title", (float) 4);
boostPerField.put( "title_ngram", (float) 1);
boostPerField.put( "description", (float) 1);
boostPerField.put( "description_ngram", (float) .25);
BooleanQuery luceneQuery = new BooleanQuery();
for ( Map.Entry<String, Float> entry : boostPerField.entrySet() ) {
final String fieldName = entry.getKey();
final Float boost = entry.getValue();
List<String> terms = new ArrayList<String>();
try {
Reader reader = new StringReader(search);
TokenStream stream = analyzer.tokenStream( fieldName, reader);
Token token = new Token();
token = stream.next(token);
while (token != null) {
if (token.termLength() != 0) {
String term = new String(token.termBuffer(), 0, token.termLength());
terms.add( term );
}
token = stream.next(token);
}
}
catch ( IOException e ) {
throw new RuntimeException("IO exception while reading String
stream??", e);
}
for (String term : terms) {
TermQuery termQuery = new TermQuery( new Term( fieldName, term ) );
termQuery.setBoost( boost );
luceneQuery.add( termQuery, BooleanClause.Occur.SHOULD );
}
}
Did I make my case?