[HSearch] DSL for Lucene queries (was: Re: [infinispan-dev] Query module new API and configurations)

[infinispan-dev] [HSearch] DSL for...

Delivery reports about your e-mail

Emmanuel Bernard

Wednesday, 26 August 2009 Wed, 26 Aug '09

6:39 a.m.

I've been thinking about a DSL to build Lucene queries in the last day. What do you think of this proposal? A few remarks: - it asks the analyzer so that we correctly apply the analyzer on terms - it has a few query factory methods - it contains a few orthogonal operations - I am not quite satisfied with how boolean is handled, any idea? Examples SealedQueryBuilder qb = searchFactory.withEntityAnalyzer(Address.class); Query luceneQuery = qb.must(Occurs.MUST) .add( qb.boolean(Occurs.Should) .add( qb.term("city", "Atlanta").boostedTo(4).createQuery() ) .add( qb.term("address1", "Peachtree").fuzzy().createQuery() ) ) .add( qb.from("movingDate", "200604").to("201201").exclusive().createQuery() ) .createQuery(); Analyzer choice queryBuilder.withAnalyzer(Analyzer) queryBuilder.withEntityAnalyzer(Class<?>) queryBuilder.basedOnEntityAnalyzer(Class<?>) .overridesForField(String field, Analyzer) .overridesForField(String field, Analyzer) .build() //sucky name returns a SealedQueryBuilder //sucky name SealedQueryBuilder contains the factory methods Factory methods Hosted onSealedQueryBuilder .term(String field, String text) //define a new query .term(String field, String text) //define a new query .ignoreAnalyzer() //ignore the analyzer, optional .fuzzy() //API prevent wildcard calls, optional .threshold() //optional .prefixLengh() //optional .term(String field, String value) .wildcard() //API prevent fuzzy calls, optional //range query .from(String field, String text) .exclusive() //optional .to(String text) .exclusive() //optional .constantScore() //optional, due to constantScoreRangeQuery but in practice inherited from the common operations //match all docs .all() //phrase query .phrase(String field) .ignoreAnalyzer() //ignore the analyzer, optional .addWord(String text) //at least one .addWord(String text) .sentence(String text) //do we need that? .slop() //optional //search multiple fields for same value .searchInMultipleFields() .onField(String field) .boostedTo(float) //optional .ignoreAnalyzer() //optional .onField(String field) .forWords(String) //do we need that? .forWord(String) Boolean operations SealedQueryBuilder contains the boolean methods .boolean(Occurs occurs) .add( qb.from().to() ) .add( ... ) Works on all queries .boostedTo() .constantScore() .filter(Filter) //filter the current query .scoreMultipliedByField(field) //FieldScoreQuery + FunctionQuery?? //Not backed .createQuery() Todo Span*Queries MultiPhraseQuery - needs to fillup all accepted terms FieldScoreQuery ValueSourceQuery FuzzyLikeThis MoreLikeThis On 25 août 09, at 16:43, Manik Surtani wrote:

...

On 25 Aug 2009, at 13:34, Emmanuel Bernard wrote: > > On 25 août 09, at 14:27, Manik Surtani wrote: > >> A DSL would work, but I'd rather not define our own language here. >> Which is why I asked for a standard. Perhaps something based on >> SQL/ >> JPA-QL? Or are you thinking DSL specific to Lucene - which could >> be used by any/all of {Lucene, Hibernate Search, Infinispan}? In >> which case the DSL should ideally be a Lucene project. > > Yes I was thinking about a DSL used for Hibernate Search and maybe > all > of Lucene if the HS integration benefits offer no value towards > simplicity (but I think i can offer value). Ok, this should be interesting. Lets chat about this some more - have you drafted any thoughts around this DSL somewhere?

Attachments:

attachment.html (text/html — 6.9 KB)

Show replies by date

Hardy Ferentschik

Wednesday, 26 August Wed, 26 Aug

3:08 p.m.

On Wed, 2009-08-26 at 13:39 +0200, Emmanuel Bernard wrote:

...

I've been thinking about a DSL to build Lucene queries in the last day. What do you think of this proposal?

What do you really gain compared to native Lucene queries? If your API achieves exactly the same as what's possible with Lucene it is just a 'useless' wrapper. A wrapper around native Lucene queries would make sense if it could somehow use some of the Hibernate Search specific meta data. As an extreme example one could generate some meta classes a la JPA2. This way one could ensure that you can get help with which field names are available. --Hardy

Emmanuel Bernard

Thursday, 27 August Thu, 27 Aug

3:48 a.m.

On 26 août 09, at 22:08, Hardy Ferentschik wrote:

...

On Wed, 2009-08-26 at 13:39 +0200, Emmanuel Bernard wrote: > I've been thinking about a DSL to build Lucene queries in the last > day. > What do you think of this proposal? What do you really gain compared to native Lucene queries? If your API achieves exactly the same as what's possible with Lucene it is just a 'useless' wrapper. A wrapper around native Lucene queries would make sense if it could somehow use some of the Hibernate Search specific meta data. As an extreme example one could generate some meta classes a la JPA2. This way one could ensure that you can get help with which field names are available.

Remember, Hibernate Search's mission is to make full-text search as easy to use as possible to increase the overall technology adoption. There are several advantages to the DSL API listed below, but let's compare my example and the Lucene equivalent and see if you can still claim the API to be useless with a straight face. SealedQueryBuilder qb = searchFactory.withEntityAnalyzer(Address.class); Query luceneQuery = qb.must(Occurs.MUST) .add( qb.boolean(Occurs.Should) .add( qb.term("city", "Atlanta").boostedTo(4).createQuery() ) .add( qb.term("address1", "Peachtree").fuzzy().threshold(. 7).createQuery() ) ) .add( qb.from("movingDate", "200604").to("201201").exclusive().createQuery() ) .createQuery(); vs BooleanQuery luceneQuery = new BooleanQuery(); BooleanQuery addressLocationQuery = new BooleanQuery(); Query city = new TermQuery( new Term("city", "Atlanta") ); city.setBoost(4f); addressLocationQuery.add(BooleanClause.Occur.Should, city); Query address1 = new FuzzyQuery( new Term("address1", "Peachtree"), . 7 ); addressLocationQuery.add(BooleanClause.Occur.Should, address1); luceneQuery.add(BooleanClause.Occur.Must, addressLocationQuery); Query range = new RangeQuery( new Term("movingDate", "200604"), new Term("movingDate", "201201", false); luceneQuery.add(BooleanClause.Occur.Must, range); Advantages: - the query is readable and understandable even to new Lucene users. BTW the example is a quite simple one, it does not involve filter, search in multiple fields, query negation etc. - I have normalized some operations that require knowledge of the lucene query hierarchy (eg. ConstantScoreQuery, ConstantScorePrefixQuery, ConstrantScoreRangeQuery or PrefixQuery vs WildcardQuery) - the API shows available options right away using IDE auto- completion, not by looking at the Query hierarchy and its implementations - the API does take the analyzer into account which means that I can take my input and use it without thinking much about the underlying analyzer used at indexing time. In the example, my plain Lucene rewrite of the query will very likely fail because "Atlanta" and "Peachtree" should really be "atlanta" and "peachtree". In the API, we have the analyzer and can take that into account. Likewise for synonyms, phonetic approximation etc. Even worse, trying to search a user query containing several words in different fields is quite difficult in plain Lucene. In the new API it could look like: String search = "harry potter"; SealedQueryBuilder qb = searchFactory.withEntityAnalyzer(Book.class); Query luceneQuery = qb.searchInMultipleFields() .onField("title").boostedTo(4) .onField("title_ngram") .onField("description") .onField("description_ngram").boostedTo(.25) .forWords(search); vs String search = "harry potter"; Analyzer analyzer = searchFactory.getAnalyzer(Book.class); Map<String,Float> boostPerField = new HashMap<String,Float>(2); // boost factors boostPerField.put( "title", (float) 4); boostPerField.put( "title_ngram", (float) 1); boostPerField.put( "description", (float) 1); boostPerField.put( "description_ngram", (float) .25); BooleanQuery luceneQuery = new BooleanQuery(); for ( Map.Entry<String, Float> entry : boostPerField.entrySet() ) { final String fieldName = entry.getKey(); final Float boost = entry.getValue(); List<String> terms = new ArrayList<String>(); try { Reader reader = new StringReader(search); TokenStream stream = analyzer.tokenStream( fieldName, reader); Token token = new Token(); token = stream.next(token); while (token != null) { if (token.termLength() != 0) { String term = new String(token.termBuffer(), 0, token.termLength()); terms.add( term ); } token = stream.next(token); } } catch ( IOException e ) { throw new RuntimeException("IO exception while reading String stream??", e); } for (String term : terms) { TermQuery termQuery = new TermQuery( new Term( fieldName, term ) ); termQuery.setBoost( boost ); luceneQuery.add( termQuery, BooleanClause.Occur.SHOULD ); } } Did I make my case?

Hardy Ferentschik

6:14 a.m.

On Thu, 27 Aug 2009 10:48:42 +0200, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:

...

Did I make my case?

Yes. I can agree that your code is more readable and it will help building queries. That said, why not suggesting something like this to the Lucene folks directly. I agree on this one with Manik. --Hardy

...

String search = "harry potter"; SealedQueryBuilder qb = searchFactory.withEntityAnalyzer(Book.class); Query luceneQuery = qb.searchInMultipleFields() .onField("title").boostedTo(4) .onField("title_ngram") .onField("description") .onField("description_ngram").boostedTo(.25) .forWords(search); vs String search = "harry potter"; Analyzer analyzer = searchFactory.getAnalyzer(Book.class); Map<String,Float> boostPerField = new HashMap<String,Float>(2); // boost factors boostPerField.put( "title", (float) 4); boostPerField.put( "title_ngram", (float) 1); boostPerField.put( "description", (float) 1); boostPerField.put( "description_ngram", (float) .25); BooleanQuery luceneQuery = new BooleanQuery(); for ( Map.Entry<String, Float> entry : boostPerField.entrySet() ) { final String fieldName = entry.getKey(); final Float boost = entry.getValue(); List<String> terms = new ArrayList<String>(); try { Reader reader = new StringReader(search); TokenStream stream = analyzer.tokenStream( fieldName, reader); Token token = new Token(); token = stream.next(token); while (token != null) { if (token.termLength() != 0) { String term = new String(token.termBuffer(), 0, token.termLength()); terms.add( term ); } token = stream.next(token); } } catch ( IOException e ) { throw new RuntimeException("IO exception while reading String stream??", e); } for (String term : terms) { TermQuery termQuery = new TermQuery( new Term( fieldName, term ) ); termQuery.setBoost( boost ); luceneQuery.add( termQuery, BooleanClause.Occur.SHOULD ); } }

Navin Surtani

Friday, 25 September Fri, 25 Sep

9:12 a.m.

New subject: [infinispan-dev] [HSearch] DSL for Lucene queries (was: Re: Query module new API and configurations)

Just wanted to get this topic re-started again. Essentially what I think this project/DSL/module/thingy-bob is thought to become: - A simple package where a user can build Lucene queries without having to know too much about Lucene itself. If I'm headed down the wrong thought path then just thwack me. On 26 Aug 2009, at 21:08, Hardy Ferentschik wrote:

...

What's gained I believe is the fact that people can build complex lucene queries easier. Currently, it's a bit clunky imo so if we provide a cleaner way to build them it can prove beneficial to any lucene user (myself included for querying on Infinispan). Any other thoughts?

...

If your API achieves exactly the same as what's possible with Lucene it is just a 'useless' wrapper. A wrapper around native Lucene queries would make sense if it could somehow use some of the Hibernate Search specific meta data. As an extreme example one could generate some meta classes a la JPA2. This way one could ensure that you can get help with which field names are available. --Hardy _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Navin Surtani Intern Infinispan Intern JBoss Cache Searchable

Manik Surtani

Thursday, 27 August Thu, 27 Aug

6:07 a.m.

New subject: [infinispan-dev] [HSearch] DSL for Lucene queries (was: Re: Query module new API and configurations)

Very elegant. I'm generally a big fan of 'builder' patterns like this, but this really isn't a DSL, is it? :) When you first mentioned a DSL I had visions of defining a new grammar and an ANTLR parser, etc. But that is overkill. This approach certainly works, and will almost certainly perform better too. One question: for the sake of brevity, why SealedQueryBuilder instead of QueryBuilder ? :) Also, I still think that if this is a generic helper factory that helps you build Lucene queries - and has no knowledge of how and where the query is used (why should it?) - then this should be something people can use outside of HS or Infinispan. E.g., directly with Lucene. On 26 Aug 2009, at 12:39, Emmanuel Bernard wrote:

...

-- Manik Surtani manik(a)jboss.org Lead, Infinispan Lead, JBoss Cache http://www.infinispan.org http://www.jbosscache.org

Emmanuel Bernard

7:06 a.m.

New subject: [infinispan-dev] [HSearch] DSL for Lucene queries (was: Re: Query module new API and configurations)

On 27 août 09, at 13:07, Manik Surtani wrote:

...

This is called an internal DSL, ie you use Java as the language, not some external representation.

...

This approach certainly works, and will almost certainly perform better too. One question: for the sake of brevity, why SealedQueryBuilder instead of QueryBuilder ? :)

The name is not right yet There are two things: - the query builder that lets you define the analyzer - the query builder that has an analyzer assigned and lets you build query What name is best for each of them.

...

Also, I still think that if this is a generic helper factory that helps you build Lucene queries - and has no knowledge of how and where the query is used (why should it?) - then this should be something people can use outside of HS or Infinispan. E.g., directly with Lucene.

As of today this code is technically pure Lucene but to be honest the idea of passing an analyzer multiplexer (like the one we receive from searchFactory.getAnalyzer<Class<?>)) is not wildly spread in Lucene and potentially cumbersome wo the declarative approach of HSearch. The second problem is that some potential improvements will require inner knowledge of HSearch: - object parameters (and not string params) do require to know the FieldBridge of the property. This is a pure HSearch notion. - "property literal" like JPA is introducing could be added to replace the String-based field approach in some situations. Though I don't think that it would be a perfect fit. - spell checker (the old idea we had) That been said, if the API ends up being pure Lucene and once we stabilize it, we can contribute it back even though I am not necessarily a huge fan of ASL.

Manik Surtani

8:03 a.m.

New subject: [infinispan-dev] [HSearch] DSL for Lucene queries (was: Re: Query module new API and configurations)

On 27 Aug 2009, at 13:06, Emmanuel Bernard wrote:

...

On 27 août 09, at 13:07, Manik Surtani wrote: > Very elegant. I'm generally a big fan of 'builder' patterns like > this, but this really isn't a DSL, is it? :) When you first > mentioned a DSL I had visions of defining a new grammar and an > ANTLR parser, etc. But that is overkill. This is called an internal DSL, ie you use Java as the language, not some external representation. > > This approach certainly works, and will almost certainly perform > better too. One question: for the sake of brevity, why > SealedQueryBuilder instead of QueryBuilder ? :) The name is not right yet There are two things: - the query builder that lets you define the analyzer - the query builder that has an analyzer assigned and lets you build query What name is best for each of them.

I thought this stuff you mentioned made sense:

...

queryBuilder.withAnalyzer(Analyzer) queryBuilder.withEntityAnalyzer(Class<?>) queryBuilder.basedOnEntityAnalyzer(Class<?>) .overridesForField(String field, Analyzer) .overridesForField(String field, Analyzer) .build() //sucky name

Perhaps rename the static factory methods to something like: QueryBuilder.getQueryBuilder(Analyzer) QueryBuilder.getQueryBuilder(Class<?>) and QueryBuilder instances have overrideAnalyzerForField(String, Analyzer). Why do you need the build() method at the end?

...

> > Also, I still think that if this is a generic helper factory that > helps you build Lucene queries - and has no knowledge of how and > where the query is used (why should it?) - then this should be > something people can use outside of HS or Infinispan. E.g., > directly with Lucene. As of today this code is technically pure Lucene but to be honest the idea of passing an analyzer multiplexer (like the one we receive from searchFactory.getAnalyzer<Class<?>)) is not wildly spread in Lucene and potentially cumbersome wo the declarative approach of HSearch. The second problem is that some potential improvements will require inner knowledge of HSearch: - object parameters (and not string params) do require to know the FieldBridge of the property. This is a pure HSearch notion. - "property literal" like JPA is introducing could be added to replace the String-based field approach in some situations. Though I don't think that it would be a perfect fit. - spell checker (the old idea we had) That been said, if the API ends up being pure Lucene and once we stabilize it, we can contribute it back even though I am not necessarily a huge fan of ASL.

Not it doesn't have to be either ASL or even hosted at Apache. I guess what I am suggesting is perhaps even a separate project - LuceneQueryBuilder or something - which plain-old-Lucene users could use as well. Doesn't matter where it's hosted or what the license is - as long as its ASL or LGPL :) -- Manik Surtani manik(a)jboss.org Lead, Infinispan Lead, JBoss Cache http://www.infinispan.org http://www.jbosscache.org

Emmanuel Bernard

10:10 a.m.

New subject: [infinispan-dev] [HSearch] DSL for Lucene queries (was: Re: Query module new API and configurations)

...

> queryBuilder.withAnalyzer(Analyzer) > queryBuilder.withEntityAnalyzer(Class<?>) > queryBuilder.basedOnEntityAnalyzer(Class<?>) > .overridesForField(String field, Analyzer) > .overridesForField(String field, Analyzer) > .build() //sucky name Perhaps rename the static factory methods to something like: QueryBuilder.getQueryBuilder(Analyzer) QueryBuilder.getQueryBuilder(Class<?>) and QueryBuilder instances have overrideAnalyzerForField(String, Analyzer). Why do you need the build() method at the end?

if you do that, all of the sudden, a QB can change it's analyzer on the fly making it immutable. Also the overridesForField methods would pollute the API when it's time to create a query. One of the advantages of a fluent API in a strongly typed environment is that we can hide methods that are meaningless in a given context.

...

> > That been said, if the API ends up being pure Lucene and once we > stabilize it, we can contribute it back even though I am not > necessarily a huge fan of ASL. Not it doesn't have to be either ASL or even hosted at Apache. I guess what I am suggesting is perhaps even a separate project - LuceneQueryBuilder or something - which plain-old-Lucene users could use as well. Doesn't matter where it's hosted or what the license is - as long as its ASL or LGPL :)

Let's start it under the Hibernate Search umbrella due to potential synergies and spin it out if needed.

Manik Surtani

10:18 a.m.

New subject: [infinispan-dev] [HSearch] DSL for Lucene queries (was: Re: Query module new API and configurations)

On 27 Aug 2009, at 16:10, Emmanuel Bernard wrote:

...

> >> queryBuilder.withAnalyzer(Analyzer) >> queryBuilder.withEntityAnalyzer(Class<?>) >> queryBuilder.basedOnEntityAnalyzer(Class<?>) >> .overridesForField(String field, Analyzer) >> .overridesForField(String field, Analyzer) >> .build() //sucky name > > Perhaps rename the static factory methods to something like: > > QueryBuilder.getQueryBuilder(Analyzer) > QueryBuilder.getQueryBuilder(Class<?>) > > and QueryBuilder instances have overrideAnalyzerForField(String, > Analyzer). Why do you need the build() method at the end? if you do that, all of the sudden, a QB can change it's analyzer on the fly making it immutable. Also the overridesForField methods would pollute the API when it's time to create a query. One of the advantages of a fluent API in a strongly typed environment is that we can hide methods that are meaningless in a given context. >> >> That been said, if the API ends up being pure Lucene and once we >> stabilize it, we can contribute it back even though I am not >> necessarily a huge fan of ASL. > > Not it doesn't have to be either ASL or even hosted at Apache. I > guess what I am suggesting is perhaps even a separate project - > LuceneQueryBuilder or something - which plain-old-Lucene users > could use as well. Doesn't matter where it's hosted or what the > license is - as long as its ASL or LGPL :) Let's start it under the Hibernate Search umbrella due to potential synergies and spin it out if needed.

Ok. Just make sure we use a different maven module or something so that there are no dependencies on the rest of HS or Hibernate. Otherwise spinning out will be a PITA. Lucene should be the only dependencies of this code. Cheers -- Manik Surtani manik(a)jboss.org Lead, Infinispan Lead, JBoss Cache http://www.infinispan.org http://www.jbosscache.org

Hardy Ferentschik

10:22 a.m.

New subject: [infinispan-dev] [HSearch] DSL for Lucene queries (was: Re: Query module new API and configurations)

That's what I said as well. Separate maven module. However, one can feel a disturbance in the force when one mentions MAVEN MODULE within the Hibernate Team :) On Thu, 27 Aug 2009 17:18:23 +0200, Manik Surtani <manik(a)jboss.org> wrote:

...

Sanne Grinovero

Friday, 28 August Fri, 28 Aug

3:37 a.m.

New subject: [infinispan-dev] [HSearch] DSL for Lucene queries (was: Re: Query module new API and configurations)

I've nothing against a separate maven module, still Hibernate Search already has lots of "goodies" to work with Lucene which are not necessarily linked to Hibernate (e.g. Analyzer definition helpers, pojo mapping through annotations, enhanced filtering, IndexReader pooling, nice Infinispan Directory...) so this new query builder is not much different. Just a thought. So even if Emmanuel has shown this builder to be useful even with this limited features, it could become even more useful when strongly combined with the other features; 2 come to mind, may be more later: A) adding filters to the builders; I don't think it would be easy to have named filters without the full Search package B) Letting the users forget about the Analyzer matches complexity (optionally), as by using the mapping information we could default to a reasonable Analyzer for each field. Most users on the forum are in trouble because they select the wrong analyzer/ forget to use one when building the F.T.Query. IMHO these are good reasons to couple it to the rest of the code; Maybe it would be possible in future to have Hibernate optional. Sanne 2009/8/27 Manik Surtani <manik(a)jboss.org>:

...

On 27 Aug 2009, at 16:10, Emmanuel Bernard wrote: queryBuilder.withAnalyzer(Analyzer) queryBuilder.withEntityAnalyzer(Class<?>) queryBuilder.basedOnEntityAnalyzer(Class<?>) .overridesForField(String field, Analyzer) .overridesForField(String field, Analyzer) .build() //sucky name Perhaps rename the static factory methods to something like: QueryBuilder.getQueryBuilder(Analyzer) QueryBuilder.getQueryBuilder(Class<?>) and QueryBuilder instances have overrideAnalyzerForField(String, Analyzer). Why do you need the build() method at the end? if you do that, all of the sudden, a QB can change it's analyzer on the fly making it immutable. Also the overridesForField methods would pollute the API when it's time to create a query. One of the advantages of a fluent API in a strongly typed environment is that we can hide methods that are meaningless in a given context. That been said, if the API ends up being pure Lucene and once we stabilize it, we can contribute it back even though I am not necessarily a huge fan of ASL. Not it doesn't have to be either ASL or even hosted at Apache. I guess what I am suggesting is perhaps even a separate project - LuceneQueryBuilder or something - which plain-old-Lucene users could use as well. Doesn't matter where it's hosted or what the license is - as long as its ASL or LGPL :) Let's start it under the Hibernate Search umbrella due to potential synergies and spin it out if needed. Ok. Just make sure we use a different maven module or something so that there are no dependencies on the rest of HS or Hibernate. Otherwise spinning out will be a PITA. Lucene should be the only dependencies of this code. Cheers -- Manik Surtani manik(a)jboss.org Lead, Infinispan Lead, JBoss Cache http://www.infinispan.org http://www.jbosscache.org _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

6043

days inactive

6073

days old

hibernate-dev@lists.jboss.org

Manage subscription

11 comments

5 participants

tags (0)

participants (5)

Emmanuel Bernard
Hardy Ferentschik
Manik Surtani
Navin Surtani
Sanne Grinovero

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[HSearch] DSL for Lucene queries (was: Re: [infinispan-dev] Query module new API and configurations)