[hibernate-dev] HSEARCH - Different analyzers for Indexing and Querying

Tue Aug 13 04:13:52 EDT 2013

Hi,

Note: this is just a prospective idea I'd like to discuss. Even if
it's a good idea, it's definitely 5.0 material.

Those who have used Solr and are familiar with the Solr schema have
already seen the ability to use different analyzer for indexing and
querying.

It's usually useful when you use analyzers which returns several
tokens for a given token: the QueryParser usually can't build the
correct query with these analyzers.

To take an example from my current work on HSEARCH-917 (soon to come
\o/), I have the following case. From i-pod , the analyzer builds ipod
i pod i-pod. ipod and i-pod aren't the issue here but the fact that i
pod is on two tokens makes the QueryParser build an incorrect query
(even if I use the Lucene 4.4 version which is a little bit smarter
about these cases and at least make the i-pod ipod case work
correctly).

The fact is that if the analyzer used at indexing has correctly
indexed all the tokens, I don't need to expand the terms at querying
and it should be sufficient to use a simple analyzer to lowercase the
string and remove the accents.

Solr introduced this feature a long time ago (it was already there in
the good old times of 1.3) and I'm wondering if we shouldn't introduce
it in Hibernate Search too.

As for the implementation, I was thinking about adding an attribute
queryAnalyzer to the @Field annotation. I was also wondering if we
shouldn't add the ability to define an Analyzer for wildcard queries
(Lucene introduced recently an AnalyzingQueryParser to do something
like that).

And maybe, in this case, it would be a good idea to centralize the
configuration with types as it's done in Solr? Usually, the three
analyzers definitions would come together.

As for my particular needs, most of my full text fields would be
analyzed like this:

indexing:
	@AnalyzerDef(name = HibernateSearchAnalyzer.TEXT,
			tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
			filters = {
					@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
					@TokenFilterDef(factory = WordDelimiterFilterFactory.class, params = {
									@org.hibernate.search.annotations.Parameter(name =
"generateWordParts", value = "1"),
									@org.hibernate.search.annotations.Parameter(name =
"generateNumberParts", value = "1"),
									@org.hibernate.search.annotations.Parameter(name =
"catenateWords", value = "1"),
									@org.hibernate.search.annotations.Parameter(name =
"catenateNumbers", value = "0"),
									@org.hibernate.search.annotations.Parameter(name =
"catenateAll", value = "0"),
									@org.hibernate.search.annotations.Parameter(name =
"splitOnCaseChange", value = "0"),
									@org.hibernate.search.annotations.Parameter(name =
"splitOnNumerics", value = "0"),
									@org.hibernate.search.annotations.Parameter(name =
"preserveOriginal", value = "1")
							}
					),
					@TokenFilterDef(factory = LowerCaseFilterFactory.class)
			}
	),
querying:
	@AnalyzerDef(name = HibernateSearchAnalyzer.TEXT,
			tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
			filters = {
					@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
					@TokenFilterDef(factory = LowerCaseFilterFactory.class)
			}
	),
wildcard:
	@AnalyzerDef(name = HibernateSearchAnalyzer.TEXT,
			tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
			filters = {
					@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
					@TokenFilterDef(factory = LowerCaseFilterFactory.class)
			}
	),

I could contribute time to work on this if we can agree on the way to
pursue this idea.

Thanks for your feedback.

-- 
Guillaume