[hibernate-dev] HSEARCH - Different analyzers for Indexing and Querying

Emmanuel Bernard emmanuel at hibernate.org
Wed Aug 21 11:20:11 EDT 2013


It looks like an interesting idea especially as it keep the simple use
case simple (ie simply not defining an queryAnalyzer.

Can you explain to me why you would need a different analyzer for a
wildcard query? My brain is still tanning on the beach.

Brainstorming here we could do the following

@AnalyzerDef.target

enum AnalyzerTarget { ALL, INDEXING, QUERY, WILDCARD }

So you could define the same @AnalyzerDef.name several times provided
that they did not share the same targets.

But that would also change the API for the dynamic analyzer I suppose.
It also does not cover the @Analyzer.impl usage.

On Tue 2013-08-13 10:13, Guillaume Smet wrote:
> Hi,
> 
> Note: this is just a prospective idea I'd like to discuss. Even if
> it's a good idea, it's definitely 5.0 material.
> 
> Those who have used Solr and are familiar with the Solr schema have
> already seen the ability to use different analyzer for indexing and
> querying.
> 
> It's usually useful when you use analyzers which returns several
> tokens for a given token: the QueryParser usually can't build the
> correct query with these analyzers.
> 
> To take an example from my current work on HSEARCH-917 (soon to come
> \o/), I have the following case. From i-pod , the analyzer builds ipod
> i pod i-pod. ipod and i-pod aren't the issue here but the fact that i
> pod is on two tokens makes the QueryParser build an incorrect query
> (even if I use the Lucene 4.4 version which is a little bit smarter
> about these cases and at least make the i-pod ipod case work
> correctly).
> 
> The fact is that if the analyzer used at indexing has correctly
> indexed all the tokens, I don't need to expand the terms at querying
> and it should be sufficient to use a simple analyzer to lowercase the
> string and remove the accents.
> 
> Solr introduced this feature a long time ago (it was already there in
> the good old times of 1.3) and I'm wondering if we shouldn't introduce
> it in Hibernate Search too.
> 
> As for the implementation, I was thinking about adding an attribute
> queryAnalyzer to the @Field annotation. I was also wondering if we
> shouldn't add the ability to define an Analyzer for wildcard queries
> (Lucene introduced recently an AnalyzingQueryParser to do something
> like that).
> 
> And maybe, in this case, it would be a good idea to centralize the
> configuration with types as it's done in Solr? Usually, the three
> analyzers definitions would come together.
> 
> As for my particular needs, most of my full text fields would be
> analyzed like this:
> 
> indexing:
> 	@AnalyzerDef(name = HibernateSearchAnalyzer.TEXT,
> 			tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
> 			filters = {
> 					@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
> 					@TokenFilterDef(factory = WordDelimiterFilterFactory.class, params = {
> 									@org.hibernate.search.annotations.Parameter(name =
> "generateWordParts", value = "1"),
> 									@org.hibernate.search.annotations.Parameter(name =
> "generateNumberParts", value = "1"),
> 									@org.hibernate.search.annotations.Parameter(name =
> "catenateWords", value = "1"),
> 									@org.hibernate.search.annotations.Parameter(name =
> "catenateNumbers", value = "0"),
> 									@org.hibernate.search.annotations.Parameter(name =
> "catenateAll", value = "0"),
> 									@org.hibernate.search.annotations.Parameter(name =
> "splitOnCaseChange", value = "0"),
> 									@org.hibernate.search.annotations.Parameter(name =
> "splitOnNumerics", value = "0"),
> 									@org.hibernate.search.annotations.Parameter(name =
> "preserveOriginal", value = "1")
> 							}
> 					),
> 					@TokenFilterDef(factory = LowerCaseFilterFactory.class)
> 			}
> 	),
> querying:
> 	@AnalyzerDef(name = HibernateSearchAnalyzer.TEXT,
> 			tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
> 			filters = {
> 					@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
> 					@TokenFilterDef(factory = LowerCaseFilterFactory.class)
> 			}
> 	),
> wildcard:
> 	@AnalyzerDef(name = HibernateSearchAnalyzer.TEXT,
> 			tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
> 			filters = {
> 					@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
> 					@TokenFilterDef(factory = LowerCaseFilterFactory.class)
> 			}
> 	),
> 
> I could contribute time to work on this if we can agree on the way to
> pursue this idea.
> 
> Thanks for your feedback.
> 
> -- 
> Guillaume
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev


More information about the hibernate-dev mailing list