[SEARCH] Translating analyzer definitions from HSearch to Elasticsearch

Tuesday, 13 December 2016

Hello everyone,

I'm currently working on HSEARCH-2219, "Define analyzers via the REST API",
whose purpose is to automatically translate @AnalyzerDefs in Hibernate
Search to settings in Elasticsearch, removing the need for users to
configure analyzers separately in their Elasticsearch instance.

The thing is, the structure of our configuration in Hibernate Search is
different from the one in Elasticsearch. In particular, we can't name
instances of token filters, char filters, etc, while in Elasticsearch one
*has* to name them in order to provide parameters.

See for instance:

@AnalyzerDef(
  name = "myAnalyzer",
  tokenizer = @TokenizerDef(
    factory = StandardTokenizerFactory.class,
    parameters = @Parameters(@Parameter(name = "maxTokenLength", value =
"900"))
  )
)

compared to the Elasticsearch way:

index :
    analysis :
        analyzer :
            myAnalyzer :
                type : custom
                tokenizer : myTokenizer1
        tokenizer :
            myTokenizer1 :
                type : standard
                max_token_length : 900

The analyzer name is there on both sides, @TokenizerDef.factory would give
me the tokenizer type, and parameters are pretty obvious too. But
"myTokenizer1", the tokenizer name, has absolutely no equivalent in
Hibernate Search.

I could try to generate names automatically, but those would need to be
more or less stable across multiple executions in order for schema
validation to work properly. And there's nothing we could really use as an
identifier in our annotations, at least not reliably.

To fill the gap, I'd like to add a "name" attribute to the TokenizerDef,
CharFilterDef and TokenFilterDef annotations. This attribute would be
optional and the documentation would mention that it's useless for embedded
Lucene.

Another solution would be to have a "magic" @Parameter, named after a
constant (ElasticsearchParameters.TOKENIZER_NAME for instance), and detect
that parameter automatically, but it feels wrong... mainly because
@AnalyzerDef already has its own "name" attribute, so why wouldn't
@TokenizerDef?

And finally, we could bring our annotations closer to the Elasticsearch
way, by providing a way to define tokenizers/char filters/token filters and
a separate way to reference those definitions, but I don't think that's 5.6
material, since we'd likely have to break things or lose consistency.

WDYT?

Yoann Rodière <yoann(a)hibernate.org&gt;
Hibernate NoORM Team

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006