[hibernate-dev] [SEARCH] Query-only analyzers with Elasticsearch - new annotation?

Thu Jan 5 09:42:00 EST 2017

> I don't disagree, I'm merely aiming to have in future Analyzer(s)
> defined in a non-Lucene specific way, possibly allowing controlled
> exceptions.
> When changing the definitions API we'll be able to reconsider if we
> want Analyzer definitions to be scoped per index like Elasticsearch
> does.

Actually, for HSEARCH-2534, it would be enough if analyzer definitions were
scoped by indexing service (Lucene/ES).
But sure, that would be a good solution. If we wait for 6.0.

> Sure that wouldn't allow to map on Hibernate Search an existing ES
> cluster which uses conflicting names on different indexes, but reverse
> engineering of existing ES clusters isn't our focus at this time;
> people badly needing it can change their names to saner choices (as
> I'd argue that name reuse for different things wouldn't be a sane
> configuration, probably it won't be common either).

I agree with you on that. To be honest I didn't even think of such an
issue, since currently the analyzer definitions are scoped globally.

> is there something which prevents
> us to refine this decision and rather generate ES definitions out of
> all known Analyzer definitions, rather than just the ones being
> referred?

Well, yes, that was in my first message; see below:

> First we'd have to have a different namespace for each indexing service,
> but I've already implemented that much.
> Second, some analyzer definitions are only valid for one indexing service,
> and not for the other.
> For instance, analyzer definitions using ElasticsearchTokenFilterFactory
> are specific to Elasticsearch. And Analyzer definitions using the
> WhitespaceTokenizerFactory with the "rule" parameter are only valid with
> embedded Lucene. And so on. To sum up, I'm not sure we can do
> something smart.

What prevents us to generate ES definitions out of all known analyzer
definitions is that there may be definitions that *cannot* be translated to
ES, simply because they are supposed to be used only with Lucene.
I guess we could say "let's try to generate ES definitions, and if it fails
just ignore it and log a warning", but it seems a bit unsafe...

> Let's keep in mind that we're only able "translate" a very limited set
> of well-known Analyzer definitions [...]

For translations it's true, but any ES analyzer definition can be expressed
with Hibernate Search by using Elasticsearch*Factory. In fact, it's the
recommended approach.
See
https://docs.jboss.org/hibernate/search/5.6/reference/en-US/html_single/#_custom_analyzers
.

> In short, I think what matters most now is not how to define such
> analyzers as there are viable (better?) alternatives, but we need to
> make sure one can run a query with the right query-time overrides,
> especially be able to refer to an Analyzer which has been manually
> defined on ES but is possibly not known to us. (As discussed
> previously with the exception of More-Like-This Queries which will
> have to wait).

We already have discussed this many times, but once again: users will not
be able to define their analyzers manually on ES starting from ES 5.0 for
various reasons.
So that's clearly not a long-term solution. It's "viable" for now, but
since it's not future-proof it's certainly not better.

As for the short term, if I understand correctly, what you're proposing is
that users don't add an @AnalyzerDef for query-only analyzers, and that we
allow using unknown analyzers in queries? I guess we could do that, but
that basically amounts to solution 3 "don't do it". Which is fine as long
as we plan to fix it later.
Also note we'd still have to explain users that query-only analyzer
definitions are not supported with Elasticsearch.

Yoann Rodière <yoann at hibernate.org>
Hibernate NoORM Team

On 5 January 2017 at 15:06, Sanne Grinovero <sanne at hibernate.org> wrote:

> On 5 January 2017 at 13:06, Yoann Rodiere <yoann at hibernate.org> wrote:
> >> I'm wondering how you'd all feel about the third solution:
> >> 3. don't do it.
> >> This depends of course how far it is blocking in practice.
> >
> > "Don't do it until 6.0" would be acceptable, I guess, since it's still
> just
> > a technical preview. Though we would introduce a limitation that would
> only
> > be our fault (since Elasticsearch supports query-time analyzers) and that
> > would not exist with the Lucene integration.
> >
> > "Don't do it ever" seems really bad. As we've already discussed at length
> > (multiple times), not being able to define analyzers from Hibernate
> Search
> > would be a real pain for users, especially in Elasticsearch 5. That's
> true
> > for indexing analyzers, and that's also true for querying-only analyzers.
> > I wouldn't say that query-only analyzers are widespread, but they're at
> > least useful, and I'm sure there are problems that can only be solved by
> > using a different analyzer when querying than when indexing...
>
> I don't disagree, I'm merely aiming to have in future Analyzer(s)
> defined in a non-Lucene specific way, possibly allowing controlled
> exceptions.
> When changing the definitions API we'll be able to reconsider if we
> want Analyzer definitions to be scoped per index like Elasticsearch
> does.
>
> But since today the Analyzer map is "global" (as in one map per
> SearchIntegrator), I don't see why we can't treat them consistently on
> both technologies and consider them global on one ES as well, i.e.
> we'd copy all definitions to each ES index definition.
> Sure that wouldn't allow to map on Hibernate Search an existing ES
> cluster which uses conflicting names on different indexes, but reverse
> engineering of existing ES clusters isn't our focus at this time;
> people badly needing it can change their names to saner choices (as
> I'd argue that name reuse for different things wouldn't be a sane
> configuration, probably it won't be common either).
>
> >
> >> I guess that I'm missing why you'd want to force people to express
> >> that a specific Analyzer is meant to be used only at query time
> >> differently than one which is used at indexing time.
> >> If there is need to clearly make such discrimination then this should
> >> be made very clear to our users too, so I'd prefer if we could avoid
> >> introducing new concepts for people to learn.. unless there's strong
> >> need of course.
> >
> > Analyzer definitions are interpreted as either Lucene analyzers (to be
> > instantiated) or Elasticsearch analyzers (to be pushed to the ES index
> > settings) based on where they are referenced (using @Analyzer).
> > When I say an analyzer definition is query-only, it means there is an
> > @AnalyzerDefinition but there isn't any @Analyzer referencing it. So
> > Hibernate Search wouldn't know how to interpret it (ES or Lucene).
> > Currently, the default for those definitions is to interpret them as
> Lucene
> > analyzers, which leads to HSEARCH-2534: we can't have Elasticsearch
> > query-only analyzers.
>
> Ok, I understand the status quo, but is there something which prevents
> us to refine this decision and rather generate ES definitions out of
> all known Analyzer definitions, rather than just the ones being
> referred?
>
> Let's keep in mind that we're only able "translate" a very limited set
> of well-known Analyzer definitions so - while it's cool to help
> migrations were we can - our primary focus is to make sure that people
> can use any custom Analyzer configuration which they have defined
> "manually" on ES.
>
> In short, I think what matters most now is not how to define such
> analyzers as there are viable (better?) alternatives, but we need to
> make sure one can run a query with the right query-time overrides,
> especially be able to refer to an Analyzer which has been manually
> defined on ES but is possibly not known to us. (As discussed
> previously with the exception of More-Like-This Queries which will
> have to wait).
>
> Thanks,
> Sanne
>
> >
> > Maybe with this piece of information, my original message makes more
> sense?
> > I.e.:
> >
> > Solution 1, interpret those definitions as both Lucene and Elasticsearch
> > analyzer (there are problems with that, see my first message)
> > Solution 2, make users "reference" those definitions using a new
> > @QueryAnalyzer annotation.
> >
> >> Maybe I'm
> >> missing something, but couldn't a user simply use an additional
> >> @AnalyzerDef, so that the analyzer definition is associated to a name,
> >> and use that?
> >
> > As mentioned above, an @AnalyzerDef that is not referenced is considered
> as
> > a Lucene analyzer, so it's not pushed to Elasticsearch and it can't be
> used
> > when querying Elasticsearch.
> > The only workaround I see would be to add a dummy, always-empty field
> like
> > that:
> >
> > @Transient
> > @Field(name = "__dummy", analyzer = @Analyzer(definition =
> > "myQueryOnlyAnalyzer))
> > public String getMyQueryOnlyAnalyzerDummyField() {
> >   return null;
> > }
> >
> > Which means there will be a useless field in the schema just to make
> > Hibernate Search happy.
> >
> >> Is this issue relating to a specific user request?
> >
> > No, it's just a feature that is available for Lucene but not for
> > Elasticsearch.
> >
> >
> > Yoann Rodière <yoann at hibernate.org>
> > Hibernate NoORM Team
> >
> > On 5 January 2017 at 13:04, Sanne Grinovero <sanne at hibernate.org> wrote:
> >>
> >> Hello,
> >>
> >> I'm wondering how you'd all feel about the third solution:
> >>
> >>  3. don't do it.
> >>
> >> This depends of course how far it is blocking in practice. Maybe I'm
> >> missing something, but couldn't a user simply use an additional
> >> @AnalyzerDef, so that the analyzer definition is associated to a name,
> >> and use that?
> >>
> >> I guess that I'm missing why you'd want to force people to express
> >> that a specific Analyzer is meant to be used only at query time
> >> differently than one which is used at indexing time.
> >> If there is need to clearly make such discrimination then this should
> >> be made very clear to our users too, so I'd prefer if we could avoid
> >> introducing new concepts for people to learn.. unless there's strong
> >> need of course.
> >>
> >> Is this issue relating to a specific user request?
> >>
> >> Thanks,
> >> Sanne
> >>
> >>
> >>
> >> On 4 January 2017 at 16:00, Yoann Rodiere <yoann at hibernate.org> wrote:
> >> > Hello team,
> >> >
> >> > I'm currently working on HSEARCH-2534, "Query-only analyzer
> definitions
> >> > are
> >> > never added to the index settings with Elasticsearch".
> >> > This issue is about using analyzers only when querying with
> >> > Elasticsearch.
> >> > It is already possible with Lucene, but not in Elasticsearch, because
> we
> >> > assume that any analyzer definition that is not referenced by a
> >> > @Analyzer
> >> > annotation is a Lucene analyzer [1].
> >> >
> >> > To be precise, the exact place where query-only analyzers are used is
> in
> >> > EntityContext.overridesForField [2], and the overrides are leveraged
> >> > even
> >> > with Elasticsearch, for instance in ConnectedMultiFieldsTermQueryB
> uilder
> >> > [3].
> >> >
> >> > I can see two solutions to the issue:
> >> >
> >> >    1. Make all analyzer definitions available for all indexing
> services.
> >> >    2. Allow users to define, for each entity, which analyzer
> definitions
> >> >    will be necessary when querying, even though the definitions are
> not
> >> > used
> >> >    when indexing.
> >> >
> >> > Solution 1 seems quite hard to implement correctly.
> >> > First we'd have to have a different namespace for each indexing
> service,
> >> > but I've already implemented that much.
> >> > Second, some analyzer definitions are only valid for one indexing
> >> > service,
> >> > and not for the other.
> >> > For instance, analyzer definitions using
> ElasticsearchTokenFilterFactory
> >> > are specific to Elasticsearch. And Analyzer definitions using
> >> > the WhitespaceTokenizerFactory with the "rule" parameter are only
> valid
> >> > with embedded Lucene. And so on. To sum up, I'm not sure we can do
> >> > something smart.
> >> >
> >> > Solution 2 is easier to implement, but requires to add a bit of API:
> the
> >> > way for users to declare that a given analyzer definition is to be
> >> > available when querying a given entity. I would add type-level
> >> > @QueryAnalyzer(definition = "foo") and @QueryAnalyzers annotation.
> >> >
> >> > I know nobody wants to add new annotations in a minor, but right now
> >> > that
> >> > seems to be the only workable solution.
> >> >
> >> > What do you think?
> >> >
> >> > [1]
> >> >
> >> > https://github.com/hibernate/hibernate-search/blob/
> 1847bd222128395056cdf6e7cfb601ceed5e40c3/engine/src/main/
> java/org/hibernate/search/engine/impl/ConfigContext.java#L277
> >> > [2]
> >> >
> >> > https://github.com/hibernate/hibernate-search/blob/
> 1847bd222128395056cdf6e7cfb601ceed5e40c3/engine/src/main/
> java/org/hibernate/search/query/dsl/EntityContext.java#L14
> >> > [3]
> >> >
> >> > https://github.com/hibernate/hibernate-search/blob/
> 1847bd222128395056cdf6e7cfb601ceed5e40c3/engine/src/main/
> java/org/hibernate/search/query/dsl/impl/ConnectedMultiFieldsTermQueryB
> uilder.java#L222
> >> >
> >> >
> >> > Yoann Rodière <yoann at hibernate.org>
> >> > Hibernate NoORM Team
> >> > _______________________________________________
> >> > hibernate-dev mailing list
> >> > hibernate-dev at lists.jboss.org
> >> > https://lists.jboss.org/mailman/listinfo/hibernate-dev
> >
> >
>