[hibernate-dev] [SEARCH] Query-only analyzers with Elasticsearch - new annotation?

Yoann Rodiere yoann at hibernate.org
Thu Jan 5 08:06:48 EST 2017


> I'm wondering how you'd all feel about the third solution:
> 3. don't do it.
> This depends of course how far it is blocking in practice.

"Don't do it until 6.0" would be acceptable, I guess, since it's still just
a technical preview. Though we would introduce a limitation that would only
be our fault (since Elasticsearch supports query-time analyzers) and that
would not exist with the Lucene integration.

"Don't do it ever" seems really bad. As we've already discussed at length
(multiple times), not being able to define analyzers from Hibernate Search
would be a real pain for users, especially in Elasticsearch 5. That's true
for indexing analyzers, and that's also true for querying-only analyzers.
I wouldn't say that query-only analyzers are widespread, but they're at
least useful, and I'm sure there are problems that can *only* be solved by
using a different analyzer when querying than when indexing...

> I guess that I'm missing why you'd want to force people to express
> that a specific Analyzer is meant to be used only at query time
> differently than one which is used at indexing time.
> If there is need to clearly make such discrimination then this should
> be made very clear to our users too, so I'd prefer if we could avoid
> introducing new concepts for people to learn.. unless there's strong
> need of course.

Analyzer definitions are interpreted as either Lucene analyzers (to be
instantiated) or Elasticsearch analyzers (to be pushed to the ES index
settings) based on where they are referenced (using @Analyzer).
When I say an analyzer definition is query-only, it means there is an
@AnalyzerDefinition but there isn't any @Analyzer referencing it. So
Hibernate Search wouldn't know how to interpret it (ES or Lucene).
Currently, the default for those definitions is to interpret them as Lucene
analyzers, which leads to HSEARCH-2534: we can't have Elasticsearch
query-only analyzers.

Maybe with this piece of information, my original message makes more sense?
I.e.:

   1. Solution 1, interpret those definitions as both Lucene and
   Elasticsearch analyzer (there are problems with that, see my first message)
   2. Solution 2, make users "reference" those definitions using a new
   @QueryAnalyzer annotation.

> Maybe I'm
> missing something, but couldn't a user simply use an additional
> @AnalyzerDef, so that the analyzer definition is associated to a name,
> and use that?

As mentioned above, an @AnalyzerDef that is not referenced is considered as
a Lucene analyzer, so it's not pushed to Elasticsearch and it can't be used
when querying Elasticsearch.
The only workaround I see would be to add a dummy, always-empty field like
that:

@Transient
@Field(name = "__dummy", analyzer = @Analyzer(definition =
"myQueryOnlyAnalyzer))
public String getMyQueryOnlyAnalyzerDummyField() {
  return null;
}

Which means there will be a useless field in the schema just to make
Hibernate Search happy.

> Is this issue relating to a specific user request?

No, it's just a feature that is available for Lucene but not for
Elasticsearch.


Yoann Rodière <yoann at hibernate.org>
Hibernate NoORM Team

On 5 January 2017 at 13:04, Sanne Grinovero <sanne at hibernate.org> wrote:

> Hello,
>
> I'm wondering how you'd all feel about the third solution:
>
>  3. don't do it.
>
> This depends of course how far it is blocking in practice. Maybe I'm
> missing something, but couldn't a user simply use an additional
> @AnalyzerDef, so that the analyzer definition is associated to a name,
> and use that?
>
> I guess that I'm missing why you'd want to force people to express
> that a specific Analyzer is meant to be used only at query time
> differently than one which is used at indexing time.
> If there is need to clearly make such discrimination then this should
> be made very clear to our users too, so I'd prefer if we could avoid
> introducing new concepts for people to learn.. unless there's strong
> need of course.
>
> Is this issue relating to a specific user request?
>
> Thanks,
> Sanne
>
>
>
> On 4 January 2017 at 16:00, Yoann Rodiere <yoann at hibernate.org> wrote:
> > Hello team,
> >
> > I'm currently working on HSEARCH-2534, "Query-only analyzer definitions
> are
> > never added to the index settings with Elasticsearch".
> > This issue is about using analyzers only when querying with
> Elasticsearch.
> > It is already possible with Lucene, but not in Elasticsearch, because we
> > assume that any analyzer definition that is not referenced by a @Analyzer
> > annotation is a Lucene analyzer [1].
> >
> > To be precise, the exact place where query-only analyzers are used is in
> > EntityContext.overridesForField [2], and the overrides are leveraged
> even
> > with Elasticsearch, for instance in ConnectedMultiFieldsTermQueryBuilder
> > [3].
> >
> > I can see two solutions to the issue:
> >
> >    1. Make all analyzer definitions available for all indexing services.
> >    2. Allow users to define, for each entity, which analyzer definitions
> >    will be necessary when querying, even though the definitions are not
> used
> >    when indexing.
> >
> > Solution 1 seems quite hard to implement correctly.
> > First we'd have to have a different namespace for each indexing service,
> > but I've already implemented that much.
> > Second, some analyzer definitions are only valid for one indexing
> service,
> > and not for the other.
> > For instance, analyzer definitions using ElasticsearchTokenFilterFactory
> > are specific to Elasticsearch. And Analyzer definitions using
> > the WhitespaceTokenizerFactory with the "rule" parameter are only valid
> > with embedded Lucene. And so on. To sum up, I'm not sure we can do
> > something smart.
> >
> > Solution 2 is easier to implement, but requires to add a bit of API: the
> > way for users to declare that a given analyzer definition is to be
> > available when querying a given entity. I would add type-level
> > @QueryAnalyzer(definition = "foo") and @QueryAnalyzers annotation.
> >
> > I know nobody wants to add new annotations in a minor, but right now that
> > seems to be the only workable solution.
> >
> > What do you think?
> >
> > [1]
> > https://github.com/hibernate/hibernate-search/blob/
> 1847bd222128395056cdf6e7cfb601ceed5e40c3/engine/src/main/
> java/org/hibernate/search/engine/impl/ConfigContext.java#L277
> > [2]
> > https://github.com/hibernate/hibernate-search/blob/
> 1847bd222128395056cdf6e7cfb601ceed5e40c3/engine/src/main/
> java/org/hibernate/search/query/dsl/EntityContext.java#L14
> > [3]
> > https://github.com/hibernate/hibernate-search/blob/
> 1847bd222128395056cdf6e7cfb601ceed5e40c3/engine/src/main/
> java/org/hibernate/search/query/dsl/impl/ConnectedMultiFieldsTermQueryB
> uilder.java#L222
> >
> >
> > Yoann Rodière <yoann at hibernate.org>
> > Hibernate NoORM Team
> > _______________________________________________
> > hibernate-dev mailing list
> > hibernate-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/hibernate-dev
>


More information about the hibernate-dev mailing list