[hibernate-dev] [Search] Elasticsearch - Thoughts about analyzers and fieldbridges

Thu Feb 25 13:01:29 EST 2016

2016-02-25 18:48 GMT+01:00 Sanne Grinovero <sanne at hibernate.org>:
> On 25 February 2016 at 17:16, Guillaume Smet <guillaume.smet at gmail.com> wrote:
>> Hi,
>>
>> Running the DSLTest is quite interesting as it runs a lot of different
>> queries with different configurations.
>>
>> Here are some thoughts while working on it.
>>
>> 1/ Analyzers
>> =========
>>
>> Gunnar, I was wondering what you had in mind for analyzers: a) deal with
>> them in Hibernate Search or b) let Elasticsearch do the analyze thingy.
>>
>> I'm not sure we can get a/ to work correctly (see below) and for b) we're
>> going to need a way to disable the analyzers in Search for the entities
>> managed by Elasticsearch.
>
> Right we need to validate and/or push the analyzers configuration to
> Elasticsearch,
> and then let him do the analysis work.
>
> https://hibernate.atlassian.net/browse/HSEARCH-2108
>
> Davide is exploring this. We were just now discussing in chat how (if?) to
> validate the analyzer names which a user is attempting to use, when the named
> analyzer is not otherwise defined (on a Lucene backend this would cause an error
> during bootstrap as a "model validation failure").
>
> For example, Elasticsearch defines a bunch of default analyzers out of the box,
> such as one named "whitespace".
> We should be able to validate that this is a valid analyzer name
> even though this wasn't explicitly defined using an @AnalyzerDef.
>
> By using the "analysis test" API of ES we should be able to validate names which
> have been configured on the ES server by other means too.

That's nice, exactly what I was looking for. That way we can validate
analyzer refs.

>
> One catch is the QueryDSL: AFAIR there are a couple of cases in which we invoke
> the analyzer directly to "preprocess" the tokens.. this either needs
> to be killed,
> or we invoke the ES server to do the same operation but that seems
> inefficient.. we'll
> see, I don't remember now for which cases we need such things and hopefully
> there will be an alternative strategy when it comes to run queries on ES.

Right, I'd disable analyzers in queries targeted at ES completely for
now. I see there is an ignoreAnalyzer() option, seems like this should
be implicitly set for ES? But then this setting seems not to be
applied in the case at hand AFAICS. Or is this even a glitch in the
current impl?

>
> As a second step, we should also see if we can "push" new analyzer definitions
> to the ES server, taking them from our @AnalyzerDef.. not sure if that's doable.

Yes, I am not sure how well @AnalyzerDef can be mapped. The attributes
are largely semantically similar to what's there in ES, but the
specific annotations on our side may still not be a good fit. We'll
see :) I don't think it's a blocker, in the worst case people can
still push definitions through native API calls, at least in the first
iteration.

>
>> Take a phrase query "colder and whitening" extracted from DSLTest. As "and"
>> is considered a stopword, the resulting PhraseQuery only contains colder &
>> whitening so it's not possible to rebuild the phrase before sending it to
>> Elasticsearch. We could try to use span queries but I'm not sure we'll be
>> able to get it right with synonyms and so on.
>>
>> 2/ FieldBridges
>> ===========
>>
>> Currently, the conversion from Date to JSON string is managed by a
>> fieldbridge specific to the Elasticsearch backend (and it's not that easy
>> to plug it in as we have seen it in a ticket).
>>
>> I'm wondering if it's a good fit and if we should not invent an entirely
>> new thing to deal with the transformation between a given Java type and
>> what a backend is waiting for.
>
> I agree, the current abstraction model doesn't cut it.
> Other than discussing a "FieldBriedge 2" which is significantly
> different and will take quite some work
> (and probably require a 6.0 due to API changes) I don't have a
> concrete plan for the short term..
> you might want to propose something in the scope of current APIs?
>
> Keep in mind I'm working now to remove the Lucene Document notion from
> the LuceneWork types;
> in this pre-6 phase I think we'll want to have a translation phase
> between what the old fieldbridges
> are producing and the modernized backends. Essentially I hope we'll be
> rewriting it from the backend upwards,
> public API as last step.. until we're done there needs to be a
> translation processing, and this might live
> at a certain level initially and need to be moved while we make progress.

+1

>
>>
>> What made me think of it is that a test from DSLTest is using
>> ignoreFieldBridge() on a date field and it disables the Date -> JSON string
>> conversion entirely.

I think we can ignore this one for now.

>>
>> Comments, thoughts?
>>
>> --
>> Guillaume
>> _______________________________________________
>> hibernate-dev mailing list
>> hibernate-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev