[hibernate-dev] HSEARCH: Removing dynamic analyzer mapping?

Thu Jul 2 07:20:46 EDT 2015

On 2 July 2015 at 11:25, Hardy Ferentschik <hardy at hibernate.org> wrote:
> Hi,
>
>> This means we might need to drop our "Dynamic Analyzer" feature:
>>  http://docs.jboss.org/hibernate/search/5.4/reference/en-US/html_single/#_dynamic_analyzer_selection
>
> I think that seems rather harsh.

I agree, I'd be quite unhappy if it comes to that. If we do it, we
should at least provide an alternative way to handle multi-language
indexing.
Ideally we should provide something similar to the Dynamic Analyzer
feature but which also multiplexes an entity property into multiple
fieldnames;
for example
 property "title"
    -> title_en & analyzer en
    -> title_de & analyzer de

The selection would work based on the Discriminator field, much like
the current Dynamic Analyzer.
Still, even if we were to find the bandwidth to make that, we'd need a
deprecation path for the existing feature. So for now I'm focusing on
trying to keep the existing feature to work somehow, we can then work
on the better solution as follow-up step.

>> So, the alternatives I'm seeing:
>>  # Dropping the Dynamic Analyzer feature
>>  # Cheat and pass in a mutable Analyzer - needs some caution re concurrent usage
>>  # Cheat and pass in a pre-analyzed Document
>>  # Fork & patch the IndexWriter
>
> What's about the alternative to close the IndexWriter and re-open it? Obviously this could be
> optimised, but storing the field to analyzer map together with the open IndexWriter and only
> re-open if the mapping changes. As long as the mapping is the same the same IndexWriter can be used.
> This way we could keep the feature with a potential performance hit for the people who are using it.
> Still better than removing it, right? That said, what are the exact performance impacts? Did you run
> a test?

The performace impact is huge as it would prevent you from using both
NRT and the new backend strategy to pack multiple blocks in commit
cycles;
that means the impact is in the 3 to 4 orders of magnitude in throughput.
Another problem is that you'd have to apply such a strategy in all
cases, even if they don't use Dynamic Analyzers as the backend can't
really auto-detect when such a Work item is about to be processed (I
just tried it, it's getting very hairy).

I could apply your suggestion in practice if we go for setting a flag
in the backend to change strategy, depending if any entity is using
the Discriminator feature, but beyond that we also have the problem of
different entities sharing the same index but potentially using a
different analyzer for the same named field... I'd agree with the
Lucene developers that people should really not do it, but we support
that today.

>
> Funny enough, what the Lucene guys try to prevent by the API change can still be done, namely
> by just re-opening the IndexWriter. So they are effectively forcing people who want to use this
> analyzer per document feature to go down an even more slippery slope. I would not be surprised if
> this change get reverted.

Right, there are many ways around, they are just forcing us to write
uglier and slower code. I hope they'll accommodate for it.

>
>> My favourite long-term solution would be to do pre-analysis:
>
> How would that look like and did we not once discuss exactly the opposite (aka letting even
> the Document be built on the master)?

We discussed to not create a Document instance on the slave, to only
serialize a custom serializable-friendly container, but that doesn't
prevent you to pre-tokenize the text on the slaves.
AFAIR we discussed to create a "master node" which doesn't need the
user classes so that would be an easy to start service w/o need for
much more than some configuration properties.. if you don't
pre-tokenize this configuration would still need the classes to read
our analyzer definitions from annotations.

>
>> master/slave clustering approach, that would have several other
>> benefits:
>>  - move the analyzer work to the slaves
>
> Why is that a benefit?

- removes the need to have the analyzer definitions on the master (see above).
- spreads out the CPU and memory allocations cost to each slave node:
better scalability than have it all done on the master.

>>  - reduce the network payloads
>
> Really, is it actually not increasing payloads?

I would expect so: a pre-filtered token sequence is usually smaller
than the source text, often by a good margin.

>>  - remove the need to be able to serialize analyzers
>
> We don't serialize analyzers afaik

Right, not sure why I wrote that. I was probably thinking of the
Map<fieldName, AnalyzerName> which we ship with each Work to the
backend to apply custom overrides.. a little simplification but not a
big win.

Sanne

>
> --Hardy