Re: [hibernate-dev] HSEARCH: Removing dynamic analyzer mapping?

Thursday, 2 July 2015

Hi,

On Thu, Jul 02, 2015 at 12:20:46PM +0100, Sanne Grinovero wrote:
...
 Ideally we should provide something similar to the Dynamic Analyzer
 feature but which also multiplexes an entity property into multiple
 fieldnames;
 for example
  property "title"
     -> title_en & analyzer en
     -> title_de & analyzer de

 The selection would work based on the Discriminator field, much like
 the current Dynamic Analyzer. 
That might be a possibility, even though I am not quite sure how exactly this would
look like. I would first need to dig in more into the existing code.
Do you have a more concrete idea on how this would look like?

...
 Still, even if we were to find the bandwidth to make that, we'd
need a
 deprecation path for the existing feature 
Well, in the above case we are not talking about deprecation right? It would be more of
a change in behavior and use!?

...
 > What's about the alternative to close the IndexWriter and
re-open it? Obviously this could be
 > optimised, but storing the field to analyzer map together with the open IndexWriter
and only
 > re-open if the mapping changes. As long as the mapping is the same the same
IndexWriter can be used.
 > This way we could keep the feature with a potential performance hit for the people
who are using it.
 > Still better than removing it, right? That said, what are the exact performance
impacts? Did you run
 > a test?

 The performace impact is huge as it would prevent you from using both
 NRT and the new backend strategy to pack multiple blocks in commit
 cycles;
 that means the impact is in the 3 to 4 orders of magnitude in throughput. 
Might be still worth testing and prototyping. 

...
 I could apply your suggestion in practice if we go for setting a
flag
 in the backend to change strategy, depending if any entity is using
 the Discriminator feature,  
That would work for me.

...
 but beyond that we also have the problem of
 different entities sharing the same index but potentially using a
 different analyzer for the same named field... I'd agree with the
 Lucene developers that people should really not do it, but we support
 that today. 
Ok. In this case I am more inclined to enforce the same analyzer.

...
 > How would that look like and did we not once discuss exactly the
opposite (aka letting even
 > the Document be built on the master)?

 We discussed to not create a Document instance on the slave, to only
 serialize a custom serializable-friendly container, but that doesn't
 prevent you to pre-tokenize the text on the slaves.
 AFAIR we discussed to create a "master node" which doesn't need the
 user classes so that would be an easy to start service w/o need for
 much more than some configuration properties.. if you don't
 pre-tokenize this configuration would still need the classes to read
 our analyzer definitions from annotations. 
Ok, that is possible as well. I think we discussed both and I was indeed
referring to the approach where the master node would do the index using 
user classes and the corresponding Search metadata. 

I like the solution you are referring to much better, since it also works
better with the ideas I have regarding the clustering of the index (eg with
RAFT). As you suggest, it would be beneficial if only the slaves would need
to know about the user classes.

...
 >> master/slave clustering approach, that would have several
other
 >> benefits:
 >>  - move the analyzer work to the slaves
 >
 > Why is that a benefit?

 - removes the need to have the analyzer definitions on the master (see above). 
Ok, in the light of the above discussed solution, you would not need the analyzers
on the master node. Not sure whether this such an important thing so. 

...
 - spreads out the CPU and memory allocations cost to each slave
node:
 better scalability than have it all done on the master 
Well, one could also take the point of view that the slaves should do as little as
possible
and let the master do the heavy lifting. It depends really for what you are optimizing
imo.

...
 >>  - reduce the network payloads
 >
 > Really, is it actually not increasing payloads?

 I would expect so: a pre-filtered token sequence is usually smaller
 than the source text, often by a good margin. 
True, in the usual cases that is probably the case.

--Hardy

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [hibernate-dev] HSEARCH: Removing dynamic analyzer mapping?