[hibernate-dev] HSEARCH: Removing dynamic analyzer mapping?

Fri Jul 3 04:53:46 EDT 2015

> Anyway, nice brainstorming but I'm not even sure how feasible it would
be to do pre-processing without the IndexWriter :)

Where/when is that pre-processing happening today? IMO we must start and
consider non-Lucene backends in all our plans.

2015-07-02 18:24 GMT+02:00 Sanne Grinovero <sanne at hibernate.org>:

> On 2 July 2015 at 12:50, Hardy Ferentschik <hardy at hibernate.org> wrote:
> > Hi,
> >
> > On Thu, Jul 02, 2015 at 12:20:46PM +0100, Sanne Grinovero wrote:
> >> Ideally we should provide something similar to the Dynamic Analyzer
> >> feature but which also multiplexes an entity property into multiple
> >> fieldnames;
> >> for example
> >>  property "title"
> >>     -> title_en & analyzer en
> >>     -> title_de & analyzer de
> >>
> >> The selection would work based on the Discriminator field, much like
> >> the current Dynamic Analyzer.
> >
> > That might be a possibility, even though I am not quite sure how exactly
> this would
> > look like. I would first need to dig in more into the existing code.
> > Do you have a more concrete idea on how this would look like?
>
> I did not sketch an implementation.
>
> >> Still, even if we were to find the bandwidth to make that, we'd need a
> >> deprecation path for the existing feature
> >
> > Well, in the above case we are not talking about deprecation right? It
> would be more of
> > a change in behavior and use!?
>
> Right, the above example would be quite different, for example queries
> would need to target the right field - and using the right analyzer -
> and that would need explicit user input.
> So to provide a deprecation path, we'd need a version which supports
> both approaches so that people can move from one to the other.. which
> implies keeping the existing model around for a little longer, which
> is problematic.
> In other words, discussing a better solution is good but doesn't avoid
> the need to keep the existing functionality around.
>
> >
> >> > What's about the alternative to close the IndexWriter and re-open it?
> Obviously this could be
> >> > optimised, but storing the field to analyzer map together with the
> open IndexWriter and only
> >> > re-open if the mapping changes. As long as the mapping is the same
> the same IndexWriter can be used.
> >> > This way we could keep the feature with a potential performance hit
> for the people who are using it.
> >> > Still better than removing it, right? That said, what are the exact
> performance impacts? Did you run
> >> > a test?
> >>
> >> The performace impact is huge as it would prevent you from using both
> >> NRT and the new backend strategy to pack multiple blocks in commit
> >> cycles;
> >> that means the impact is in the 3 to 4 orders of magnitude in
> throughput.
> >
> > Might be still worth testing and prototyping.
> >
> >> I could apply your suggestion in practice if we go for setting a flag
> >> in the backend to change strategy, depending if any entity is using
> >> the Discriminator feature,
> >
> > That would work for me.
> >
> >> but beyond that we also have the problem of
> >> different entities sharing the same index but potentially using a
> >> different analyzer for the same named field... I'd agree with the
> >> Lucene developers that people should really not do it, but we support
> >> that today.
> >
> > Ok. In this case I am more inclined to enforce the same analyzer.
>
> Right, especially as we can detect the inconsistency at boot time and
> raise an appropriate warning.
> In this case I'd not expect a nice deprecation path as the existing
> usage (if any user did this) would have been problematic already.
>
> >
> >> > How would that look like and did we not once discuss exactly the
> opposite (aka letting even
> >> > the Document be built on the master)?
> >>
> >> We discussed to not create a Document instance on the slave, to only
> >> serialize a custom serializable-friendly container, but that doesn't
> >> prevent you to pre-tokenize the text on the slaves.
> >> AFAIR we discussed to create a "master node" which doesn't need the
> >> user classes so that would be an easy to start service w/o need for
> >> much more than some configuration properties.. if you don't
> >> pre-tokenize this configuration would still need the classes to read
> >> our analyzer definitions from annotations.
> >
> > Ok, that is possible as well. I think we discussed both and I was indeed
> > referring to the approach where the master node would do the index using
> > user classes and the corresponding Search metadata.
> >
> > I like the solution you are referring to much better, since it also works
> > better with the ideas I have regarding the clustering of the index (eg
> with
> > RAFT). As you suggest, it would be beneficial if only the slaves would
> need
> > to know about the user classes.
>
> +1
>
> >
> >> >> master/slave clustering approach, that would have several other
> >> >> benefits:
> >> >>  - move the analyzer work to the slaves
> >> >
> >> > Why is that a benefit?
> >>
> >> - removes the need to have the analyzer definitions on the master (see
> above).
> >
> > Ok, in the light of the above discussed solution, you would not need the
> analyzers
> > on the master node. Not sure whether this such an important thing so.
>
> Above you said you like it much better to not need the user classes on
> the master.
> We build the analyzers from the annotations on the user classes - not
> least we allow the user to provide custom analyzer implementations.
> So avoiding the need to have the analyzers on the master node is a
> pre-requisite to get rid of the user classes.
>
> >
> >> - spreads out the CPU and memory allocations cost to each slave node:
> >> better scalability than have it all done on the master
> >
> > Well, one could also take the point of view that the slaves should do as
> little as possible
> > and let the master do the heavy lifting. It depends really for what you
> are optimizing imo.
>
> Good point, one might prefer the opposite. But by decoupling the chain:
>   entity -> [tokenizing && indexwriting]
> into
>   entity -> tokenizing -> indexwriting
> Then you can easily provide an option to let the user make this choice
> about were you want the tokenizing to happen.
>
> I'd wager though that most will want to favour scalability, so I'd
> implement that first.
>
> >> >>  - reduce the network payloads
> >> >
> >> > Really, is it actually not increasing payloads?
> >>
> >> I would expect so: a pre-filtered token sequence is usually smaller
> >> than the source text, often by a good margin.
> >
> > True, in the usual cases that is probably the case.
>
> I can only think of the opposite to happen in context of information
> enrichment, such as Apache UIMA or Stanbol, but in these cases the
> high level of computation would even more want you to choose for
> scalability, i.e. pre-process each case on the slave rather than
> killing your master nodes.
>
> Anyway, nice brainstorming but I'm not even sure how feasible it would
> be to do pre-processing without the IndexWriter :)
>
> Sanne
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>