HSEARCH: Removing dynamic analyzer mapping?

Building snapshots of Hibernate...

OGM options and ORM 5 bootstrap

Sanne Grinovero

Tuesday, 30 June 2015 Tue, 30 Jun '15

6:57 a.m.

Among the many changes of Apache Lucene 5, it is no longer possible to override the Analyzer on a per-document base. You have to pick a single Analyzer when opening the IndexWriter. Of course the Analyzer can still return a different tokenization chain for each field, but the field->tokenizer mapping has to be consistent for the lifecycle of the IndexWriter. This means we might need to drop our "Dynamic Analyzer" feature: http://docs.jboss.org/hibernate/search/5.4/reference/en-US/html_single/#_... I did ask to restore the functionality: https://issues.apache.org/jira/browse/LUCENE-6212 So, the alternatives I'm seeing: # Dropping the Dynamic Analyzer feature # Cheat and pass in a mutable Analyzer - needs some caution re concurrent usage # Cheat and pass in a pre-analyzed Document # Fork & patch the IndexWriter Patching the functionality back in Lucene is trivial, but the Lucene team needs to agree on the use case and then the release time will be long. We should discuss both a short-term solution and the better long-term solution. My favourite long-term solution would be to do pre-analysis: in our master/slave clustering approach, that would have several other benefits: - move the analyzer work to the slaves - reduce the network payloads - remove the need to be able to serialize analyzers But I'd prefer to do this in a second "polishing phase" rather than consider such a backend rewrite as a blocker for Lucene 5. WDYT? Thanks, Sanne

Show replies by date

Emmanuel Bernard

Tuesday, 30 June Tue, 30 Jun

8:12 a.m.

If we feel short handed, we could do the following: 1. disable the feature and raise an exception when someone uses it with a pointer to the JIRA to restore it that way we will know how many people we pissed off and we can feed the use cases to our Lucene friends 2. Work on a workaround if the JIRa becomes popular or compelling. A mutable analyzer or the preanalized approach has my preference.

...

On 30 Jun 2015, at 13:57, Sanne Grinovero <sanne(a)hibernate.org> wrote: Among the many changes of Apache Lucene 5, it is no longer possible to override the Analyzer on a per-document base. You have to pick a single Analyzer when opening the IndexWriter. Of course the Analyzer can still return a different tokenization chain for each field, but the field->tokenizer mapping has to be consistent for the lifecycle of the IndexWriter. This means we might need to drop our "Dynamic Analyzer" feature: http://docs.jboss.org/hibernate/search/5.4/reference/en-US/html_single/#_... I did ask to restore the functionality: https://issues.apache.org/jira/browse/LUCENE-6212 So, the alternatives I'm seeing: # Dropping the Dynamic Analyzer feature # Cheat and pass in a mutable Analyzer - needs some caution re concurrent usage # Cheat and pass in a pre-analyzed Document # Fork & patch the IndexWriter Patching the functionality back in Lucene is trivial, but the Lucene team needs to agree on the use case and then the release time will be long. We should discuss both a short-term solution and the better long-term solution. My favourite long-term solution would be to do pre-analysis: in our master/slave clustering approach, that would have several other benefits: - move the analyzer work to the slaves - reduce the network payloads - remove the need to be able to serialize analyzers But I'd prefer to do this in a second "polishing phase" rather than consider such a backend rewrite as a blocker for Lucene 5. WDYT? Thanks, Sanne _______________________________________________ hibernate-dev mailing list hibernate-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev

Hardy Ferentschik

Thursday, 2 July Thu, 2 Jul

5:25 a.m.

Hi,

...

This means we might need to drop our "Dynamic Analyzer" feature: http://docs.jboss.org/hibernate/search/5.4/reference/en-US/html_single/#_...

I think that seems rather harsh.

...

So, the alternatives I'm seeing: # Dropping the Dynamic Analyzer feature # Cheat and pass in a mutable Analyzer - needs some caution re concurrent usage # Cheat and pass in a pre-analyzed Document # Fork & patch the IndexWriter

What's about the alternative to close the IndexWriter and re-open it? Obviously this could be optimised, but storing the field to analyzer map together with the open IndexWriter and only re-open if the mapping changes. As long as the mapping is the same the same IndexWriter can be used. This way we could keep the feature with a potential performance hit for the people who are using it. Still better than removing it, right? That said, what are the exact performance impacts? Did you run a test? Funny enough, what the Lucene guys try to prevent by the API change can still be done, namely by just re-opening the IndexWriter. So they are effectively forcing people who want to use this analyzer per document feature to go down an even more slippery slope. I would not be surprised if this change get reverted.

...

My favourite long-term solution would be to do pre-analysis:

How would that look like and did we not once discuss exactly the opposite (aka letting even the Document be built on the master)?

...

master/slave clustering approach, that would have several other benefits: - move the analyzer work to the slaves

Why is that a benefit?

...

- reduce the network payloads

Really, is it actually not increasing payloads?

...

- remove the need to be able to serialize analyzers

We don't serialize analyzers afaik --Hardy

Sanne Grinovero

6:20 a.m.

On 2 July 2015 at 11:25, Hardy Ferentschik <hardy(a)hibernate.org> wrote:

...

Hi, > This means we might need to drop our "Dynamic Analyzer" feature: > http://docs.jboss.org/hibernate/search/5.4/reference/en-US/html_single/#_... I think that seems rather harsh.

I agree, I'd be quite unhappy if it comes to that. If we do it, we should at least provide an alternative way to handle multi-language indexing. Ideally we should provide something similar to the Dynamic Analyzer feature but which also multiplexes an entity property into multiple fieldnames; for example property "title" -> title_en & analyzer en -> title_de & analyzer de The selection would work based on the Discriminator field, much like the current Dynamic Analyzer. Still, even if we were to find the bandwidth to make that, we'd need a deprecation path for the existing feature. So for now I'm focusing on trying to keep the existing feature to work somehow, we can then work on the better solution as follow-up step.

...

> So, the alternatives I'm seeing: > # Dropping the Dynamic Analyzer feature > # Cheat and pass in a mutable Analyzer - needs some caution re concurrent usage > # Cheat and pass in a pre-analyzed Document > # Fork & patch the IndexWriter What's about the alternative to close the IndexWriter and re-open it? Obviously this could be optimised, but storing the field to analyzer map together with the open IndexWriter and only re-open if the mapping changes. As long as the mapping is the same the same IndexWriter can be used. This way we could keep the feature with a potential performance hit for the people who are using it. Still better than removing it, right? That said, what are the exact performance impacts? Did you run a test?

The performace impact is huge as it would prevent you from using both NRT and the new backend strategy to pack multiple blocks in commit cycles; that means the impact is in the 3 to 4 orders of magnitude in throughput. Another problem is that you'd have to apply such a strategy in all cases, even if they don't use Dynamic Analyzers as the backend can't really auto-detect when such a Work item is about to be processed (I just tried it, it's getting very hairy). I could apply your suggestion in practice if we go for setting a flag in the backend to change strategy, depending if any entity is using the Discriminator feature, but beyond that we also have the problem of different entities sharing the same index but potentially using a different analyzer for the same named field... I'd agree with the Lucene developers that people should really not do it, but we support that today.

...

Funny enough, what the Lucene guys try to prevent by the API change can still be done, namely by just re-opening the IndexWriter. So they are effectively forcing people who want to use this analyzer per document feature to go down an even more slippery slope. I would not be surprised if this change get reverted.

Right, there are many ways around, they are just forcing us to write uglier and slower code. I hope they'll accommodate for it.

...

> My favourite long-term solution would be to do pre-analysis: How would that look like and did we not once discuss exactly the opposite (aka letting even the Document be built on the master)?

We discussed to not create a Document instance on the slave, to only serialize a custom serializable-friendly container, but that doesn't prevent you to pre-tokenize the text on the slaves. AFAIR we discussed to create a "master node" which doesn't need the user classes so that would be an easy to start service w/o need for much more than some configuration properties.. if you don't pre-tokenize this configuration would still need the classes to read our analyzer definitions from annotations.

...

> master/slave clustering approach, that would have several other > benefits: > - move the analyzer work to the slaves Why is that a benefit?

- removes the need to have the analyzer definitions on the master (see above). - spreads out the CPU and memory allocations cost to each slave node: better scalability than have it all done on the master.

...

> - reduce the network payloads Really, is it actually not increasing payloads?

I would expect so: a pre-filtered token sequence is usually smaller than the source text, often by a good margin.

...

> - remove the need to be able to serialize analyzers We don't serialize analyzers afaik

Right, not sure why I wrote that. I was probably thinking of the Map<fieldName, AnalyzerName> which we ship with each Work to the backend to apply custom overrides.. a little simplification but not a big win. Sanne

...

--Hardy

Hardy Ferentschik

6:50 a.m.

Hi, On Thu, Jul 02, 2015 at 12:20:46PM +0100, Sanne Grinovero wrote:

...

Ideally we should provide something similar to the Dynamic Analyzer feature but which also multiplexes an entity property into multiple fieldnames; for example property "title" -> title_en & analyzer en -> title_de & analyzer de The selection would work based on the Discriminator field, much like the current Dynamic Analyzer.

That might be a possibility, even though I am not quite sure how exactly this would look like. I would first need to dig in more into the existing code. Do you have a more concrete idea on how this would look like?

...

Still, even if we were to find the bandwidth to make that, we'd need a deprecation path for the existing feature

Well, in the above case we are not talking about deprecation right? It would be more of a change in behavior and use!?

...

> What's about the alternative to close the IndexWriter and re-open it? Obviously this could be > optimised, but storing the field to analyzer map together with the open IndexWriter and only > re-open if the mapping changes. As long as the mapping is the same the same IndexWriter can be used. > This way we could keep the feature with a potential performance hit for the people who are using it. > Still better than removing it, right? That said, what are the exact performance impacts? Did you run > a test? The performace impact is huge as it would prevent you from using both NRT and the new backend strategy to pack multiple blocks in commit cycles; that means the impact is in the 3 to 4 orders of magnitude in throughput.

Might be still worth testing and prototyping.

...

I could apply your suggestion in practice if we go for setting a flag in the backend to change strategy, depending if any entity is using the Discriminator feature,

That would work for me.

...

but beyond that we also have the problem of different entities sharing the same index but potentially using a different analyzer for the same named field... I'd agree with the Lucene developers that people should really not do it, but we support that today.

Ok. In this case I am more inclined to enforce the same analyzer.

...

> How would that look like and did we not once discuss exactly the opposite (aka letting even > the Document be built on the master)? We discussed to not create a Document instance on the slave, to only serialize a custom serializable-friendly container, but that doesn't prevent you to pre-tokenize the text on the slaves. AFAIR we discussed to create a "master node" which doesn't need the user classes so that would be an easy to start service w/o need for much more than some configuration properties.. if you don't pre-tokenize this configuration would still need the classes to read our analyzer definitions from annotations.

Ok, that is possible as well. I think we discussed both and I was indeed referring to the approach where the master node would do the index using user classes and the corresponding Search metadata. I like the solution you are referring to much better, since it also works better with the ideas I have regarding the clustering of the index (eg with RAFT). As you suggest, it would be beneficial if only the slaves would need to know about the user classes.

...

>> master/slave clustering approach, that would have several other >> benefits: >> - move the analyzer work to the slaves > > Why is that a benefit? - removes the need to have the analyzer definitions on the master (see above).

Ok, in the light of the above discussed solution, you would not need the analyzers on the master node. Not sure whether this such an important thing so.

...

- spreads out the CPU and memory allocations cost to each slave node: better scalability than have it all done on the master

Well, one could also take the point of view that the slaves should do as little as possible and let the master do the heavy lifting. It depends really for what you are optimizing imo.

...

>> - reduce the network payloads > > Really, is it actually not increasing payloads? I would expect so: a pre-filtered token sequence is usually smaller than the source text, often by a good margin.

True, in the usual cases that is probably the case. --Hardy

Sanne Grinovero

11:24 a.m.

On 2 July 2015 at 12:50, Hardy Ferentschik <hardy(a)hibernate.org> wrote:

...

Hi, On Thu, Jul 02, 2015 at 12:20:46PM +0100, Sanne Grinovero wrote: > Ideally we should provide something similar to the Dynamic Analyzer > feature but which also multiplexes an entity property into multiple > fieldnames; > for example > property "title" > -> title_en & analyzer en > -> title_de & analyzer de > > The selection would work based on the Discriminator field, much like > the current Dynamic Analyzer. That might be a possibility, even though I am not quite sure how exactly this would look like. I would first need to dig in more into the existing code. Do you have a more concrete idea on how this would look like?

I did not sketch an implementation.

...

> Still, even if we were to find the bandwidth to make that, we'd need a > deprecation path for the existing feature Well, in the above case we are not talking about deprecation right? It would be more of a change in behavior and use!?

Right, the above example would be quite different, for example queries would need to target the right field - and using the right analyzer - and that would need explicit user input. So to provide a deprecation path, we'd need a version which supports both approaches so that people can move from one to the other.. which implies keeping the existing model around for a little longer, which is problematic. In other words, discussing a better solution is good but doesn't avoid the need to keep the existing functionality around.

...

> > What's about the alternative to close the IndexWriter and re-open it? Obviously this could be > > optimised, but storing the field to analyzer map together with the open IndexWriter and only > > re-open if the mapping changes. As long as the mapping is the same the same IndexWriter can be used. > > This way we could keep the feature with a potential performance hit for the people who are using it. > > Still better than removing it, right? That said, what are the exact performance impacts? Did you run > > a test? > > The performace impact is huge as it would prevent you from using both > NRT and the new backend strategy to pack multiple blocks in commit > cycles; > that means the impact is in the 3 to 4 orders of magnitude in throughput. Might be still worth testing and prototyping. > I could apply your suggestion in practice if we go for setting a flag > in the backend to change strategy, depending if any entity is using > the Discriminator feature, That would work for me. > but beyond that we also have the problem of > different entities sharing the same index but potentially using a > different analyzer for the same named field... I'd agree with the > Lucene developers that people should really not do it, but we support > that today. Ok. In this case I am more inclined to enforce the same analyzer.

Right, especially as we can detect the inconsistency at boot time and raise an appropriate warning. In this case I'd not expect a nice deprecation path as the existing usage (if any user did this) would have been problematic already.

...

> > How would that look like and did we not once discuss exactly the opposite (aka letting even > > the Document be built on the master)? > > We discussed to not create a Document instance on the slave, to only > serialize a custom serializable-friendly container, but that doesn't > prevent you to pre-tokenize the text on the slaves. > AFAIR we discussed to create a "master node" which doesn't need the > user classes so that would be an easy to start service w/o need for > much more than some configuration properties.. if you don't > pre-tokenize this configuration would still need the classes to read > our analyzer definitions from annotations. Ok, that is possible as well. I think we discussed both and I was indeed referring to the approach where the master node would do the index using user classes and the corresponding Search metadata. I like the solution you are referring to much better, since it also works better with the ideas I have regarding the clustering of the index (eg with RAFT). As you suggest, it would be beneficial if only the slaves would need to know about the user classes.

...

> >> master/slave clustering approach, that would have several other > >> benefits: > >> - move the analyzer work to the slaves > > > > Why is that a benefit? > > - removes the need to have the analyzer definitions on the master (see above). Ok, in the light of the above discussed solution, you would not need the analyzers on the master node. Not sure whether this such an important thing so.

Above you said you like it much better to not need the user classes on the master. We build the analyzers from the annotations on the user classes - not least we allow the user to provide custom analyzer implementations. So avoiding the need to have the analyzers on the master node is a pre-requisite to get rid of the user classes.

...

> - spreads out the CPU and memory allocations cost to each slave node: > better scalability than have it all done on the master Well, one could also take the point of view that the slaves should do as little as possible and let the master do the heavy lifting. It depends really for what you are optimizing imo.

Good point, one might prefer the opposite. But by decoupling the chain: entity -> [tokenizing && indexwriting] into entity -> tokenizing -> indexwriting Then you can easily provide an option to let the user make this choice about were you want the tokenizing to happen. I'd wager though that most will want to favour scalability, so I'd implement that first.

...

> >> - reduce the network payloads > > > > Really, is it actually not increasing payloads? > > I would expect so: a pre-filtered token sequence is usually smaller > than the source text, often by a good margin. True, in the usual cases that is probably the case.

I can only think of the opposite to happen in context of information enrichment, such as Apache UIMA or Stanbol, but in these cases the high level of computation would even more want you to choose for scalability, i.e. pre-process each case on the slave rather than killing your master nodes. Anyway, nice brainstorming but I'm not even sure how feasible it would be to do pre-processing without the IndexWriter :) Sanne

Gunnar Morling

Friday, 3 July Fri, 3 Jul

3:53 a.m.

...

Anyway, nice brainstorming but I'm not even sure how feasible it would

be to do pre-processing without the IndexWriter :) Where/when is that pre-processing happening today? IMO we must start and consider non-Lucene backends in all our plans. 2015-07-02 18:24 GMT+02:00 Sanne Grinovero <sanne(a)hibernate.org>: > On 2 July 2015 at 12:50, Hardy Ferentschik <hardy(a)hibernate.org> wrote: > > Hi, > > > > On Thu, Jul 02, 2015 at 12:20:46PM +0100, Sanne Grinovero wrote: > >> Ideally we should provide something similar to the Dynamic Analyzer > >> feature but which also multiplexes an entity property into multiple > >> fieldnames; > >> for example > >> property "title" > >> -> title_en & analyzer en > >> -> title_de & analyzer de > >> > >> The selection would work based on the Discriminator field, much like > >> the current Dynamic Analyzer. > > > > That might be a possibility, even though I am not quite sure how exactly > this would > > look like. I would first need to dig in more into the existing code. > > Do you have a more concrete idea on how this would look like? > > I did not sketch an implementation. > > >> Still, even if we were to find the bandwidth to make that, we'd need a > >> deprecation path for the existing feature > > > > Well, in the above case we are not talking about deprecation right? It > would be more of > > a change in behavior and use!? > > Right, the above example would be quite different, for example queries > would need to target the right field - and using the right analyzer - > and that would need explicit user input. > So to provide a deprecation path, we'd need a version which supports > both approaches so that people can move from one to the other.. which > implies keeping the existing model around for a little longer, which > is problematic. > In other words, discussing a better solution is good but doesn't avoid > the need to keep the existing functionality around. > > > > >> > What's about the alternative to close the IndexWriter and re-open it? > Obviously this could be > >> > optimised, but storing the field to analyzer map together with the > open IndexWriter and only > >> > re-open if the mapping changes. As long as the mapping is the same > the same IndexWriter can be used. > >> > This way we could keep the feature with a potential performance hit > for the people who are using it. > >> > Still better than removing it, right? That said, what are the exact > performance impacts? Did you run > >> > a test? > >> > >> The performace impact is huge as it would prevent you from using both > >> NRT and the new backend strategy to pack multiple blocks in commit > >> cycles; > >> that means the impact is in the 3 to 4 orders of magnitude in > throughput. > > > > Might be still worth testing and prototyping. > > > >> I could apply your suggestion in practice if we go for setting a flag > >> in the backend to change strategy, depending if any entity is using > >> the Discriminator feature, > > > > That would work for me. > > > >> but beyond that we also have the problem of > >> different entities sharing the same index but potentially using a > >> different analyzer for the same named field... I'd agree with the > >> Lucene developers that people should really not do it, but we support > >> that today. > > > > Ok. In this case I am more inclined to enforce the same analyzer. > > Right, especially as we can detect the inconsistency at boot time and > raise an appropriate warning. > In this case I'd not expect a nice deprecation path as the existing > usage (if any user did this) would have been problematic already. > > > > >> > How would that look like and did we not once discuss exactly the > opposite (aka letting even > >> > the Document be built on the master)? > >> > >> We discussed to not create a Document instance on the slave, to only > >> serialize a custom serializable-friendly container, but that doesn't > >> prevent you to pre-tokenize the text on the slaves. > >> AFAIR we discussed to create a "master node" which doesn't need the > >> user classes so that would be an easy to start service w/o need for > >> much more than some configuration properties.. if you don't > >> pre-tokenize this configuration would still need the classes to read > >> our analyzer definitions from annotations. > > > > Ok, that is possible as well. I think we discussed both and I was indeed > > referring to the approach where the master node would do the index using > > user classes and the corresponding Search metadata. > > > > I like the solution you are referring to much better, since it also works > > better with the ideas I have regarding the clustering of the index (eg > with > > RAFT). As you suggest, it would be beneficial if only the slaves would > need > > to know about the user classes. > > +1 > > > > >> >> master/slave clustering approach, that would have several other > >> >> benefits: > >> >> - move the analyzer work to the slaves > >> > > >> > Why is that a benefit? > >> > >> - removes the need to have the analyzer definitions on the master (see > above). > > > > Ok, in the light of the above discussed solution, you would not need the > analyzers > > on the master node. Not sure whether this such an important thing so. > > Above you said you like it much better to not need the user classes on > the master. > We build the analyzers from the annotations on the user classes - not > least we allow the user to provide custom analyzer implementations. > So avoiding the need to have the analyzers on the master node is a > pre-requisite to get rid of the user classes. > > > > >> - spreads out the CPU and memory allocations cost to each slave node: > >> better scalability than have it all done on the master > > > > Well, one could also take the point of view that the slaves should do as > little as possible > > and let the master do the heavy lifting. It depends really for what you > are optimizing imo. > > Good point, one might prefer the opposite. But by decoupling the chain: > entity -> [tokenizing && indexwriting] > into > entity -> tokenizing -> indexwriting > Then you can easily provide an option to let the user make this choice > about were you want the tokenizing to happen. > > I'd wager though that most will want to favour scalability, so I'd > implement that first. > > >> >> - reduce the network payloads > >> > > >> > Really, is it actually not increasing payloads? > >> > >> I would expect so: a pre-filtered token sequence is usually smaller > >> than the source text, often by a good margin. > > > > True, in the usual cases that is probably the case. > > I can only think of the opposite to happen in context of information > enrichment, such as Apache UIMA or Stanbol, but in these cases the > high level of computation would even more want you to choose for > scalability, i.e. pre-process each case on the slave rather than > killing your master nodes. >

...

Anyway, nice brainstorming but I'm not even sure how feasible it would

> be to do pre-processing without the IndexWriter :) > > Sanne > _______________________________________________ > hibernate-dev mailing list > hibernate-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/hibernate-dev >

Sanne Grinovero

4:14 a.m.

On 3 July 2015 at 09:53, Gunnar Morling <gunnar(a)hibernate.org> wrote:

...

> Anyway, nice brainstorming but I'm not even sure how feasible it would be to do pre-processing without the IndexWriter :) Where/when is that pre-processing happening today? IMO we must start and consider non-Lucene backends in all our plans.

Today the tokenization (analysis) happens during indexwriting: indexWriter.addDocument( Document d, Analyzer a ); We don't do pre-processing, which is what I'm suggesting we might want to do to avoid these problems, and like you say it might help with other backends. Keep in mind that the alternative backends we're aiming at are all "Lucene based" though.. they certainly support performing the analysis on their side so this isn't the only way. If we decide to rely on their own analysis capabilities, we'll need to be have these configured in some specific way to match our needs; since today we have a nice set of annotations to define these options, there's a risk of having the user to have to maintain duplicate declarations. So we'll either have the user configure them manually to match our expectations, or we do pre-processing, or we generate configuration files, or maybe we can directly "set" the configuration automatically. What they all have in common, is the mapping fieldname -> Analyzer Chain needs to be consistent, but we gave more flexibility away.

...

2015-07-02 18:24 GMT+02:00 Sanne Grinovero <sanne(a)hibernate.org>: > > On 2 July 2015 at 12:50, Hardy Ferentschik <hardy(a)hibernate.org> wrote: > > Hi, > > > > On Thu, Jul 02, 2015 at 12:20:46PM +0100, Sanne Grinovero wrote: > >> Ideally we should provide something similar to the Dynamic Analyzer > >> feature but which also multiplexes an entity property into multiple > >> fieldnames; > >> for example > >> property "title" > >> -> title_en & analyzer en > >> -> title_de & analyzer de > >> > >> The selection would work based on the Discriminator field, much like > >> the current Dynamic Analyzer. > > > > That might be a possibility, even though I am not quite sure how exactly > > this would > > look like. I would first need to dig in more into the existing code. > > Do you have a more concrete idea on how this would look like? > > I did not sketch an implementation. > > >> Still, even if we were to find the bandwidth to make that, we'd need a > >> deprecation path for the existing feature > > > > Well, in the above case we are not talking about deprecation right? It > > would be more of > > a change in behavior and use!? > > Right, the above example would be quite different, for example queries > would need to target the right field - and using the right analyzer - > and that would need explicit user input. > So to provide a deprecation path, we'd need a version which supports > both approaches so that people can move from one to the other.. which > implies keeping the existing model around for a little longer, which > is problematic. > In other words, discussing a better solution is good but doesn't avoid > the need to keep the existing functionality around. > > > > >> > What's about the alternative to close the IndexWriter and re-open it? > >> > Obviously this could be > >> > optimised, but storing the field to analyzer map together with the > >> > open IndexWriter and only > >> > re-open if the mapping changes. As long as the mapping is the same > >> > the same IndexWriter can be used. > >> > This way we could keep the feature with a potential performance hit > >> > for the people who are using it. > >> > Still better than removing it, right? That said, what are the exact > >> > performance impacts? Did you run > >> > a test? > >> > >> The performace impact is huge as it would prevent you from using both > >> NRT and the new backend strategy to pack multiple blocks in commit > >> cycles; > >> that means the impact is in the 3 to 4 orders of magnitude in > >> throughput. > > > > Might be still worth testing and prototyping. > > > >> I could apply your suggestion in practice if we go for setting a flag > >> in the backend to change strategy, depending if any entity is using > >> the Discriminator feature, > > > > That would work for me. > > > >> but beyond that we also have the problem of > >> different entities sharing the same index but potentially using a > >> different analyzer for the same named field... I'd agree with the > >> Lucene developers that people should really not do it, but we support > >> that today. > > > > Ok. In this case I am more inclined to enforce the same analyzer. > > Right, especially as we can detect the inconsistency at boot time and > raise an appropriate warning. > In this case I'd not expect a nice deprecation path as the existing > usage (if any user did this) would have been problematic already. > > > > >> > How would that look like and did we not once discuss exactly the > >> > opposite (aka letting even > >> > the Document be built on the master)? > >> > >> We discussed to not create a Document instance on the slave, to only > >> serialize a custom serializable-friendly container, but that doesn't > >> prevent you to pre-tokenize the text on the slaves. > >> AFAIR we discussed to create a "master node" which doesn't need the > >> user classes so that would be an easy to start service w/o need for > >> much more than some configuration properties.. if you don't > >> pre-tokenize this configuration would still need the classes to read > >> our analyzer definitions from annotations. > > > > Ok, that is possible as well. I think we discussed both and I was indeed > > referring to the approach where the master node would do the index using > > user classes and the corresponding Search metadata. > > > > I like the solution you are referring to much better, since it also > > works > > better with the ideas I have regarding the clustering of the index (eg > > with > > RAFT). As you suggest, it would be beneficial if only the slaves would > > need > > to know about the user classes. > > +1 > > > > >> >> master/slave clustering approach, that would have several other > >> >> benefits: > >> >> - move the analyzer work to the slaves > >> > > >> > Why is that a benefit? > >> > >> - removes the need to have the analyzer definitions on the master (see > >> above). > > > > Ok, in the light of the above discussed solution, you would not need the > > analyzers > > on the master node. Not sure whether this such an important thing so. > > Above you said you like it much better to not need the user classes on > the master. > We build the analyzers from the annotations on the user classes - not > least we allow the user to provide custom analyzer implementations. > So avoiding the need to have the analyzers on the master node is a > pre-requisite to get rid of the user classes. > > > > >> - spreads out the CPU and memory allocations cost to each slave node: > >> better scalability than have it all done on the master > > > > Well, one could also take the point of view that the slaves should do as > > little as possible > > and let the master do the heavy lifting. It depends really for what you > > are optimizing imo. > > Good point, one might prefer the opposite. But by decoupling the chain: > entity -> [tokenizing && indexwriting] > into > entity -> tokenizing -> indexwriting > Then you can easily provide an option to let the user make this choice > about were you want the tokenizing to happen. > > I'd wager though that most will want to favour scalability, so I'd > implement that first. > > >> >> - reduce the network payloads > >> > > >> > Really, is it actually not increasing payloads? > >> > >> I would expect so: a pre-filtered token sequence is usually smaller > >> than the source text, often by a good margin. > > > > True, in the usual cases that is probably the case. > > I can only think of the opposite to happen in context of information > enrichment, such as Apache UIMA or Stanbol, but in these cases the > high level of computation would even more want you to choose for > scalability, i.e. pre-process each case on the slave rather than > killing your master nodes. > > Anyway, nice brainstorming but I'm not even sure how feasible it would > be to do pre-processing without the IndexWriter :) > > Sanne > _______________________________________________ > hibernate-dev mailing list > hibernate-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/hibernate-dev

4023

days inactive

4026

days old

hibernate-dev@lists.jboss.org

Manage subscription

7 comments

4 participants

tags (0)

participants (4)

Emmanuel Bernard
Gunnar Morling
Hardy Ferentschik
Sanne Grinovero

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

HSEARCH: Removing dynamic analyzer mapping?