[hibernate-dev] [AND] Search: changing the way we search

Tue Mar 4 05:09:23 EST 2014

Thanks fo that very valuable input. Let me through some ideas.
This email is about the AND problem.

On 03 Mar 2014, at 17:11, Guillaume Smet <guillaume.smet at hibernate.org> wrote:

> 
> 1/ the aforementioned detail about sorting: we need AND badly in plain
> text search;
> 
> 
> III. So let's add an AND option...
> -----------------------------------------------
> 
> Yeah, well, that's not so easy.
> 
> Let's take a look at the code, especially our dear friend
> ConnectedMultiFieldsTermQueryBuilder .
> 
> When I started to look at HSEARCH-917, I thought it would be quite
> easy to build lucene queries using a Lucene QueryParser instead of all
> the machinery in ConnectedMultiFieldsTermQueryBuilder. It's not.
> 
> Here are pointers to the main problems I have:
> 1/ the getAllTermsFromText is cute when you want to OR the terms but
> really bad when you need AND, especially when you use analyzers which
> returns several tokens for a term (this is the case when you use the
> SynonymFilter or the WordDelimiterFilter);

Why do you say “especially”? Isn’t it “only"

> 2/ the fieldBridge thing is quite painful for plain text search as we
> are not sure that all the fields have the same fieldBridge and, thus,
> the search terms might be different for each fields after applying the
> fieldBridge.

That would lead to different term queries but on different fields. So I am not sure I follow the problem you are describing. Any chance you can rephrase it?

> 
> These problems are not so easy to solve in an absolute kind of way.
> That's why I haven't made any progress on this problem.
> 
> Let's illustrate the problem:
> - you search for "several words in my content" (without ", it's not a
> phrase query, just terms)
> - you search in the fields title, summary and content so you expect to
> find at least one occurrence of each term in one of these fields;

If I understand you, you wan to find several and words and in and my and content in title or in summary or in content but all terms should be present in one of the field. Is that correct? What’s the use case behind?

Or is that you want to find several and words and in and my and content but across all of the fields mentioned?

> - for some reason, you have a different fieldBridge on one of the
> fields and it's quite hard to define "at least one occurrence of each
> term in one of these fields" as the fieldBridge might transform the
> text.
> 
> My point is that I don't see a way to fix the current DSL without
> breaking some cases (note that the current code only works because
> only the OR operator is supported) even if we might consider they are
> weird.
> 
>> From my perspective, a plainText branch of the DSL could ignore the
> fieldBridge machinery but I'm not sure it's a good idea. That's why I
> would like some feedback about this before moving in this direction.

We already do some magic depending on the fieldbridge we have (especially the built-in ones vs custom ones).
We might enable some features iif we know the field is built-in and predictable. Or literally if that is the same one.

So, if we go and enable different classes of analyzers per field, I think we can solve the AND problem, the stack needs to be separated into:
- an tokenizer that splits the stream into words
- a set of filters that only normalise the words (lower case, accept, stemming, stop words probably, etc).
- a set of filters that inject multiple tokens per initial “word” (ngrams and synonyms)

It is ver possible that the first set of filter is always naturally before the second set of analyzers. Any counter example?

With this, we can AND the various words and OR the second stream of tokens (or forbid them initially). We would apply the analysis in two phases.