Thanks fo that very valuable input. Let me through some ideas.
This email is about the AND problem.
On 03 Mar 2014, at 17:11, Guillaume Smet <guillaume.smet(a)hibernate.org> wrote:
1/ the aforementioned detail about sorting: we need AND badly in plain
text search;
III. So let's add an AND option...
-----------------------------------------------
Yeah, well, that's not so easy.
Let's take a look at the code, especially our dear friend
ConnectedMultiFieldsTermQueryBuilder .
When I started to look at HSEARCH-917, I thought it would be quite
easy to build lucene queries using a Lucene QueryParser instead of all
the machinery in ConnectedMultiFieldsTermQueryBuilder. It's not.
Here are pointers to the main problems I have:
1/ the getAllTermsFromText is cute when you want to OR the terms but
really bad when you need AND, especially when you use analyzers which
returns several tokens for a term (this is the case when you use the
SynonymFilter or the WordDelimiterFilter);
Why do you say “especially”? Isn’t it “only"
2/ the fieldBridge thing is quite painful for plain text search as
we
are not sure that all the fields have the same fieldBridge and, thus,
the search terms might be different for each fields after applying the
fieldBridge.
That would lead to different term queries but on different fields. So I am not sure I
follow the problem you are describing. Any chance you can rephrase it?
These problems are not so easy to solve in an absolute kind of way.
That's why I haven't made any progress on this problem.
Let's illustrate the problem:
- you search for "several words in my content" (without ", it's not a
phrase query, just terms)
- you search in the fields title, summary and content so you expect to
find at least one occurrence of each term in one of these fields;
If I understand you, you wan to find several and words and in and my and content in title
or in summary or in content but all terms should be present in one of the field. Is that
correct? What’s the use case behind?
Or is that you want to find several and words and in and my and content but across all of
the fields mentioned?
- for some reason, you have a different fieldBridge on one of the
fields and it's quite hard to define "at least one occurrence of each
term in one of these fields" as the fieldBridge might transform the
text.
My point is that I don't see a way to fix the current DSL without
breaking some cases (note that the current code only works because
only the OR operator is supported) even if we might consider they are
weird.
> From my perspective, a plainText branch of the DSL could ignore the
fieldBridge machinery but I'm not sure it's a good idea. That's why I
would like some feedback about this before moving in this direction.
We already do some magic depending on the fieldbridge we have (especially the built-in
ones vs custom ones).
We might enable some features iif we know the field is built-in and predictable. Or
literally if that is the same one.
So, if we go and enable different classes of analyzers per field, I think we can solve the
AND problem, the stack needs to be separated into:
- an tokenizer that splits the stream into words
- a set of filters that only normalise the words (lower case, accept, stemming, stop words
probably, etc).
- a set of filters that inject multiple tokens per initial “word” (ngrams and synonyms)
It is ver possible that the first set of filter is always naturally before the second set
of analyzers. Any counter example?
With this, we can AND the various words and OR the second stream of tokens (or forbid them
initially). We would apply the analysis in two phases.