[hibernate-dev] [AND] Search: changing the way we search

Tue Mar 4 06:24:09 EST 2014

Hi Emmanuel,

On Tue, Mar 4, 2014 at 11:09 AM, Emmanuel Bernard
<emmanuel at hibernate.org> wrote:
>> Here are pointers to the main problems I have:
>> 1/ the getAllTermsFromText is cute when you want to OR the terms but
>> really bad when you need AND, especially when you use analyzers which
>> returns several tokens for a term (this is the case when you use the
>> SynonymFilter or the WordDelimiterFilter);
>
> Why do you say "especially"? Isn't it "only"

It's been a while, that's why I wasn't categorical. But, as far as I
recall from my work back then, I think you're right.

>> 2/ the fieldBridge thing is quite painful for plain text search as we
>> are not sure that all the fields have the same fieldBridge and, thus,
>> the search terms might be different for each fields after applying the
>> fieldBridge.
>
> That would lead to different term queries but on different fields. So I am not sure I follow the problem you are describing. Any chance you can rephrase it?

This point is related to the one below.

> If I understand you, you wan to find several and words and in and my and content in title or in summary or in content but all terms should be present in one of the field. Is that correct? What's the use case behind?

It's probably the most common use case we have.

Let's say you have an entity called "Hotel whatever" and in its
description it says it does have a swimming pool but the word "hotel"
doesn't appear in the description (my example isn't the best chosen
but I think you can easily imagine it does happen on real data).

Our user is looking for "hotel swimming pool", and we want "Hotel
whatever" to match.

Of course, if you use a OR with a sort by score, it does work (more or
less) but the main issue is that our customers don't want too many
unrelated results. They only want items which really match the query.
Moreover, they often don't want the results sorted by score so the OR
+ sort by score approach is really not acceptable.

This is why we use MultiFieldQueryParser with AND as the default
operator a lot when using Lucene directly and the (e)dismax parser
when using Solr.

>>> From my perspective, a plainText branch of the DSL could ignore the
>> fieldBridge machinery but I'm not sure it's a good idea. That's why I
>> would like some feedback about this before moving in this direction.
>
> We already do some magic depending on the fieldbridge we have (especially the built-in ones vs custom ones).
> We might enable some features iif we know the field is built-in and predictable. Or literally if that is the same one.
>
> So, if we go and enable different classes of analyzers per field, I think we can solve the AND problem, the stack needs to be separated into:
> - an tokenizer that splits the stream into words
> - a set of filters that only normalise the words (lower case, accept, stemming, stop words probably, etc).
> - a set of filters that inject multiple tokens per initial "word" (ngrams and synonyms)
>
> It is ver possible that the first set of filter is always naturally before the second set of analyzers. Any counter example?
>
> With this, we can AND the various words and OR the second stream of tokens (or forbid them initially). We would apply the analysis in two phases.

See my points in my other email.

I'll try to write some code to explain what I would like to do. I'll
keep you posted when I have something consistent.

-- 
Guillaume