[hibernate-dev] [AND] Search: changing the way we search

Tue Mar 4 09:02:23 EST 2014

On Tue, Mar 4, 2014 at 1:36 PM, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
> OK so you want the words hotel + swimming pool to be present somewhere in the sum of the corpus of title and description. That's the second case I was describing then. Indeed it kinda fails if you don't order by score but rather alphabetically or by distance.
> Have you considered the following: your query should only consider the top n, or the results whose score reaches 70% of the top score and then do your business sort on this subset.

It doesn't work. Users want the results which really match and we
can't have missing results or additional results.

They search for something and they want to find exactly what they are
looking for.

Note: this is really 99.9% of our use cases, probably because we
mostly develop business applications.

> Anyways, to address this, one need to target fields that are:
> - using the same fieldbridge
> - using the same analyzer
> - do the trick I was describing around filters like ngrams (and then or)

That's when I stopped my work on HSEARCH-917. I wasn't sure I could
decently require such conditions, at least not in the current API.

I started to wonder if we could introduce a text() branch in parallel
to keyword() and phrase() but never really posted about it.

I would like to separate the user responsibility from the developer
responsibility:
- the user defines his search query. It's a little more clever than
just a term search: he can use + - and "": that's why I would like to
use a QueryParser directly (most of our users don't use it but some of
them need it);
- the developer defines how the search is done: it can search on
several fields: for each field, the developer can define a boost (this
is supported by the SimpleQueryParser) AND he can also define if it's
a fuzzy query (not supported out of the box by the SimpleQueryParser).
(we could even imagine to support minimum should match as the dismax
parser does)

Because, this is really what we need on a daily basis: my user don't
really know if his search needs to be fuzzy or not. And I would like
to be able to make the decision for him because I know the corpus of
documents and I know it's going to be needed.

I don't know if it looks like something interesting to you?