[hibernate-dev] Search: changing the way we search

Mon Mar 3 13:01:19 EST 2014

Hi Guillaume,
that's a very welcome email, and very timely set of proposals.

I'll top-post rather than try to comment inline as it's huge :)
Copying your headers for convenience:

I. How do we search at my company?

Very useful to know how you use it. I would expect most users to focus
on the full-text capabilities, still I understand your position as in
a previous project we used it for full-text but not only: it was
primarily used to "boost" performance of some strictly relational
queries, and in such cases we where going to quite some extremes to
encode necessary tokens into the index to get it to behave in a
structured manner.
So, that makes at least two power users using it this way, not least
in the Hibernate OGM and Infinispan use cases "full text" is a nice to
have but we expect the users to be primarily interested in a SQL
replacement.

So I think we'll have to concede that focusin on full-text only is no
longer the only goal.

It quite strikes me that you mention auto-completion: it has been an
open JIRA but we never found a volunteer. Do you build it on top of
existing infrastructure? I didn't look at it yet, but if all it takes
is an example, I'd love to add it to our documentation.

II. On why we can't use the DSL out of the box

1/ Not sure how to address is; if you have patches I'll be glad to include them.
2/ Good point, also reminds me previous projects: lots of boilerplate
code to handle this.

III. So let's add an AND option...

You seem to suggest having a solution, although the cost would be
breaking current API.
Since we're preparing for a new major release it would be possible to
consider this now. Don't wait too long ;-)
TBH I haven't fully understood what you are suggesting, it would be
great if we could start from a failing test to illustrate what you
mean in practice?

IV. About wildcard queries

Great point. AFAIK we had a solution for this (yes we needed the
same), which essentially required the full field to be indexed as a
single token, but we'd still lowercase it, remove accents, etc. I'm
not sure that's a good solution for people not understanding the
implications though :)

Let's start by immediately making sure we support index time vs query
time analyzer choices, that's something that has been bothering me for
longer. Would you create a JIRA for this?

V. The "don't add this clause if null/empty" problem

I like the idea, but let's split the conversation in sub-tasks. Create
a JIRA too? Feel free to optimistically mark it for 5.0, but unless
you can help driving this everything which can be added incrementally
(like not breaking the API) we will likely move this to some 5.x, so
to not prevent us to deliver a quick 5.0 (by as quick as it can be).
Still I prefer at least partially discussing this kind of tasks
quickly, to make sure we won't actually need to break the API soon
after.

VI. Provide not so bad default analyzers

Ok, another great one which should be done before 5.0. Would you
propose a default one?
And track it on JIRA for 5.0 too.

Thanks!
Sanne

On 3 March 2014 16:11, Guillaume Smet <guillaume.smet at hibernate.org> wrote:
> Hi,
>
> So, it's been a long time since I threw the first idea of this (see
> HSEARCH-917) but, after a lot more thoughts, and the fact that I'm
> basically stuck for a long time on this one, it's probably better to
> agree with a plan before putting together some code.
>
> Note that this plan is based on our usage of Hibernate Search on a lot
> of applications for several years and I think our usage pattern is
> quite common. But, even so, I'm pretty sure there are other search
> patterns out there which might be interesting and it would be nice to
> include them in this proposal if they don't fit.
>
> I. How do we search at my company?
> -------------------------------------------------------
>
> We mainly use Search for 2 things:
> - autocompletion;
> - search engines: search form to filter a list of items. Usually, a
> plain text field and several structured fields (drop down choice
> mostly).
>
> We usually sort with business rules, not using score. Users usually
> like it better as it's more predictable. For example, we sort our
> autocompletion results alphabetically. An interesting note here is
> probably that we work on structured data, not on CMS content. This
> might be considered a detail but you'll see it's important.
>
> We use analyzers to:
> - split the words (the WordDelimiterFilter - yeah, I have a Solr background :));
> - filter the input (AsciiFoldingFilter, LowercaseFilter...);
> - eventually do simple stemming (with our own very minimal stemmers).
>
> We sometimes use Search to find the elements to apply business rules
> when it's really hard to use the database to do so. Search provides a
> convenient way to denormalize the data.
>
> II. On why we can't use the DSL out of the box
> --------------------------------------------------------------------
>
> The Hibernate Search DSL is great and I must admit this is the DSL
> which learned me how to build DSL for our own usage. It's intuitive,
> well thought, definitely a nice piece of code.
>
> So, why don't we use it for our plain text queries? (Disclaimer: we
> use it under the hood, we just have to do a lot of things manually
> outside of the DSL)
>
> Several reasons:
> 1/ the aforementioned detail about sorting: we need AND badly in plain
> text search;
> 2/ we often need to add a clause only if the text isn't empty or the
> object not null and we then need to add more logic than the fluent
> approach allows it (I don't have any ideas/proposals for this one but
> I think it's worth mentioning).
>
> And why is it not ideal:
> 3/ wildcard and analyzers are really a pain with Lucene and you need
> to implement your own cleaning stuff to get a working wildcard query.
>
> 1/ is definitely our biggest problem.
>
> III. So let's add an AND option...
> -----------------------------------------------
>
> Yeah, well, that's not so easy.
>
> Let's take a look at the code, especially our dear friend
> ConnectedMultiFieldsTermQueryBuilder .
>
> When I started to look at HSEARCH-917, I thought it would be quite
> easy to build lucene queries using a Lucene QueryParser instead of all
> the machinery in ConnectedMultiFieldsTermQueryBuilder. It's not.
>
> Here are pointers to the main problems I have:
> 1/ the getAllTermsFromText is cute when you want to OR the terms but
> really bad when you need AND, especially when you use analyzers which
> returns several tokens for a term (this is the case when you use the
> SynonymFilter or the WordDelimiterFilter);
> 2/ the fieldBridge thing is quite painful for plain text search as we
> are not sure that all the fields have the same fieldBridge and, thus,
> the search terms might be different for each fields after applying the
> fieldBridge.
>
> These problems are not so easy to solve in an absolute kind of way.
> That's why I haven't made any progress on this problem.
>
> Let's illustrate the problem:
> - you search for "several words in my content" (without ", it's not a
> phrase query, just terms)
> - you search in the fields title, summary and content so you expect to
> find at least one occurrence of each term in one of these fields;
> - for some reason, you have a different fieldBridge on one of the
> fields and it's quite hard to define "at least one occurrence of each
> term in one of these fields" as the fieldBridge might transform the
> text.
>
> My point is that I don't see a way to fix the current DSL without
> breaking some cases (note that the current code only works because
> only the OR operator is supported) even if we might consider they are
> weird.
>
> >From my perspective, a plainText branch of the DSL could ignore the
> fieldBridge machinery but I'm not sure it's a good idea. That's why I
> would like some feedback about this before moving in this direction.
>
> I took a look at the new features of Lucene 4.7 and the new
> SimpleQueryParser looks kinda interesting as it's really simple and
> could be a good starting point to come up with a QueryParser which
> simply does the job for our plain text search queries.
>
> IV. About wildcard queries
> --------------------------------------
>
> Let's say it frankly: wildcard queries are a pain in Lucene.
>
> Let's take an example:
> - You index "Parking" and you have a LowerCaseFilter so your index
> contains "parking";
> - You search for Parking without wildcard, it will work;
> - You search for Parki* with wildcard, yeah, it won't work.
>
> This is due to the fact that for wildcards, the analyzers are ignored.
> Usually, because if you use ? or *, they can be replaced by the
> filters you use in your analyzers.
>
> While we all understand the Lucene point of view from a technical
> perspective, I don't think we can keep this position for Hibernate
> Search as a user friendly search framework on top of Hibernate.
>
> At Open Wide, we have a quite complex method which rewrites a search
> as a working autocompletion search which might work most of the time
> (with a high value of most...). It's kinda ugly, far from perfect and
> I'm wondering if we could have something more clever in Search. I once
> talked with Emmanuel about having different analyzers for Indexing,
> Querying (this is the Solr way) and Wildcards/Fuzzy search (this is
> IMHO a good idea as the way you want to normalize your wildcard query
> highly depends on the analyzer used to index your data).
>
> V. The "don't add this clause if null/empty" problem
> ----------------------------------------------------------------------------
>
> Ideas welcome!
>
> VI. Provide not so bad default analyzers
> ---------------------------------------------------------
>
> I think it would be nice to provide default analyzers for plain text.
> Not necessarily ones including complex/debatable things like stemmers,
> but at least something which gives a good taste of Search before going
> into more details.
>
> Why would it be interesting? As a French speaking person, I see so
> much search engines out there which don't normalize accented
> characters, it would be nice to have something working by default.
>
> VII. Conclusion
> ----------------------
>
> I would really like to make some quick progress on III. I'm pretty
> sure, we're not the only ones having a lot of MultiFieldQueryParser
> instantiations in our Search code to deal with this. And I don't talk
> about the numerous times when one of our developers used the DSL
> without even thinking it would use the OR operator.
>
> Comments welcome.
>
> --
> Guillaume
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev