Hi,
So, it's been a long time since I threw the first idea of this (see
HSEARCH-917) but, after a lot more thoughts, and the fact that I'm
basically stuck for a long time on this one, it's probably better to
agree with a plan before putting together some code.
Note that this plan is based on our usage of Hibernate Search on a lot
of applications for several years and I think our usage pattern is
quite common. But, even so, I'm pretty sure there are other search
patterns out there which might be interesting and it would be nice to
include them in this proposal if they don't fit.
I. How do we search at my company?
-------------------------------------------------------
We mainly use Search for 2 things:
- autocompletion;
- search engines: search form to filter a list of items. Usually, a
plain text field and several structured fields (drop down choice
mostly).
We usually sort with business rules, not using score. Users usually
like it better as it's more predictable. For example, we sort our
autocompletion results alphabetically. An interesting note here is
probably that we work on structured data, not on CMS content. This
might be considered a detail but you'll see it's important.
We use analyzers to:
- split the words (the WordDelimiterFilter - yeah, I have a Solr background :));
- filter the input (AsciiFoldingFilter, LowercaseFilter...);
- eventually do simple stemming (with our own very minimal stemmers).
We sometimes use Search to find the elements to apply business rules
when it's really hard to use the database to do so. Search provides a
convenient way to denormalize the data.
II. On why we can't use the DSL out of the box
--------------------------------------------------------------------
The Hibernate Search DSL is great and I must admit this is the DSL
which learned me how to build DSL for our own usage. It's intuitive,
well thought, definitely a nice piece of code.
So, why don't we use it for our plain text queries? (Disclaimer: we
use it under the hood, we just have to do a lot of things manually
outside of the DSL)
Several reasons:
1/ the aforementioned detail about sorting: we need AND badly in plain
text search;
2/ we often need to add a clause only if the text isn't empty or the
object not null and we then need to add more logic than the fluent
approach allows it (I don't have any ideas/proposals for this one but
I think it's worth mentioning).
And why is it not ideal:
3/ wildcard and analyzers are really a pain with Lucene and you need
to implement your own cleaning stuff to get a working wildcard query.
1/ is definitely our biggest problem.
III. So let's add an AND option...
-----------------------------------------------
Yeah, well, that's not so easy.
Let's take a look at the code, especially our dear friend
ConnectedMultiFieldsTermQueryBuilder .
When I started to look at HSEARCH-917, I thought it would be quite
easy to build lucene queries using a Lucene QueryParser instead of all
the machinery in ConnectedMultiFieldsTermQueryBuilder. It's not.
Here are pointers to the main problems I have:
1/ the getAllTermsFromText is cute when you want to OR the terms but
really bad when you need AND, especially when you use analyzers which
returns several tokens for a term (this is the case when you use the
SynonymFilter or the WordDelimiterFilter);
2/ the fieldBridge thing is quite painful for plain text search as we
are not sure that all the fields have the same fieldBridge and, thus,
the search terms might be different for each fields after applying the
fieldBridge.
These problems are not so easy to solve in an absolute kind of way.
That's why I haven't made any progress on this problem.
Let's illustrate the problem:
- you search for "several words in my content" (without ", it's not a
phrase query, just terms)
- you search in the fields title, summary and content so you expect to
find at least one occurrence of each term in one of these fields;
- for some reason, you have a different fieldBridge on one of the
fields and it's quite hard to define "at least one occurrence of each
term in one of these fields" as the fieldBridge might transform the
text.
My point is that I don't see a way to fix the current DSL without
breaking some cases (note that the current code only works because
only the OR operator is supported) even if we might consider they are
weird.
From my perspective, a plainText branch of the DSL could ignore the
fieldBridge machinery but I'm not sure it's a good idea. That's why I
would like some feedback about this before moving in this direction.
I took a look at the new features of Lucene 4.7 and the new
SimpleQueryParser looks kinda interesting as it's really simple and
could be a good starting point to come up with a QueryParser which
simply does the job for our plain text search queries.
IV. About wildcard queries
--------------------------------------
Let's say it frankly: wildcard queries are a pain in Lucene.
Let's take an example:
- You index "Parking" and you have a LowerCaseFilter so your index
contains "parking";
- You search for Parking without wildcard, it will work;
- You search for Parki* with wildcard, yeah, it won't work.
This is due to the fact that for wildcards, the analyzers are ignored.
Usually, because if you use ? or *, they can be replaced by the
filters you use in your analyzers.
While we all understand the Lucene point of view from a technical
perspective, I don't think we can keep this position for Hibernate
Search as a user friendly search framework on top of Hibernate.
At Open Wide, we have a quite complex method which rewrites a search
as a working autocompletion search which might work most of the time
(with a high value of most...). It's kinda ugly, far from perfect and
I'm wondering if we could have something more clever in Search. I once
talked with Emmanuel about having different analyzers for Indexing,
Querying (this is the Solr way) and Wildcards/Fuzzy search (this is
IMHO a good idea as the way you want to normalize your wildcard query
highly depends on the analyzer used to index your data).
V. The "don't add this clause if null/empty" problem
----------------------------------------------------------------------------
Ideas welcome!
VI. Provide not so bad default analyzers
---------------------------------------------------------
I think it would be nice to provide default analyzers for plain text.
Not necessarily ones including complex/debatable things like stemmers,
but at least something which gives a good taste of Search before going
into more details.
Why would it be interesting? As a French speaking person, I see so
much search engines out there which don't normalize accented
characters, it would be nice to have something working by default.
VII. Conclusion
----------------------
I would really like to make some quick progress on III. I'm pretty
sure, we're not the only ones having a lot of MultiFieldQueryParser
instantiations in our Search code to deal with this. And I don't talk
about the numerous times when one of our developers used the DSL
without even thinking it would use the OR operator.
Comments welcome.
--
Guillaume