[Search] The case against searching with Criteria + restrictions
by Guillaume Smet
Hi,
= Context =
So, my patch here [1] broke a test which checks that Criteria +
restrictions mostly work - even if it's documented as not supported
and not working.
"Mostly" as in "you can't get the result size but you might get the
results". See [2] for explanations.
I spent some time yesterday contemplating this issue and, while I'm
sorry for breaking this test, I still think we should apply my patch,
remove this test and make this case not supported for good.
= Why it mostly works =
In the original ObjectLoaderHelper implementation, we use
session.load: it doesn't force the proxy to be initialized. If a proxy
for an entity isn't initialized, it's filtered out from the results.
It's the job of the various implementations of ObjectsInitializer to
initialize the objects in the session so that they are later included
in the results.
In the case of Criteria + restrictions, the restrictions are applied
in the ObjectsInitializer so the entity which doesn't satisfy the
restrictions are not initialized in the ObjectsInitializer... and thus
not included in the results.
= Why my patch is breaking this behaviour consistently =
In my patch, I use Session.get which forces the initialization of the
proxy and I removed the filter removing the uninitialized proxies: it
became unnecessary as I was sure all proxies are now initialized.
This patch has been designed to solve HSEARCH-1448 and to simplify the
ObjectLoaderHelper code which was quite complicated.
Situation after my patch: all the results satisfying the full text
search are returned. The restrictions of the criteria are not taken
into account. In fact, it works as documented.
= Relying on the session state to filter out entities is wrong =
So the fact is that we basically rely on the session state to filter
out the results we don't want.
I had to check that my gut feeling was right so I checked out current
master, opened ResultSizeOnCriteriaTest and just added the following
lines before the search call:
//load in the session the object which shouldn't be returned just for fun
session.get( Tractor.class, 2 );
-> the object is returned and the test is broken too. This is expected
behaviour as this object has been initialized in the session and is
now considered as a valid candidate for the results.
= Conclusion =
I don't think we can have a really working criteria + restrictions
thing without refactoring a lot of things in search: the Initializer +
Loader concept can't work reliably in this case.
Therefore I think we should simply remove this test and clearly make
it fail as it can be a potential security flaw if we return entities
the user shouldn't see just because they were initialized in the
session for another purpose.
We might revisit it later but I really think it's a lot of work to get it right.
Thoughts?
References
[1] https://github.com/hibernate/hibernate-search/pull/581
[2] https://hibernate.atlassian.net/browse/HSEARCH-753
10 years, 9 months
Search: changing the way we search
by Guillaume Smet
Hi,
So, it's been a long time since I threw the first idea of this (see
HSEARCH-917) but, after a lot more thoughts, and the fact that I'm
basically stuck for a long time on this one, it's probably better to
agree with a plan before putting together some code.
Note that this plan is based on our usage of Hibernate Search on a lot
of applications for several years and I think our usage pattern is
quite common. But, even so, I'm pretty sure there are other search
patterns out there which might be interesting and it would be nice to
include them in this proposal if they don't fit.
I. How do we search at my company?
-------------------------------------------------------
We mainly use Search for 2 things:
- autocompletion;
- search engines: search form to filter a list of items. Usually, a
plain text field and several structured fields (drop down choice
mostly).
We usually sort with business rules, not using score. Users usually
like it better as it's more predictable. For example, we sort our
autocompletion results alphabetically. An interesting note here is
probably that we work on structured data, not on CMS content. This
might be considered a detail but you'll see it's important.
We use analyzers to:
- split the words (the WordDelimiterFilter - yeah, I have a Solr background :));
- filter the input (AsciiFoldingFilter, LowercaseFilter...);
- eventually do simple stemming (with our own very minimal stemmers).
We sometimes use Search to find the elements to apply business rules
when it's really hard to use the database to do so. Search provides a
convenient way to denormalize the data.
II. On why we can't use the DSL out of the box
--------------------------------------------------------------------
The Hibernate Search DSL is great and I must admit this is the DSL
which learned me how to build DSL for our own usage. It's intuitive,
well thought, definitely a nice piece of code.
So, why don't we use it for our plain text queries? (Disclaimer: we
use it under the hood, we just have to do a lot of things manually
outside of the DSL)
Several reasons:
1/ the aforementioned detail about sorting: we need AND badly in plain
text search;
2/ we often need to add a clause only if the text isn't empty or the
object not null and we then need to add more logic than the fluent
approach allows it (I don't have any ideas/proposals for this one but
I think it's worth mentioning).
And why is it not ideal:
3/ wildcard and analyzers are really a pain with Lucene and you need
to implement your own cleaning stuff to get a working wildcard query.
1/ is definitely our biggest problem.
III. So let's add an AND option...
-----------------------------------------------
Yeah, well, that's not so easy.
Let's take a look at the code, especially our dear friend
ConnectedMultiFieldsTermQueryBuilder .
When I started to look at HSEARCH-917, I thought it would be quite
easy to build lucene queries using a Lucene QueryParser instead of all
the machinery in ConnectedMultiFieldsTermQueryBuilder. It's not.
Here are pointers to the main problems I have:
1/ the getAllTermsFromText is cute when you want to OR the terms but
really bad when you need AND, especially when you use analyzers which
returns several tokens for a term (this is the case when you use the
SynonymFilter or the WordDelimiterFilter);
2/ the fieldBridge thing is quite painful for plain text search as we
are not sure that all the fields have the same fieldBridge and, thus,
the search terms might be different for each fields after applying the
fieldBridge.
These problems are not so easy to solve in an absolute kind of way.
That's why I haven't made any progress on this problem.
Let's illustrate the problem:
- you search for "several words in my content" (without ", it's not a
phrase query, just terms)
- you search in the fields title, summary and content so you expect to
find at least one occurrence of each term in one of these fields;
- for some reason, you have a different fieldBridge on one of the
fields and it's quite hard to define "at least one occurrence of each
term in one of these fields" as the fieldBridge might transform the
text.
My point is that I don't see a way to fix the current DSL without
breaking some cases (note that the current code only works because
only the OR operator is supported) even if we might consider they are
weird.
>From my perspective, a plainText branch of the DSL could ignore the
fieldBridge machinery but I'm not sure it's a good idea. That's why I
would like some feedback about this before moving in this direction.
I took a look at the new features of Lucene 4.7 and the new
SimpleQueryParser looks kinda interesting as it's really simple and
could be a good starting point to come up with a QueryParser which
simply does the job for our plain text search queries.
IV. About wildcard queries
--------------------------------------
Let's say it frankly: wildcard queries are a pain in Lucene.
Let's take an example:
- You index "Parking" and you have a LowerCaseFilter so your index
contains "parking";
- You search for Parking without wildcard, it will work;
- You search for Parki* with wildcard, yeah, it won't work.
This is due to the fact that for wildcards, the analyzers are ignored.
Usually, because if you use ? or *, they can be replaced by the
filters you use in your analyzers.
While we all understand the Lucene point of view from a technical
perspective, I don't think we can keep this position for Hibernate
Search as a user friendly search framework on top of Hibernate.
At Open Wide, we have a quite complex method which rewrites a search
as a working autocompletion search which might work most of the time
(with a high value of most...). It's kinda ugly, far from perfect and
I'm wondering if we could have something more clever in Search. I once
talked with Emmanuel about having different analyzers for Indexing,
Querying (this is the Solr way) and Wildcards/Fuzzy search (this is
IMHO a good idea as the way you want to normalize your wildcard query
highly depends on the analyzer used to index your data).
V. The "don't add this clause if null/empty" problem
----------------------------------------------------------------------------
Ideas welcome!
VI. Provide not so bad default analyzers
---------------------------------------------------------
I think it would be nice to provide default analyzers for plain text.
Not necessarily ones including complex/debatable things like stemmers,
but at least something which gives a good taste of Search before going
into more details.
Why would it be interesting? As a French speaking person, I see so
much search engines out there which don't normalize accented
characters, it would be nice to have something working by default.
VII. Conclusion
----------------------
I would really like to make some quick progress on III. I'm pretty
sure, we're not the only ones having a lot of MultiFieldQueryParser
instantiations in our Search code to deal with this. And I don't talk
about the numerous times when one of our developers used the DSL
without even thinking it would use the OR operator.
Comments welcome.
--
Guillaume
10 years, 9 months
jsr107
by Alex Snaps
Hey everyone,
I wondered if anyone had considered (even the feasibility of) moving the
Caching SPI of Hibernate to use the (now released!) jcache API of JSR107?
I was contemplating having a look at providing a "jsr107 caching provider"
maybe first, which then could maybe folded into Hibernate... Anyways,
random thoughts, maybe some of you already have insights. Also, I'd expect
that (some) "cache vendors" might want to do some tuning based on the
Hibernate usecase still, so maybe the idea isn't such a great one (if even
feasible again, as I didn't even look into that). Anyways... further random
thoughts on the subject welcome, even non-random ones actually.
Alex
--
Alex Snaps <alex.snaps(a)gmail.com>
Principal Software Engineer - Terracotta
http://twitter.com/alexsnaps
http://www.linkedin.com/in/alexsnaps
http://withinthefra.me
10 years, 9 months
Re: [hibernate-dev] Lucene moving to Java7
by Sanne Grinovero
Note this would affect only our upcoming Hibernate Search 5.0: it's a
major release which breaks some backwards compatibility anyway. I
guess that blasts any remaining concern?
For the purpose of WFK users in maintenance mode I'll expect them to
stay on previous Search version, on which we'll backport fixes as
usual.
But I also expect we'll eventually want to provide a "new" version to
deliver the goodies of EE7, JPA 2.1, etc.. which all require JDK7
anyway (in the scope of WFK or anything else coming our way).
Thanks all for the comments
Sanne
On 20 March 2014 16:04, Burr Sutter <bsutter(a)redhat.com> wrote:
> Adding the WFK Mareks :-)
>
> The only potential problem that I see is backward incompatibility with WFK 2.0.0 and its supported frameworks through June 2015.
> We do not require JVM upgrades, in production, for customers, within the "supported time window" - in our WFK case June 2012 to June 2015.
>
>
> On Mar 20, 2014, at 11:21 AM, Sanne Grinovero <sanne(a)hibernate.org> wrote:
>
>> The next minor release of Apache Lucene v. 4.8 will require Java7.
>>
>> The Lucene team has highlighted many good reasons for that, including
>> some excellent improvements in sorting performance and reliability of
>> IO operations: nice things we'd like to take advantage of.
>>
>> Objections against baseling Hibernate Search 5 to *require* Java7 too?
>> We hardly have a choice, so objections better be good ;-)
>>
>> -- Sanne
>
10 years, 9 months
Re: [hibernate-dev] ci.hibernate.org and network port assignment
by Paolo Antinori
hi everyone,
I'll be happy to help with the activity of isolating build job in docker
containers started directly via jenkins.
The technology should allow concurrent build job totally isolated, as
anticipated.
I am going to start with OGM that is the project I am more familiar with
and I will let you know of eventual achievement.
paolo
I've created WEBSITE-178 [1] as once again we had testsuites failing
because of a network port being used by a different job; bad luck, but
we can do better.
Assuming we want to use the Jenkins plugin "port allocator", this
would however need to be consistently honored by all builds we launch:
it passes variables which we should use, but it can't enforce them
AFAIK.
Is that going to be maintanable in the long run?
An alternative would be to consistently run each and every build in
its own isolated docker container, but
a) that's not something I can setup overnight
b) we'd need to use more complex build scripts, for example the nice
Maven and Gradle integrations with the Jenkins build configuration
would be unusable.
We're having quite a list of services; even ignoring the OGM exotic
databases we have at least:
- Byteman
- JGroups
- Arquillian
- WildFly
- ActiveMQ (JMS tests in Search)
- ... ?
To fight the urgent need, I'm going to prevent parallelism of any
build. Let's hope ORM's master doesn't get stuck, as everything else
would be blocked. I really hope this measure stays as temporary as
possible :-/
-- Sanne
10 years, 9 months
DefaultLoadEventListener and second level cache
by Guillaume Smet
Hi,
We have a lot of second level cache misses on one of our applications
and I wanted to understand why it was the case. These cache misses
happen even after loading twice the exact same page. They are coming
from entities which are loaded via DefaultLoadEventListener.
I tried to debug it and was looking for the place where the entity is
put in the cache when the DefaultLoadEventListener path is used.
Could someone point me to where we put the entity in the cache so that
I can try debugging further?
Thanks in advance.
--
Guillaume
10 years, 9 months