]
Guillaume Smet commented on HSEARCH-917:
----------------------------------------
Hi Emmanuel,
Thanks for your answer.
Coming from the Solr world before starting working with Hibernate Search 3 years ago, the
WordDelimiterFilterFactory is one of my favorite filters/analyzers.
To answer to your points:
* results are on top *if* sorted by score. I must admit I often sort by name instead of
score (especially in autocomplete features because it's really easier to use but even
often in regular searches) and then the problem is that you can have far too many results.
In our case we had 1000 results for an autocomplete search sorted alphabetically (and no,
I cannot get the top ten results because I sort them by name not by score) and it was
unusable due to the latency.
* considering your ngram point, I understand it but I'm under the impression that the
OR should be an option, not the default choice but I suppose it's too late for that.
Perhaps a .matchingAll[Terms](String text) at the same level as .matching() would be a
good idea? That way, people would see it in their IDE autocomplete choices and they would
think about what they really want to do. I don't know if it's compatible with your
vision of the DSL API.
From my experience with Hibernate Search, a lot of people are mistaken
about how matching() works so it might be a good idea to give them an hint in the API.
I can work on a patch once we agree on how to solve this. I'd like to add edismax
support one day or another so I'm interested in working on this part of the code.
--
Guillaume
DSL API doesn't build the correct Lucene query
----------------------------------------------
Key: HSEARCH-917
URL:
http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-917
Project: Hibernate Search
Issue Type: Bug
Components: query
Affects Versions: 3.4.1.Final
Reporter: Guillaume Smet
Priority: Critical
Hi,
A bit of context:
We are early adopters of Hibernate Search and we have very few problems with it (except
the @IndexEmbedded problem we helped to fix in 3.4.1, no problem so far).
When the DSL API was introduced, I tried it and I found the problem I describe below. I
decided to use the QueryParser API (and the MultiFieldQueryParser API) as a workaround.
The fact is that:
* we use Hibernate Search in every application we have, now;
* the DSL API is really nice and, as we introduced QueryDSL in our application, we now
use a lot of DSL like API and I would like to be able to use Hibernate Search API too;
* I thought it was a deliberate choice but, recently I found an example so weird, I
can't think it's the wanted behaviour.
So this problem isn't new and it exists since the first version of the DSL API.
Now, the description of the problem:
* we use the following analyzer to index a field in our entity:
{code}
@AnalyzerDef(name = HibernateSearchAnalyzer.TEXT,
tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
filters = {
@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
@TokenFilterDef(factory = WordDelimiterFilterFactory.class, params = {
@org.hibernate.search.annotations.Parameter(name =
"generateWordParts", value = "1"),
@org.hibernate.search.annotations.Parameter(name =
"generateNumberParts", value = "1"),
@org.hibernate.search.annotations.Parameter(name = "catenateWords",
value = "0"),
@org.hibernate.search.annotations.Parameter(name = "catenateNumbers",
value = "0"),
@org.hibernate.search.annotations.Parameter(name = "catenateAll",
value = "0"),
@org.hibernate.search.annotations.Parameter(name =
"splitOnCaseChange", value = "0"),
@org.hibernate.search.annotations.Parameter(name = "splitOnNumerics",
value = "0"),
@org.hibernate.search.annotations.Parameter(name = "preserveOriginal",
value = "1")
}
),
@TokenFilterDef(factory = LowerCaseFilterFactory.class)
}
),
{code}
* the content of the field is something like XXXX-AAAA-HAGYU-19910
* if you search for an exact match "XXXX-AAAA-HAGYU-19910" with the
QueryParser, you have a few results: namely the results which have all the different parts
(XXXX, AAAA, HAGYU and 19910) in any order. That's the behaviour I expect considering
my analyzer.
* if you search using the DSL API, you have ALL the results containing at least ONE token
so A LOT of results in our case.
My expectation is that the DSL API should work as the Lucene parser works and it should
return the same results.
The problem is that in ConnectedMultiFieldsTermQueryBuilder, we don't use the
QueryParser to build the Lucene query but a getAllTermsFromText() method which uses the
analyzer to get all the terms and from that we build a OR query.
So when I search for XXXX-AAAA-HAGYU-19910, the DSL API searches for "XXXX" OR
"AAAA" OR "HAGYU" OR "19910".
I really think it's a mistake and that we should use the *QueryParser API to build
the Lucene Query and have the correct behaviour.
If needed, I can provide any further information and/or a test case. I just want to be
sure you consider it a bug before working further on this. Otherwise I'll stick to
using the *QueryParser API.
Thanks for your feedback.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: