[JIRA] (HSEARCH-3856) Aggregations on multi-valued numeric fields for Lucene
by Yoann Rodière (JIRA)
Yoann Rodière ( https://hibernate.atlassian.net/secure/ViewProfile.jspa?accountId=557058%... ) *updated* an issue
Hibernate Search ( https://hibernate.atlassian.net/browse/HSEARCH?atlOrigin=eyJpIjoiNzljMzU4... ) / Improvement ( https://hibernate.atlassian.net/browse/HSEARCH-3856?atlOrigin=eyJpIjoiNzl... ) HSEARCH-3856 ( https://hibernate.atlassian.net/browse/HSEARCH-3856?atlOrigin=eyJpIjoiNzl... ) Aggregations on multi-valued numeric fields for Lucene ( https://hibernate.atlassian.net/browse/HSEARCH-3856?atlOrigin=eyJpIjoiNzl... )
Change By: Yoann Rodière ( https://hibernate.atlassian.net/secure/ViewProfile.jspa?accountId=557058%... )
See how {{ org.hibernate.search.integrationtest.backend.tck.search.aggregation.SingleFieldAggregationBaseIT#multiValued }} is disabled due to {{ org.hibernate.search.integrationtest.backend.lucene.testsupport.util.LuceneTckBackendFeatures#aggregationsOnMultiValuedFields }} .
Before HSEARCH-3839, we couldn't even index multiple values for numeric fields in Lucene. After HSEARCH-3839, we can, but we pick a single value when aggregating, so aggregations are still incorrect.
Ideally, when counting documents per field value, multi-valued documents should be counted once per value that appears in the field. So if a single document has values {{ 1 }} and {{ 2 }} for a single field, it should increment the count for both {{ 1 }} and {{ 2 }} . At least that's what happens on Elasticsearch.
How to test the behavior on Elasticsearch:
{code}
curl -XDELETE -H "Content-Type: application/json" localhost:9200/mytest1/ 1>&2 2>/dev/null; curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest1/\?pretty -d'{"mappings":{"properties":{"num":{"type":"integer" }}} }'
url curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest1/_doc/1 -d'{"num":1}'
curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest1/_doc/2 -d'{"num":[1,2]}'
curl -XPOST -H "Content-Type: application/json" localhost:9200/mytest1/_search\?pretty -d'{"aggs":{"foo":{"terms":{"field":"num" }}} }'
{code}
Result:
{noformat}
{
...
"aggregations" : {
"foo" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 2
},
{
"key" : 2,
"doc_count" : 1
}
]
}
}
}
{noformat}
So document 2 was counted twice.
( https://hibernate.atlassian.net/browse/HSEARCH-3856#add-comment?atlOrigin... ) Add Comment ( https://hibernate.atlassian.net/browse/HSEARCH-3856#add-comment?atlOrigin... )
Get Jira notifications on your phone! Download the Jira Cloud app for Android ( https://play.google.com/store/apps/details?id=com.atlassian.android.jira.... ) or iOS ( https://itunes.apple.com/app/apple-store/id1006972087?pt=696495&ct=EmailN... ) This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100121- sha1:b4d24b6 )
4 years, 10 months
[JIRA] (HSEARCH-3856) Aggregations on multi-valued numeric fields for Lucene
by Yoann Rodière (JIRA)
Yoann Rodière ( https://hibernate.atlassian.net/secure/ViewProfile.jspa?accountId=557058%... ) *commented* on HSEARCH-3856 ( https://hibernate.atlassian.net/browse/HSEARCH-3856?atlOrigin=eyJpIjoiZjU... )
Re: Aggregations on multi-valued numeric fields for Lucene ( https://hibernate.atlassian.net/browse/HSEARCH-3856?atlOrigin=eyJpIjoiZjU... )
>
>
>
> Now there are actually four aggregation options for nested documents and
> five options for flat documents. But you can add "none", or if you don't
> set it possible, all fields can be agitated without performing linking
> functions on them.
>
>
Yes, that's the plan. By default, I don't thing we should "per-document aggregations" (sum, avg, lowest, etc.) in aggregations, so as to behave consistently:
* Between string aggregations and numeric aggregations: we can't sun/avg/... for strings, and lowest/highest don't make much sense for terms found in text.
* Between Lucene numeric aggregations and Elasticsearch numeric aggregations: Elasticsearch takes into account all values by default, not the sum/avg/lowest/etc.
Also, I don't think we can request per-document sum/avg/lowest/etc. for numeric terms/range aggregations in Elasticsearch, so we can't expose the feature in generic APIs that both Elasticsearch and Lucene must implement. We could move it to Lucene-specific APIs, I suppose, but there isn't really a use case, is there? You just implemented this so that aggregations would somehow work on multi-valued fields?
>
>
>
> You can practically set the sorting option to none. But it would have to
> return as many document repetitions as the nested or duplicate values in
> the flat model field.
>
>
Yes, some documents would be counted multiple times. That's what Elasticsearch does by default, and I think it's a decent default.
>
>
>
> Especially if paging is used.
>
>
Paging is not relevant for aggregations, which are applied on the whole index.
I don't think performance is an issue here, if that's what you're suggesting. The problem is more that we have to move away for our "legacy" implementation of aggregations that relied on Lucene's faceting.
Anyway, this is all something I'm suggesting to do as a second step. After your work, sorts on multi-valued fields work correctly, and aggregations on multi-valued fields work correctly as long as there is effectively only one value per document (which will probably be the case once you add filtering anyway).
( https://hibernate.atlassian.net/browse/HSEARCH-3856#add-comment?atlOrigin... ) Add Comment ( https://hibernate.atlassian.net/browse/HSEARCH-3856#add-comment?atlOrigin... )
Get Jira notifications on your phone! Download the Jira Cloud app for Android ( https://play.google.com/store/apps/details?id=com.atlassian.android.jira.... ) or iOS ( https://itunes.apple.com/app/apple-store/id1006972087?pt=696495&ct=EmailN... ) This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100121- sha1:b4d24b6 )
4 years, 10 months