Message Title

Issue Type:	Improvement
Affects Versions:	6.0.0.Beta3
Assignee:	Yoann Rodière
Components:	backend-lucene
Created:	18/Dec/2019 09:14 AM
Fix Versions:	6.0.0.Beta-backlog-high-priority
Priority:	Major
Reporter:	Yoann Rodière

Currently, we implement projections by adding collectors next to the TopDocsCollector.

The problem with this strategy is that collectors will then be applied to all documents in the index.

It's not even just the competitive documents (those that have a score higher than the lowest document in the priority queue when they are visisted): as we can see in MultiCollector, joining multiple collectors together will disable score-based optimizations that would allow skipping some documents along the way.

As a result, the distance collector for example will need to store in memory as many results as the total number of documents in the index. Which is ridiculous.

We should switch to a two-phase approach:

First phase: search.search() call that inspects all documents to collect the top docs and their score (TopDocsCollector), and if necessary the aggregations (FacetsCollector).
Second phase: explicit collection that inspects only top docs to extract data from docvalues (DistanceCollector) or from storage (reader.document(...) using the StoredFieldVisitor: even if the javadoc advises against it, in this case it would be fine). Maybe we can use collectors, but a different abstraction would be fine, since we do not need to perform a search, but rather to inspect a pre-determined set of documents.

Note that solving this ticket should fix HSEARCH-3786 In Progress .

Add Comment

Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS

This message was sent by Atlassian Jira