Message Title

Change By:	Yoann Rodière

Currently, we implement projections by adding collectors next to the {{TopDocsCollector}}.

The problem with this strategy is that collectors will then be applied to all documents in the index.

It's not even just the competitive documents (those that have a score higher than the lowest document in the priority queue when they are visisted): as we can see in MultiCollector, joining multiple collectors together will disable score-based optimizations that would allow skipping some documents along the way.

As a result, the distance collector for example will need to store in memory as many results as the total number of matching documents in the index. Regardless of the limit passed to {{fetch(limit)}}. Which is ridiculous.

Worse, starting with recent changes, the default projection that only retrieves document IDs will always add a collector next to the {{TopDocsCollector}}... and this collector will always build a list as large as the total number of matching documents in the index.

For. Each. Single. Search.

We should switch to a two-phase approach:

# First phase: {{search.search()}} call that inspects all documents to collect the top docs and their score (TopDocsCollector), and if necessary the aggregations (FacetsCollector).
# Second phase: explicit collection that inspects only top docs to extract data from docvalues (DistanceCollector) or from storage (reader.document(...) using the StoredFieldVisitor: even if the javadoc of Collector advises against it, in this case it would be fine). Maybe we can use collectors, but a different abstraction would be fine, since we do not need to perform a search, but rather to inspect a pre-determined set of documents.

Note that solving this ticket should fix HSEARCH-3786.

Add Comment

Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS

This message was sent by Atlassian Jira