Currently, we implement projections by adding collectors next to the {{TopDocsCollector}}.
The problem with this strategy is that collectors will then be applied to all documents in the index.
It's not even just the competitive documents (those that have a score higher than the lowest document in the priority queue when they are visisted): as we can see in MultiCollector, joining multiple collectors together will disable score-based optimizations that would allow skipping some documents along the way.
As a result, the distance collector for example will need to store in memory as many results as the total number of matching documents in the index. Regardless of the limit passed to {{fetch(limit)}}. Which is ridiculous.
Worse, starting with recent changes, the default projection that only retrieves document IDs will always add a collector next to the {{TopDocsCollector}}... and this collector will always build a list as large as the total number of matching documents in the index.
For. Each. Single. Search.
We should switch to a two-phase approach:
# First phase: {{search.search()}} call that inspects all documents to collect the top docs and their score (TopDocsCollector), and if necessary the aggregations (FacetsCollector). # Second phase: explicit collection that inspects only top docs to extract data from docvalues (DistanceCollector) or from storage (reader.document(...) using the StoredFieldVisitor: even if the javadoc of Collector advises against it, in this case it would be fine). Maybe we can use collectors, but a different abstraction would be fine, since we do not need to perform a search, but rather to inspect a pre-determined set of documents.
Note that solving this ticket should fix HSEARCH-3786. |
|