Message Title

HSEARCH-1927

I dug a bit into the Elasticsearch implementation, and it seems they indeed use SortedNumericDocValuesField when indexing:

                                                                NumberFieldMapper.java, line 707, commit 27496d6b925d8900b3357ff5672cbafa5ef2b154
                                                            
 
                                                                            @Override
            public List<Field> createFields(String name, Number value,
                                            boolean indexed, boolean docValued, boolean stored) {
                List<Field> fields = new ArrayList<>();
                if (indexed) {
                    fields.add(new LongPoint(name, value.longValue()));
                }
                if (docValued) {
                    fields.add(new SortedNumericDocValuesField(name, value.longValue()));
                }
                if (stored) {
                    fields.add(new StoredField(name, value.longValue()));
                }
                return fields;
            }
 
                                                            

On the querying size (range aggretations), the implementation seems to be fully specific (they didn't use any Lucene feature):

                                                                RangeAggregator.java, line 242, commit 27496d6b925d8900b3357ff5672cbafa5ef2b154
                                                            
 
                                                                    @Override
    public LeafBucketCollector getLeafCollector(LeafReaderContext ctx,
            final LeafBucketCollector sub) throws IOException {
        final SortedNumericDoubleValues values = valuesSource.doubleValues(ctx);
        return new LeafBucketCollectorBase(sub, values) {
            @Override
            public void collect(int doc, long bucket) throws IOException {
                values.setDocument(doc);
                final int valuesCount = values.count();
                for (int i = 0, lo = 0; i < valuesCount; ++i) {
                    final double value = values.valueAt(i);
                    lo = collect(doc, value, bucket, lo);
                }
            }

    private int collect(int doc, double value, long owningBucketOrdinal, int lowBound) throws IOException {
        int lo = lowBound, hi = ranges.length - 1; // all candidates are between these indexes
        int mid = (lo + hi) >>> 1;
        while (lo <= hi) {
            if (value < ranges[mid].from) {
                hi = mid - 1;
            } else if (value >= maxTo[mid]) {
                lo = mid + 1;
            } else {
                break;
            }
            mid = (lo + hi) >>> 1;
        }
        if (lo > hi) return lo; // no potential candidate

        // binary search the lower bound
        int startLo = lo, startHi = mid;
        while (startLo <= startHi) {
            final int startMid = (startLo + startHi) >>> 1;
            if (value >= maxTo[startMid]) {
                startLo = startMid + 1;
            } else {
                startHi = startMid - 1;
            }
        }

        // binary search the upper bound
        int endLo = mid, endHi = hi;
        while (endLo <= endHi) {
            final int endMid = (endLo + endHi) >>> 1;
            if (value < ranges[endMid].from) {
                endHi = endMid - 1;
            } else {
                endLo = endMid + 1;
            }
        }

        assert startLo == lowBound || value >= maxTo[startLo - 1];
        assert endHi == ranges.length - 1 || value < ranges[endHi + 1].from;

        for (int i = startLo; i <= endHi; ++i) {
            if (ranges[i].matches(value)) {
                        collectBucket(sub, doc, subBucketOrdinal(owningBucketOrdinal, i));
            }
        }

        return endHi + 1;
    }
        };
    }
 
                                                            

SolR, on the other hand, seems to use SortedSetDocValues for numeric multi-valued fields, storing the numeric value as ByteRefs
I only checked the faceting query part, which looks like that:

                                                                "IntervalFacets.java, line 176, commit 93562da610bf8756351be7720c69872bc1cea727"
                                                            
 
                                                                  private void doCount() throws IOException {
    if (schemaField.getType().getNumericType() != null && !schemaField.multiValued()) {
      getCountNumeric();
    } else {
      getCountString();
    }
  }
 
                                                            

                                                                "IntervalFacets.java, line 246, commit 93562da610bf8756351be7720c69872bc1cea727"
                                                            
 
                                                                  private void getCountString() throws IOException {
    Filter filter = docs.getTopFilter();
    List<LeafReaderContext> leaves = searcher.getTopReaderContext().leaves();
    for (int subIndex = 0; subIndex < leaves.size(); subIndex++) {
      LeafReaderContext leaf = leaves.get(subIndex);
      DocIdSet dis = filter.getDocIdSet(leaf, null); // solr docsets already exclude any deleted docs
      if (dis == null) {
        continue;
      }
      DocIdSetIterator disi = dis.iterator();
      if (disi != null) {
        if (schemaField.multiValued()) {
          SortedSetDocValues sub = leaf.reader().getSortedSetDocValues(schemaField.getName());
          if (sub == null) {
            continue;
          }
          final SortedDocValues singleton = DocValues.unwrapSingleton(sub);
          if (singleton != null) {
            // some codecs may optimize SORTED_SET storage for single-valued fields
            accumIntervalsSingle(singleton, disi, dis.bits());
          } else {
            accumIntervalsMulti(sub, disi, dis.bits());
          }
        } else {
          SortedDocValues sub = leaf.reader().getSortedDocValues(schemaField.getName());
          if (sub == null) {
            continue;
          }
          accumIntervalsSingle(sub, disi, dis.bits());
        }
      }
    }
  }
 
                                                            

See how multi-valued fields are handled by getCountString().
The full code is in IntervalFacets.java: it's a bit dense, but it really seems to be designed to work with numeric values.

This all tends to prove that Lucene itself is not capable of doing what we want to do without some additional work.
Judging by the amount of attention the Lucene ticket (https://issues.apache.org/jira/browse/LUCENE-7044) has received, I'd say we'll have to do it ourselves.

Add Comment

This message was sent by Atlassian JIRA