I dug a bit into the Elasticsearch implementation, and it seems they indeed use SortedNumericDocValuesField when indexing:
@Override
public List<Field> createFields(String name, Number value,
boolean indexed, boolean docValued, boolean stored) {
List<Field> fields = new ArrayList<>();
if (indexed) {
fields.add(new LongPoint(name, value.longValue()));
}
if (docValued) {
fields.add(new SortedNumericDocValuesField(name, value.longValue()));
}
if (stored) {
fields.add(new StoredField(name, value.longValue()));
}
return fields;
}
On the querying size (range aggretations), the implementation seems to be fully specific (they didn't use any Lucene feature):
@Override
public LeafBucketCollector getLeafCollector(LeafReaderContext ctx,
final LeafBucketCollector sub) throws IOException {
final SortedNumericDoubleValues values = valuesSource.doubleValues(ctx);
return new LeafBucketCollectorBase(sub, values) {
@Override
public void collect(int doc, long bucket) throws IOException {
values.setDocument(doc);
final int valuesCount = values.count();
for (int i = 0, lo = 0; i < valuesCount; ++i) {
final double value = values.valueAt(i);
lo = collect(doc, value, bucket, lo);
}
}
private int collect(int doc, double value, long owningBucketOrdinal, int lowBound) throws IOException {
int lo = lowBound, hi = ranges.length - 1; int mid = (lo + hi) >>> 1;
while (lo <= hi) {
if (value < ranges[mid].from) {
hi = mid - 1;
} else if (value >= maxTo[mid]) {
lo = mid + 1;
} else {
break;
}
mid = (lo + hi) >>> 1;
}
if (lo > hi) return lo;
int startLo = lo, startHi = mid;
while (startLo <= startHi) {
final int startMid = (startLo + startHi) >>> 1;
if (value >= maxTo[startMid]) {
startLo = startMid + 1;
} else {
startHi = startMid - 1;
}
}
int endLo = mid, endHi = hi;
while (endLo <= endHi) {
final int endMid = (endLo + endHi) >>> 1;
if (value < ranges[endMid].from) {
endHi = endMid - 1;
} else {
endLo = endMid + 1;
}
}
assert startLo == lowBound || value >= maxTo[startLo - 1];
assert endHi == ranges.length - 1 || value < ranges[endHi + 1].from;
for (int i = startLo; i <= endHi; ++i) {
if (ranges[i].matches(value)) {
collectBucket(sub, doc, subBucketOrdinal(owningBucketOrdinal, i));
}
}
return endHi + 1;
}
};
}
SolR, on the other hand, seems to use SortedSetDocValues for numeric multi-valued fields, storing the numeric value as ByteRefs I only checked the faceting query part, which looks like that:
private void doCount() throws IOException {
if (schemaField.getType().getNumericType() != null && !schemaField.multiValued()) {
getCountNumeric();
} else {
getCountString();
}
}
private void getCountString() throws IOException {
Filter filter = docs.getTopFilter();
List<LeafReaderContext> leaves = searcher.getTopReaderContext().leaves();
for (int subIndex = 0; subIndex < leaves.size(); subIndex++) {
LeafReaderContext leaf = leaves.get(subIndex);
DocIdSet dis = filter.getDocIdSet(leaf, null); if (dis == null) {
continue;
}
DocIdSetIterator disi = dis.iterator();
if (disi != null) {
if (schemaField.multiValued()) {
SortedSetDocValues sub = leaf.reader().getSortedSetDocValues(schemaField.getName());
if (sub == null) {
continue;
}
final SortedDocValues singleton = DocValues.unwrapSingleton(sub);
if (singleton != null) {
accumIntervalsSingle(singleton, disi, dis.bits());
} else {
accumIntervalsMulti(sub, disi, dis.bits());
}
} else {
SortedDocValues sub = leaf.reader().getSortedDocValues(schemaField.getName());
if (sub == null) {
continue;
}
accumIntervalsSingle(sub, disi, dis.bits());
}
}
}
}
See how multi-valued fields are handled by getCountString(). The full code is in IntervalFacets.java: it's a bit dense, but it really seems to be designed to work with numeric values. This all tends to prove that Lucene itself is not capable of doing what we want to do without some additional work. Judging by the amount of attention the Lucene ticket (https://issues.apache.org/jira/browse/LUCENE-7044) has received, I'd say we'll have to do it ourselves. |