[infinispan-dev] Lucene 5 is coming: pitfalls to consider

Tue Jul 28 08:43:30 EDT 2015

On 28 July 2015 at 15:09, Tristan Tarrant <ttarrant at redhat.com> wrote:
> On 28/07/2015 13:13, Sanne Grinovero wrote:
>> # Sorting
>> To sort on a field will require an UninvertingReader to wrap the
>> cached IndexReaders, and the uninverting process is very inefficient.
>> On top of that, the result of the uninverting process is not
>> cacheable, so that will need to be repeated on each index, for each
>> query which is executed.
>> In short, I expect performance of sorted queries to be quite degraded
>> in our first milestone using Lucene 5, and we'll have to discuss how
>> to fix this.
>> Needless to say, fixing this is a blocking requirement before we can
>> consider the migration complete.
>>
>> Sorting will not need an UninvertingReader if the target field has
>> been indexed as DocValues, but that implies:
>>   - we'll need an explicit, upfront (indexing time) flag to be set
>>   - we'll need to detect if the matching indexing options are
>> compatible with the runtime query to skip the uninverting process
>>
>> This is mostly a job for Hibernate Search, but in terms of user
>> experience it means you have to mark fields for "sortability"
>> explicitly; will we need to extend the protobuf schema?
>>
>> Please make sure we'll just have to hook in existing metadata, we
>> can't fix this after API freeze.
>
> This is very important to get right. As a user, I'd honestly expect an
> indexed field to also be used for sorting, so probably having some
> per-index flag (off by default) which implicitly enables DocValues for
> all indexed fields.

Even in past Lucene versions it never has been that simple: a field to
be sortable *correctly* would
require the tokenizer to output a single token. Which implies that the
user had to explicitly design some fields for the "purpose of
sorting".
DocValues express this as a stricter requirement: you can't encode a
multi-token value as a DocValue.

The problem is that we can't guess for which analyzers it would be
safe to automatically enable DocValue(s).
 - The Analyzer is an open set
 - The number of tokens depends on the input

What we could do is to store the value as DocValue IFF the output of a
specific analyzer chain happens to be a single token.. but to me it
sounds a bit ugly, not sure what kind of issues we'd be getting into.

> Does uninverting offer any performance advantage
> over the sorting we already do ? Sorting wouldn't help anyway in the
> clustered query scenario, where you'd have to merge the results from
> multiple nodes anyway (I guess there is some win in pre-sorting on each
> node and then splicing the sorted sub-resultsets).

If I skip "uninverting" you will only be able to sort on fields which
have DocValues stored in the index;
it's not offering any performance benefit it all: it's much slower and
takes quite some memory, we've put the uninverting process in place
just as a fallback to allow sorting to happen on current models.

We'll need to
a) define a convenient metadata to mark fields used for sorting (for
embedded mode we'll discuss this on hibernate-dev, for Hot Rod queries
it's up to you)
b) encourage users to explicitly use this:
 - should we log a warning when we fallback to the slow (uninverting) strategy?
 - should we disable the fallback and have people stare at an explicit
stacktrace?

>> # Index encoding
>> As usual the index encoding evolves and the easy solution is to
>> rebuild it. Lucene 5 no longer ships with backwards compatible
>> de-coders, but these are available as separate dependencies. If you
>> feel the need to be able to read existing indexes, we should include
>> these.
>
> Possibly make them optional.

+1

> I guess our recommendation is to
> mass-reindex anyway.

Always ;)

> Tristan
>
> --
> Tristan Tarrant
> Infinispan Lead
> JBoss, a division of Red Hat
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev