On 28 July 2015 at 15:09, Tristan Tarrant <ttarrant(a)redhat.com> wrote:
On 28/07/2015 13:13, Sanne Grinovero wrote:
> # Sorting
> To sort on a field will require an UninvertingReader to wrap the
> cached IndexReaders, and the uninverting process is very inefficient.
> On top of that, the result of the uninverting process is not
> cacheable, so that will need to be repeated on each index, for each
> query which is executed.
> In short, I expect performance of sorted queries to be quite degraded
> in our first milestone using Lucene 5, and we'll have to discuss how
> to fix this.
> Needless to say, fixing this is a blocking requirement before we can
> consider the migration complete.
>
> Sorting will not need an UninvertingReader if the target field has
> been indexed as DocValues, but that implies:
> - we'll need an explicit, upfront (indexing time) flag to be set
> - we'll need to detect if the matching indexing options are
> compatible with the runtime query to skip the uninverting process
>
> This is mostly a job for Hibernate Search, but in terms of user
> experience it means you have to mark fields for "sortability"
> explicitly; will we need to extend the protobuf schema?
>
> Please make sure we'll just have to hook in existing metadata, we
> can't fix this after API freeze.
This is very important to get right. As a user, I'd honestly expect an
indexed field to also be used for sorting, so probably having some
per-index flag (off by default) which implicitly enables DocValues for
all indexed fields.
Even in past Lucene versions it never has been that simple: a field to
be sortable *correctly* would
require the tokenizer to output a single token. Which implies that the
user had to explicitly design some fields for the "purpose of
sorting".
DocValues express this as a stricter requirement: you can't encode a
multi-token value as a DocValue.
The problem is that we can't guess for which analyzers it would be
safe to automatically enable DocValue(s).
- The Analyzer is an open set
- The number of tokens depends on the input
What we could do is to store the value as DocValue IFF the output of a
specific analyzer chain happens to be a single token.. but to me it
sounds a bit ugly, not sure what kind of issues we'd be getting into.
Does uninverting offer any performance advantage
over the sorting we already do ? Sorting wouldn't help anyway in the
clustered query scenario, where you'd have to merge the results from
multiple nodes anyway (I guess there is some win in pre-sorting on each
node and then splicing the sorted sub-resultsets).
If I skip "uninverting" you will only be able to sort on fields which
have DocValues stored in the index;
it's not offering any performance benefit it all: it's much slower and
takes quite some memory, we've put the uninverting process in place
just as a fallback to allow sorting to happen on current models.
We'll need to
a) define a convenient metadata to mark fields used for sorting (for
embedded mode we'll discuss this on hibernate-dev, for Hot Rod queries
it's up to you)
b) encourage users to explicitly use this:
- should we log a warning when we fallback to the slow (uninverting) strategy?
- should we disable the fallback and have people stare at an explicit
stacktrace?
> # Index encoding
> As usual the index encoding evolves and the easy solution is to
> rebuild it. Lucene 5 no longer ships with backwards compatible
> de-coders, but these are available as separate dependencies. If you
> feel the need to be able to read existing indexes, we should include
> these.
Possibly make them optional.
+1
I guess our recommendation is to
mass-reindex anyway.
Always ;)
Tristan
--
Tristan Tarrant
Infinispan Lead
JBoss, a division of Red Hat
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev