[infinispan-dev] Lucene 5 is coming: pitfalls to consider

Tue Jul 28 07:13:14 EDT 2015

Hi all,
the Hibernate Search branch upgrading to Apache Lucene 5.2.x is almost
ready, but there are some drawbacks on top of the many nice efficiency
improvements.

# API changes
The API changes are not too bad, and definitely an improvement. I'll
provide a detailed list as usual in the Hibernate Search migration
guide - for now let it suffice to know that it's an easy upgrade for
end users, as long as they were just creating Query instances and not
using the more powerful and complex stuff.

# Sorting
To sort on a field will require an UninvertingReader to wrap the
cached IndexReaders, and the uninverting process is very inefficient.
On top of that, the result of the uninverting process is not
cacheable, so that will need to be repeated on each index, for each
query which is executed.
In short, I expect performance of sorted queries to be quite degraded
in our first milestone using Lucene 5, and we'll have to discuss how
to fix this.
Needless to say, fixing this is a blocking requirement before we can
consider the migration complete.

Sorting will not need an UninvertingReader if the target field has
been indexed as DocValues, but that implies:
 - we'll need an explicit, upfront (indexing time) flag to be set
 - we'll need to detect if the matching indexing options are
compatible with the runtime query to skip the uninverting process

This is mostly a job for Hibernate Search, but in terms of user
experience it means you have to mark fields for "sortability"
explicitly; will we need to extend the protobuf schema?

Please make sure we'll just have to hook in existing metadata, we
can't fix this after API freeze.

# Filters
We did some clever bitset level optimisations to merge multiple Filter
instances and save memory to cache multiple filter instances, I had to
drop that code as we don't deal with in-heap structures more but the
design is about iterating off heap chunks of data, and resort on the
more traditional Lucene stack for filtering.
I couldn't measure the performance impact yet; it's a significantly
different approach and while it sounds promising on paper, we'll need
some help testing this. The Lucene team can generally be trusted to go
in the better direction, but we'll have to verify if we're using it in
the best way.

# Analyzers
It is no longer possible to override the field->analyzer mapping at
runtime. We did expose this feature as a public API and I found a way
to still do it, but it comes with a performance price tag.
We'll soon deprecate this feature; if you can, start making sure
there's no need for this in Infinispan as at some time in the near
future we'll have to drop this, with no replacement.

# Index encoding
As usual the index encoding evolves and the easy solution is to
rebuild it. Lucene 5 no longer ships with backwards compatible
de-coders, but these are available as separate dependencies. If you
feel the need to be able to read existing indexes, we should include
these.
(I'm including these as private dependencies in the Hibernate Search modules).

Thanks,
Sanne