[hibernate-dev] [Hibernate Search] DocValues and Sorting API -> new mapping annotations ?

Wed Jul 29 07:26:03 EDT 2015

You might remember that running a full-text Query on a field always
required some specific care;
since the beginning of Hibernate Search the user had to make sure the
field was not tokenized, or tokenized but generating a single token.

This was a "soft requirement": if you didn't know, you'd get
inconsistent results but no error would be shown - after all, a Lucene
was typically schema-less.

With Lucene 5, if you didn't map your field specifically for the
sorting purpose, you'll get a runtime exception at query time. By
"specifically for sorting" the requirement is that the *single token*
needs to be stored as a DocValue.

DocValues are useful for other purposes too; for example they are a
more efficient strategy to store our "id" field and I hope we'll soon
use that transparently. It's also a better replacement for all those
use cases which previously would rely on a FieldCache (which by the
way was used by the sorting code under the hood). For example, we
recently migrated the Spatial indexing to use DocValues, and we now
support serializing and writing of DocValues.

What we don't have is a way for end user to explicitly single out
which fields they want to be "DocValue encoded", and this really needs
to be added now as the workaround we have to be able to run Sort
operations without it is killing our performance.

How should such annotations look like?

I don't like to expose such low-level details when for most people
it's just about sorting. Still, being useful for other reasons having
a "@Sortable" (or similar) named annotation would be limiting.
DocValues themselves - as a concept - are fine but even in the Lucene
history the exact name changed several times; wondering if we should
stick to the (current) technical term or abstract a bit from it.

I'm not sure if this should be extending the @Field annotation as
there are special restrictions implied in terms of analysis: are we
going to enforce a specific type of tokenizer, or simply take the
analysis option away?

Any nice suggestion of how this could looke like? This would become a
highly used public API.

The good news is that we'll be able to validate sort definitions.

Thanks,
Sanne