[hibernate-dev] [Hibernate Search] DocValues and Sorting API -> new mapping annotations ?

Tue Aug 4 12:00:14 EDT 2015

Hi Guillaume,
thanks! great input. Some comments inline:

On 4 August 2015 at 15:11, Guillaume Smet <guillaume.smet at gmail.com> wrote:
> Hi Sanne,
>
> On Wed, Jul 29, 2015 at 1:26 PM, Sanne Grinovero <sanne at hibernate.org>
> wrote:
>>
>> I'm not sure if this should be extending the @Field annotation as
>> there are special restrictions implied in terms of analysis: are we
>> going to enforce a specific type of tokenizer, or simply take the
>> analysis option away?
>
>
> You can't remove the analysis option away: it's often used to normalize
> sorting on strings (lowercase, remove accents, remove special characters and
> so on).

Right we made this same example in a recent meeting we had on this same subject.
So that's what makes it tricky: we want to allow Analysis, but while
Lucene needs a strong guarantee that it will be unique, we can't
really verify for that unless we take away the liberty to use any
analyzer.
An alternative would be to wrap the Analyzer to monitor and verify it
to be "well-behaved" but I'm not sure if that's doable, or if the
performance would be negligible. I guess we'll just put it into user's
hands to make a sensible choice.. not that we've done better so far on
this aspect.

> FWIW, we use specific fields for sorting each time we need to sort on a
> string as we don't want to tokenize the string (but not for numerics and
> dates). Maybe @SortFields/@SortField annotations would be in order (I don't
> like Sortable as I don't think it's a good idea to use these fields for
> search).

I like that name proposal, and +1 to not encourage people to try reuse
the same field for sorting and indexing.

The next action for us is to verify what the performance impact is of
the current approach, which is based on the UninvertingReader from
lucene-misc. Gunnar pointed out that uninverting and loading into a
FieldCache is not very different than what Lucene has been doing so
far, so that might be a good strategy to allow migrating to Lucene 5
incrementally, and provide an incremental improvement in this area
rather than requiring the new mapping.

I'll soon merge this approach, and as usual I'm lacking on real-world
applications to benchmark so if you're interested in helping on that
that would be awesome; we just need to know that the new code won't be
significantly slower than the Lucene 4 based strategies for sorted
queries.

Thanks,
Sanne