| Taking on the work of Emmanuel, I'm now working on this on my own branch (https://github.com/yrodiere/hibernate-search/tree/HSEARCH-1872). I just created
HSEARCH-2320 Open for everything that will be addressed later on (see the ticket for details). We can always create sub-tasks when the time comes to address those. Summing up, here is the remaining work:
- Making things work in -engine (the tests are failing right now)
- Ensuring things work in -elasticsearch
- Using proper exceptions
- Taking some decisions (see below)
Below are the subjects on which we should take decisions. Where is this API going? Is this API going to be, on the long term:
- an independent way of building org.apache.lucene.search.Sort objects, that will later be provided when executing a query?
- an independent way of building HibernateSearch-defined sort objects (say, HSSort, like the HSQuery), that will later be provided when executing a HSQuery?
Solution 1 seems workable, since the ElasticSearch module currently converts Sort objects to its internal representation, but this discussion makes me doubt that reliance on Lucene types is a long-term objective. Solution 2 also seems workable, but would be part of a larger work on HibernateSearch APIs, which I'm not aware of (yet). Is this API going to support native sorts? See the discussion here: https://github.com/hibernate/hibernate-search/pull/1132#discussion_r72258266 In my opinion, supporting native sorts in this API would be a bit annoying, because the API is currently returning org.apache.lucene.search.Sort, which is not at all a good fit for putting ElasticSearch JSON sort definitions in it. We could either work it around (having a specific SortField with a JSON string in it, and a dummy field name), which seems a bit messy, or convert the JSON to a SortField, which would defeat the purpose of allowing to provide native JSON in the first place, since it would prevent users from using ElasticSearch features not yet supported in HibernateSearch. Also, the API will IMO be less elegant, since we'll have a method that accept a parameter whose content (be it a plain String or an object) is not clearly constrained. That would normally be a good candidate for a type parameter (TNativeSort), but given the way we obtain the sort builder, we cannot have an implementation-dependent type parameter. So whatever we do, we won't be able to get much better than exposing a byNative(Object) method that tells close to nothing to the user. A shame for a domain-specific language. Maybe we should take the problem the other way around: somewhere in the ElasticSearch module, allow for concatenation of DSL-built sorts and plain JSON strings? Or, maybe, add setSort(Sort ... fields) methods to queries, and add something Should we keep andByIndexOrder/andByScore Those should remain, IMO, because:
- Implementation is trivial (as long as Lucene supports that kind of sorts, which seems likely)
- They might be useful, particularly andByScore. I've provided a use case in the tests: sorting first by a field for which many documents have the same value (a category), and only then by relevance. So you'll have blocks of results, with the elements in each block being sorted by relevance. Granted, it's exotic, but business users are definitely able to come up with such a use case. And when business users want that kind of things in their whole application, it's frustrating being prevented from using the DSL for such a minor detail.
- Removing them would go against the principle of least surprise. If everything we provide as by* method is also provided as andBy* method, why not those two?
Should we keep in(Unit) I'd say we should not keep this in the DSL, since it's not supported out-of-the-box for vanilla Lucene queries. All methods in the DSL should be implemented for every backend. Maybe we can create another ticket for adding that kind of feature to locale Lucene indexes. Anyone find this useful? Is the byDistanceFromSpatialQuery(Query) method making sense The point of this method would be to take a query distance (distance to these coordinates must be below 1km, for instance), extract the point from it, and create a sort by distance to this point. On the plus side, this relieves the user from passing the reference coordinates twice. On the minus side, the method could break at runtime: we ask for a Query, which may or may not have a distance filter, and as such introduces new ways for users to make things explode. I wonder if it's such a great pain to provide the coordinates twice. Isn't it how it's done almost everywhere? I mean, when you write an ElasticSearch query by hand, or a SolR query, you also have to provide these coordinates twice. Besides, we're only changing the thing to provide twice, here: instead of providing the coordinates to HS twice, we provide the query to HS twice... I don't see any way to do it better than this, so I'd say we drop it. It's not implemented yet, anyway. Do we want to support sorting on unmapped fields When querying multiple indices at once, it may happen that some field we're sorting on is absent from one of the indices. In that case, ElasticSearch requires that type information be provided. So in that case, we may have to retrieve type information from the metamodel. Problem: the QueryBuilder, from which we start building the sort, is only aware of one entity we're working on. This might be something to fix, because queries built this way may (and will) still be ran against multiple indices, but that's another matter. So, let's say the user is querying two indices: A(fields: z, a1, a2), B (fields: z, b1, b2), and he wants to sort on z, then a1, then b1. The user won't be able to build his sort with the QueryBuilder, since whichever entity he chooses to bind his QueryBuilder to, there would be some missing type information at some point (for either the a1 field or the b1 field). Now, we can decide either to drop support for that kind of sort in the DSL (and that should be documented), or to provide support for {{QueryBuilder}}s spanning multiple types (say, they would use the first type definition they find on any of the entities they would be bound to). Or both. Any thoughts would be greatly appreciated! |