Message Title

Ivan Krumov created an issue

Issue Type:	New Feature
Assignee:	Unassigned
Components:	elasticsearch
Created:	08/Mar/2017 06:17 AM
Environment:	Hibernate ORM 5.1.3.Final Hibernate Search 5.6.1.Final Elasticsearch 2.3.2 Java EE 7 DB engine: Aurora 5.6.10a
Priority:	Major
Reporter:	Ivan Krumov

Consider adding support for the '_routing' parameter when doing CRUD operations against an ElasticSearch cluster. This can be a very effective way (in my opinion) to improve performance of search and update operations, and provide more control over isolating different domains of data in the index. See this link for the documentation of this parameter.

By default, without custom _routing, the document ID is used as a routing value to determine in which ES shard it needs to be indexed. The shard is selected by ES based on a formula which takes the configured shard count into consideration to spread the data over the available shards (reasonably) evenly. However, a user might want to isolate a set of documents into a single shard (determined by a discriminating property for example) and, knowing in which shard they are, he can search for documents in this set by explicitly querying their shard and no other shard. This can be done by using custom _routing. Multiple values can be used for this parameter to index a document in more than one shard for example.

Why do I need this? My use case is:
I am building an interface where users can segment a big set of data using custom-built filtering queries (using ES). Moreover, users can do full-text search and apply filters on it as they choose. Each user belongs to an organisation, and only has access to data in that organisation. I have millions of documents to index, with a couple entity types. I want to isolate data for a given organisation, and make search directed to the indexes and shards that store that data. I do not want to search all shards because it is inefficient to search in such big data set. So I split the data into multiple indices, each further split into shards.

Most of these documents are old and not very relevant for search. I put all of that data into an ARCHIVE index. The newer data is split into two LIVE indices, each containing data for half of the organisations. Each index is split into 3 shards, replicated once, so 6 per index. I want to put all the data of a single organisation in a single primary shard (and its replica). Then, when searching that data, I want to use custom routing to select that shard only.

I currently use custom routing successfully with my own manual integration with ES. But I want to use Hibernate Search to sync data between my db and ES, because this is a task best suited for an ORM.

Add Comment

This message was sent by Atlassian JIRA