Re: [hibernate-dev] HSEARCH: Coexisting of Lucene and Elasticsearch backends vs polymorphism & co

Thursday, 21 April 2016

Hey,

+1 for the concept of index families in the long run. As you say, I don't
think there is an immediate need for action as of 5.6.

When you say

...
 know that the Search instance is fully using ES exclusively, or
 Lucene exclusively 
do you mean "Hibernate Search" instance, or just a specific query? It's
possible to use Lucene and ES for different entities already if you know
the limitations. E.g. no queries crossing the family border, no sharding
crossing the family border (which seems questionable anyways, so I think we
could disallow that to begin with).

How about this: Let's bring support for multiple ES clusters into 5.7,
which should allow us to lay the grounds for "index families" as we'll
learn what needs to be done for different settings per cluster etc.

--Gunnar

2016-04-20 22:21 GMT+02:00 Sanne Grinovero <sanne(a)hibernate.org&gt;:

...
 In the context of implementing Elasticsearch support for Hibernate
 Search, there's a recurring need to transform the domain model to the
 "Document" representation using a strategy which depends on the
 storage choice, i.e. Lucene vs Elasticsearch.

 For example Guillaume working on HSEARCH-2067 needs to associate the
 entities document builder with a FieldBridge choice which needs to
 know if the output document will be indexed in ES, rather than Lucene.

 The choice of FieldBridge implementation affects the DocumentBuilder
 bound to each type; this implies that we're "tainting" the
 DocumentBuilder for all instance of a type.

 The abstraction of "IndexManager" is meant to initialize and manage an
 *index* - but remember that there's no guarantee that a single type is
 bound to a single index (and so to a single IndexManager).

  - We have the case of a single type being spread out on multiple
 indexes, using Sharding.
  - We also have the opposite, of multiple different types sharing and index
  - Subtypes of indexed types can opt to be indexed in a different type
  - All of two above can be mixed freely, as there's a clear
 distinction between type (identified by a Class) and index (identified
 by a String)

 [I'm not stating that the above facts are necessarily all required,
 just that they are currently supported.. so we could in theory discuss
 taking away some of this flexibility now, but implementing such
 restrictions would need to wait for version 6.0.]

 When a Query is run on a type A, we're transparently running the query
 on all indexes of shards containing A, and also its indexed subtypes
 on different indexes. We're also filtering out incompatible types
 transparently, if any of these sub-indexes are shared with other
 types.
 We also allow running a FullTextQuery on multiple, unrelated types and
 the same rules apply.

 To perform such a Query on multiple indexes, the trick currently used
 with Lucene based backends is the usage of MultiReaders: we wrap
 multiple indexes and present them as one index reader to the query
 engine, it's a "unified view" on which the query is performed.

 For obvious reasons we can not wrap a MultiReader across both Lucene
 indexes and Elasticsearch's query capabilities (or maybe we could
 eventually, but that's a whole lot of R&D to be done for questionable
 usefulness).

 So, we need to introduce a new concept: something like "index
 families" to properly abstract the boundaries as clearly some indexes
 can work together better within the same kind than with indexes of
 other kind.
 Stuff indexed in Lucene embedded would belong to a family A, stuff in
 the Elasticsearch cluster would be family B, and I guess one might
 have a secondary independent Elasticsearch cluster which would need to
 be in a different family C, or eventually a Solr cluster in yet
 another separated family.

 Such an "index family" would give us:
  - a place were the connection settings, connections pools are handled
 for Elasticsearch
  - clear boundaries about which types can be queried "as one": only
 the types in the same family, and subtypes might be allowed a
 different index but it must live in the same family. Same for
 Sharding.
  - a reasonable place to query for which "kind of storage" is being
 used for a specific type
  - An Analyzer might exist only within a family (Defined on one ES
 cluster, not on the other)
  - We have a long standing issue with Similarity: you can only have
 one in a group of indexes, but the group concept is undefined (and
 only loosely validatable)
  - And "index family" could have a type, therefore define what kind of
 FieldBridge(s) need to be generated

 I'm not saying that this is all blocking for 5.6. My proposal is to
 see if we agree on such a design as a longer term objective (set some
 foundation in 5.7, finalize for 6).

 For 5.6 I'd be happy enough to essentially document that there's only
 one family allowed, which allows us to cut some corners like:
  - single set of Analyzers to validate
  - know that the Search instance is fully using ES exclusively, or
 Lucene exclusively
  - know that all IndexManagers are connected to the same set of ES
 nodes (if using ES)

 So not much changing.. just hope this helps in shaping our internals
 with an eye on the next step, and make sure that the listed
 limitations which we've been accepting already can be clearly
 documented.

 It would be great to already have the basics for index families in
 place, for example to define the proper API to read metadata for a
 type (like Guillaume is needing), and to cleanup some things, such as
 make the Similarity definition clearly associated to such a thing.

 Naming: index family ? index groups?
 Not sure if there's need to add anything to the configuration
 properties; for now it could simply reflect our interpretation of the
 existing configuration, yet expose useful and clean metadata to the
 internal components which need this.

 Thanks for any comments!

 Sanne
 _______________________________________________
 hibernate-dev mailing list
 hibernate-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/hibernate-dev

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [hibernate-dev] HSEARCH: Coexisting of Lucene and Elasticsearch backends vs polymorphism & co