HSEARCH: Coexisting of Lucene and Elasticsearch backends vs polymorphism & co

Wednesday, 20 April 2016

In the context of implementing Elasticsearch support for Hibernate
Search, there's a recurring need to transform the domain model to the
"Document" representation using a strategy which depends on the
storage choice, i.e. Lucene vs Elasticsearch.

For example Guillaume working on HSEARCH-2067 needs to associate the
entities document builder with a FieldBridge choice which needs to
know if the output document will be indexed in ES, rather than Lucene.

The choice of FieldBridge implementation affects the DocumentBuilder
bound to each type; this implies that we're "tainting" the
DocumentBuilder for all instance of a type.

The abstraction of "IndexManager" is meant to initialize and manage an
*index* - but remember that there's no guarantee that a single type is
bound to a single index (and so to a single IndexManager).

 - We have the case of a single type being spread out on multiple
indexes, using Sharding.
 - We also have the opposite, of multiple different types sharing and index
 - Subtypes of indexed types can opt to be indexed in a different type
 - All of two above can be mixed freely, as there's a clear
distinction between type (identified by a Class) and index (identified
by a String)

[I'm not stating that the above facts are necessarily all required,
just that they are currently supported.. so we could in theory discuss
taking away some of this flexibility now, but implementing such
restrictions would need to wait for version 6.0.]

When a Query is run on a type A, we're transparently running the query
on all indexes of shards containing A, and also its indexed subtypes
on different indexes. We're also filtering out incompatible types
transparently, if any of these sub-indexes are shared with other
types.
We also allow running a FullTextQuery on multiple, unrelated types and
the same rules apply.

To perform such a Query on multiple indexes, the trick currently used
with Lucene based backends is the usage of MultiReaders: we wrap
multiple indexes and present them as one index reader to the query
engine, it's a "unified view" on which the query is performed.

For obvious reasons we can not wrap a MultiReader across both Lucene
indexes and Elasticsearch's query capabilities (or maybe we could
eventually, but that's a whole lot of R&D to be done for questionable
usefulness).

So, we need to introduce a new concept: something like "index
families" to properly abstract the boundaries as clearly some indexes
can work together better within the same kind than with indexes of
other kind.
Stuff indexed in Lucene embedded would belong to a family A, stuff in
the Elasticsearch cluster would be family B, and I guess one might
have a secondary independent Elasticsearch cluster which would need to
be in a different family C, or eventually a Solr cluster in yet
another separated family.

Such an "index family" would give us:
 - a place were the connection settings, connections pools are handled
for Elasticsearch
 - clear boundaries about which types can be queried "as one": only
the types in the same family, and subtypes might be allowed a
different index but it must live in the same family. Same for
Sharding.
 - a reasonable place to query for which "kind of storage" is being
used for a specific type
 - An Analyzer might exist only within a family (Defined on one ES
cluster, not on the other)
 - We have a long standing issue with Similarity: you can only have
one in a group of indexes, but the group concept is undefined (and
only loosely validatable)
 - And "index family" could have a type, therefore define what kind of
FieldBridge(s) need to be generated

I'm not saying that this is all blocking for 5.6. My proposal is to
see if we agree on such a design as a longer term objective (set some
foundation in 5.7, finalize for 6).

For 5.6 I'd be happy enough to essentially document that there's only
one family allowed, which allows us to cut some corners like:
 - single set of Analyzers to validate
 - know that the Search instance is fully using ES exclusively, or
Lucene exclusively
 - know that all IndexManagers are connected to the same set of ES
nodes (if using ES)

So not much changing.. just hope this helps in shaping our internals
with an eye on the next step, and make sure that the listed
limitations which we've been accepting already can be clearly
documented.

It would be great to already have the basics for index families in
place, for example to define the proper API to read metadata for a
type (like Guillaume is needing), and to cleanup some things, such as
make the Similarity definition clearly associated to such a thing.

Naming: index family ? index groups?
Not sure if there's need to add anything to the configuration
properties; for now it could simply reflect our interpretation of the
existing configuration, yet expose useful and clean metadata to the
internal components which need this.

Thanks for any comments!

Sanne

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006