[hibernate-dev] HSEARCH: Coexisting of Lucene and Elasticsearch backends vs polymorphism & co

Gunnar Morling gunnar at hibernate.org
Thu Apr 21 03:09:19 EDT 2016


Hey,

+1 for the concept of index families in the long run. As you say, I don't
think there is an immediate need for action as of 5.6.

When you say

> know that the Search instance is fully using ES exclusively, or
> Lucene exclusively

do you mean "Hibernate Search" instance, or just a specific query? It's
possible to use Lucene and ES for different entities already if you know
the limitations. E.g. no queries crossing the family border, no sharding
crossing the family border (which seems questionable anyways, so I think we
could disallow that to begin with).

How about this: Let's bring support for multiple ES clusters into 5.7,
which should allow us to lay the grounds for "index families" as we'll
learn what needs to be done for different settings per cluster etc.

--Gunnar


2016-04-20 22:21 GMT+02:00 Sanne Grinovero <sanne at hibernate.org>:

> In the context of implementing Elasticsearch support for Hibernate
> Search, there's a recurring need to transform the domain model to the
> "Document" representation using a strategy which depends on the
> storage choice, i.e. Lucene vs Elasticsearch.
>
> For example Guillaume working on HSEARCH-2067 needs to associate the
> entities document builder with a FieldBridge choice which needs to
> know if the output document will be indexed in ES, rather than Lucene.
>
> The choice of FieldBridge implementation affects the DocumentBuilder
> bound to each type; this implies that we're "tainting" the
> DocumentBuilder for all instance of a type.
>
> The abstraction of "IndexManager" is meant to initialize and manage an
> *index* - but remember that there's no guarantee that a single type is
> bound to a single index (and so to a single IndexManager).
>
>  - We have the case of a single type being spread out on multiple
> indexes, using Sharding.
>  - We also have the opposite, of multiple different types sharing and index
>  - Subtypes of indexed types can opt to be indexed in a different type
>  - All of two above can be mixed freely, as there's a clear
> distinction between type (identified by a Class) and index (identified
> by a String)
>
> [I'm not stating that the above facts are necessarily all required,
> just that they are currently supported.. so we could in theory discuss
> taking away some of this flexibility now, but implementing such
> restrictions would need to wait for version 6.0.]
>
> When a Query is run on a type A, we're transparently running the query
> on all indexes of shards containing A, and also its indexed subtypes
> on different indexes. We're also filtering out incompatible types
> transparently, if any of these sub-indexes are shared with other
> types.
> We also allow running a FullTextQuery on multiple, unrelated types and
> the same rules apply.
>
> To perform such a Query on multiple indexes, the trick currently used
> with Lucene based backends is the usage of MultiReaders: we wrap
> multiple indexes and present them as one index reader to the query
> engine, it's a "unified view" on which the query is performed.
>
> For obvious reasons we can not wrap a MultiReader across both Lucene
> indexes and Elasticsearch's query capabilities (or maybe we could
> eventually, but that's a whole lot of R&D to be done for questionable
> usefulness).
>
> So, we need to introduce a new concept: something like "index
> families" to properly abstract the boundaries as clearly some indexes
> can work together better within the same kind than with indexes of
> other kind.
> Stuff indexed in Lucene embedded would belong to a family A, stuff in
> the Elasticsearch cluster would be family B, and I guess one might
> have a secondary independent Elasticsearch cluster which would need to
> be in a different family C, or eventually a Solr cluster in yet
> another separated family.
>
> Such an "index family" would give us:
>  - a place were the connection settings, connections pools are handled
> for Elasticsearch
>  - clear boundaries about which types can be queried "as one": only
> the types in the same family, and subtypes might be allowed a
> different index but it must live in the same family. Same for
> Sharding.
>  - a reasonable place to query for which "kind of storage" is being
> used for a specific type
>  - An Analyzer might exist only within a family (Defined on one ES
> cluster, not on the other)
>  - We have a long standing issue with Similarity: you can only have
> one in a group of indexes, but the group concept is undefined (and
> only loosely validatable)
>  - And "index family" could have a type, therefore define what kind of
> FieldBridge(s) need to be generated
>
> I'm not saying that this is all blocking for 5.6. My proposal is to
> see if we agree on such a design as a longer term objective (set some
> foundation in 5.7, finalize for 6).
>
> For 5.6 I'd be happy enough to essentially document that there's only
> one family allowed, which allows us to cut some corners like:
>  - single set of Analyzers to validate
>  - know that the Search instance is fully using ES exclusively, or
> Lucene exclusively
>  - know that all IndexManagers are connected to the same set of ES
> nodes (if using ES)
>
> So not much changing.. just hope this helps in shaping our internals
> with an eye on the next step, and make sure that the listed
> limitations which we've been accepting already can be clearly
> documented.
>
> It would be great to already have the basics for index families in
> place, for example to define the proper API to read metadata for a
> type (like Guillaume is needing), and to cleanup some things, such as
> make the Similarity definition clearly associated to such a thing.
>
> Naming: index family ? index groups?
> Not sure if there's need to add anything to the configuration
> properties; for now it could simply reflect our interpretation of the
> existing configuration, yet expose useful and clean metadata to the
> internal components which need this.
>
> Thanks for any comments!
>
> Sanne
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>


More information about the hibernate-dev mailing list