[hibernate-dev] HSEARCH: Coexisting of Lucene and Elasticsearch backends vs polymorphism & co

Sanne Grinovero sanne at hibernate.org
Thu Apr 21 07:04:24 EDT 2016


On 21 April 2016 at 09:09, Gunnar Morling <gunnar at hibernate.org> wrote:
> Hey,
>
> +1 for the concept of index families in the long run. As you say, I don't
> think there is an immediate need for action as of 5.6.

Cool. Right, no immediate changes but it should allow us to better
understand which kind of limitations we aim at "by design", and to
clearly express them in docs.

I do expect it to set drive some of the new SPIs though, e.g.

 - https://github.com/hibernate/hibernate-search/pull/1068#issuecomment-212591419

>
> When you say
>
>> know that the Search instance is fully using ES exclusively, or
>> Lucene exclusively
>
> do you mean "Hibernate Search" instance, or just a specific query? It's
> possible to use Lucene and ES for different entities already if you know the
> limitations. E.g. no queries crossing the family border, no sharding
> crossing the family border (which seems questionable anyways, so I think we
> could disallow that to begin with).

I meant the whole instance. We took some shortcuts, like the
DateBridge being affected (even though Guillaume solved it), but also
e.g. for Analyzer validation.

These limitations will get relaxed while we work, but will never be
able to "cross families": in terms of defining a MVP for 5.6 I think
we can aim at one type.

>
> How about this: Let's bring support for multiple ES clusters into 5.7, which
> should allow us to lay the grounds for "index families" as we'll learn what
> needs to be done for different settings per cluster etc.

Exactly.
But also I'd move the option of Lucene embedded + ES to 5.7. Not to
say that we shouldn't explore these, just not considering them
blockers.

Thanks,
Sanne

>
> --Gunnar
>
>
> 2016-04-20 22:21 GMT+02:00 Sanne Grinovero <sanne at hibernate.org>:
>>
>> In the context of implementing Elasticsearch support for Hibernate
>> Search, there's a recurring need to transform the domain model to the
>> "Document" representation using a strategy which depends on the
>> storage choice, i.e. Lucene vs Elasticsearch.
>>
>> For example Guillaume working on HSEARCH-2067 needs to associate the
>> entities document builder with a FieldBridge choice which needs to
>> know if the output document will be indexed in ES, rather than Lucene.
>>
>> The choice of FieldBridge implementation affects the DocumentBuilder
>> bound to each type; this implies that we're "tainting" the
>> DocumentBuilder for all instance of a type.
>>
>> The abstraction of "IndexManager" is meant to initialize and manage an
>> *index* - but remember that there's no guarantee that a single type is
>> bound to a single index (and so to a single IndexManager).
>>
>>  - We have the case of a single type being spread out on multiple
>> indexes, using Sharding.
>>  - We also have the opposite, of multiple different types sharing and
>> index
>>  - Subtypes of indexed types can opt to be indexed in a different type
>>  - All of two above can be mixed freely, as there's a clear
>> distinction between type (identified by a Class) and index (identified
>> by a String)
>>
>> [I'm not stating that the above facts are necessarily all required,
>> just that they are currently supported.. so we could in theory discuss
>> taking away some of this flexibility now, but implementing such
>> restrictions would need to wait for version 6.0.]
>>
>> When a Query is run on a type A, we're transparently running the query
>> on all indexes of shards containing A, and also its indexed subtypes
>> on different indexes. We're also filtering out incompatible types
>> transparently, if any of these sub-indexes are shared with other
>> types.
>> We also allow running a FullTextQuery on multiple, unrelated types and
>> the same rules apply.
>>
>> To perform such a Query on multiple indexes, the trick currently used
>> with Lucene based backends is the usage of MultiReaders: we wrap
>> multiple indexes and present them as one index reader to the query
>> engine, it's a "unified view" on which the query is performed.
>>
>> For obvious reasons we can not wrap a MultiReader across both Lucene
>> indexes and Elasticsearch's query capabilities (or maybe we could
>> eventually, but that's a whole lot of R&D to be done for questionable
>> usefulness).
>>
>> So, we need to introduce a new concept: something like "index
>> families" to properly abstract the boundaries as clearly some indexes
>> can work together better within the same kind than with indexes of
>> other kind.
>> Stuff indexed in Lucene embedded would belong to a family A, stuff in
>> the Elasticsearch cluster would be family B, and I guess one might
>> have a secondary independent Elasticsearch cluster which would need to
>> be in a different family C, or eventually a Solr cluster in yet
>> another separated family.
>>
>> Such an "index family" would give us:
>>  - a place were the connection settings, connections pools are handled
>> for Elasticsearch
>>  - clear boundaries about which types can be queried "as one": only
>> the types in the same family, and subtypes might be allowed a
>> different index but it must live in the same family. Same for
>> Sharding.
>>  - a reasonable place to query for which "kind of storage" is being
>> used for a specific type
>>  - An Analyzer might exist only within a family (Defined on one ES
>> cluster, not on the other)
>>  - We have a long standing issue with Similarity: you can only have
>> one in a group of indexes, but the group concept is undefined (and
>> only loosely validatable)
>>  - And "index family" could have a type, therefore define what kind of
>> FieldBridge(s) need to be generated
>>
>> I'm not saying that this is all blocking for 5.6. My proposal is to
>> see if we agree on such a design as a longer term objective (set some
>> foundation in 5.7, finalize for 6).
>>
>> For 5.6 I'd be happy enough to essentially document that there's only
>> one family allowed, which allows us to cut some corners like:
>>  - single set of Analyzers to validate
>>  - know that the Search instance is fully using ES exclusively, or
>> Lucene exclusively
>>  - know that all IndexManagers are connected to the same set of ES
>> nodes (if using ES)
>>
>> So not much changing.. just hope this helps in shaping our internals
>> with an eye on the next step, and make sure that the listed
>> limitations which we've been accepting already can be clearly
>> documented.
>>
>> It would be great to already have the basics for index families in
>> place, for example to define the proper API to read metadata for a
>> type (like Guillaume is needing), and to cleanup some things, such as
>> make the Similarity definition clearly associated to such a thing.
>>
>> Naming: index family ? index groups?
>> Not sure if there's need to add anything to the configuration
>> properties; for now it could simply reflect our interpretation of the
>> existing configuration, yet expose useful and clean metadata to the
>> internal components which need this.
>>
>> Thanks for any comments!
>>
>> Sanne
>> _______________________________________________
>> hibernate-dev mailing list
>> hibernate-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>
>


More information about the hibernate-dev mailing list