Shaping the requirements for the new DocumentBuilder to come in Hibernate Search 6

Thursday, 17 November 2016

Hi all,
among the various plans for Hibernate Search 6, one of the reasons we
had to do the Elasticsearch integration sooner as experimental was to
get ourselves a clearer picture of what's going to be needed in terms
of internal cleanup.

Our DocumentBuilder is ancient, and several new features have been
added since it was a well designed, simple piece of code..

So, while we have discussed several wishes already, I started now a
document to try get all our thoughts to converge.

For convenience, pasting the current content below.
 -
https://docs.google.com/document/d/1JwKanIRHVTw1LvCdLGyY6EKuyvvQn6gvlkmPG...

I'm not giving comment permissions to the world; anyone who's
interested please answer here or drop me a note, happy to give
permissions to comment to well-intentioned people.

N.B. The document will very likely evolve beyond this email; as it is
now it's an initial brain dump. For example, I haven't thought about
the ES capability of nesting structures yet.

Thanks,
Sanne

==== Pasting from document =====

DocumentBuilder and FieldBridge requirements for Hibernate Search 6.0

* Never import Lucene types; ideally make Lucene an dependency of the
Lucene backend only.
   * In a modular world, don’t expect end user code to be able to load
Lucene class definitions.
* Efficient lookup “field name” -> field mappings and its indexing
options; not least:
   * Cardinality {always one, optional one, one-many, zero-many}
      * Needed for validation of queries, e.g. query for null can use
an “exists” query only in some of these cases, vs needing a null
token.
   * projectable alone vs part of multiple fields relating to a single
property (allow projection of Two-Way bridges using multiple fields)
   * Might need “group name”.”sub field name” for groups and index time joins
* IndexedEmbedded
   * “depth” and navigational graph to be pre-computed: tree of valid
fields and options to be known in advance.
   * Navigating into a relation must deal with possibly navigating
into subclasses of the relation type:
http://stackoverflow.com/questions/39516355/indexing-a-interface-in-hiber...
* Immutable, threadsafe, easy to inspect/walk mapping tree
   * Built and validated at boostrap of the IndexManager
      * can’t be updated after that
   * Field names and custom FieldType not to be allocated at runtime
   * Efficient to validate Queries
   * Allow efficient production of an Entity instance into:
      * Elasticsearch “document”
      * Lucene “document”
      * An efficient to serialize “document”
         * If it gets easy enough, make our own simple serialization?
      * Extensible to other backends e.g. Apache Solr in the future (a
Walkable SPI)
      * Pretty printed text to dump the “schema” we’re using from a
given domain model
   * Validations and comparisons
      * Allow to validate compatibility with an Elasticsearch schema
      * Allow to validate compatibility with a Lucene schema
   * Walking tree to map to ORM loading strategies
      * allow to predict which paths we’ll need to initialize
(database load) for efficient batch loading (graph initialization)
      * Allow for accurate Dirty-checking to skip indexing operations
      * Allow generation of better MassIndexer queries (fetch join
some of the relations?)
* ID handling: specific care
   * ad-hoc encoders for ID
   * stricter validation (e.g. cardinatlity, DocValues, Two-Way fieldbridges)
   * Support multi-term IDs (composite keys, @IdClass)
   * Have different “index id strategies” to have them apply different
logic, i.e. “delete by term” and “update by term” only apply on
single-term IDs.
   * ID handling strategy might need to take into account if the index
is shared among types.
* Decoupling from Java “Class” as entity-type identifiers
* Sharding:
   * Allow reuse of the same schema for indexes using the same
      * Allow reuse of some elements for indexes sharing such elements
* Properties / Field relations
   * Handle one property -> multiple Fields as a bidirectional relation.
   * Disallow one index field being target of different properties
and/or bridges?
* Representation of “Join points” and Groups:
   * allow future production of Lucene documents with index-time join
(write in groups)
   * allow efficient Query validation for both index-time and
query-time join options
* Composable
   * @ClassBridge, @Field annotations to both contribute to field definitions
   * a @ClassBridge of an @IndexedEmbedded to both contribute to the
embedded field definitions
   * Include type-bound user custom Bridges (see BridgeProvider) in
the compositions
   * Both @ClassBridge and custom Bridges need to trigger on
polymorphic relations as well

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006