Re: [hibernate-dev] DocumentBuilder refactoring in Hibernate Search: how to deal (internally) with metadata

Thursday, 30 May 2013

Gee, that's an email ;-)
Before getting too much into it I think it would be useful to talk about what I am
actually doing.
I am trying to expose a meta data API for Search which allows users to determine which
entities are
indexed and which fields are available for each entity. I am trying to do a similar
approach to
Bean Validation where all metadata is exposed via descriptors. The entry point into the
API is the 
SearchFactory. I am basically thinking about something like this (feedback welcome):

/**
 * Top level descriptor of the metadata API. Giving access to the indexing information for
a single entity.
 *
 * @author Hardy Ferentschik
 */
public interface IndexedEntityDescriptor {
	/**
	 * @return Returns {@code true} if the entity for this descriptor is indexed, {@code
false} otherwise
	 */
	boolean isIndexed();

	/**
	 * @return Returns the class boost value, 1 being the default.
	 */
	float getClassBoost();

	/**
	 * @return Returns the names of the indexes instances of the entity are indexed into.
Generally this will
	 *         be just one index, however, when sharding is applied multiple indexes per
entity can be used.
	 */
	Set<String> getIndexNames();

	/**
	 * @return Returns a set of {@code FieldDescriptor}s for the indexed fields of the
entity.
	 */
	// TODO does this include the id field descriptor or should that be a separate
descriptor?
	// TODO should OBJECT_CLASS be considered?
	Set<FieldDescriptor> getIndexedFields();
}

/**
 * Metadata related to a single indexed field.
 *
 * @author Hardy Ferentschik
 */
public interface FieldDescriptor {
	/**
	 * Returns the Lucene {@code Document} field name for this indexed property.
	 *
	 * @return Returns the field name for this index property
	 */
	String getFieldName();

	/**
	 * @return Returns an {@code Analyze} enum instance defining the type of analyzing
applied to
	 *         this field.
	 */
	Analyze getAnalyzeType();

	/**
	 * @return Returns a {@code Store} enum instance defining whether the index value is
stored in the index itself.
	 */
	Store getStoreType();

	/**
	 * @return Returns a {@code TermVector} enum instance defining whether and how term
vectors are stored for this
	 *         field
	 */
	TermVector getTermVectorType();

	/**
	 * @return Returns a {@code Norms} enum instance defining whether and how norms are
stored for this
	 *         field
	 */
	Norms getNormType();

	/**
	 * @return Returns the boost value for this field. 1 being the default value.
	 */
	float getBoost();

	/**
	 * @return Returns the string used to index {@code null} values. {@code null} in case
null values are not indexed.
	 */
	String nullIndexedAs();

	/**
	 * @return Returns the field bridge instance used to convert the property value into a
string based field value
	 */
	FieldBridge getFieldBridge();

	/**
	 * @return Returns the analyzer used for this field, {@code null} if the field is not
analyzed
	 */
	Analyzer getAnalyzer();
}

On top of this I am planning to add (addressing HSEARCH-903):

public interface FieldNameReportingBridge {
	Iterable<String> getGeneratedFieldNames(String baseFieldName);
}

The latter I need to allow custom bridges to report which fields they add. 
Most of the information I need to implement all this is in
AbstractDocumentBuilder.PropertiesMetadata. The plan so far 
was to extract the information from there and while working in this making
PropertiesMetadata a proper object (instead of the
parallel arrays thingy). Maybe some other minor refactorings along the way. I was not
going to touch the processing of annotations 
for now. As discussed that, there we would need yet another level of abstraction (similar
to EntitySource in ORM or BeanConfiguration
in HV). Something which can be populated by either annotation processing (be it Jandex or
reflection) or by the the programmatic API. 
Different story though. 

For what I can tell I don't need a Visitor pattern for what I have planned to do so
far. If you think I am on the wrong track let me know 
and let me see the light. 

One thing I was wondering about after your email, however, was whether the API needs to
provide information which field/getter/class
is responsible for creating a given Lucene Document Field. Do we have a use case for
that?

On 29 Jan 2013, at 6:39 PM, Sanne Grinovero <sanne(a)hibernate.org&gt; wrote:

...
 We're starting a series of refactorings in Hibernate Search to
improve
 how we handle the entity mapping to the index; to summarize goals:

 1# Expose the Metadata as API

 We need to expose it because:
 a - OGM needs to be able to read this metadata to produce appropriate queries 
@gunnar, does the API above address your needs?

...
  Personally I think we end up needing this just as an SPI: that
might
 be good for cases {a,b}, and I have an alternative proposal for {c}
 described below. 
-1 why SPI. I think this is a very general purpose API useful for any users. 
For example, you could image to build auto field suggesting query field which 
makes suggestions on which fields you can search on (a little like the Jira queries).
In this case you could get the available fields via this API. Just to mention one use
case.

...
  However we expose it, I think we agree this should be a read-only
 structure built as a second phase after the model is consumed from
 (annotations / programmatic API / jandex / auto-generated by OGM). 
+1

...
 It
 would also be good to keep it "minimal" in terms of memory cost, so to
 either:
 - drop references to the source structure
 - not holding on it at all, building the Metadata on demand (!)
 (Assuming we can build it from a more obscure internal representation
 I'll describe next). 
Given that I am going to build it from required runtime information it could for sure 
be lazily loaded. However, right now I think I will just go for the straight forward
approach. 

...
 3# MutableSearchFactory

 Let's not forget we also have a MutableSearchFactory to maintain: new
 entities could be added at any time so if we drop the original
 metadata we need to be able to build a new (read-only) one from the
 current state. 
Good point

...
 Things we wanted but where too hard to do so far:
 - Separate annotation reading from Document building. Separate
 validity checks too. 
+1 See above. I want to address this in another issue. We will need another intermediate 
model for that. With this in place we can remove commons-annotaiotns and easily 
consume a Jandex index as well

...
 - It checks for JPA @Id using reflection as it might not be
available
 -> pluggable? 
Not sure what you mean here. That's just a very specific JPA/ORM based use case.

...
 - LuceneOptionsImpl are built at runtime each time we need one ->
 reuse them, coupling them to their field 
+1

...
  - We need a reliable way to track which field names are created,
and
 from which bridge they are originating (including custom bridges:
 HSEARCH-904) 
See above and the FieldNameReportingBridge I am suggesting

...
 == Solution ? ==

 Now let's assume that we can build this as a recursive structure which
 accepts a generic visitor. … 
that's where you loose me. I think I am a little like Emmanuel here. Where does a 
Visitor pattern help here? 

--Hardy

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [hibernate-dev] DocumentBuilder refactoring in Hibernate Search: how to deal (internally) with metadata