[hibernate-dev] DocumentBuilder refactoring in Hibernate Search: how to deal (internally) with metadata

Thu May 30 08:32:02 EDT 2013

Gee, that's an email ;-)
Before getting too much into it I think it would be useful to talk about what I am actually doing.
I am trying to expose a meta data API for Search which allows users to determine which entities are
indexed and which fields are available for each entity. I am trying to do a similar approach to
Bean Validation where all metadata is exposed via descriptors. The entry point into the API is the 
SearchFactory. I am basically thinking about something like this (feedback welcome):

/**
 * Top level descriptor of the metadata API. Giving access to the indexing information for a single entity.
 *
 * @author Hardy Ferentschik
 */
public interface IndexedEntityDescriptor {
	/**
	 * @return Returns {@code true} if the entity for this descriptor is indexed, {@code false} otherwise
	 */
	boolean isIndexed();

	/**
	 * @return Returns the class boost value, 1 being the default.
	 */
	float getClassBoost();

	/**
	 * @return Returns the names of the indexes instances of the entity are indexed into. Generally this will
	 *         be just one index, however, when sharding is applied multiple indexes per entity can be used.
	 */
	Set<String> getIndexNames();

	/**
	 * @return Returns a set of {@code FieldDescriptor}s for the indexed fields of the entity.
	 */
	// TODO does this include the id field descriptor or should that be a separate descriptor?
	// TODO should OBJECT_CLASS be considered?
	Set<FieldDescriptor> getIndexedFields();
}

/**
 * Metadata related to a single indexed field.
 *
 * @author Hardy Ferentschik
 */
public interface FieldDescriptor {
	/**
	 * Returns the Lucene {@code Document} field name for this indexed property.
	 *
	 * @return Returns the field name for this index property
	 */
	String getFieldName();

	/**
	 * @return Returns an {@code Analyze} enum instance defining the type of analyzing applied to
	 *         this field.
	 */
	Analyze getAnalyzeType();

	/**
	 * @return Returns a {@code Store} enum instance defining whether the index value is stored in the index itself.
	 */
	Store getStoreType();

	/**
	 * @return Returns a {@code TermVector} enum instance defining whether and how term vectors are stored for this
	 *         field
	 */
	TermVector getTermVectorType();

	/**
	 * @return Returns a {@code Norms} enum instance defining whether and how norms are stored for this
	 *         field
	 */
	Norms getNormType();

	/**
	 * @return Returns the boost value for this field. 1 being the default value.
	 */
	float getBoost();

	/**
	 * @return Returns the string used to index {@code null} values. {@code null} in case null values are not indexed.
	 */
	String nullIndexedAs();

	/**
	 * @return Returns the field bridge instance used to convert the property value into a string based field value
	 */
	FieldBridge getFieldBridge();

	/**
	 * @return Returns the analyzer used for this field, {@code null} if the field is not analyzed
	 */
	Analyzer getAnalyzer();
}

On top of this I am planning to add (addressing HSEARCH-903):

public interface FieldNameReportingBridge {
	Iterable<String> getGeneratedFieldNames(String baseFieldName);
}

The latter I need to allow custom bridges to report which fields they add. 
Most of the information I need to implement all this is in AbstractDocumentBuilder.PropertiesMetadata. The plan so far 
was to extract the information from there and while working in this making PropertiesMetadata a proper object (instead of the
parallel arrays thingy). Maybe some other minor refactorings along the way. I was not going to touch the processing of annotations 
for now. As discussed that, there we would need yet another level of abstraction (similar to EntitySource in ORM or BeanConfiguration
in HV). Something which can be populated by either annotation processing (be it Jandex or reflection) or by the the programmatic API. 
Different story though. 

For what I can tell I don't need a Visitor pattern for what I have planned to do so far. If you think I am on the wrong track let me know 
and let me see the light. 

One thing I was wondering about after your email, however, was whether the API needs to provide information which field/getter/class
is responsible for creating a given Lucene Document Field. Do we have a use case for that?

On 29 Jan 2013, at 6:39 PM, Sanne Grinovero <sanne at hibernate.org> wrote:

> We're starting a series of refactorings in Hibernate Search to improve
> how we handle the entity mapping to the index; to summarize goals:
> 
> 1# Expose the Metadata as API
> 
> We need to expose it because:
> a - OGM needs to be able to read this metadata to produce appropriate queries

@gunnar, does the API above address your needs?

>  Personally I think we end up needing this just as an SPI: that might
> be good for cases {a,b}, and I have an alternative proposal for {c}
> described below.

-1 why SPI. I think this is a very general purpose API useful for any users. 
For example, you could image to build auto field suggesting query field which 
makes suggestions on which fields you can search on (a little like the Jira queries).
In this case you could get the available fields via this API. Just to mention one use case.

>  However we expose it, I think we agree this should be a read-only
> structure built as a second phase after the model is consumed from
> (annotations / programmatic API / jandex / auto-generated by OGM).

+1

> It
> would also be good to keep it "minimal" in terms of memory cost, so to
> either:
> - drop references to the source structure
> - not holding on it at all, building the Metadata on demand (!)
> (Assuming we can build it from a more obscure internal representation
> I'll describe next).

Given that I am going to build it from required runtime information it could for sure 
be lazily loaded. However, right now I think I will just go for the straight forward approach. 

> 3# MutableSearchFactory
> 
> Let's not forget we also have a MutableSearchFactory to maintain: new
> entities could be added at any time so if we drop the original
> metadata we need to be able to build a new (read-only) one from the
> current state.

Good point

> Things we wanted but where too hard to do so far:
> - Separate annotation reading from Document building. Separate
> validity checks too.

+1 See above. I want to address this in another issue. We will need another intermediate 
model for that. With this in place we can remove commons-annotaiotns and easily 
consume a Jandex index as well

> - It checks for JPA @Id using reflection as it might not be available
> -> pluggable?

Not sure what you mean here. That's just a very specific JPA/ORM based use case.

> - LuceneOptionsImpl are built at runtime each time we need one ->
> reuse them, coupling them to their field

+1

>  - We need a reliable way to track which field names are created, and
> from which bridge they are originating (including custom bridges:
> HSEARCH-904)

See above and the FieldNameReportingBridge I am suggesting

> == Solution ? ==
> 
> Now let's assume that we can build this as a recursive structure which
> accepts a generic visitor. …

that's where you loose me. I think I am a little like Emmanuel here. Where does a 
Visitor pattern help here? 

--Hardy