[hibernate-dev] DocumentBuilder refactoring in Hibernate Search: how to deal (internally) with metadata

Fri May 31 04:51:15 EDT 2013

Hi Hardy,

great proposal for the meta-data API. I've added some comments inline.

--Gunnar

2013/5/30 Hardy Ferentschik <hardy at hibernate.org>

> Gee, that's an email ;-)
> Before getting too much into it I think it would be useful to talk about
> what I am actually doing.
> I am trying to expose a meta data API for Search which allows users to
> determine which entities are
> indexed and which fields are available for each entity. I am trying to do
> a similar approach to
> Bean Validation where all metadata is exposed via descriptors. The entry
> point into the API is the
> SearchFactory. I am basically thinking about something like this (feedback
> welcome):
>
> /**
>  * Top level descriptor of the metadata API. Giving access to the indexing
> information for a single entity.
>  *
>  * @author Hardy Ferentschik
>  */
> public interface IndexedEntityDescriptor {
>

I find the name "IndexedEntityDescriptor" in conjunction with isIndexed()
potentially returning "false" a bit irritating. Maybe just
EntityDescriptor? Or SearchableEntityDescriptor?

>         /**
>          * @return Returns {@code true} if the entity for this descriptor
> is indexed, {@code false} otherwise
>          */
>         boolean isIndexed();
>

Maybe return an enum if this can potentially be more than a simple yes/no?
I don't know how likely that is, but an enum would allow for evolvement.

>         /**
>          * @return Returns the class boost value, 1 being the default.
>          */
>         float getClassBoost();
>
>         /**
>          * @return Returns the names of the indexes instances of the
> entity are indexed into. Generally this will
>          *         be just one index, however, when sharding is applied
> multiple indexes per entity can be used.
>          */
>         Set<String> getIndexNames();
>

Would something like Set<IndexDescriptor> getIndexes() make sense?

>         /**
>          * @return Returns a set of {@code FieldDescriptor}s for the
> indexed fields of the entity.
>          */
>         // TODO does this include the id field descriptor or should that
> be a separate descriptor?
>

At least for my case I think it would be easier if this contained all field
descriptors so I can handle them uniformly. Maybe FieldDescriptor#isId() or
if there are more id specific things something like this could be added:

    if ( fieldDescriptor.getType = DescriptorType.ID ) {
        fieldDescriptor.as( IdDescriptor.class ).somethingIdSpecific();
    }

>         // TODO should OBJECT_CLASS be considered?
>         Set<FieldDescriptor> getIndexedFields();
>

Could you also add FieldDescriptor getIndexedField(String fieldName);

> }
>
> /**
>  * Metadata related to a single indexed field.
>  *
>  * @author Hardy Ferentschik
>  */
> public interface FieldDescriptor {
>         /**
>          * Returns the Lucene {@code Document} field name for this indexed
> property.
>          *
>          * @return Returns the field name for this index property
>          */
>         String getFieldName();
>

I'd call it just "getName()", not repeating the type's name.

>
>         /**
>          * @return Returns an {@code Analyze} enum instance defining the
> type of analyzing applied to
>          *         this field.
>          */
>         Analyze getAnalyzeType();
>
>         /**
>          * @return Returns a {@code Store} enum instance defining whether
> the index value is stored in the index itself.
>          */
>         Store getStoreType();
>
>         /**
>          * @return Returns a {@code TermVector} enum instance defining
> whether and how term vectors are stored for this
>          *         field
>          */
>         TermVector getTermVectorType();
>
>
>         /**
>          * @return Returns a {@code Norms} enum instance defining whether
> and how norms are stored for this
>          *         field
>          */
>         Norms getNormType();
>
>         /**
>          * @return Returns the boost value for this field. 1 being the
> default value.
>          */
>         float getBoost();
>
>         /**
>          * @return Returns the string used to index {@code null} values.
> {@code null} in case null values are not indexed.
>          */
>         String nullIndexedAs();
>
>         /**
>          * @return Returns the field bridge instance used to convert the
> property value into a string based field value
>          */
>         FieldBridge getFieldBridge();
>
>         /**
>          * @return Returns the analyzer used for this field, {@code null}
> if the field is not analyzed
>          */
>         Analyzer getAnalyzer();
> }
>
> On top of this I am planning to add (addressing HSEARCH-903):
>
> public interface FieldNameReportingBridge {
>         Iterable<String> getGeneratedFieldNames(String baseFieldName);
> }
>

Not better a Set? Returning Iterable makes it harder for users (e.g. no
contains()) and also hides set vs. list semantics.

> The latter I need to allow custom bridges to report which fields they add.
> Most of the information I need to implement all this is in
> AbstractDocumentBuilder.PropertiesMetadata. The plan so far
> was to extract the information from there and while working in this making
> PropertiesMetadata a proper object (instead of the
> parallel arrays thingy).

+1

> Maybe some other minor refactorings along the way. I was not going to
> touch the processing of annotations
> for now. As discussed that, there we would need yet another level of
> abstraction (similar to EntitySource in ORM or BeanConfiguration
> in HV). Something which can be populated by either annotation processing
> (be it Jandex or reflection) or by the the programmatic API.
> Different story though.
>
> For what I can tell I don't need a Visitor pattern for what I have planned
> to do so far. If you think I am on the wrong track let me know
> and let me see the light.
>
> One thing I was wondering about after your email, however, was whether the
> API needs to provide information which field/getter/class
> is responsible for creating a given Lucene Document Field. Do we have a
> use case for that?
>
>
>
> On 29 Jan 2013, at 6:39 PM, Sanne Grinovero <sanne at hibernate.org> wrote:
>
> > We're starting a series of refactorings in Hibernate Search to improve
> > how we handle the entity mapping to the index; to summarize goals:
> >
> > 1# Expose the Metadata as API
> >
> > We need to expose it because:
> > a - OGM needs to be able to read this metadata to produce appropriate
> queries
>
> @gunnar, does the API above address your needs?
>

Yes, from what I'm aware of atm. I think so.

>
> >  Personally I think we end up needing this just as an SPI: that might
> > be good for cases {a,b}, and I have an alternative proposal for {c}
> > described below.
>
> -1 why SPI. I think this is a very general purpose API useful for any
> users.
> For example, you could image to build auto field suggesting query field
> which
> makes suggestions on which fields you can search on (a little like the
> Jira queries).
> In this case you could get the available fields via this API. Just to
> mention one use case.
>
> >  However we expose it, I think we agree this should be a read-only
> > structure built as a second phase after the model is consumed from
> > (annotations / programmatic API / jandex / auto-generated by OGM).
>
> +1
>
> > It
> > would also be good to keep it "minimal" in terms of memory cost, so to
> > either:
> > - drop references to the source structure
> > - not holding on it at all, building the Metadata on demand (!)
> > (Assuming we can build it from a more obscure internal representation
> > I'll describe next).
>
> Given that I am going to build it from required runtime information it
> could for sure
> be lazily loaded. However, right now I think I will just go for the
> straight forward approach.
>
> > 3# MutableSearchFactory
> >
> > Let's not forget we also have a MutableSearchFactory to maintain: new
> > entities could be added at any time so if we drop the original
> > metadata we need to be able to build a new (read-only) one from the
> > current state.
>
> Good point
>
> > Things we wanted but where too hard to do so far:
> > - Separate annotation reading from Document building. Separate
> > validity checks too.
>
> +1 See above. I want to address this in another issue. We will need
> another intermediate
> model for that. With this in place we can remove commons-annotaiotns and
> easily
> consume a Jandex index as well
>
> > - It checks for JPA @Id using reflection as it might not be available
> > -> pluggable?
>
> Not sure what you mean here. That's just a very specific JPA/ORM based use
> case.
>
> > - LuceneOptionsImpl are built at runtime each time we need one ->
> > reuse them, coupling them to their field
>
> +1
>
> >  - We need a reliable way to track which field names are created, and
> > from which bridge they are originating (including custom bridges:
> > HSEARCH-904)
>
> See above and the FieldNameReportingBridge I am suggesting
>
> > == Solution ? ==
> >
> > Now let's assume that we can build this as a recursive structure which
> > accepts a generic visitor. …
>
> that's where you loose me. I think I am a little like Emmanuel here. Where
> does a
> Visitor pattern help here?
>
> --Hardy
>
>
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>