[hibernate-dev] DocumentBuilder refactoring in Hibernate Search: how to deal (internally) with metadata

Fri Jun 21 06:10:49 EDT 2013

> I'm sorry I now realize I made it hard to understand as I didn't
> actually explain how I mean this to be recursive.
> 
> First, note that another frequently requested feature is to be able to
> add some fields to a Document *in addition* to what we would normally
> do.
> Today this flexibility is granted by defining a class level bridge,
> but by doing so this disables processing of @Field annotations, so
> people can't decorate but have to hardcode the full transformation in
> their bridge.

How does a class bridge "disables processing of @Field annotations".

> To solve that - and the design mentioned above - I was thinking of
> compositing bridges recursively to match the metadata.

"compositing bridges recursively to match the metadata"? Even by 
replacing compositing with composing I am not quite sure what you are after.

> To express that in pseudo-functional code, an entity:
> 
> @ClassBridge(impl=CustomAnimals.class)
> @Indexed
> class Animal {
>  @Id long id;
>  @IndexedEmbedded Color skinColor;
>  @Field name;
> }
> 
> would generate a reusable transformation function which gets
> associated to the Animal.class
> 
> class ObjectToDocument {
>  Document transform(Entity e);
> }
> [not exactly like that, bear with me a moment]

and how would the user hook in there? 
IIRC some users just wanted the ability to get hold of the Lucene document before it gets indexed.
I think we can make this happen without changing anything around the bridges etc. 
We just need to hook somewhere into DocumentBuilderIndexedEntity#getDocument

> I guess by now you start seeing the problem of defining the exact
> signature of such composite transformation blocks:
> I mentioned Visitor in my first email as I think it could help, but it
> doesn't have to be strictly a Visitor.

Sorry, I still don't fully understand. That's not to say that I am against a new way of getting from entity
to document,  but I don't see what of the things you mentioned is not possible today.

> The problem is that we will want to navigate the internal metadata for
> different purposes, as I had outlined in the next paragraph; generally
> a Visitor allows to decouple the metadata graph from the purpose,
> while also preserving a good level of typesafety and performance:
> let's not forget this is one of the hottest areas of the Search
> codebase (CPU wise), and at the same time the place where we trigger
> the more important optimisations, like opportunities to skip network
> operations or disk IO.

Fair enough. However, imo the visitor pattern adds also quite some complexity 
and becomes useful where you have to an object structure with many different 
types. In our case the metadata for a single indexed type is quite "simple". 

>>>  - We need a reliable way to track which field names are created, and
>>> from which bridge they are originating (including custom bridges:
>>> HSEARCH-904)

I am working on this. As mentioned before, my idea for this was to add another
interface a bridge can implement. This interface reports the required meta information
(the fields which are getting added plus maybe their Lucene index settings). For the
built-in bridges we implement this interface. Implementors of custom bridges will need to
implement this interface.

>>>  - If we could know in advance which properties of the entities need
>>> to be initialized for a complete Document to be created we could
>>> generate more efficient queries at entity initialization time, or at
>>> MassIndexing select time. I think users really would expect such a
>>> clever integration with ORM (HSEARCH-1235)

But this is all a question on which metadata we collect not on how we process it.

> Seems in the list above I forgot my favorite one: dump the metadata as
> simple text on bootup; this should greatly simplify query writing, and
> doesn't need to open the index with Luke to figure out the field names
> / options.

+1 I like this idea. Maybe instead of dumping the internal metadata we should dump
the public metadata information once I am finished with   HSEARCH-436

--Hardy