[hibernate-dev] DocumentBuilder refactoring in Hibernate Search: how to deal (internally) with metadata

Tue Jun 18 13:06:07 EDT 2013

I got side tracked by a laptop failure when I was about to answer this:
sorry it replies just to the first email and doesn't take in
consideration most of the subsequent excellent feedback.

Update: I'm about to merge Hardy's huge cleanup as it's very relevant,
but these thoughts are still relevant for the next phases.

On 30 May 2013 09:53, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
> On Wed 2013-05-29 17:39, Sanne Grinovero wrote:
>> We're starting a series of refactorings in Hibernate Search to improve
>> how we handle the entity mapping to the index; to summarize goals:
>>
>> 1# Expose the Metadata as API
>>
>> We need to expose it because:
>>  a - OGM needs to be able to read this metadata to produce appropriate queries
>>  b - Advanced users have expressed the need to for things like listing
>> all indexed entities to integrate external tools, code generation,
>> etc..
>>  c - All users (advanced and not) have interest in -at least- logging
>> the field structure to help creating Queries; today people need a
>> debugger or Luke.
>>
>>   Personally I think we end up needing this just as an SPI: that might
>> be good for cases {a,b}, and I have an alternative proposal for {c}
>> described below.
>>   However we expose it, I think we agree this should be a read-only
>> structure built as a second phase after the model is consumed from
>> (annotations / programmatic API / jandex / auto-generated by OGM). It
>> would also be good to keep it "minimal" in terms of memory cost, so to
>> either:
>>  - drop references to the source structure
>>  - not holding on it at all, building the Metadata on demand (!)
>> (Assuming we can build it from a more obscure internal representation
>> I'll describe next).
>
> To me since we have an internal representation for DocumentBuilder to do
> its runtime job, the user visible model could end up being a flyweight
> object in front of it exposing thigns the way we want. Whether you
> create this flyweight structure each time is another question.

+1

>> Whatever the final implementation will actually do to store this
>> metadata, for now the priority is to define the contract for the sake
>> of OGM so I'm not too concerned on the two phase buildup and how
>> references are handled internally - but let's discuss the options
>> already.
>>
>>  2# Better fit Lucene 4 / High performance
>>
>> There are some small performance oriented optimizations that we could
>> already do with Lucene 3, but where unlikely to be worth the effort;
>> for example reusing Field instances and pre-intern all field names.
>>   These considerations however are practically mandatory with Lucene 4, as:
>>   - the cost of *not* doing as Lucene wants is higher (runtime field
>> creation is more expensive now)
>>   - the performance benefit of following the Lucene expectations are
>> significantly higher (takes advantage of several new features)
>>   - code is much more complex if we don't do it
>
> I am not sure how that requirement fits witht he solution you describe
> later.

See the comments on LuceneOptionsImpl below

>> 3# MutableSearchFactory
>>
>>  Let's not forget we also have a MutableSearchFactory to maintain: new
>> entities could be added at any time so if we drop the original
>> metadata we need to be able to build a new (read-only) one from the
>> current state.
>>
>> 4# Finally some cleanups in AbstractDocumentBuilder
>>
>> This class served us well, but has grown too much over time.
>>
>> Things we wanted but where too hard to do so far:
>>  - Separate annotation reading from Document building. Separate
>> validity checks too.
>>  - It checks for JPA @Id using reflection as it might not be available
>> -> pluggable?
>
> We know only one use case for this pluggable mechanism, do we really
> need it?

No we don't. It just is yet another reason to do such a cleanup.
BTW we're also planning for @EmbeddableId, and we could take advantage
of more, like @Cacheable, multitenancy mapping options, etc.. so this
need might become more relevant.

>>  - LuceneOptionsImpl are built at runtime each time we need one ->
>> reuse them, coupling them to their field
>>
>
> Do you think it would yield performance improvement? It does not look
> like an expensive object to create compared to keeping a reference
> around. What's your reasoning?

This actually is one of the pain points of a proper Lucene4 implementation.
Today we create these instances very frequently; in an ideal world we
might not want to do that, but I agree that would be very unlikely to
be worth the effort.
With Lucene4 however you want to reuse Fieldable dinifitions (reuse
the instance); they became more expensive to create to benefit faster
operation at runtime - as long as you reuse them. Reusing such a
Fieldable is normally trivial, but very complex with out current code.

We have two options here:
 - ignore the problem and move to Lucene4 with a non optimal performance
 - review the design first, move the Lucene4 when it's easier

Personally I like the iterative approach more so I would normally
choose to move to Lucene4 first; however the move is quite complex to
do in one go, so after some attempts I've backed off and my position
now is actually to try making our life easier before attempting the
jump. Consider the jump will need some weeks of coding of one
developer alone, who will be coding most of this time without being
able to run the tests as there are lots of compilation issues to
solve.

I'm not having a strong position on this, I just suspect refactoring
this class first makes it easier. I tried a migration: this is where
it gets seriously messy, but of course I could try again; I backed off
not because it's impossible but to explore our options.

>> DocumentBuilderIndexedEntity specific:
>>   - A ConversionContext tracks progress on each field by push/pop a
>> navigation stack to eventually thrown an exception with the correct
>> description. If instead we used a recursive function, there would be
>> no need to track anything.
>
> I'm not entirely following how the recursive method could help you to
> track the context. Upon failure, You would catch inner call exceptions and add your
> context information each time and rethrow?

I'm sorry I now realize I made it hard to understand as I didn't
actually explain how I mean this to be recursive.

First, note that another frequently requested feature is to be able to
add some fields to a Document *in addition* to what we would normally
do.
Today this flexibility is granted by defining a class level bridge,
but by doing so this disables processing of @Field annotations, so
people can't decorate but have to hardcode the full transformation in
their bridge.
To solve that - and the design mentioned above - I was thinking of
compositing bridges recursively to match the metadata.

To express that in pseudo-functional code, an entity:

@ClassBridge(impl=CustomAnimals.class)
@Indexed
class Animal {
  @Id long id;
  @IndexedEmbedded Color skinColor;
  @Field name;
}

would generate a reusable transformation function which gets
associated to the Animal.class

class ObjectToDocument {
  Document transform(Entity e);
}
[not exactly like that, bear with me a moment]

But the implementation of this could be a composite of such transformers:
it would have an internal reference to an "id" ObjectToDocument
instance implementation which knows about long (and includes a long
Two-Way fieldbridge),
and also includes a constant reference to the Fieldable instance it
would use to populate a Document instance.

[you see: the signature above isn't valid anymore as we're not
returning a Document for the subtypes]

the adaptor needed to transform the "skinColor" field - having an
@IndexedEmbedded -
would not need to be hardcoded on a Color class but could need to do a
lookup from a transformer registry
depending on the runtime type (this would be a new feature needed for
less ORM oriented use cases).

Also this transformer instance could be used to eagerly initialize the
needed fields: we could define a graph of needed relations which need
to be loaded.
Today Search walks each property as needed, but that triggers multiple
roundtrips to the database where we could use fetch profiles.

I guess by now you start seeing the problem of defining the exact
signature of such composite transformation blocks:
I mentioned Visitor in my first email as I think it could help, but it
doesn't have to be strictly a Visitor.

The problem is that we will want to navigate the internal metadata for
different purposes, as I had outlined in the next paragraph; generally
a Visitor allows to decouple the metadata graph from the purpose,
while also preserving a good level of typesafety and performance:
let's not forget this is one of the hottest areas of the Search
codebase (CPU wise), and at the same time the place where we trigger
the more important optimisations, like opportunities to skip network
operations or disk IO.

>>   - We had issues with "forgetting" to initialize a collection before
>> trying to index it (HSEARCH-1245, HSEARCH-1240, ..)
>>   - We need a reliable way to track which field names are created, and
>> from which bridge they are originating (including custom bridges:
>> HSEARCH-904)
>>   - If we could know in advance which properties of the entities need
>> to be initialized for a complete Document to be created we could
>> generate more efficient queries at entity initialization time, or at
>> MassIndexing select time. I think users really would expect such a
>> clever integration with ORM (HSEARCH-1235)

Seems in the list above I forgot my favorite one: dump the metadata as
simple text on bootup; this should greatly simplify query writing, and
doesn't need to open the index with Luke to figure out the field names
/ options.

>> == Solution ? ==
>>
>> Now let's assume that we can build this as a recursive structure which
>> accepts a generic visitor.
>> One could "visit" the structure with a static collector to:
>>  - discover which fields are written - and at the same time collect
>> information about specific options used on them
>>      -> query validation
>>      -> logging the mapping
>>      -> connect to some tooling
>>  - split the needed properties graph into optimised loading SQL or
>> auto-generated fetch profiles; ideally taking into account 2nd level
>> cache options from ORM (which means this visitor resides in the
>> hibernate-search-orm module, not engine! so note the dependency
>> inversion).
>> - visit it with a non-static collector to initialize all needed
>> properties of an input Entity
>> - visit it to build a Document of an initialized input Entity
>
> Does that mean the entity graph (data) is really traversed twice, once to init
> the boundaries and a second time to build the document?

The structure by itself wouldn't enforce that, but yes I think we'd do
that as it looks like the best strategy, performance wise.

>
>> - visit it to build something which gets feeded into a non-Lucene
>> output !! (ElasticSearch or Solr client value objects: HSEARCH-1188)
>
> This one is a really interesting idea.

Yet another interesting one: encoding in easy-serializable format.
And decoding too: you would traverse the same composite metadata with
a different visitor. JBoss Marshalling uses the same recursive
pattern, so each joint of the composition could invoke a specific
Marshaller.

>
>>  .. define the Analyzer mapping,  generate the dynamic boosting
>> values, etc.. each one could be a separate, clean, concern.
>>
>> This would also make it easier to implement a whole crop of feature
>> requests we have about improving the @IndexedEmbedded(includePaths)
>> feature, and the ones I like most:
>>
>>  # easy tool integration for inspection
>>  # better testability of how we create this metadata
>>  # could make a "visualizing" visitor to actually show how a test
>> entity is transformed and make it easier to understand why it's
>> matching a query (or not).
>>
>> Quite related, what does everybody think of this :
>> https://hibernate.atlassian.net/browse/HSEARCH-438 Support runtime
>> polymorphism on associations (instead of defining the indexed
>> properties based on the returned type)
>> ?
>> Personally I think the we should support that, but it's a significant
>> change. I'm bringing that up again as I suspect it would affect the
>> design of the changes proposed above.
>>
>>
>> This might sound a big change; in fact I agree it's a significant
>> style change but it is rewriting what is defined today in just 3
>> classes; no doubt we'll get more than a dozen ouf of it, but I think
>> it would be better to handle in the long run, more flexible and
>> potentially more efficient too.
>
> What's your argument for efficiency?

Mostly object reuse in areas we can't today, and considering some of
these objects are expensive, and we have evidence from profiler runs
that garbage creation is our weakest point.
But it's not limited to "less objects" created at runtime: consider
for example today we artificially track the navigation stack to build
meaningful error messages; you wouldn't need to track the stack at
all.
Also some objects are expensive, like the Field instances in Lucene
require interning of the field names which is an inherent
synchronization point: today we have to traverse these synch points
for each field for each write.

>
>> Do we all agree on this? In practical terms we'd also need to define
>> how far Hardy wants to go with this, if he wants to deal only with the
>> Metadata API/SPI aspect and then I could apply the rest, or if he
>> wants to try doing it all in one go. I don't think we can start
>> working in parallel on this ;-)
>
> I have always been a skeptic of the visitor pattern. I know you have
> been drinking it over and over for the parser but to me it's a pattern
> that:
>
> - makes the big picture harder to grasp

Can't deny it, but some practice using it helps.

> - extremely difficult to debug if some generic behavior is different
>   that what you expect

A plus point is that each visitor can be kept very simple and
essential; it's generally less error prone: we have some cases of
visitors in Search already and they had never had any problem: very
robust code, and strong to maintain because an error in the pattern
usage most of the times results in a compile error.

> - tend to create visitors with state (machines) because contextual
>   information ends up being needed anyways

Right, especially if the purpose of each is different: for example the
one outputting the metadata tree in text format would need a
StringBuilder, but that feels very natural to it doesn't it?
It's not particularly different then the Builder that Hardy has been
using in the pull of today.

>
> In our case, there might be correlation between booting and the specific
> structure, or analyzer and bridges (making that up). AFAIU from the
> visitor pattern, you need to merge those correlated works in a single
> visitor or store intermediate state somewhere in your "output"
> structure.
>
> I like the idea of a visitor to make non Lucene backend easier but the
> fact that I don't grok code using the visitor pattern is of concern to
> me. I can be convinced I imagine but I'll need arguments.

It's not going to be applied tonight, and it doesn't need to be a
Visitor. I just wished to finally share all my thoughts on what we
need from this metadata internals. I have a Visitor in mind, but we
can see what fits better; for sure we need to traverse this structure
in different ways.

Sanne