On Wed 2013-05-29 17:39, Sanne Grinovero wrote:
> We're starting a series of refactorings in Hibernate Search to improve
> how we handle the entity mapping to the index; to summarize goals:
>
> 1# Expose the Metadata as API
>
> We need to expose it because:
> a - OGM needs to be able to read this metadata to produce appropriate
queries
> b - Advanced users have expressed the need to for things like listing
> all indexed entities to integrate external tools, code generation,
> etc..
> c - All users (advanced and not) have interest in -at least- logging
> the field structure to help creating Queries; today people need a
> debugger or Luke.
>
> Personally I think we end up needing this just as an SPI: that might
> be good for cases {a,b}, and I have an alternative proposal for {c}
> described below.
> However we expose it, I think we agree this should be a read-only
> structure built as a second phase after the model is consumed from
> (annotations / programmatic API / jandex / auto-generated by OGM). It
> would also be good to keep it "minimal" in terms of memory cost, so to
> either:
> - drop references to the source structure
> - not holding on it at all, building the Metadata on demand (!)
> (Assuming we can build it from a more obscure internal representation
> I'll describe next).
To me since we have an internal representation for DocumentBuilder to do
its runtime job, the user visible model could end up being a flyweight
object in front of it exposing thigns the way we want. Whether you
create this flyweight structure each time is another question.
>
> Whatever the final implementation will actually do to store this
> metadata, for now the priority is to define the contract for the sake
> of OGM so I'm not too concerned on the two phase buildup and how
> references are handled internally - but let's discuss the options
> already.
>
> 2# Better fit Lucene 4 / High performance
>
> There are some small performance oriented optimizations that we could
> already do with Lucene 3, but where unlikely to be worth the effort;
> for example reusing Field instances and pre-intern all field names.
> These considerations however are practically mandatory with Lucene 4,
as:
> - the cost of *not* doing as Lucene wants is higher (runtime field
> creation is more expensive now)
> - the performance benefit of following the Lucene expectations are
> significantly higher (takes advantage of several new features)
> - code is much more complex if we don't do it
I am not sure how that requirement fits witht he solution you describe
later.
>
> 3# MutableSearchFactory
>
> Let's not forget we also have a MutableSearchFactory to maintain: new
> entities could be added at any time so if we drop the original
> metadata we need to be able to build a new (read-only) one from the
> current state.
>
> 4# Finally some cleanups in AbstractDocumentBuilder
>
> This class served us well, but has grown too much over time.
>
> Things we wanted but where too hard to do so far:
> - Separate annotation reading from Document building. Separate
> validity checks too.
> - It checks for JPA @Id using reflection as it might not be available
> -> pluggable?
We know only one use case for this pluggable mechanism, do we really
need it?
> - LuceneOptionsImpl are built at runtime each time we need one ->
> reuse them, coupling them to their field
>
Do you think it would yield performance improvement? It does not look
like an expensive object to create compared to keeping a reference
around. What's your reasoning?
> DocumentBuilderIndexedEntity specific:
> - A ConversionContext tracks progress on each field by push/pop a
> navigation stack to eventually thrown an exception with the correct
> description. If instead we used a recursive function, there would be
> no need to track anything.
I'm not entirely following how the recursive method could help you to
track the context. Upon failure, You would catch inner call exceptions and
add your
context information each time and rethrow?
> - We had issues with "forgetting" to initialize a collection before
> trying to index it (HSEARCH-1245, HSEARCH-1240, ..)
> - We need a reliable way to track which field names are created, and
> from which bridge they are originating (including custom bridges:
> HSEARCH-904)
> - If we could know in advance which properties of the entities need
> to be initialized for a complete Document to be created we could
> generate more efficient queries at entity initialization time, or at
> MassIndexing select time. I think users really would expect such a
> clever integration with ORM (HSEARCH-1235)
>
>
> == Solution ? ==
>
> Now let's assume that we can build this as a recursive structure which
> accepts a generic visitor.
> One could "visit" the structure with a static collector to:
> - discover which fields are written - and at the same time collect
> information about specific options used on them
> -> query validation
> -> logging the mapping
> -> connect to some tooling
> - split the needed properties graph into optimised loading SQL or
> auto-generated fetch profiles; ideally taking into account 2nd level
> cache options from ORM (which means this visitor resides in the
> hibernate-search-orm module, not engine! so note the dependency
> inversion).
> - visit it with a non-static collector to initialize all needed
> properties of an input Entity
> - visit it to build a Document of an initialized input Entity
Does that mean the entity graph (data) is really traversed twice, once to
init
the boundaries and a second time to build the document?
> - visit it to build something which gets feeded into a non-Lucene
> output !! (ElasticSearch or Solr client value objects: HSEARCH-1188)
This one is a really interesting idea.
> .. define the Analyzer mapping, generate the dynamic boosting
> values, etc.. each one could be a separate, clean, concern.
>
> This would also make it easier to implement a whole crop of feature
> requests we have about improving the @IndexedEmbedded(includePaths)
> feature, and the ones I like most:
>
> # easy tool integration for inspection
> # better testability of how we create this metadata
> # could make a "visualizing" visitor to actually show how a test
> entity is transformed and make it easier to understand why it's
> matching a query (or not).
>
> Quite related, what does everybody think of this :
>
https://hibernate.atlassian.net/browse/HSEARCH-438 Support runtime
> polymorphism on associations (instead of defining the indexed
> properties based on the returned type)
> ?
> Personally I think the we should support that, but it's a significant
> change. I'm bringing that up again as I suspect it would affect the
> design of the changes proposed above.
>
>
> This might sound a big change; in fact I agree it's a significant
> style change but it is rewriting what is defined today in just 3
> classes; no doubt we'll get more than a dozen ouf of it, but I think
> it would be better to handle in the long run, more flexible and
> potentially more efficient too.
What's your argument for efficiency?
> Do we all agree on this? In practical terms we'd also need to define
> how far Hardy wants to go with this, if he wants to deal only with the
> Metadata API/SPI aspect and then I could apply the rest, or if he
> wants to try doing it all in one go. I don't think we can start
> working in parallel on this ;-)
I have always been a skeptic of the visitor pattern. I know you have
been drinking it over and over for the parser but to me it's a pattern
that:
- makes the big picture harder to grasp
- extremely difficult to debug if some generic behavior is different
that what you expect
- tend to create visitors with state (machines) because contextual
information ends up being needed anyways
In our case, there might be correlation between booting and the specific
structure, or analyzer and bridges (making that up). AFAIU from the
visitor pattern, you need to merge those correlated works in a single
visitor or store intermediate state somewhere in your "output"
structure.
I like the idea of a visitor to make non Lucene backend easier but the
fact that I don't grok code using the visitor pattern is of concern to
me. I can be convinced I imagine but I'll need arguments.
I'm wondering how many node types there will be and how deep the hierarchy
actually is. If it is basically only something like BeanDescriptor and
PropertyDescriptor, is it really worth a visitor pattern? In which
situation would one get a collection of mixed descriptor types where one
needs to dispatch type-specific logic?
On a more general note, would it make sense to have a joint meta model for
the different projects such as OGM and Search?
This might help avoiding to write and maintain redundant code (e.g. merging
meta data from different sources) and also might be beneficial for the sake
of memory. Such a common model might have generic node types, to which the
different clients could attach specific payload.
Note that I'm not sure whether that's a sound idea, just throwing it out :)
--Gunnar
_______________________________________________
hibernate-dev mailing list
hibernate-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hibernate-dev