[hibernate-dev] Query handling : Antlr 3 versus Antlr 4
Gunnar Morling
gunnar at hibernate.org
Tue Jun 9 07:11:26 EDT 2015
2015-06-09 12:49 GMT+02:00 Sanne Grinovero <sanne at hibernate.org>:
> On 9 June 2015 at 09:56, Gunnar Morling <gunnar at hibernate.org> wrote:
> > Hi Steve,
> >
> > Did you ever have a chance to apply the "decorated parse tree" approach
> to
> > your Antlr4 PoC?
> >
> > What I like about the Antlr4 approach is the fact that you don't need a
> set
> > of several quite similar grammars as you'd do with the tree
> transformation
> > approach. Also using the current version of Antlr instead of 3 appears
> > attractive to me wrt. to bugfixes and future development of the tool.
> >
> > Based on what I understand from your discussions on the Antlr mailing
> list,
> > I'd assume the parse tree and the external state it references to look
> > roughly like so (---> indicates a reference to state built up during
> > sub-sequential walks, maybe in some external "table", maybe stored within
> > the (typed) tree nodes themselves):
> >
> > [QUERY]
> > [SELECT]
> > [ATTRIBUTE_REF] ---> AttributeReference("<gen:1>", "code")
> > [DOT]
> > [DOT]
> > [DOT]
> > [IDENT, "c"]
> > [IDENT, "headquarters"]
> > [IDENT, "state"]
> > [IDENT, "code"]
> > [FROM]
> > [SPACE]
> > [SPACE_ROOT] ---> InnerJoin( InnerJoin ( PersisterRef( "c",
> > "com.acme.Customer" ), TableRef ( "<gen:0>", "headquarters" ) ),
> TableRef (
> > "<gen:1>", "state" ) ) )
> > [IDENT, "Customer"]
> > [IDENT, "c"]
> >
> > I.e. instead of transforming the tree itself, the state required for
> output
> > generation would be added as "decorators" to nodes of the original parse
> > tree itself. That's just the basic idea as I understand it, surely the
> > specific types of the decorator elements (AttributeReference, InnerJoin
> etc.)
> > may look different. During "query rendering" we'd have to inspect the
> > decorator state of the parse tree nodes and interpret it accordingly.
> >
> > So I believe the issue of alias resolution and implicit join conversion
> > could be handled without tree transformations (at least conceptually, I
> > could not code an actual implementation out of my head right away). But
> > maybe there are other cases where tree transformations are more strictly
> > needed?
>
> Do you mean that you would be ok to "navigate" all the [DOT] nodes to
> get to the decorated attachments?
> In that case while you might be fine to translate each fragment into a
> different fragment, it's not straight forward to transform it into a
> different structure, say with sub-trees in different orders or nodes
> which don't have a 1:1 match.
> It's of course doable if you are filling in your own builder while
> navigating these (like we do with the Lucene DSL output), but it
> doesn't help you with multiple phases which is what Steve is pointing
> out.
>
No, what I mean is to add attachments to nodes up in the tree, based on
information either a) from sub-nodes of that tree or b) nodes somewhere
else in the tree. E.g. a) is the case for the attribute reference, which is
represented by an attachment at the ATTRIBUTE_REF node (it has been created
by prior visits to the DOT sub-nodes) and b) is the case for the implicit
join syntax: It is declared by sub-nodes of the SELECT clause, but the
attachments representing the join are added beneath the FROM clause.
The query generation would work based on these "semantic" attachments, it
would not visit the individual DOT nodes for instance.
I would highly prefer to feed the semantic representation of the tree
> to our query generating backends, especially so if we could all share
> the same initial smart phases to do some basic validations and
> optimisations DRY. But then the consuming backends will likely have
> some additional validations and optimisations which need to be
> backend-specific (dialect-specific or technology specific in case of
> OGM).
>
Yes, of course that's my preference as well. But collecting semantic
attachments on higher-level nodes (using one more several visits on the
parse tree) should not be in the way of that.
The difference to incrementally altering the structure of the tree is that
this approach attaches the required state to nodes of the original tree
itself. E.g. in a first pass you could register all alias definitions (e.g.
"c" = PersisterRef(Customer)) in some look-up table. Then in a second pass
you could resolve alias uses against these definitions and attach that
resolved information to the node representing the original reference. So
"semantic representation" would be in node attachments (again, likely in
aggregated forms on super-nodes or nodes somewhere else in the tree)
instead of nodes themselves.
At least that's how I understand things to work in Antlr4 based on their
docs and the user group discussions initiated by Steve.
>
> Steve, you mentioned that ANTLR4 handles transformations but only when
> it's the last step. What prevents us to chain multiple such
> transformations, applying the "last step" approach multiple times?
> I didn't look at it at all, so take this just as an high level,
> conceptual question. I guess one would need to clearly define all
> intermediate data types rather than have ANTLR generate them like it
> does with tokens, but that could be the lesser trouble?
>
> Thanks,
> Sanne
>
> >
> > --Gunnar
> >
> >
> >
> >
> >
> >
> >
> > 2014-11-13 19:42 GMT+01:00 Steve Ebersole <steve at hibernate.org>:
> >
> >> As most of you know already, we are planning to redesign the current
> >> Antlr-based HQL/JPQL parser in ORM for a variety of reasons.
> >>
> >> The current approach in the translator (Antlr 2 based, although Antlr 3
> >> supports the same model) is that we actually define multiple
> >> grammars/parsers which progressively re-write the tree adding more and
> more
> >> semantic information; think of this as multiple passes or phases. The
> >> current code has 3 phases:
> >> 1) parsing - we simply parse the HQL/JPQL query into an AST, although
> we do
> >> do one interesting (and uber-important!) re-write here where we "hoist"
> the
> >> from clause in front of all other clauses.
> >> 2) rough semantic analysis - the current code, to be honest, sucks here.
> >> The end result of this phase is a tree that mixes normalized semantic
> >> information with lots of SQL fragments. It is extremely fugly
> >> 3) rendering to SQL
> >>
> >> The idea of phases is still the best way to attack this translation
> imo. I
> >> just think we did not implement the phases very well before; we were
> just
> >> learning Antlr at the time. So part of the redesign here is to leverage
> >> our better understanding of Antlr and design some better trees. The
> other
> >> big reason is to centralize the generation of SQL into one place rather
> >> than the 3 different places we do it today (not to mention the many,
> many
> >> places we render SQL fragments).
> >>
> >> Part of the process here is to decide which parser to use. Antlr 2 is
> >> ancient :) I used Antlr 3 in the initial prototyping of this redesign
> >> because it was the most recent release at that time. In the interim
> Antlr
> >> 4 has been released.
> >>
> >> I have been evaluating whether Antlr 4 is appropriate for our needs
> there.
> >> Antlr 4 is a pretty big conceptual deviation from Antlr 2/3 in quite a
> few
> >> ways. Generally speaking, Antlr 4 is geared more towards interpreting
> >> rather than translating/transforming. It can handle "transformation" if
> >> the transformation is the final step in the process. Transformations is
> >> where tree re-writing comes in handy.
> >>
> >> First lets step back and look at the "conceptual model" of Antlr 4. The
> >> grammar is used to produce:
> >> 1) the parser - takes the input and builds a "parse tree" based on the
> >> rules of the lexer and grammar.
> >> 2) listener/visitor for parse-tree traversal - can optionally generate
> >> listeners or visitors (or both) for traversing the parse tree (output
> from
> >> parser).
> >>
> >> There are 2 highly-related changes that negatively impact us:
> >> 1) no tree grammars/parsers
> >> 2) no tree re-writing
> >>
> >> Our existing translator is fundamentally built on the concepts of tree
> >> parsers and tree re-writing. Even the initial prototypes for the
> redesign
> >> (and the current state of hql-parser which Sanne and Gunnar picked up
> from
> >> there) are built on those concepts. So moving to Antlr 4 in that regard
> >> does represent a risk. How big of a risk, and whether that risk is
> worth
> >> it, is what we need to determine.
> >>
> >> What does all this mean in simple, practical terms? Let's look at a
> simple
> >> query: "select c.headquarters.state.code from Company c". Simple
> syntactic
> >> analysis will produce a tree something like:
> >>
> >> [QUERY]
> >> [SELECT]
> >> [DOT]
> >> [DOT]
> >> [DOT]
> >> [IDENT, "c"]
> >> [IDENT, "headquarters"]
> >> [IDENT, "state"]
> >> [IDENT, "code"]
> >> [FROM]
> >> [SPACE]
> >> [SPACE_ROOT]
> >> [IDENT, "Customer"]
> >> [IDENT, "c"]
> >>
> >> There is not a lot of semantic (meaning) information here. A more
> semantic
> >> representation of the query would look something like:
> >>
> >> [QUERY]
> >> [SELECT]
> >> [ATTRIBUTE_REF]
> >> [ALIAS_REF, "<gen:1>"]
> >> [IDENT, "code"]
> >> [FROM]
> >> [SPACE]
> >> [PERSISTER_REF]
> >> [ENTITY_NAME, "com.acme.Customer"]
> >> [ALIAS, "c"]
> >> [JOIN]
> >> [INNER]
> >> [ATTRIBUTE_JOIN]
> >> [IDENT, "headquarters"]
> >> [ALIAS, "<gen:0>"]
> >> [JOIN]
> >> [INNER]
> >> [ATTRIBUTE_JOIN]
> >> [IDENT, "state"]
> >> [ALIAS, "<gen:1>"]
> >>
> >>
> >> Notice especially the difference in the tree rules. This is tree
> >> re-writing, and is the major difference affecting us. Consider a
> specific
> >> thing like the "c.headquarters.state.code" DOT-IDENT sequence.
> Essentially
> >> Antlr 4 would make us deal with that as a DOT-IDENT sequence through all
> >> the phases - even SQL generation. Quite fugly. The intent of Antlr 4
> in
> >> cases like this is to build up an external state table (external to the
> >> tree itself) or what Antlr folks typically refer to as "iterative tree
> >> decoration"[1]. So with Antlr 4, in generating the SQL, we would still
> be
> >> handling calls in terms of "c.headquarters.state.code" in the SELECT
> clause
> >> and resolving that through the external symbol tables. Again, with
> Antlr 4
> >> we would always be walking that initial (non-semantic) tree. Unless I
> am
> >> missing something. I would be happy to be corrected, if anyone knows
> Antlr
> >> 4 better. I have also asked as part of the antlr-discussion group[2].
> >>
> >> In my opinion though, if it comes down to us needing to walk the tree in
> >> that first form across all phases I just do not see the benefit to
> moving
> >> to Antlr 4.
> >>
> >> P.S. When I say SQL above I really just mean the target query language
> for
> >> the back-end data store whether that be SQL targeting a RDBMS for ORM
> or a
> >> NoSQL store for OGM.
> >>
> >> [1] I still have not fully grokked this paradigm, so I may be missing
> >> something, but... AFAICT even in this paradigm the listener/visitor
> rules
> >> are defined in terms of the initial parse tree rules rather than more
> >> [2]
> https://groups.google.com/forum/#!topic/antlr-discussion/hzF_YrzfDKo
> >> _______________________________________________
> >> hibernate-dev mailing list
> >> hibernate-dev at lists.jboss.org
> >> https://lists.jboss.org/mailman/listinfo/hibernate-dev
> >>
> > _______________________________________________
> > hibernate-dev mailing list
> > hibernate-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/hibernate-dev
> _______________________________________________
> hibernate-dev mailing list
> hibernate-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>
More information about the hibernate-dev
mailing list