Re: [hibernate-dev] Query handling : Antlr 3 versus Antlr 4

Tuesday, 9 June 2015

Hi Steve,

Did you ever have a chance to apply the "decorated parse tree" approach to
your Antlr4 PoC?

What I like about the Antlr4 approach is the fact that you don't need a set
of several quite similar grammars as you'd do with the tree transformation
approach. Also using the current version of Antlr instead of 3 appears
attractive to me wrt. to bugfixes and future development of the tool.

Based on what I understand from your discussions on the Antlr mailing list,
I'd assume the parse tree and the external state it references to look
roughly like so (---> indicates a reference to state built up during
sub-sequential walks, maybe in some external "table", maybe stored within
the (typed) tree nodes themselves):

[QUERY]
  [SELECT]
    [ATTRIBUTE_REF] ---> AttributeReference("<gen:1>",
"code")
      [DOT]
        [DOT]
          [DOT]
            [IDENT, "c"]
            [IDENT, "headquarters"]
          [IDENT, "state"]
        [IDENT, "code"]
  [FROM]
    [SPACE]
      [SPACE_ROOT] ---> InnerJoin( InnerJoin ( PersisterRef( "c",
"com.acme.Customer" ), TableRef ( "<gen:0>",
"headquarters" ) ), TableRef (
"<gen:1>", "state" ) ) )
        [IDENT, "Customer"]
        [IDENT, "c"]

I.e. instead of transforming the tree itself, the state required for output
generation would be added as "decorators" to nodes of the original parse
tree itself. That's just the basic idea as I understand it, surely the
specific types of the decorator elements (AttributeReference, InnerJoin etc.)
may look different. During "query rendering" we'd have to inspect the
decorator state of the parse tree nodes and interpret it accordingly.

So I believe the issue of alias resolution and implicit join conversion
could be handled without tree transformations (at least conceptually, I
could not code an actual implementation out of my head right away). But
maybe there are other cases where tree transformations are more strictly
needed?

--Gunnar

2014-11-13 19:42 GMT+01:00 Steve Ebersole <steve(a)hibernate.org&gt;:

...
 As most of you know already, we are planning to redesign the current
 Antlr-based HQL/JPQL parser in ORM for a variety of reasons.

 The current approach in the translator (Antlr 2 based, although Antlr 3
 supports the same model) is that we actually define multiple
 grammars/parsers which progressively re-write the tree adding more and more
 semantic information; think of this as multiple passes or phases.  The
 current code has 3 phases:
 1) parsing - we simply parse the HQL/JPQL query into an AST, although we do
 do one interesting (and uber-important!) re-write here where we "hoist" the
 from clause in front of all other clauses.
 2) rough semantic analysis - the current code, to be honest, sucks here.
 The end result of this phase is a tree that mixes normalized semantic
 information with lots of SQL fragments.  It is extremely fugly
 3) rendering to SQL

 The idea of phases is still the best way to attack this translation imo.  I
 just think we did not implement the phases very well before; we were just
 learning Antlr at the time.  So part of the redesign here is to leverage
 our better understanding of Antlr and design some better trees.  The other
 big reason is to centralize the generation of SQL into one place rather
 than the 3 different places we do it today (not to mention the many, many
 places we render SQL fragments).

 Part of the process here is to decide which parser to use.  Antlr 2 is
 ancient :)  I used Antlr 3 in the initial prototyping of this redesign
 because it was the most recent release at that time.  In the interim Antlr
 4 has been released.

 I have been evaluating whether Antlr 4 is appropriate for our needs there.
 Antlr 4 is a pretty big conceptual deviation from Antlr 2/3 in quite a few
 ways.  Generally speaking, Antlr 4 is geared more towards interpreting
 rather than translating/transforming.  It can handle "transformation" if
 the transformation is the final step in the process.  Transformations is
 where tree re-writing comes in handy.

 First lets step back and look at the "conceptual model" of Antlr 4.  The
 grammar is used to produce:
 1) the parser - takes the input and builds a "parse tree" based on the
 rules of the lexer and grammar.
 2) listener/visitor for parse-tree traversal - can optionally generate
 listeners or visitors (or both) for traversing the parse tree (output from
 parser).

 There are 2 highly-related changes that negatively impact us:
 1) no tree grammars/parsers
 2) no tree re-writing

 Our existing translator is fundamentally built on the concepts of tree
 parsers and tree re-writing.  Even the initial prototypes for the redesign
 (and the current state of hql-parser which Sanne and Gunnar picked up from
 there) are built on those concepts.  So moving to Antlr 4 in that regard
 does represent a risk.  How big of a risk, and whether that risk is worth
 it, is what we need to determine.

 What does all this mean in simple, practical terms?  Let's look at a simple
 query: "select c.headquarters.state.code from Company c".  Simple syntactic
 analysis will produce a tree something like:

 [QUERY]
   [SELECT]
     [DOT]
       [DOT]
         [DOT]
           [IDENT, "c"]
           [IDENT, "headquarters"]
         [IDENT, "state"]
       [IDENT, "code"]
   [FROM]
     [SPACE]
       [SPACE_ROOT]
         [IDENT, "Customer"]
         [IDENT, "c"]

 There is not a lot of semantic (meaning) information here.  A more semantic
 representation of the query would look something like:

 [QUERY]
   [SELECT]
     [ATTRIBUTE_REF]
       [ALIAS_REF, "<gen:1>"]
       [IDENT, "code"]
   [FROM]
     [SPACE]
       [PERSISTER_REF]
         [ENTITY_NAME, "com.acme.Customer"]
         [ALIAS, "c"]
         [JOIN]
           [INNER]
           [ATTRIBUTE_JOIN]
             [IDENT, "headquarters"]
             [ALIAS, "<gen:0>"]
               [JOIN]
                 [INNER]
                 [ATTRIBUTE_JOIN]
                   [IDENT, "state"]
                   [ALIAS, "<gen:1>"]

 Notice especially the difference in the tree rules.  This is tree
 re-writing, and is the major difference affecting us.  Consider a specific
 thing like the "c.headquarters.state.code" DOT-IDENT sequence.  Essentially
 Antlr 4 would make us deal with that as a DOT-IDENT sequence through all
 the phases - even SQL generation.  Quite fugly.  The intent of Antlr 4 in
 cases like this is to build up an external state table (external to the
 tree itself) or what Antlr folks typically refer to as "iterative tree
 decoration"[1].  So with Antlr 4, in generating the SQL, we would still be
 handling calls in terms of "c.headquarters.state.code" in the SELECT clause
 and resolving that through the external symbol tables.  Again, with Antlr 4
 we would always be walking that initial (non-semantic) tree.  Unless I am
 missing something.  I would be happy to be corrected, if anyone knows Antlr
 4 better.  I have also asked as part of the antlr-discussion group[2].

 In my opinion though, if it comes down to us needing to walk the tree in
 that first form across all phases I just do not see the benefit to moving
 to Antlr 4.

 P.S. When I say SQL above I really just mean the target query language for
 the back-end data store whether that be SQL targeting a RDBMS for ORM or a
 NoSQL store for OGM.

 [1]  I still have not fully grokked this paradigm, so I may be missing
 something, but... AFAICT even in this paradigm the listener/visitor rules
 are defined in terms of the initial parse tree rules rather than more
 [2] https://groups.google.com/forum/#!topic/antlr-discussion/hzF_YrzfDKo
 _______________________________________________
 hibernate-dev mailing list
 hibernate-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/hibernate-dev

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [hibernate-dev] Query handling : Antlr 3 versus Antlr 4