[hibernate-dev] Query handling : Antlr 3 versus Antlr 4

Tue Jun 9 11:14:50 EDT 2015

2015-06-09 16:02 GMT+02:00 Steve Ebersole <steve at hibernate.org>:

> On Tue, Jun 9, 2015 at 3:57 AM Gunnar Morling <gunnar at hibernate.org>
> wrote:
>
> What I like about the Antlr4 approach is the fact that you don't need a
>> set of several quite similar grammars as you'd do with the tree
>> transformation approach. Also using the current version of Antlr instead of
>> 3 appears attractive to me wrt. to bugfixes and future development of the
>> tool.
>>
>
> Understand that we would all "like" to use Antlr 4 for many reasons,
> myself included.  But it has to work for our needs.  There are just so many
> open questions (for me) as to whether that is the case.
>

Sure, that's what we need to find out.

>> Based on what I understand from your discussions on the Antlr mailing
>> list, I'd assume the parse tree and the external state it references to
>> look roughly like so (---> indicates a reference to state built up during
>> sub-sequential walks, maybe in some external "table", maybe stored within
>> the (typed) tree nodes themselves):
>>
>> [QUERY]
>>   [SELECT]
>>     [ATTRIBUTE_REF] ---> AttributeReference("<gen:1>", "code")
>>       [DOT]
>>         [DOT]
>>           [DOT]
>>             [IDENT, "c"]
>>             [IDENT, "headquarters"]
>>           [IDENT, "state"]
>>         [IDENT, "code"]
>>   [FROM]
>>     [SPACE]
>>       [SPACE_ROOT] ---> InnerJoin( InnerJoin ( PersisterRef( "c",
>> "com.acme.Customer" ), TableRef ( "<gen:0>", "headquarters" ) ), TableRef (
>> "<gen:1>", "state" ) ) )
>>         [IDENT, "Customer"]
>>         [IDENT, "c"]
>>
>> I.e. instead of transforming the tree itself, the state required for
>> output generation would be added as "decorators" to nodes of the original
>> parse tree itself. That's just the basic idea as I understand it, surely
>> the specific types of the decorator elements (AttributeReference,
>> InnerJoin etc.) may look different. During "query rendering" we'd have
>> to inspect the decorator state of the parse tree nodes and interpret it
>> accordingly.
>>
>
> Well, see you do something "tricky" here that is actually one of my
> concerns with Antlr 4 :)  You mix a parse tree and a semantic tree.
> Specifically this part of your tree:
>
>  [ATTRIBUTE_REF] ---> AttributeReference("<gen:1>", "code")
>       [DOT]
>         [DOT]
>           [DOT]
>             [IDENT, "c"]
>             [IDENT, "headquarters"]
>           [IDENT, "state"]
>         [IDENT, "code"]
>
>  The idea of "ATTRIBUTE_REF" is a semantic concept.  The DOT-IDENT struct
> is your parse tree.  Antlr 4 does allow mixing these based on left
> refactoring of the rules, *but* there is an assumption there... that the
> branches in such a left-refactored rule can be resolved unambiguously.  I
> am not so sure we can do that.
>

Yes, indeed I cheated here a bit. Probably it should be the following
instead:

      [DOT] ---> AttributeReference("<gen:1>", "code")
        [DOT]
          [DOT]
            [IDENT, "c"]
            [IDENT, "headquarters"]
          [IDENT, "state"]
        [IDENT, "code"]

Or maybe something like:

    [SELECTION_PARTICLE] ---> AttributeReference("<gen:1>", "code")
      [DOT]
        [DOT]
          [DOT]
            [IDENT, "c"]
            [IDENT, "headquarters"]
          [IDENT, "state"]
        [IDENT, "code"]

Where SELECTION_PARTICLE would be an abstract representation of anything
that can be selected (attribute ref, Java literal ref etc.) and the
decorator element added in a later pass would specify its actual semantics
based on the alias definitions etc. discovered before.

Bottom line being, that decorators providing semantics are attached to the
nodes of the parse tree based on information gathered in previous passes.

In simpler terms... Antlr 4 needs you to be able to apply those semantic
> resolutions (attributeRef versus javaLiteralRef versus
> oraclePackagedProcedure versus ...) up front.
>
> So take the input that produces that tree: select c.headquarters.state.code
>
> Syntactically that dot-ident structure could represent any number of
> things.  And semantically we just simply do not have enough information.
> We *could* eliminate it being a javaLiteralRef if we
> made javaLiteralRef the highest precedence branch in the left-factored rule
> that produces this, but that has serious drawbacks:
> 1) we are checking each and every dot-ident path as a possible
> javaLiteralRef first, which means reflection (perf)
> 2) it is not a fool-proof approach.  The problem is that javaLiteralRef
> should really have very low precedence.  There are conceivably cases where
> the expression could resolve to either a javaLiteralRef or an attributeRef,
> and in those cases the resolution should be routed through attributeRef not
> javaLiteralRef
>
> The ultimate problem there is that we cannot possibly know much of the
> information we need to know for proper semantic analysis until after we
> have seen the FROM clause.  We got around that with older Antlr versions
> specifically via tree-rewriting: we re-write the tree to "hoist" FROM
> before the other clauses.
>
>
> So I believe the issue of alias resolution and implicit join conversion
>> could be handled without tree transformations (at least conceptually, I
>> could not code an actual implementation out of my head right away). But
>> maybe there are other cases where tree transformations are more strictly
>> needed?
>>
>
> Well I just illustrated above how that is actually a problem that does
> need either tree transformations or at least delayed processing of the
> sub-tree.
>
> Also get out of your head this idea that we can encode the semantic
> resolution of dot-ident paths into the tree.  We simply will not be able to
> (I believe).
>

Not into the tree itself, but we can encode that semantic resolution into
decorators (node attachments).

> And I think that starts to show my reservations about Antlr 4.  Basically
> every pass over this tree we will need to deal with [[DOT][IDENT]] as
> opposed to [ATTRIBUTE_REFERENCE]
>

Yes, they would deal with [[DOT][IDENT]] nodes but would benefit from
semantic decorators attached previously. During rendering I would expect
mainly those attachments to be of importance for the query creation.

Admittedly, that's all quite "high level", but so far it seems doable to me
in principle. It doesn't answer of course actual tree transformations such
as (x + 0) -> x. I am not sure whether there are cases like this.