[hibernate-dev] Why does implicit join translate to inner join?

Fri Feb 23 04:39:58 EST 2018

2018-02-22 21:39 GMT+01:00 Steve Ebersole <steve at hibernate.org>:

> On Thu, Feb 22, 2018 at 9:11 AM Lukas Eder <lukas.eder at gmail.com> wrote:
>
>> Hi Steve,
>>
>> Thanks for your message. Of course, being explicit always has the
>> advantage of ... being explicit. But my question here is really about the
>> implicit join feature and how it works.
>>
>
> Sure, but that is also part of the answer.  A big part
>

I take you kinda regret the feature? :-) But I'm certain that many users
find it extremely useful.

> Not to mention, adjusting the type of SQL join used for implicit jois
>>> means I can no longer just look at the query and know what is happening in
>>> terms of SQL joins - which is bad.
>>>
>>
>> Why not? There's just an additional keyword between the generated tables:
>> LEFT (or if you will, LEFT OUTER).
>>
>
> I think you misunderstand...  I was saying that I can no longer look at
> the HQL/JPQL and tell what kind of SQL joins will be used for that if it is
> dependent on the mapped association.
>

OK, I see. However, this could be said of a few other features as well (at
least in SQL). For instance, Oracle's FETCH FIRST n ROWS ONLY is translated
to nesting queries and filtering on ROW_NUMBER() window functions, which is
indeed the same thing. FETCH FIRST n ROWS WITH TIES is translated to
nesting them and filtering on RANK(). FETCH FIRST n PERCENT ROWS can be
implemented with PERCENT_RANK() (although, I think that's not what's being
done). PIVOT could be seen as syntax sugar for GROUP BY and a generated set
of projections, UNPIVOT and GROUPING SETS for UNION ALL. That's the nature
of syntax sugar. It hides verbosity in favour of simplicity and development
ease, and in some cases, offers an option for a specific optimisation that
is only possible when the syntax sugar is used, not the "expanded" version.
For example, few databases are as mad as Oracle to translate FETCH FIRST to
window function filtering - they have more optimal LIMIT implementations.
That's just a historic Oracle issue.

The price is, well, the verbose version (which is more explicit and thus
more "clear") is hidden. But is that a high price to pay? I personally
think not.

>   This approach was mentioned  earlier in the thread.  But you clarified
> that you mean that implicit joins ought to just always be interpreted as an
> outer join.
>

I may have misstated that. Personally, I prefer them to be implemented in
the most optimal way because sadly, not all databases are able to transform
outer joins to inner joins when this is appropriate. From the databases I
checked (DB2 LUW, MySQL, Oracle, PostgreSQL, SQL Server), only DB2 does it.
I can imagine some edge cases where an inner join produces better
cardinality estimates in complex queries than a left join.

On the other hand, not all databases can eliminate inner joins. E.g.
PostgreSQL can only eliminate outer joins. This shouldn't apply in this
discussion, but just to show, there are pros and cons to both join types
from a performance perspective.

But for most trivial cases (and I'm assuming that most JPQL/HQL generated
queries, which don't support derived tables nor unions, will still match my
definition of trivial), this doesn't really matter much anyway, so I
understand that the pragmatic, preferred argument here is to always do it
in the same way.

> You could plugin in your own query handling and interpret implicit joins
> however you want.  That is current not the most trivial task though.
>

Sure, that's always an option! Perhaps a nice blog post for Vlad? ;-)

> Not to mention, IMO interpreting these as inner joins is more OO-ish.
>>>
>>
>> I'm curious about that, would you mind elaborating?
>>
>
> Well in normal Java - Idiomatic Java ;) - calling `p.address.city` leads
> to a NPE if `p.address` happens to be null, right?  To me, using inner
> joins naturally follows from that - you are saying no intermediate path is
> null.
>

I see. Although the query language, even if based on Java-annotated
entities, is a relational-ish one where NULL has a different semantics.
E.g. I don't suppose that substring(NULL, 1, 2) throws a NPE in JPQL, it'll
rather return NULL, just like NULL + 1 returns NULL. Likewise, I'm
expecting a NULL = NULL comparison to yield NULL, not TRUE.

I guess that's a matter of perspective. As a SQL person, I'd expect a
language (JPQL) that so obviously translates to SQL to follow SQL idioms
much more than Java idioms.

> And what's your opinion on the Stream analogy, where the current behaviour
>> (implicit joins from the context of a SELECT clause) corresponds to
>> Stream.map() potentially applying filters?
>>
>
> Well 2 things:
>
> 1) I personally think its not good form to put null in the output of a
> Stream map operation, so to me that in fact does imply a filtering
>

You should try suggesting that to the Stream EG :-) I'm pretty sure
everyone will agree that while your perception of good form is reasonable,
your implication is not. Besides, my example isn't equivalent to returning
null from a Stream.map() operation. It is equivalent to returning a tuple /
object where one of the attributes is null, which is probably a bit less
bad form.

> 2) You think its odd that the SELECT clause "applies a filter", but your
> ok with the fact that it can expand the query domain (from clause)?  I
> think that is inconsistent.
>

Technically, the canonical approach to implementing "implicit joins" is not
to implement joins at all, but correlated subqueries. My original query
example:

  SELECT c.firstName, c.lastName, c.address.city.country.code
  FROM customer c

Could easily be translated to:

  SELECT
    c.first_name,
    c.last_name,
    (
      SELECT co.code
      FROM country co
      JOIN city ci USING (country_id)
      JOIN address a USING (city_id)
      WHERE a.address_id = c.address_id
    )
  FROM customer c

While quickly being more verbose than a join-based solution, this looks
very natural. In addition to that, we get the desired behaviour of having
either the country code if all of address, city, country, and code are
present, or NULL if any of these things are absent. But never an implicit
filter.

This approach would work well logically, regardless where the implicit join
is located (including WHERE, GROUP BY, ORDER BY). However, it is quickly
prohibitive, performance wise, in pretty much every database. Even DB2 has
trouble factoring out similar join graphs to avoid doing them several times.

Yet in principle, correlated subqueries and left joins can be transformed
into each other in many cases, so left join is a pragmatic improvement
here, both from a readability and from a performance perspective. Inner
joins are not, because they are semantically very different.

I hope this clarifies why I don't think that modifying the "query domain",
as you call it, is inconsistent with my mental model of SQL-style
relational algebra. At least as long as the behaviour of SELECT * is
maintained correctly (i.e. the implicitly joined columns will not be
projected).

Lukas