[teiid-issues] [JBoss JIRA] (TEIID-4997) Teiid on/with Spark

Thursday, 20 July 2017

    [
https://issues.jboss.org/browse/TEIID-4997?page=com.atlassian.jira.plugin...
] 

Steven Hawkins edited comment on TEIID-4997 at 7/20/17 11:33 AM:
-----------------------------------------------------------------

To do the collocation of Teiid embedded with the worker will take a lot of specialization.
 A starting point of a wildfly-swarm uber jar without any translators, gives us a
convenient way to be loaded in through the spark shell.  However our base jar size is >
122 MB with a significant runtime footprint as well given the amount of heap usage
reserved by Teiid.  This of course could be slimmed more as we don't need the rest nor
remote jdbc layers.  We'd need to customize the engine startup to be triggered through
the driver, customize the resource consumption such that the buffermanager is reserving
only a small amount of memory, prevent any materialization loading (for now), and have a
convention for vdbs are managed.

In the interest of a POC level of effort this leads me to believe that we should focus
instead on remote access back to Teiid and as an optimization direct creation of pushdown
of source queries to JDBC sources by directly tapping into the translator layer.

was (Author: shawkins):
To do the collocation of Teiid embedded with the worker will take a lot of specialization.
 A starting point of a wildfly-swarm uber jar without any translators, gives us a
convenient way to be loaded in through the spark shell.  However our base size is > 122
MB with a significant amount of heap usage reserved by Teiid.  This of course could be
slimmed more as we don't need the rest nor remote jdbc layers.  We'd need to
customize the engine startup to be triggered through the driver, customize the resource
consumption such that the buffermanager is reserving only a small amount of memory,
prevent any materialization loading (for now), and have a convention for vdbs are
managed.

In the interest of a POC level of effort this leads me to believe that we should focus
instead on remote access back to Teiid and as an optimization direct creation of pushdown
of source queries to JDBC sources by directly tapping into the translator layer.

...
 Teiid on/with Spark
 -------------------

                 Key: TEIID-4997
                 URL: https://issues.jboss.org/browse/TEIID-4997
             Project: Teiid
          Issue Type: Feature Request
          Components: Build/Kits, Query Engine
            Reporter: Steven Hawkins
            Assignee: Steven Hawkins

 With the availability of Spark on OpenShift, we should provide a cooperative
planning/execution mode for Teiid that utilizes the Spark engine.
 Roughly this would look like a Teiid master running embedded with the Spark master
serving the typical JDBC/ODBC/OData endpoints.  On an incoming query the optimizer would
choose to process against Spark or to process with Teiid - if processing with Teiid that
may still require submitting the job to a worker to avoid burdening the master. 
Alternatively the Teiid master could run in a separate pod with the additional
serialization costs, however initially the remote Spark [JDBC/ODBC
layer|https://spark.apache.org/docs/latest/sql-programming-guide.html#dis...]
will not be available in the OpenShift effort.
 If execution against Spark is chosen, then instead of a typical Teiid processor plan a
spark job will be created instead.  Initially this could be limited to relational plans,
but could be expanded to include procedure language support translated to python, scala,
etc.  The spark job would represent each source access as a [temporary
view|https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc...]
accessing the relevant pushdown query.  Ideally this would be executed against a Teiid
Embedded instance running in the worker node.  If remote this would incur an extra hop and
have security considerations.  This can be thought of as using Teiid for its
virtualization and access layer features.  The rest of the processing about the access
layers could then be represented as Spark SQL.
 For example a Teiid user query of "select * from hdfs.tbl h, oracle.tbl o where h.id
= o.id order by h.col" would become the Spark SQL job:
 CREATE TEMPORARY VIEW h
 USING org.apache.spark.sql.jdbc
 OPTIONS (
   url "jdbc:teiid:vdb",
   dbtable "(select col ... from hdfs.tbl)",
   fetchSize '1024,
   ...
 )
 CREATE TEMPORARY VIEW o
 USING org.apache.spark.sql.jdbc
 OPTIONS (
   url "jdbc:teiid:vdb",
   dbtable "(select col ... from oracle.tbl)",
   fetchSize '1024,
   ...
 )
 SELECT * FROM h inner join o on h.id
 The challenges/considerations of this are:
 * Utilizing embedded with coordinated VDB management.  There's the associated issue
of driver management as well.
 * Translating Teiid SQL to Spark SQL.  All Teiid functions, udfs, aggregate functions
would need to be made known to Spark.  Table function constructs, such as XMLTABLE,
TEXTTABLE, etc. could initially just be treated as access layer concerns.  Type issues
would exist as xml/clob/json would map to string.
 * no xa support
 * we'd need to provide reasonable values for fetch size, partition information, etc.
in the access layer queries.
 * We'd have to determine the extent to which federated join optimizations need to be
conveyed (dependent join and pushdown) as that would go beyond simply translating to Spark
SQL.
 * there's a potential to use [global temporary
views|http://www.gatorsmile.io/globaltempview/] which is a more convenient way of adding
virtualization to Spark.  
 * Large internal materialization should be re-targeted to Spark or JDG  

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[teiid-issues] [JBoss JIRA] (TEIID-4997) Teiid on/with Spark