[JBoss JIRA] (TEIID-4594) Add ability to read Parquet Files

Friday, 26 June 2020

    [
https://issues.redhat.com/browse/TEIID-4594?page=com.atlassian.jira.plugi...
] 

Steven Hawkins commented on TEIID-4594:
---------------------------------------

We should probably look to use the apache project https://parquet.apache.org/ - the same
dependency supports both avro and parquet.  

There is also arrow https://arrow.apache.org/docs/index.html - that supports reading
parquet and provides an in-memory layer over which to process. 

The translator will have several responsibilities:

- metadata: Using the metadata facilities provided by the parquet library it should be
possible to automatically create a table representation for a set of parquet files.  There
appears to both be an external metadata specification (generally used by spark) and
metadata in each parquet file.  For reference we would see the common representations
possible in the Hive metastore.  Presumably a file single source can define a number of
parquet tables.  If it seems too complex for us to automate the import, then we should
focus on the set of options needed to define the foreign table:  create foreign table
my_parquet_table (...) OPTIONS (PATH '...' PARTITION_COLUMN '...' ...) -
with the expectation that the user will manually add these statements to their ddl.   I am
not familiar enough with all of the possible options and partitioning schemes to fully
specify what is needed.

- pushdown: controlled by the ExecutionFactory capabilities methods we need to determine
what is possible for pushdown.  Since a set of files will generally define the same
logical table the most fundamental pushdown would be around a partitioning column such
that at least an equality predicate can be used to specify a specific partition/file. 
Since it's a columnar format the most benefit from other pushdowns is allowing further
projection minimization.

- processing: utilizing one of the libraries from above the Teiid source query will need a
ResultSetExecution providing the implementation.  There are quite a few examples to draw
from that translate from our SQL language objects to either another query or api.

...
 Add ability to read Parquet Files
 ---------------------------------

                 Key: TEIID-4594
                 URL: https://issues.redhat.com/browse/TEIID-4594
             Project: Teiid
          Issue Type: Feature Request
          Components: Misc. Connectors
    Affects Versions: 9.2
            Reporter: Van Halbert
            Priority: Major
             Fix For: 15.0

 Integration with Parquet files on Gluster is an important requirement. RADAnalytics will
be accessing data from Parquet which is a common file format for Spark.  

--
This message was sent by Atlassian Jira
(v7.13.8#713008)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009