[teiid-issues] [JBoss JIRA] (TEIID-4594) Add ability to read Parquet Files

Fri Jun 26 09:35:01 EDT 2020

    [ https://issues.redhat.com/browse/TEIID-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178080#comment-14178080 ] 

Steven Hawkins commented on TEIID-4594:
---------------------------------------

We should probably look to use the apache project https://parquet.apache.org/ - the same dependency supports both avro and parquet.  

There is also arrow https://arrow.apache.org/docs/index.html - that supports reading parquet and provides an in-memory layer over which to process. 

The translator will have several responsibilities:

- metadata: Using the metadata facilities provided by the parquet library it should be possible to automatically create a table representation for a set of parquet files.  There appears to both be an external metadata specification (generally used by spark) and metadata in each parquet file.  For reference we would see the common representations possible in the Hive metastore.  Presumably a file single source can define a number of parquet tables.  If it seems too complex for us to automate the import, then we should focus on the set of options needed to define the foreign table:  create foreign table my_parquet_table (...) OPTIONS (PATH '...' PARTITION_COLUMN '...' ...) - with the expectation that the user will manually add these statements to their ddl.   I am not familiar enough with all of the possible options and partitioning schemes to fully specify what is needed.

- pushdown: controlled by the ExecutionFactory capabilities methods we need to determine what is possible for pushdown.  Since a set of files will generally define the same logical table the most fundamental pushdown would be around a partitioning column such that at least an equality predicate can be used to specify a specific partition/file.  Since it's a columnar format the most benefit from other pushdowns is allowing further projection minimization.

- processing: utilizing one of the libraries from above the Teiid source query will need a ResultSetExecution providing the implementation.  There are quite a few examples to draw from that translate from our SQL language objects to either another query or api.

> Add ability to read Parquet Files
> ---------------------------------
>
>                 Key: TEIID-4594
>                 URL: https://issues.redhat.com/browse/TEIID-4594
>             Project: Teiid
>          Issue Type: Feature Request
>          Components: Misc. Connectors
>    Affects Versions: 9.2
>            Reporter: Van Halbert
>            Priority: Major
>             Fix For: 15.0
>
>
> Integration with Parquet files on Gluster is an important requirement. RADAnalytics will be accessing data from Parquet which is a common file format for Spark. 

--
This message was sent by Atlassian Jira
(v7.13.8#713008)