[
https://issues.redhat.com/browse/TEIID-4594?page=com.atlassian.jira.plugi...
]
Steven Hawkins commented on TEIID-4594:
---------------------------------------
We should probably look to use the apache project
https://parquet.apache.org/ - the same
dependency supports both avro and parquet.
There is also arrow
https://arrow.apache.org/docs/index.html - that supports reading
parquet and provides an in-memory layer over which to process.
The translator will have several responsibilities:
- metadata: Using the metadata facilities provided by the parquet library it should be
possible to automatically create a table representation for a set of parquet files. There
appears to both be an external metadata specification (generally used by spark) and
metadata in each parquet file. For reference we would see the common representations
possible in the Hive metastore. Presumably a file single source can define a number of
parquet tables. If it seems too complex for us to automate the import, then we should
focus on the set of options needed to define the foreign table: create foreign table
my_parquet_table (...) OPTIONS (PATH '...' PARTITION_COLUMN '...' ...) -
with the expectation that the user will manually add these statements to their ddl. I am
not familiar enough with all of the possible options and partitioning schemes to fully
specify what is needed.
- pushdown: controlled by the ExecutionFactory capabilities methods we need to determine
what is possible for pushdown. Since a set of files will generally define the same
logical table the most fundamental pushdown would be around a partitioning column such
that at least an equality predicate can be used to specify a specific partition/file.
Since it's a columnar format the most benefit from other pushdowns is allowing further
projection minimization.
- processing: utilizing one of the libraries from above the Teiid source query will need a
ResultSetExecution providing the implementation. There are quite a few examples to draw
from that translate from our SQL language objects to either another query or api.
Add ability to read Parquet Files
---------------------------------
Key: TEIID-4594
URL:
https://issues.redhat.com/browse/TEIID-4594
Project: Teiid
Issue Type: Feature Request
Components: Misc. Connectors
Affects Versions: 9.2
Reporter: Van Halbert
Priority: Major
Fix For: 15.0
Integration with Parquet files on Gluster is an important requirement. RADAnalytics will
be accessing data from Parquet which is a common file format for Spark.
--
This message was sent by Atlassian Jira
(v7.13.8#713008)