From 2392f06fe1d7e65c9eae42052f54d97a38b463d2 Mon Sep 17 00:00:00 2001 From: Adam Lippai Date: Sat, 10 Jun 2023 22:30:56 -0400 Subject: [PATCH] Detailed parquet and parquet integration support status --- docs/source/status.rst | 104 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 104 insertions(+) diff --git a/docs/source/status.rst b/docs/source/status.rst index a73de815b779..5007d135e3a4 100644 --- a/docs/source/status.rst +++ b/docs/source/status.rst @@ -348,3 +348,107 @@ Notes: * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``) * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``) + + +Parquet format public API details +================================= + ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Format | C++ | Python | Java | Go | Rust | +| | | | | | | ++===========================================+=======+========+========+=======+=======+ +| Basic compression | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Brotli, LZ4, ZSTD | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| LZ4_RAW | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Hive-style partitioning | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| File metadata | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| RowGroup metadata | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Column metadata | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Chunk metadta | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Sorting column | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| ColumnIndex statistics | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page statistics | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Statistics min_value | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| xxHash based bloom filter | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| bloom filter length | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Modular encryption | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| External column data | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Nanosecond support | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| FIXED_LEN_BYTE_ARRAY | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Complete Delta encoding support | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Complete RLE support | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| BYTE_STREAM_SPLIT | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Partition pruning on the partition column | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| RowGroup pruning using statistics | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| RowGroup pruning using bloom filter | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page pruning using projection pushdown | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page pruning using statistics | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page pruning using bloom filter | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Partition append / delete | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| RowGroup append / delete | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page append / delete | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page CRC32 checksum | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Parallel partition processing | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Parallel RowGroup processing | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Parallel Page processing | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Storage-aware defaults (1) | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Adaptive concurrency (2) | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Adaptive IO when pruning used (3) | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Arrow schema metadata (4) | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| RLE / REE support (5) | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ + + +Notes: + +* *R* = Read supported + +* *W* = Write supported + +* \(1) In-memory or memory mapped files, SSD direct IO, HDD, NFS, local and remote S3 all need different concurrency and buffer size setups + +* \(2) Depending on the encoding, compression and row group sizes different task sizes might be ideal + +* \(3) Automatic balancing of the prefetched / block reading and the Page pruning + +* \(4) By default, the Arrow schema is serialized and stored in the Parquet file metadata (in the “ARROW:schema” key). When reading the file, if this key is available, it will be used to more faithfully recreate the original Arrow data. + +* \(5) Parquet supports RLE encoding of dictionary _data_. Reading and writing a similar structure (eg. Arrow REE) without allocating the expanded values might be supported in different implementations