Skip to content

GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status#36027

Closed
alippai wants to merge 1 commit into
apache:mainfrom
alippai:parquet-advanced-details
Closed

GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status#36027
alippai wants to merge 1 commit into
apache:mainfrom
alippai:parquet-advanced-details

Conversation

@alippai

@alippai alippai commented Jun 11, 2023

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@alippai alippai changed the title Detailed parquet and parquet integration support status GH-36028: [Documentation] Detailed parquet format support and parquet integration status Jun 11, 2023
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #36028 has been automatically assigned in GitHub to PR creator.

@alippai

alippai commented Jun 11, 2023

Copy link
Copy Markdown
Contributor Author

I'm sure this is too detailed in some places also there is a good chance that it misses many useful features.

My approach was going through the great blogpost, the parquet-format changelog, the thrift file, the parquet-mr, arrow and arrow-rs issue queue.

I've intentionally tried to avoid 2.4-2.10 parquet format version info as it'd imply that the 2.9 features include 2.6 features which might not reflect the reality. Instead of that I've tried to focus on the end-user public API and providing a flat list of features instead. I'm open for different approaches as well.

I feel particularly uncertain about the statistics and indices, I'm sure you can do that part better.

@alippai

alippai commented Jun 11, 2023

Copy link
Copy Markdown
Contributor Author

@tustvold @mapleFU @westonpace @wgtmac What do you think? Would this be useful?

@tustvold tustvold left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, I would personally restrict this table to feature of the actual file readers and not query engine functionality like partitioning and concurrency - imo these are not features of a parquet implementation, but rather a query system. IMO a parquet implementation should not be unilaterally making concurrency decisions, but rather exposing APIs to allow query engines to distribute the work how they deem fit. Similarly partitions are a catalog detail

I would also suggest having separate tables for supported types, encodings, compression and feature support.

Comment thread docs/source/status.rst
+-------------------------------------------+-------+--------+--------+-------+-------+
| LZ4_RAW | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Hive-style partitioning | | | | | |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'd consider this a feature of the parquet implementation, it is more a detail of the query engine imo?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While arrow-rs needs datafusion for this functionality, arrow handles it without Acero. I don't have strong opinion though

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tustvold, partitioning is more like a high-level use case on top of file format.

Comment thread docs/source/status.rst
+-------------------------------------------+-------+--------+--------+-------+-------+
| ColumnIndex statistics | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Page statistics | | | | | |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this referring to?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said there is a good chance I made a mistake here. I saw this in the thrift spec: ColumnChunk->ColumnMetadata->Statistics

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we organize these items in a layered fashion? Maybe this is a good start point: https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features

Comment thread docs/source/status.rst
+-------------------------------------------+-------+--------+--------+-------+-------+
| Page CRC32 checksum | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Parallel partition processing | | | | | |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a query engine detail, not a detail of the file format?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the arrow API in python

Comment thread docs/source/status.rst
+-------------------------------------------+-------+--------+--------+-------+-------+
| xxHash based bloom filter | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| bloom filter length | | | | | |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG, they finally added it - amazing, will get that incorporated into the rust writer/reader

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG, they finally added it - amazing, will get that incorporated into the rust writer/reader

I just added it recently :) Please note that the latest format is not released yet so the parquet-mr does not know bloom_filter_length now.

Comment thread docs/source/status.rst
+-------------------------------------------+-------+--------+--------+-------+-------+
| BYTE_STREAM_SPLIT | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Partition pruning on the partition column | | | | | |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this is a detail of the query engine not the parquet implementation imo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, it's part of the current API, but I agree it's not consistent across implementations.

Comment thread docs/source/status.rst
+-------------------------------------------+-------+--------+--------+-------+-------+
| RowGroup append / delete | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Page append / delete | | | | | |

@tustvold tustvold Jun 11, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any support page appending, the semantics would be peculiar for things like dictionary pages, the rust implementation does support appending column chunks though

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, likely some / most of the Page references should be ColumnChunk. I'll read about this more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't Parquet itself a write-once format that can't be appended to? I'm not sure what these are supposed to indicate. The inability to append/delete without re-writing a Parquet file is why table formats like Iceberg and Delta have proliferated.

Comment thread docs/source/status.rst
Comment on lines +428 to +432
| Storage-aware defaults (1) | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Adaptive concurrency (2) | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Adaptive IO when pruning used (3) | | | | | |

@tustvold tustvold Jun 11, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which parquet reader these features are based off, but my 2 cents is that they indicate a problematic IO abstraction that relies on prefetching heuristics instead of pushing vectored IO down into the IO subsystem (which the Rust, and proprietary DataBricks implementation do).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to capture the IO pushdown section https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/#io-pushdown but also added more. Likely out of scope as none of the implementations goes into details or provides an API

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just a "Vectorized IO Pushdown". I believe there are efforts to add such an API to parquet-mr

Comment thread docs/source/status.rst
+-------------------------------------------+-------+--------+--------+-------+-------+
| RowGroup pruning using bloom filter | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Page pruning using projection pushdown | | | | | |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Page pruning using projection pushdown | | | | | |
| Column Pruning using projection pushdown | | | | | |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this also a detail of the engine choosing what columns to read or not? Or is the intent here to indicate that rows/values can be pruned based on projection directly in the parquet lib?

Comment thread docs/source/status.rst
+-------------------------------------------+-------+--------+--------+-------+-------+
| Page pruning using statistics | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Page pruning using bloom filter | | | | | |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is supported by the format, bloom filters are per column chunk

Comment thread docs/source/status.rst
| Format | C++ | Python | Java | Go | Rust |
| | | | | | |
+===========================================+=======+========+========+=======+=======+
| Basic compression | | | | | |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could have separate tables for supported physical types, encodings and compression

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this.

@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jun 11, 2023
@kou kou changed the title GH-36028: [Documentation] Detailed parquet format support and parquet integration status GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status Jun 11, 2023
@alippai

alippai commented Jun 12, 2023

Copy link
Copy Markdown
Contributor Author

Thanks @tustvold. I'll address the Page vs ColumnChunk issues and other improvement ideas. Also it's a good insight that the parquet vs arrow vs dataset vs query engine level API separation is different in select languages.

Comment thread docs/source/status.rst
| Format | C++ | Python | Java | Go | Rust |
| | | | | | |
+===========================================+=======+========+========+=======+=======+
| Basic compression | | | | | |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this.

Comment thread docs/source/status.rst
+-------------------------------------------+-------+--------+--------+-------+-------+
| LZ4_RAW | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Hive-style partitioning | | | | | |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tustvold, partitioning is more like a high-level use case on top of file format.

Comment thread docs/source/status.rst
Comment on lines +367 to +373
+-------------------------------------------+-------+--------+--------+-------+-------+
| File metadata | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| RowGroup metadata | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Column metadata | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these intended for the completeness of fields defined in the metadata? If yes, probably they worth a separate table and indicate the states of each field. But that sounds too complicated.

Comment thread docs/source/status.rst
=================================

+-------------------------------------------+-------+--------+--------+-------+-------+
| Format | C++ | Python | Java | Go | Rust |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java column could be misleading here. In the arrow repo, there is a java dataset reader to support reading from parquet dataset. If this is for parquet-mr, then it can be easily out of sync.

Comment thread docs/source/status.rst
+-------------------------------------------+-------+--------+--------+-------+-------+
| ColumnIndex statistics | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Page statistics | | | | | |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we organize these items in a layered fashion? Maybe this is a good start point: https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features

@westonpace

Copy link
Copy Markdown
Member

I'll repeat what the rest said about engine/format differences and maybe offer some clarification.

In C++ the picture is pretty clear, as the APIs tend to be focused on implementation:

There is a C++ parquet module which is purely a parquet reader.
There is a C++ datasets library which, using Acero, offers a lot of features on top of this

In pyarrow the picture is pretty muddled, as the APIs are more focused on user experience:

There is a pyarrow.parquet module, however, many of its features are powered by C++ datasets. For example, the pyarrow.parquet module can read from S3 even the the C++ parquet module has no concept of S3 (it just has an abstraction for input streams).

So I agree with the others that we should probably not base the features on the python API.

@westonpace

Copy link
Copy Markdown
Member

Although...to play devil's advocate...it might be odd when a feature is available in the parquet reader, but not yet exposed in the query component. For example, there is some row skipping and bloom filters in the C++ parquet reader, but we haven't integrated those into the datasets layer yet.

@westonpace

Copy link
Copy Markdown
Member

Also, do we think this table might belong at https://parquet.apache.org/docs/ (and we could link to it from Arrow's docs)? For example, the parquet-mr (java) implementation and the parquet.net (C#) implementation are not involved with the arrow project but are still standalone parquet readers.

@pitrou

pitrou commented Jun 15, 2023

Copy link
Copy Markdown
Member

Agreed with @westonpace.
I created https://issues.apache.org/jira/browse/PARQUET-2310 to propose adding this in the Parquet docs.

@alippai

alippai commented Jun 15, 2023

Copy link
Copy Markdown
Contributor Author

Thanks, I can do another round on the weekend on the correct website and the suggestions included

@alippai

alippai commented Jun 20, 2023

Copy link
Copy Markdown
Contributor Author

Moved it to the parquet-site repo: apache/parquet-site#34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs][Parquet] Document Parquet implementation status

6 participants