Document Parquet Features by Version#186
Conversation
| | [Variant shredding] | [2.12.0] | [2.11.0..2.12.0] | [Approved 2025-08-24] | | ||
| | [GEOMETRY] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2025-02-09] | | ||
| | [GEOGRAPHY] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2025-02-09] | | ||
| | [LIST] | [1.0.0] | [1.0.0][tree-1.0.0] | | |
There was a problem hiding this comment.
My notes say LIST and MAP were not formally codified with the three-level structure until 2.3.1 (apache/parquet-format@0e2e0a4)
There was a problem hiding this comment.
From what I can tell, LIST and MAP are actually defined in the 1.0.0 spec (using the old, deprecated "ConvertedType" annotations)
LIST: https://github.com/apache/parquet-format/blob/parquet-format-1.0.0/src/thrift/parquet.thrift#L57-L59MAP: https://github.com/apache/parquet-format/blob/parquet-format-1.0.0/src/thrift/parquet.thrift#L51-L52
It seems like apache/parquet-format@0e2e0a4 added the types to the LogicalType.md type structure
🤔
There was a problem hiding this comment.
Yes, the original LIST was underspecified, and implementers diverged in how they were represented in the schema. To enable nullable lists with nullable elements, the current 3-level structure in the schema was introduced. We still suffer with having to parse older, now non-compliant lists (see for example apache/arrow-rs#8496).
I'm just pointing this out for historical purposes 🤓
| readers may not understand the new information. | ||
|
|
||
| | Feature | Released in | Source | Notes | | ||
| | ------------------------------------------- | ----------------------------- | --- | ------------------------- | |
There was a problem hiding this comment.
Also from my own notes:
- Addition of the LogicalType union in favor of ConvertedType enum was in 2.4.0.
- Formal deprecation of ConvertedType in 2.9.0
- Addition of NANOS to TimeUnit 2.6.0
There was a problem hiding this comment.
Also from my own notes:
- Addition of the LogicalType union in favor of ConvertedType enum was in 2.4.0.
- Formal deprecation of ConvertedType in 2.9.0
- Addition of NANOS to TimeUnit 2.6.0
Good call -- added rows for these two:
- Addition of the LogicalType union in favor of ConvertedType enum was in 2.4.0.
- Addition of NANOS to TimeUnit 2.6.0
I added this as a note instead of a new row, as it doesn't change what is actually written into files, instead I think it signals a future incompatible change (no longer write converted type)
- Formal deprecation of ConvertedType in 2.9.0
| * **V1**: the original Parquet format (1.0). | ||
| * **V2**: format version 2.0. | ||
|
|
||
| | Feature | V1 | V2 | Released in | Source | Notes | |
There was a problem hiding this comment.
could we somehow integrate this as YAML data + rendering like we do for feature compatibility?
There was a problem hiding this comment.
I am happy to convert this to be data driven rendering rather than an explicit table (rather I would get Claude to do it).
But here it may make less sense as I expect this page to be the only consumer/producer of this data, and the features don't seem to need the same type of cross referencing / forced deduplication of the status page
There was a problem hiding this comment.
I think it allows use to mark have richer hyper links on the other page. For the compatibility chart, we already had to populate some of this data:
There was a problem hiding this comment.
This also means one more page necessary to update when we add something new?
There was a problem hiding this comment.
Ok, I will update to be data driven
There was a problem hiding this comment.
@alamb I'm fine with this as a follow-up or we can always reconcile separately.
There was a problem hiding this comment.
Thanks -- I did some research, and I think it would be relatively straightforward to update this page to be data driven.
Upon reflection I would like to do it as a follow on PR to minimize the diff in this PR (I think unifying the data would result in changes to more files and would be harder to track)
Fokko
left a comment
There was a problem hiding this comment.
Comments are mostly bikeshedding, but I think this would be something great to add. Thanks for taking the time to put this together 🙌
Co-authored-by: Fokko Driesprong <fokko@apache.org>
|
|
||
| ## `FileMetadata` version field | ||
|
|
||
| Each Parquet file has a `version` field in the [`thrift FileMetadata`] that |
There was a problem hiding this comment.
I've said this elsewhere, but I don't think we can rely on a version in the metadata, as it cannot convey changes to the metadata itself.
I think we should just say this field conveys no meaningful information any longer.
There was a problem hiding this comment.
If we are going to make changes to the metadata, I think new magic bytes in teh footer would be an alternate.
There was a problem hiding this comment.
That's one option, but I think we'll always want 'PAR', so that limits how many changes we can make with the remaining byte ('1' and 'E' are taken, so I guess that leaves 254 more, fewer if we restrict ourselves to ASCII 🤣). I proposed in the M/L augmenting the header, which would be a forwards compatible change that leaves us leeway to use anything we want to convey versioning info.
There was a problem hiding this comment.
PAR2 for maximum confusion!
There was a problem hiding this comment.
You owe me a new keyboard! I just sprayed coffee all over it 🤣
There was a problem hiding this comment.
See also some other ideas of encoding the parquet-format version in the metadata itself
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Well, the point about releases not actually following semantic versioning is probably more than a nit 😆 I am not sure what to do about this though |
Add an asterisk... |
🤔 Maybe we could roll it into a real "v3" 🤔 Will wait a while for more comments on this PR before I open that can of worms on the mailing list |
|
I think I have addressed most feedback and regenerated the preview: https://alamb.github.io/parquet-site/docs/file-format/versions/ |
etseidl
left a comment
There was a problem hiding this comment.
Looks good (mod SemVer). Thanks for doing this!
| ## `FileMetadata` version field | ||
|
|
||
| Each Parquet file has a `version` field in the [`thrift FileMetadata`] that | ||
| declares which features the file may use, and thus what a reader **must** support |
There was a problem hiding this comment.
I think my issue is the linear nature of versioning here it imposes on readers. In a perfect world every reader would implement everything they needed to up as it is released. But this means a reader needs to move in lock-step with the major version of the header. For example, it lets say we release the following features:
- Backward incompatible feature that isn't strictly better for everyone (V3)
- Awesome new encoding that a lot more people care about (V4)
By this spec, any writer would need to write V4. This gives readers two choices:
- Cheat and try to read the data anyways (this makes version less useful in general, and I think one of the reasons some writer always wrote "1". Readers were capable of reading new encodings (and pretty cleanly detecting when they couldn't so people ignored the guidance).
- Implement both V3 and V4 before they get the benefit of V4 (this might have much longer delays given a lot of parquet implementations are volunteer driven).
Is there a way to reconcile this?
There was a problem hiding this comment.
This is a really nice description of the core tradeoffs with using Versions (and I think one of the major points we need to resolve to arrive at agreement about how to move versioning forward)
I am not sure there are simple ways to resolve this tradeoff and I believe there are a variety of different opinions
Thus, What I plan to do is to update this PR to REMOVE the a statement on semantic versioning, and keep it focused on what the current state of the spec /features is. We can then continue to have the fun (FUN!) discussion about how to version / signal new features in the corresponding mailing list thread.
There was a problem hiding this comment.
Clarified in d8c63bd that Parquet does NOT follow semantic versioning
There was a problem hiding this comment.
Actually, here is a proposal for how to encode the parquet-format version directly in the meatdata
| [semantic versioning]: | ||
|
|
||
| 1. The major version corresponds to the [`thrift FileMetadata`] `version` field. | ||
| 2. Minor releases (e.g. `2.10.0` to `2.11.0`) may add compatible |
There was a problem hiding this comment.
Again, I'm not sure SemVer is really the right model here. Partially for reasons as outlined in the comment above.
But also, because its philosophy is maor version bumps should be rare. If we really do add a handful of new encodings and a new footer it would get bumped pretty quickly. This might naturally lead to delays similar to what we saw with v2. From my perspective, as long as we agree we are OK bumping for every new feature I'm generally OK with this. A second alternative is to specify a very short maximum cadence that features will collect for (e.g. 1 Month).
There was a problem hiding this comment.
Clarified in d8c63bd that Parquet does NOT follow semantic versioning
| | [Geospatial statistics] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2025-02-09] | | ||
| | [Binary protocol extensions] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2024-09-06] | | ||
| | [IEEE 754 total order and NaN counts] | not yet released | [#514] | [Approved 2026-05-26] | | ||
| | [LogicalType union] | [2.4.0] | [2.3.1..2.4.0] | Supersedes `ConvertedType` enum<br/>deprecated in [2.9.0] | |
There was a problem hiding this comment.
nit: I think the deprecation of converted type is not forward-compabile, you get different values if you don't understand unsigned vs signed integer types. This could actually lead to incorrect system results.
There was a problem hiding this comment.
The deprecation just means no new values are added to the ConvertedType enum. Writers are still required to populate both ConvertedType and LogicalType (although the wording changes depending on which paragraph you read...one place says "should", two paragraphs later it says "must").
| Each Parquet file has a `version` field in the [`thrift FileMetadata`] that | ||
| declares which features the file may use, and thus what a reader **must** support | ||
| to read it. | ||
|
|
||
| **Note**: Many writers set the version field to `1` even for files that use | ||
| format 2.0 features, which has caused [confusion and interoperability | ||
| issues][closing-out-2.0]. |
There was a problem hiding this comment.
The first paragraph giving meaning to the "version" metadata field seems a bit confusing/misleading, together with the note, and moreover with the fact that the thrift itself specifically says this should be hardcoded to "1":
...
struct FileMetaData {
/** Version of this file
*
* As of December 2025, there is no agreed upon consensus of what constitutes
* version 2 of the file. For maximum compatibility with readers, writers should
* always populate "1" for version. For maximum compatibility with writers,
* readers should accept "1" and "2" interchangeably. All other versions are
* reserved for potential future use-cases.
*/
1: required i32 version
...
Or when keeping the text here, the note in the thrift file should be updated to match better?
There was a problem hiding this comment.
I think you are right -- I will back this description off to match what is in the thrift
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
|
Thank you @etseidl . I also removed the V1/V2 columns in 1126b53 per @pitrou 's suggestion on the mailing listand also regenerated the preview and screenshots. |
emkornfield
left a comment
There was a problem hiding this comment.
Looking forward to capturing this in a structured way but changes look reasonable.
I will make a PR right after I merge this one to convert to using a structured format |
Co-authored-by: emkornfield <emkornfield@gmail.com>
|
@alamb anything we are waiting for on merging? |
Not that I know of -- I will plan to merge tomorrow or Friday |
| > encryption to read the file; [plaintext footer] files use `PAR1` so legacy | ||
| > readers can still read their unencrypted columns. | ||
|
|
||
| ## Forward compatible additions |
There was a problem hiding this comment.
Do we want to include the new IEEE754TotalOrder column order just released in 2.13.0? I think it is a compatible change if readers just regard the new column order as an unknown order and ignore its min/max stats.
Preview
Rendered preview: https://alamb.github.io/parquet-site/docs/file-format/versions/
Rationale
Adding new backwards-incompatible changes to the Parquet specification requires a way to communicate which systems support which features. Parquet already has a v1/v2 versioning scheme, but it is poorly documented and has caused significant confusion across the ecosystem. Clearly explaining the current (imperfect) scheme is valuable on its own and might also help a potential future V3 rollout.
Changes
Add a versions page to parquet.apache.org explaining:
Screenshots