Document Parquet Features by Version by alamb · Pull Request #186 · apache/parquet-site

alamb · 2026-06-04T20:52:22Z

Preview

Rendered preview: https://alamb.github.io/parquet-site/docs/file-format/versions/

Rationale

part of Define core features / compliance level parquet-format#384
See the related mailing list thread.

Adding new backwards-incompatible changes to the Parquet specification requires a way to communicate which systems support which features. Parquet already has a v1/v2 versioning scheme, but it is poorly documented and has caused significant confusion across the ecosystem. Clearly explaining the current (imperfect) scheme is valuable on its own and might also help a potential future V3 rollout.

Changes

Add a versions page to parquet.apache.org explaining:

The current versioning scheme and how Parquet versions relate to each other
The version/date each feature was added to the spec
Links to relevant releases and mailing list discussions

Screenshots

etseidl

Flushing some comments. Thanks for taking this on @alamb, I think this is an important addition.

etseidl · 2026-06-05T15:15:08Z

+| [Variant shredding] | [2.12.0] | [2.11.0..2.12.0] | [Approved 2025-08-24] |
+| [GEOMETRY] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2025-02-09] |
+| [GEOGRAPHY] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2025-02-09] |
+| [LIST] | [1.0.0] | [1.0.0][tree-1.0.0] |  |


My notes say LIST and MAP were not formally codified with the three-level structure until 2.3.1 (apache/parquet-format@0e2e0a4)

From what I can tell, LIST and MAP are actually defined in the 1.0.0 spec (using the old, deprecated "ConvertedType" annotations)

LIST: https://github.com/apache/parquet-format/blob/parquet-format-1.0.0/src/thrift/parquet.thrift#L57-L59

MAP: https://github.com/apache/parquet-format/blob/parquet-format-1.0.0/src/thrift/parquet.thrift#L51-L52

It seems like apache/parquet-format@0e2e0a4 added the types to the LogicalType.md type structure

🤔

Yes, the original LIST was underspecified, and implementers diverged in how they were represented in the schema. To enable nullable lists with nullable elements, the current 3-level structure in the schema was introduced. We still suffer with having to parse older, now non-compliant lists (see for example apache/arrow-rs#8496).

I'm just pointing this out for historical purposes 🤓

etseidl · 2026-06-05T15:19:25Z

+readers may not understand the new information.
+
+| Feature | Released in | Source | Notes |
+| ------------------------------------------- | ----------------------------- | --- | ------------------------- |


Also from my own notes:

Addition of the LogicalType union in favor of ConvertedType enum was in 2.4.0.

Formal deprecation of ConvertedType in 2.9.0

Addition of NANOS to TimeUnit 2.6.0

Also from my own notes:

Addition of the LogicalType union in favor of ConvertedType enum was in 2.4.0.

Formal deprecation of ConvertedType in 2.9.0

Addition of NANOS to TimeUnit 2.6.0

Good call -- added rows for these two:

Addition of the LogicalType union in favor of ConvertedType enum was in 2.4.0.

Addition of NANOS to TimeUnit 2.6.0

I added this as a note instead of a new row, as it doesn't change what is actually written into files, instead I think it signals a future incompatible change (no longer write converted type)

Formal deprecation of ConvertedType in 2.9.0

emkornfield · 2026-06-05T15:50:52Z

+* **V1**: the original Parquet format (1.0).
+* **V2**: format version 2.0.
+
+| Feature | V1 | V2 | Released in | Source | Notes |


could we somehow integrate this as YAML data + rendering like we do for feature compatibility?

I am happy to convert this to be data driven rendering rather than an explicit table (rather I would get Claude to do it).

But here it may make less sense as I expect this page to be the only consumer/producer of this data, and the features don't seem to need the same type of cross referencing / forced deduplication of the status page

I think it allows use to mark have richer hyper links on the other page. For the compatibility chart, we already had to populate some of this data:

https://github.com/apache/parquet-site/blob/production/data/implementations/features/encodings.yaml#L33

This also means one more page necessary to update when we add something new?

Ok, I will update to be data driven

@alamb I'm fine with this as a follow-up or we can always reconcile separately.

Thanks -- I did some research, and I think it would be relatively straightforward to update this page to be data driven.

Upon reflection I would like to do it as a follow on PR to minimize the diff in this PR (I think unifying the data would result in changes to more files and would be harder to track)

Fokko

Comments are mostly bikeshedding, but I think this would be something great to add. Thanks for taking the time to put this together 🙌

Co-authored-by: Fokko Driesprong <fokko@apache.org>

etseidl

A few more nits

etseidl · 2026-06-05T17:50:37Z

+
+## `FileMetadata` version field
+
+Each Parquet file has a `version` field in the [`thrift FileMetadata`] that


I've said this elsewhere, but I don't think we can rely on a version in the metadata, as it cannot convey changes to the metadata itself.

I think we should just say this field conveys no meaningful information any longer.

If we are going to make changes to the metadata, I think new magic bytes in teh footer would be an alternate.

That's one option, but I think we'll always want 'PAR', so that limits how many changes we can make with the remaining byte ('1' and 'E' are taken, so I guess that leaves 254 more, fewer if we restrict ourselves to ASCII 🤣). I proposed in the M/L augmenting the header, which would be a forwards compatible change that leaves us leeway to use anything we want to convey versioning info.

PAR2 for maximum confusion!

You owe me a new keyboard! I just sprayed coffee all over it 🤣

See also some other ideas of encoding the parquet-format version in the metadata itself

RFC: Encode parquet-format minor_version in thrift metadata parquet-format#581

RFC: Add format_major_version and format_minor_version to thrift metadata parquet-format#582

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

…alamb/update_v2

alamb · 2026-06-05T18:22:05Z

A few more nits

Well, the point about releases not actually following semantic versioning is probably more than a nit 😆 I am not sure what to do about this though

etseidl · 2026-06-05T18:24:27Z

A few more nits

Well, the point about releases not actually following semantic versioning is probably more than a nit 😆 I am not sure what to do about this though

Add an asterisk...

* Asperational. Real world versioning may differ.

alamb · 2026-06-05T18:33:52Z

A few more nits

Well, the point about releases not actually following semantic versioning is probably more than a nit 😆 I am not sure what to do about this though

🤔 Maybe we could roll it into a real "v3" 🤔

Will wait a while for more comments on this PR before I open that can of worms on the mailing list

alamb · 2026-06-05T18:34:21Z

I think I have addressed most feedback and regenerated the preview: https://alamb.github.io/parquet-site/docs/file-format/versions/

etseidl

Looks good (mod SemVer). Thanks for doing this!

emkornfield · 2026-06-06T00:01:13Z

+## `FileMetadata` version field
+
+Each Parquet file has a `version` field in the [`thrift FileMetadata`] that
+declares which features the file may use, and thus what a reader **must** support


I think my issue is the linear nature of versioning here it imposes on readers. In a perfect world every reader would implement everything they needed to up as it is released. But this means a reader needs to move in lock-step with the major version of the header. For example, it lets say we release the following features:

Backward incompatible feature that isn't strictly better for everyone (V3)

Awesome new encoding that a lot more people care about (V4)

By this spec, any writer would need to write V4. This gives readers two choices:

Cheat and try to read the data anyways (this makes version less useful in general, and I think one of the reasons some writer always wrote "1". Readers were capable of reading new encodings (and pretty cleanly detecting when they couldn't so people ignored the guidance).

Implement both V3 and V4 before they get the benefit of V4 (this might have much longer delays given a lot of parquet implementations are volunteer driven).

Is there a way to reconcile this?

This is a really nice description of the core tradeoffs with using Versions (and I think one of the major points we need to resolve to arrive at agreement about how to move versioning forward)

I am not sure there are simple ways to resolve this tradeoff and I believe there are a variety of different opinions

Thus, What I plan to do is to update this PR to REMOVE the a statement on semantic versioning, and keep it focused on what the current state of the spec /features is. We can then continue to have the fun (FUN!) discussion about how to version / signal new features in the corresponding mailing list thread.

Clarified in d8c63bd that Parquet does NOT follow semantic versioning

Actually, here is a proposal for how to encode the parquet-format version directly in the meatdata

RFC: Encode parquet-format minor_version in thrift metadata parquet-format#581

RFC: Add format_major_version and format_minor_version to thrift metadata parquet-format#582

emkornfield · 2026-06-06T00:08:14Z

+[semantic versioning]:
+
+1. The major version corresponds to the [`thrift FileMetadata`] `version` field.
+2. Minor releases (e.g. `2.10.0` to `2.11.0`) may add compatible


Again, I'm not sure SemVer is really the right model here. Partially for reasons as outlined in the comment above.

But also, because its philosophy is maor version bumps should be rare. If we really do add a handful of new encodings and a new footer it would get bumped pretty quickly. This might naturally lead to delays similar to what we saw with v2. From my perspective, as long as we agree we are OK bumping for every new feature I'm generally OK with this. A second alternative is to specify a very short maximum cadence that features will collect for (e.g. 1 Month).

Clarified in d8c63bd that Parquet does NOT follow semantic versioning

emkornfield · 2026-06-06T00:13:15Z

+| [Geospatial statistics] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2025-02-09]                                     |
+| [Binary protocol extensions] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2024-09-06]                                     |
+| [IEEE 754 total order and NaN counts] | not yet released | [#514] | [Approved 2026-05-26]                                     |
+| [LogicalType union] | [2.4.0] | [2.3.1..2.4.0] | Supersedes `ConvertedType` enum<br/>deprecated in [2.9.0] |


nit: I think the deprecation of converted type is not forward-compabile, you get different values if you don't understand unsigned vs signed integer types. This could actually lead to incorrect system results.

The deprecation just means no new values are added to the ConvertedType enum. Writers are still required to populate both ConvertedType and LogicalType (although the wording changes depending on which paragraph you read...one place says "should", two paragraphs later it says "must").

jorisvandenbossche · 2026-06-09T12:33:04Z

+Each Parquet file has a `version` field in the [`thrift FileMetadata`] that
+declares which features the file may use, and thus what a reader **must** support
+to read it.
+
+**Note**: Many writers set the version field to `1` even for files that use
+format 2.0 features, which has caused [confusion and interoperability
+issues][closing-out-2.0].


The first paragraph giving meaning to the "version" metadata field seems a bit confusing/misleading, together with the note, and moreover with the fact that the thrift itself specifically says this should be hardcoded to "1":

https://github.com/apache/parquet-format/blob/74001e41f5c5a1856b29be115f9c992cab16a4bf/src/main/thrift/parquet.thrift#L1365-L1374

... struct FileMetaData { /** Version of this file * * As of December 2025, there is no agreed upon consensus of what constitutes * version 2 of the file. For maximum compatibility with readers, writers should * always populate "1" for version. For maximum compatibility with writers, * readers should accept "1" and "2" interchangeably. All other versions are * reserved for potential future use-cases. */ 1: required i32 version ...

Or when keeping the text here, the note in the thrift file should be updated to match better?

I think you are right -- I will back this description off to match what is in the thrift

In 8915933

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

alamb · 2026-06-10T19:53:29Z

Thank you @etseidl . I also removed the V1/V2 columns in 1126b53 per @pitrou 's suggestion on the mailing listand also regenerated the preview and screenshots.

emkornfield

Looking forward to capturing this in a structured way but changes look reasonable.

alamb · 2026-06-16T21:41:03Z

Looking forward to capturing this in a structured way but changes look reasonable.

I will make a PR right after I merge this one to convert to using a structured format

Co-authored-by: emkornfield <emkornfield@gmail.com>

pitrou

This is neat, thanks a lot @alamb

emkornfield · 2026-06-17T20:49:21Z

@alamb anything we are waiting for on merging?

alamb · 2026-06-17T23:17:47Z

@alamb anything we are waiting for on merging?

Not that I know of -- I will plan to merge tomorrow or Friday

wgtmac · 2026-06-18T14:12:10Z

+> encryption to read the file; [plaintext footer] files use `PAR1` so legacy
+> readers can still read their unencrypted columns.
+
+## Forward compatible additions


Do we want to include the new IEEE754TotalOrder column order just released in 2.13.0? I think it is a compatible change if readers just regard the new column order as an unknown order and ignore its min/max stats.

alamb force-pushed the alamb/update_v2 branch from 1bf1353 to 3b7c137 Compare June 5, 2026 11:16

alamb closed this Jun 5, 2026

alamb deleted the alamb/update_v2 branch June 5, 2026 11:44

alamb reopened this Jun 5, 2026

alamb force-pushed the alamb/update_v2 branch from 92f16ef to a3a8523 Compare June 5, 2026 13:08

alamb mentioned this pull request Jun 5, 2026

Example: Communicating Parquet V3 alamb/parquet-site#1

Draft

alamb changed the title ~~Document Features by Version~~ Document Parquet Versions Jun 5, 2026

Document Parquet Versions

21fb211

alamb force-pushed the alamb/update_v2 branch from 522ab74 to 21fb211 Compare June 5, 2026 14:00

alamb changed the title ~~Document Parquet Versions~~ Document Parquet Features by Version Jun 5, 2026

alamb marked this pull request as ready for review June 5, 2026 14:13

alamb mentioned this pull request Jun 5, 2026

Make path_in_schema optional apache/parquet-format#563

Open

etseidl reviewed Jun 5, 2026

View reviewed changes

emkornfield reviewed Jun 5, 2026

View reviewed changes

Fokko approved these changes Jun 5, 2026

View reviewed changes

alamb and others added 2 commits June 5, 2026 13:02

Update content/en/docs/File Format/versions.md

430ffce

Co-authored-by: Fokko Driesprong <fokko@apache.org>

Use forward compatibile terminology

7f37c8e

etseidl reviewed Jun 5, 2026

View reviewed changes

alamb and others added 4 commits June 5, 2026 14:06

document LogicalType union, and Nano timestamp

3af9dfb

Apply suggestion from @etseidl

390bad3

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

Note PARE footer

c438950

Merge branch 'alamb/update_v2' of github.com:alamb/parquet-site into …

9c73309

…alamb/update_v2

Consistently list physical types with logical types

45b2b94

etseidl approved these changes Jun 5, 2026

View reviewed changes

emkornfield reviewed Jun 6, 2026

View reviewed changes

alamb added 3 commits June 9, 2026 06:52

Merge remote-tracking branch 'origin/production' into alamb/update_v2

9e31493

Clarfiy Semantic Versioning

d8c63bd

tweak

df8c1ca

jorisvandenbossche reviewed Jun 9, 2026

View reviewed changes

This was referenced Jun 9, 2026

RFC: Encode parquet-format minor_version in thrift metadata apache/parquet-format#581

Draft

RFC: Add format_major_version and format_minor_version to thrift metadata apache/parquet-format#582

Draft

alamb added 2 commits June 9, 2026 09:30

Avoid ascribing semantic meaning to the version field

8915933

Remove other mention of V1 and V2 to reduce confusion

0b3a17f

etseidl reviewed Jun 10, 2026

View reviewed changes

Comment thread content/en/docs/File Format/versions.md Outdated

alamb and others added 2 commits June 10, 2026 15:44

Follow wording from @stseidl to avoid implying semantic meaning to v1/v2

8cb0942

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

remove V1/V2 columns to avoid confusion

1126b53

etseidl mentioned this pull request Jun 11, 2026

Add more granular Geometry / Geography support to the implementation status page #158

Open

emkornfield reviewed Jun 16, 2026

View reviewed changes

Comment thread content/en/docs/File Format/versions.md Outdated

emkornfield approved these changes Jun 16, 2026

View reviewed changes

emkornfield reviewed Jun 16, 2026

View reviewed changes

Comment thread package.json Outdated

alamb and others added 4 commits June 16, 2026 17:42

Merge remote-tracking branch 'origin/production' into alamb/update_v2

86e6937

revert package json

829d6cf

Update content/en/docs/File Format/versions.md

ae621fe

Co-authored-by: emkornfield <emkornfield@gmail.com>

fix link

9faf941

pitrou approved these changes Jun 17, 2026

View reviewed changes

wgtmac approved these changes Jun 18, 2026

View reviewed changes


		## `FileMetadata` version field

		Each Parquet file has a `version` field in the [`thrift FileMetadata`] that

Conversation

alamb commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview

Rationale

Changes

Screenshots

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Jun 5, 2026

Uh oh!

etseidl commented Jun 5, 2026

Uh oh!

alamb commented Jun 5, 2026

Uh oh!

alamb commented Jun 5, 2026

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

alamb commented Jun 4, 2026 •

edited

Loading

alamb Jun 5, 2026 •

edited

Loading