Skip to content

Document Parquet Features by Version#186

Open
alamb wants to merge 19 commits into
apache:productionfrom
alamb:alamb/update_v2
Open

Document Parquet Features by Version#186
alamb wants to merge 19 commits into
apache:productionfrom
alamb:alamb/update_v2

Conversation

@alamb

@alamb alamb commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Preview

Rendered preview: https://alamb.github.io/parquet-site/docs/file-format/versions/

Rationale

Adding new backwards-incompatible changes to the Parquet specification requires a way to communicate which systems support which features. Parquet already has a v1/v2 versioning scheme, but it is poorly documented and has caused significant confusion across the ecosystem. Clearly explaining the current (imperfect) scheme is valuable on its own and might also help a potential future V3 rollout.

Changes

Add a versions page to parquet.apache.org explaining:

  1. The current versioning scheme and how Parquet versions relate to each other
  2. The version/date each feature was added to the spec
  3. Links to relevant releases and mailing list discussions

Screenshots

Screenshot 2026-06-10 at 3 51 43 PM Screenshot 2026-06-10 at 3 51 52 PM Screenshot 2026-06-10 at 3 52 02 PM

@alamb alamb force-pushed the alamb/update_v2 branch from 1bf1353 to 3b7c137 Compare June 5, 2026 11:16
@alamb alamb closed this Jun 5, 2026
@alamb alamb deleted the alamb/update_v2 branch June 5, 2026 11:44
@alamb alamb reopened this Jun 5, 2026
@alamb alamb force-pushed the alamb/update_v2 branch from 92f16ef to a3a8523 Compare June 5, 2026 13:08
@alamb alamb changed the title Document Features by Version Document Parquet Versions Jun 5, 2026
@alamb alamb force-pushed the alamb/update_v2 branch from 522ab74 to 21fb211 Compare June 5, 2026 14:00
@alamb alamb changed the title Document Parquet Versions Document Parquet Features by Version Jun 5, 2026
@alamb alamb marked this pull request as ready for review June 5, 2026 14:13

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing some comments. Thanks for taking this on @alamb, I think this is an important addition.

Comment thread content/en/docs/File Format/versions.md Outdated
Comment thread content/en/docs/File Format/versions.md Outdated
Comment thread content/en/docs/File Format/versions.md Outdated
| [Variant shredding] | [2.12.0] | [2.11.0..2.12.0] | [Approved 2025-08-24] |
| [GEOMETRY] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2025-02-09] |
| [GEOGRAPHY] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2025-02-09] |
| [LIST] | [1.0.0] | [1.0.0][tree-1.0.0] | |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My notes say LIST and MAP were not formally codified with the three-level structure until 2.3.1 (apache/parquet-format@0e2e0a4)

@alamb alamb Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, LIST and MAP are actually defined in the 1.0.0 spec (using the old, deprecated "ConvertedType" annotations)

It seems like apache/parquet-format@0e2e0a4 added the types to the LogicalType.md type structure

🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the original LIST was underspecified, and implementers diverged in how they were represented in the schema. To enable nullable lists with nullable elements, the current 3-level structure in the schema was introduced. We still suffer with having to parse older, now non-compliant lists (see for example apache/arrow-rs#8496).

I'm just pointing this out for historical purposes 🤓

Comment thread content/en/docs/File Format/versions.md Outdated
readers may not understand the new information.

| Feature | Released in | Source | Notes |
| ------------------------------------------- | ----------------------------- | --- | ------------------------- |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from my own notes:

  • Addition of the LogicalType union in favor of ConvertedType enum was in 2.4.0.
  • Formal deprecation of ConvertedType in 2.9.0
  • Addition of NANOS to TimeUnit 2.6.0

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from my own notes:

  • Addition of the LogicalType union in favor of ConvertedType enum was in 2.4.0.
  • Formal deprecation of ConvertedType in 2.9.0
  • Addition of NANOS to TimeUnit 2.6.0

Good call -- added rows for these two:

  • Addition of the LogicalType union in favor of ConvertedType enum was in 2.4.0.
  • Addition of NANOS to TimeUnit 2.6.0

I added this as a note instead of a new row, as it doesn't change what is actually written into files, instead I think it signals a future incompatible change (no longer write converted type)

  • Formal deprecation of ConvertedType in 2.9.0

Comment thread content/en/docs/File Format/versions.md Outdated
* **V1**: the original Parquet format (1.0).
* **V2**: format version 2.0.

| Feature | V1 | V2 | Released in | Source | Notes |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we somehow integrate this as YAML data + rendering like we do for feature compatibility?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to convert this to be data driven rendering rather than an explicit table (rather I would get Claude to do it).

But here it may make less sense as I expect this page to be the only consumer/producer of this data, and the features don't seem to need the same type of cross referencing / forced deduplication of the status page

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it allows use to mark have richer hyper links on the other page. For the compatibility chart, we already had to populate some of this data:

https://github.com/apache/parquet-site/blob/production/data/implementations/features/encodings.yaml#L33

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also means one more page necessary to update when we add something new?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will update to be data driven

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I'm fine with this as a follow-up or we can always reconcile separately.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- I did some research, and I think it would be relatively straightforward to update this page to be data driven.

Upon reflection I would like to do it as a follow on PR to minimize the diff in this PR (I think unifying the data would result in changes to more files and would be harder to track)

@Fokko Fokko left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are mostly bikeshedding, but I think this would be something great to add. Thanks for taking the time to put this together 🙌

Comment thread content/en/docs/File Format/versions.md
Comment thread content/en/docs/File Format/versions.md Outdated
Comment thread content/en/docs/File Format/versions.md Outdated
Comment thread content/en/docs/File Format/versions.md Outdated
Comment thread content/en/docs/File Format/versions.md Outdated
alamb and others added 2 commits June 5, 2026 13:02

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more nits

Comment thread content/en/docs/File Format/versions.md Outdated
Comment thread content/en/docs/File Format/versions.md Outdated

## `FileMetadata` version field

Each Parquet file has a `version` field in the [`thrift FileMetadata`] that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've said this elsewhere, but I don't think we can rely on a version in the metadata, as it cannot convey changes to the metadata itself.

I think we should just say this field conveys no meaningful information any longer.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to make changes to the metadata, I think new magic bytes in teh footer would be an alternate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one option, but I think we'll always want 'PAR', so that limits how many changes we can make with the remaining byte ('1' and 'E' are taken, so I guess that leaves 254 more, fewer if we restrict ourselves to ASCII 🤣). I proposed in the M/L augmenting the header, which would be a forwards compatible change that leaves us leeway to use anything we want to convey versioning info.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PAR2 for maximum confusion!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You owe me a new keyboard! I just sprayed coffee all over it 🤣

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread content/en/docs/File Format/versions.md
@alamb

alamb commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

A few more nits

Well, the point about releases not actually following semantic versioning is probably more than a nit 😆 I am not sure what to do about this though

@etseidl

etseidl commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

A few more nits

Well, the point about releases not actually following semantic versioning is probably more than a nit 😆 I am not sure what to do about this though

Add an asterisk...

* Asperational. Real world versioning may differ.

@alamb

alamb commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

A few more nits

Well, the point about releases not actually following semantic versioning is probably more than a nit 😆 I am not sure what to do about this though

🤔 Maybe we could roll it into a real "v3" 🤔

Will wait a while for more comments on this PR before I open that can of worms on the mailing list

@alamb

alamb commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

I think I have addressed most feedback and regenerated the preview: https://alamb.github.io/parquet-site/docs/file-format/versions/

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good (mod SemVer). Thanks for doing this!

Comment thread content/en/docs/File Format/versions.md Outdated
## `FileMetadata` version field

Each Parquet file has a `version` field in the [`thrift FileMetadata`] that
declares which features the file may use, and thus what a reader **must** support

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my issue is the linear nature of versioning here it imposes on readers. In a perfect world every reader would implement everything they needed to up as it is released. But this means a reader needs to move in lock-step with the major version of the header. For example, it lets say we release the following features:

  1. Backward incompatible feature that isn't strictly better for everyone (V3)
  2. Awesome new encoding that a lot more people care about (V4)

By this spec, any writer would need to write V4. This gives readers two choices:

  1. Cheat and try to read the data anyways (this makes version less useful in general, and I think one of the reasons some writer always wrote "1". Readers were capable of reading new encodings (and pretty cleanly detecting when they couldn't so people ignored the guidance).
  2. Implement both V3 and V4 before they get the benefit of V4 (this might have much longer delays given a lot of parquet implementations are volunteer driven).

Is there a way to reconcile this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice description of the core tradeoffs with using Versions (and I think one of the major points we need to resolve to arrive at agreement about how to move versioning forward)

I am not sure there are simple ways to resolve this tradeoff and I believe there are a variety of different opinions

Thus, What I plan to do is to update this PR to REMOVE the a statement on semantic versioning, and keep it focused on what the current state of the spec /features is. We can then continue to have the fun (FUN!) discussion about how to version / signal new features in the corresponding mailing list thread.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified in d8c63bd that Parquet does NOT follow semantic versioning

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread content/en/docs/File Format/versions.md Outdated
[semantic versioning]:

1. The major version corresponds to the [`thrift FileMetadata`] `version` field.
2. Minor releases (e.g. `2.10.0` to `2.11.0`) may add compatible

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm not sure SemVer is really the right model here. Partially for reasons as outlined in the comment above.

But also, because its philosophy is maor version bumps should be rare. If we really do add a handful of new encodings and a new footer it would get bumped pretty quickly. This might naturally lead to delays similar to what we saw with v2. From my perspective, as long as we agree we are OK bumping for every new feature I'm generally OK with this. A second alternative is to specify a very short maximum cadence that features will collect for (e.g. 1 Month).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified in d8c63bd that Parquet does NOT follow semantic versioning

| [Geospatial statistics] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2025-02-09] |
| [Binary protocol extensions] | [2.11.0] | [2.10.0..2.11.0] | [Approved 2024-09-06] |
| [IEEE 754 total order and NaN counts] | not yet released | [#514] | [Approved 2026-05-26] |
| [LogicalType union] | [2.4.0] | [2.3.1..2.4.0] | Supersedes `ConvertedType` enum<br/>deprecated in [2.9.0] |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the deprecation of converted type is not forward-compabile, you get different values if you don't understand unsigned vs signed integer types. This could actually lead to incorrect system results.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecation just means no new values are added to the ConvertedType enum. Writers are still required to populate both ConvertedType and LogicalType (although the wording changes depending on which paragraph you read...one place says "should", two paragraphs later it says "must").

Comment thread content/en/docs/File Format/versions.md Outdated
Comment on lines +44 to +50
Each Parquet file has a `version` field in the [`thrift FileMetadata`] that
declares which features the file may use, and thus what a reader **must** support
to read it.

**Note**: Many writers set the version field to `1` even for files that use
format 2.0 features, which has caused [confusion and interoperability
issues][closing-out-2.0].

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first paragraph giving meaning to the "version" metadata field seems a bit confusing/misleading, together with the note, and moreover with the fact that the thrift itself specifically says this should be hardcoded to "1":

https://github.com/apache/parquet-format/blob/74001e41f5c5a1856b29be115f9c992cab16a4bf/src/main/thrift/parquet.thrift#L1365-L1374

...
struct FileMetaData {
  /** Version of this file
    *
    * As of December 2025, there is no agreed upon consensus of what constitutes
    * version 2 of the file. For maximum compatibility with readers, writers should
    * always populate "1" for version. For maximum compatibility with writers,
    * readers should accept "1" and "2" interchangeably.  All other versions are
    * reserved for potential future use-cases.
    */
  1: required i32 version
...

Or when keeping the text here, the note in the thrift file should be updated to match better?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right -- I will back this description off to match what is in the thrift

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 8915933

Comment thread content/en/docs/File Format/versions.md Outdated
@alamb

alamb commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Thank you @etseidl . I also removed the V1/V2 columns in 1126b53 per @pitrou 's suggestion on the mailing listand also regenerated the preview and screenshots.

Comment thread content/en/docs/File Format/versions.md Outdated

@emkornfield emkornfield left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to capturing this in a structured way but changes look reasonable.

Comment thread package.json Outdated
@alamb

alamb commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Looking forward to capturing this in a structured way but changes look reasonable.

I will make a PR right after I merge this one to convert to using a structured format

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat, thanks a lot @alamb

@emkornfield

Copy link
Copy Markdown
Contributor

@alamb anything we are waiting for on merging?

@alamb

alamb commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

@alamb anything we are waiting for on merging?

Not that I know of -- I will plan to merge tomorrow or Friday

> encryption to read the file; [plaintext footer] files use `PAR1` so legacy
> readers can still read their unencrypted columns.

## Forward compatible additions

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to include the new IEEE754TotalOrder column order just released in 2.13.0? I think it is a compatible change if readers just regard the new column order as an unknown order and ignore its min/max stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants