Skip to content

Core, API, Spec: Extend partition statistics support to unpartitioned tables#15896

Open
ebyhr wants to merge 2 commits intoapache:mainfrom
ebyhr:ebi/unpartition-stats
Open

Core, API, Spec: Extend partition statistics support to unpartitioned tables#15896
ebyhr wants to merge 2 commits intoapache:mainfrom
ebyhr:ebi/unpartition-stats

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented Apr 6, 2026

Proposed Change

Currently, the partition statistics file is only supported for partitioned tables.

Preconditions.checkArgument(Partitioning.isPartitioned(table), "Table must be partitioned");

While this design makes sense since it’s conceptually a “partition” statistics file, the information it contains is also valuable for unpartitioned tables.

For example, the Trino Iceberg connector currently needs to read manifest files (TableStatisticsReader) to build internal table statistics. If Iceberg provided a statistics file for unpartitioned tables as well, Trino (and other query engines) could leverage it directly, improving planning performance by avoiding expensive manifest reads.

@ebyhr ebyhr force-pushed the ebi/unpartition-stats branch 2 times, most recently from 9bca81e to 0dd88d5 Compare April 6, 2026 07:06
@ebyhr ebyhr force-pushed the ebi/unpartition-stats branch from 0dd88d5 to da57f42 Compare April 6, 2026 07:10
@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Apr 6, 2026
@ebyhr ebyhr force-pushed the ebi/unpartition-stats branch from da57f42 to 1cda0e7 Compare April 6, 2026 07:12
@ebyhr ebyhr changed the title Core: Extend partition statistics support to unpartitioned tables Core, API: Extend partition statistics support to unpartitioned tables Apr 6, 2026
format/spec.md Outdated
| v1 | v2 | v3 | Field id, name | Type | Description |
|----|----|----|----------------|------|-------------|
| _required_ | _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table |
| _required_ | _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table, empty for unpartitioned tables |
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to post this change to the mailing list if needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. any spec change has to go for vote.

And whats the use case? You want to know data files, delete files, file size for unpartitioned table? We can use metadata tables for that Or maybe the new v4 manifests can solve it (with less I/O)?

Also, since the name itself is partitions stats. It makes odd to generate the file for unpartitioned table.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajantha-bhat Thanks, I'll post after exploring another approach, since Parquet writer can't write an empty struct.

We want to obtain the record count more quickly and efficiently for CBO. As mentioned in the PR description, Trino currently needs to read manifest files to estimate table size.

Even if v4 manifests address this, migration would take a long time. We need a solution that works for tables with format versions < v4 as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still a behavior change for existing users.
Partition stats was computed only for partition tables. Now you are enabling for non-partition tables with NULL partition. Plus like I mentioned, it is still odd for non partition table to write this file as it is via compute_partition_stats method. Why does user has to call compute_partition_stats on non-partition table?

I can understand that you need a quick table level stats for CBO in Trino without doing I/O of all the manifests.

a) Can we introduce a new table level stats (along with NDV puffin file) in compute_table_stats? People can still refer it for both partition table and non-partition table if they need whole table level info?
b) Or Can we check if snapshot summary already has this table level stats you are looking for? (we don't have to do multiple IO of the files in that case). If not, can we enhance snapshot summary to include it?

@ebyhr ebyhr force-pushed the ebi/unpartition-stats branch from 1cda0e7 to 1a0c198 Compare April 6, 2026 07:35
@github-actions github-actions bot added the ORC label Apr 6, 2026
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Apr 6, 2026

This PR ran into a known issue in the Parquet writer, which cannot handle writing an empty struct. https://issues.apache.org/jira/browse/SPARK-20593

I'm going to change the value to NULL.

@ebyhr ebyhr force-pushed the ebi/unpartition-stats branch from e6578a4 to 638ab9f Compare April 6, 2026 10:09
@ebyhr ebyhr changed the title Core, API: Extend partition statistics support to unpartitioned tables Core, API, Spec: Extend partition statistics support to unpartitioned tables Apr 6, 2026
Copy link
Copy Markdown
Member

@ajantha-bhat ajantha-bhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explore other options as It is still a behavior change for existing users.

Partition stats was computed only for partition tables. Now you are enabling for non-partition tables with NULL partition. Plus like I mentioned, it is still odd for non partition table to write this file as it is via 'compute_partition_stats' method. Why does user has to call compute_partition_stats on non-partition table?

I can understand that you need a quick table level stats for CBO in Trino without doing I/O of all the manifests.

a) Can we introduce a new table level stats (along with NDV puffin file) in compute_table_stats? People can still refer it for both partition table and non-partition table if they need whole table level info?
b) Or Can we check if snapshot summary already has this table level stats you are looking for? (we don't have to do multiple IO of the files in that case). If not, can we enhance snapshot summary to include it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API core ORC Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend partition statistics support to unpartitioned tables

3 participants