Skip to content

[Parquet] Allow setting page size per column#9353

Open
XiangpengHao wants to merge 2 commits intoapache:mainfrom
XiangpengHao:parquet-per-column-page-size
Open

[Parquet] Allow setting page size per column#9353
XiangpengHao wants to merge 2 commits intoapache:mainfrom
XiangpengHao:parquet-per-column-page-size

Conversation

@XiangpengHao
Copy link
Contributor

@XiangpengHao XiangpengHao commented Feb 4, 2026

Page sizes were per-file configuration, this PR allows each column to have their own page size settings.

This enables a few optimizations, for example, configure smaller pages size for columns that need better random access.

Related t0

cc @alamb @coracuity @JigaoLuo

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 4, 2026
self.page_metrics.num_buffered_rows as usize >= self.props.data_page_row_count_limit()
|| self.encoder.estimated_data_page_size() >= self.props.data_page_size_limit()
|| self.encoder.estimated_data_page_size()
>= self.props.column_data_page_size_limit(self.descr.path())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is the actual change

let other_values = write_and_collect_page_values(ColumnPath::from("other"), props, data);

assert_eq!(col_values, vec![3, 3, 3, 1]);
assert_eq!(other_values, vec![10]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the observed end-to-end behavior, only the specified column changed, other columns unaffected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested without this code change and the tests fails with

  ---- column::writer::tests::test_column_writer_column_data_page_size_limit stdout ----

  thread 'column::writer::tests::test_column_writer_column_data_page_size_limit' (23590832) panicked at parquet/src/column/writer/
  mod.rs:2522:9:
  assertion `left == right` failed
    left: [10]
   right: [3, 3, 3, 1]
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao

I have one suggestion about the testing, but otherwise this looks great to me

cc @etseidl @sdf-jkl and @jhorstmann as you are potentially interested in this as well

///
/// Note: this is a best effort limit based on value of
/// [`set_write_batch_size`](Self::set_write_batch_size).
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the nice comments

let other_values = write_and_collect_page_values(ColumnPath::from("other"), props, data);

assert_eq!(col_values, vec![3, 3, 3, 1]);
assert_eq!(other_values, vec![10]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested without this code change and the tests fails with

  ---- column::writer::tests::test_column_writer_column_data_page_size_limit stdout ----

  thread 'column::writer::tests::test_column_writer_column_data_page_size_limit' (23590832) panicked at parquet/src/column/writer/
  mod.rs:2522:9:
  assertion `left == right` failed
    left: [10]
   right: [3, 3, 3, 1]
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @XiangpengHao. I'm in favor of more per-column settings, but I think we should keep a consistent model for the settings.

/// Note: this is a best effort limit based on the write batch size
///
/// For more details see [`WriterPropertiesBuilder::set_data_page_size_limit`]
/// and [`WriterPropertiesBuilder::set_column_data_page_size_limit`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it confusing that this setting will break from all other per-column settings in that it exists on both the WriterProperties and ColumnProperties. I think it would be better to remove data_page_size_limit from the WriterProperties and rely on the value set on the ColumnProperties (either the specific column or the default).

/// [`set_write_batch_size`](Self::set_write_batch_size).
///
/// This value acts as the default for all columns and can be overridden with
/// [`Self::set_column_data_page_size_limit`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I think this should move down to the "Setters for any column (global)" section below (line 735 or so) and use the same pattern as other per-column settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants