[Parquet] Allow setting page size per column#9353
[Parquet] Allow setting page size per column#9353XiangpengHao wants to merge 2 commits intoapache:mainfrom
Conversation
| self.page_metrics.num_buffered_rows as usize >= self.props.data_page_row_count_limit() | ||
| || self.encoder.estimated_data_page_size() >= self.props.data_page_size_limit() | ||
| || self.encoder.estimated_data_page_size() | ||
| >= self.props.column_data_page_size_limit(self.descr.path()) |
There was a problem hiding this comment.
This line is the actual change
| let other_values = write_and_collect_page_values(ColumnPath::from("other"), props, data); | ||
|
|
||
| assert_eq!(col_values, vec![3, 3, 3, 1]); | ||
| assert_eq!(other_values, vec![10]); |
There was a problem hiding this comment.
This is the observed end-to-end behavior, only the specified column changed, other columns unaffected.
There was a problem hiding this comment.
I tested without this code change and the tests fails with
---- column::writer::tests::test_column_writer_column_data_page_size_limit stdout ----
thread 'column::writer::tests::test_column_writer_column_data_page_size_limit' (23590832) panicked at parquet/src/column/writer/
mod.rs:2522:9:
assertion `left == right` failed
left: [10]
right: [3, 3, 3, 1]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
alamb
left a comment
There was a problem hiding this comment.
Thank you @XiangpengHao
I have one suggestion about the testing, but otherwise this looks great to me
cc @etseidl @sdf-jkl and @jhorstmann as you are potentially interested in this as well
| /// | ||
| /// Note: this is a best effort limit based on value of | ||
| /// [`set_write_batch_size`](Self::set_write_batch_size). | ||
| /// |
There was a problem hiding this comment.
thank you for the nice comments
| let other_values = write_and_collect_page_values(ColumnPath::from("other"), props, data); | ||
|
|
||
| assert_eq!(col_values, vec![3, 3, 3, 1]); | ||
| assert_eq!(other_values, vec![10]); |
There was a problem hiding this comment.
I tested without this code change and the tests fails with
---- column::writer::tests::test_column_writer_column_data_page_size_limit stdout ----
thread 'column::writer::tests::test_column_writer_column_data_page_size_limit' (23590832) panicked at parquet/src/column/writer/
mod.rs:2522:9:
assertion `left == right` failed
left: [10]
right: [3, 3, 3, 1]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
etseidl
left a comment
There was a problem hiding this comment.
Thanks @XiangpengHao. I'm in favor of more per-column settings, but I think we should keep a consistent model for the settings.
| /// Note: this is a best effort limit based on the write batch size | ||
| /// | ||
| /// For more details see [`WriterPropertiesBuilder::set_data_page_size_limit`] | ||
| /// and [`WriterPropertiesBuilder::set_column_data_page_size_limit`]. |
There was a problem hiding this comment.
I find it confusing that this setting will break from all other per-column settings in that it exists on both the WriterProperties and ColumnProperties. I think it would be better to remove data_page_size_limit from the WriterProperties and rely on the value set on the ColumnProperties (either the specific column or the default).
| /// [`set_write_batch_size`](Self::set_write_batch_size). | ||
| /// | ||
| /// This value acts as the default for all columns and can be overridden with | ||
| /// [`Self::set_column_data_page_size_limit`]. |
There was a problem hiding this comment.
Similarly, I think this should move down to the "Setters for any column (global)" section below (line 735 or so) and use the same pattern as other per-column settings.
Page sizes were per-file configuration, this PR allows each column to have their own page size settings.
This enables a few optimizations, for example, configure smaller pages size for columns that need better random access.
Related t0
WriterProperties#9242cc @alamb @coracuity @JigaoLuo