Skip to content

Optimize data page statistics conversion (up to 4x)#9303

Merged
Dandandan merged 18 commits intoapache:mainfrom
Dandandan:speedup_statistics
Jan 31, 2026
Merged

Optimize data page statistics conversion (up to 4x)#9303
Dandandan merged 18 commits intoapache:mainfrom
Dandandan:speedup_statistics

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jan 29, 2026

Which issue does this PR close?

Rationale for this change

Loading statis is notably inefficient. This makes the conversion from the structure to arrow arrays a bit faster by avoiding allocations, until we get a more efficient encoding directly (#9296)

Details
Extract data page statistics for Int64/extract_statistics/Int64
                        time:   [5.2223 µs 5.2589 µs 5.3230 µs]
                        change: [−39.253% −38.205% −37.016%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

Extract data page statistics for UInt64/extract_statistics/UInt64
                        time:   [5.1035 µs 5.2173 µs 5.3576 µs]
                        change: [−32.745% −31.758% −30.535%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  8 (8.00%) high mild
  6 (6.00%) high severe

Extract data page statistics for F64/extract_statistics/F64
                        time:   [6.1922 µs 6.2021 µs 6.2130 µs]
                        change: [−30.749% −29.405% −28.469%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Extract data page statistics for String/extract_statistics/String
                        time:   [10.924 µs 10.965 µs 11.008 µs]
                        change: [−64.471% −64.330% −64.206%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  11 (11.00%) high mild
  3 (3.00%) high severe

Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, Stri...
                        time:   [10.885 µs 10.905 µs 10.928 µs]
                        change: [−64.444% −64.362% −64.285%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

Converts the inefficient iterator-based code (which doesn't really fit the iterator pattern well) to just traverse the values and use the builders. (I think it's just converting a bunch of ugly code to another bunch of ugly code).
Additionally tries to preallocate where possible.

Are these changes tested?

Existing tests

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 29, 2026
@alamb
Copy link
Contributor

alamb commented Jan 29, 2026

run benchmark arrow_statistics

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speedup_statistics (f96e14c) to bd76edd diff
BENCH_NAME=arrow_statistics
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_statistics
BENCH_FILTER=
BENCH_BRANCH_NAME=speedup_statistics
Results will be posted here when complete

let mut b = UInt8Builder::with_capacity(capacity);
for (len, index) in chunks {
match index {
ColumnIndexMetaData::INT32(index) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is great

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                                                      main                                   speedup_statistics
-----                                                                                                      ----                                   ------------------
Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    3.75     73.6±0.41µs        ? ?/sec    1.00     19.6±0.82µs        ? ?/sec
Extract data page statistics for F64/extract_statistics/F64                                                1.91     15.2±0.65µs        ? ?/sec    1.00      8.0±0.06µs        ? ?/sec
Extract data page statistics for Int64/extract_statistics/Int64                                            2.13     17.3±0.39µs        ? ?/sec    1.00      8.1±0.06µs        ? ?/sec
Extract data page statistics for String/extract_statistics/String                                          3.87     73.0±2.27µs        ? ?/sec    1.00     18.9±0.47µs        ? ?/sec
Extract data page statistics for UInt64/extract_statistics/UInt64                                          1.84     15.6±0.20µs        ? ?/sec    1.00      8.5±0.02µs        ? ?/sec
Extract row group statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    1.06  1190.7±11.52ns        ? ?/sec    1.00  1120.5±11.64ns        ? ?/sec
Extract row group statistics for F64/extract_statistics/F64                                                1.02    987.3±8.67ns        ? ?/sec    1.00    972.2±4.74ns        ? ?/sec
Extract row group statistics for Int64/extract_statistics/Int64                                            1.01   974.5±27.17ns        ? ?/sec    1.00   965.4±39.82ns        ? ?/sec
Extract row group statistics for String/extract_statistics/String                                          1.00  1158.6±20.74ns        ? ?/sec    1.00  1153.2±18.61ns        ? ?/sec
Extract row group statistics for UInt64/extract_statistics/UInt64                                          1.03   983.2±38.64ns        ? ?/sec    1.00   958.3±17.53ns        ? ?/sec

@Dandandan Dandandan changed the title Optimize data page statistics conversion Optimize data page statistics conversion (up to 3x) Jan 29, 2026
@Dandandan
Copy link
Contributor Author

Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    3.75     73.6±0.41µs        ? ?/sec    1.00     19.6±0.82µs        ? ?/sec
Extract data page statistics for F64/extract_statistics/F64                                                1.91     15.2±0.65µs        ? ?/sec    1.00      8.0±0.06µs        ? ?/sec
Extract data page statistics for Int64/extract_statistics/Int64                                            2.13     17.3±0.39µs        ? ?/sec    1.00      8.1±0.06µs        ? ?/sec
Extract data page statistics for String/extract_statistics/String                                          3.87     73.0±2.27µs        ? ?/sec    1.00     18.9±0.47µs        ? ?/sec
Extract data page statistics for UInt64/extract_statistics/UInt64                                          1.84     15.6±0.20µs        ? ?/sec    1.00      8.5±0.02µs        ? ?/sec

Almost 4x even!

@alamb
Copy link
Contributor

alamb commented Jan 29, 2026

run benchmark arrow_statistics

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speedup_statistics (f900620) to 2c0eba4 diff
BENCH_NAME=arrow_statistics
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_statistics
BENCH_FILTER=
BENCH_BRANCH_NAME=speedup_statistics
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you @Dandandan

_ => b.append_nulls(len),
}
}
Ok(Arc::new(b.finish()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing all these calls to finish with a builder that is not re-used looks wasteful to me -- I also tried to code up a PR to make this faster too (both finish as well as add a new build):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, in this case it probably does not help that much (most of the time is spent during allocations / capacity checks / copies... while building the values).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree the win is likely pretty small, but every little allocation helps

FWIW we could probably make it even better by not decoding into the ParquetMetadata structures at all (and directly decode Thrift to arrow arrays 🤔 )

Copy link
Contributor Author

@Dandandan Dandandan Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the latter would be best (and preferably also do validation directly on read/convert them to the right type, so it could just pass them on)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone else followoing along, this is tracked here

Ok(UInt64Array::from_iter(iter))
let chunks: Vec<_> = iterator.collect();
let total_capacity: usize = chunks.iter().map(|(len, _)| *len).sum();
let mut builder = UInt64Builder::with_capacity(total_capacity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it saves an allocation too

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                                                      main                                   speedup_statistics
-----                                                                                                      ----                                   ------------------
Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    3.77     70.9±0.46µs        ? ?/sec    1.00     18.8±0.25µs        ? ?/sec
Extract data page statistics for F64/extract_statistics/F64                                                1.49     11.5±0.07µs        ? ?/sec    1.00      7.7±0.11µs        ? ?/sec
Extract data page statistics for Int64/extract_statistics/Int64                                            1.70     13.1±0.36µs        ? ?/sec    1.00      7.7±0.05µs        ? ?/sec
Extract data page statistics for String/extract_statistics/String                                          3.79     70.5±0.40µs        ? ?/sec    1.00     18.6±0.10µs        ? ?/sec
Extract data page statistics for UInt64/extract_statistics/UInt64                                          1.42     11.6±0.07µs        ? ?/sec    1.00      8.1±0.07µs        ? ?/sec
Extract row group statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    1.00   1037.3±7.75ns        ? ?/sec    1.02  1058.3±27.41ns        ? ?/sec
Extract row group statistics for F64/extract_statistics/F64                                                1.00    536.1±4.09ns        ? ?/sec    1.03   552.6±10.54ns        ? ?/sec
Extract row group statistics for Int64/extract_statistics/Int64                                            1.00    541.7±3.24ns        ? ?/sec    1.01    549.2±5.01ns        ? ?/sec
Extract row group statistics for String/extract_statistics/String                                          1.00  1018.6±20.96ns        ? ?/sec    1.02  1037.7±10.67ns        ? ?/sec
Extract row group statistics for UInt64/extract_statistics/UInt64                                          1.00    536.4±2.80ns        ? ?/sec    1.03    551.5±6.30ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark arrow_statistics

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speedup_statistics (7156686) to 2c0eba4 diff
BENCH_NAME=arrow_statistics
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_statistics
BENCH_FILTER=
BENCH_BRANCH_NAME=speedup_statistics
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                                                      main                                   speedup_statistics
-----                                                                                                      ----                                   ------------------
Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    4.20     71.6±1.21µs        ? ?/sec    1.00     17.1±0.15µs        ? ?/sec
Extract data page statistics for F64/extract_statistics/F64                                                2.11     12.3±0.07µs        ? ?/sec    1.00      5.9±0.08µs        ? ?/sec
Extract data page statistics for Int64/extract_statistics/Int64                                            2.17     13.6±0.17µs        ? ?/sec    1.00      6.3±0.06µs        ? ?/sec
Extract data page statistics for String/extract_statistics/String                                          4.05     71.3±0.60µs        ? ?/sec    1.00     17.6±0.14µs        ? ?/sec
Extract data page statistics for UInt64/extract_statistics/UInt64                                          1.93     12.3±0.06µs        ? ?/sec    1.00      6.3±0.04µs        ? ?/sec
Extract row group statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    1.03  1031.2±14.21ns        ? ?/sec    1.00    997.3±6.45ns        ? ?/sec
Extract row group statistics for F64/extract_statistics/F64                                                1.00   535.8±13.21ns        ? ?/sec    1.01   542.1±10.64ns        ? ?/sec
Extract row group statistics for Int64/extract_statistics/Int64                                            1.00    539.5±7.27ns        ? ?/sec    1.02    548.6±2.15ns        ? ?/sec
Extract row group statistics for String/extract_statistics/String                                          1.01  1037.7±27.03ns        ? ?/sec    1.00  1023.5±17.51ns        ? ?/sec
Extract row group statistics for UInt64/extract_statistics/UInt64                                          1.00    539.9±6.06ns        ? ?/sec    1.01    547.8±3.55ns        ? ?/sec

@Dandandan Dandandan changed the title Optimize data page statistics conversion (up to 3x) Optimize data page statistics conversion (up to 4x) Jan 30, 2026
@Dandandan
Copy link
Contributor Author

run benchmark arrow_statistics

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speedup_statistics (f57c82e) to 2c0eba4 diff
BENCH_NAME=arrow_statistics
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_statistics
BENCH_FILTER=
BENCH_BRANCH_NAME=speedup_statistics
Results will be posted here when complete

@alamb-ghbot
Copy link

Benchmark script failed with exit code 101.

Last 10 lines of output:

Click to expand
warning: unused variable: `field`
   --> parquet/src/arrow/schema/extension.rs:185:44
    |
185 | pub(crate) fn logical_type_for_binary_view(field: &Field) -> Option<LogicalType> {
    |                                            ^^^^^ help: if this is intentional, prefix it with an underscore: `_field`

For more information about this error, try `rustc --explain E0599`.
warning: `parquet` (lib) generated 2 warnings
error: could not compile `parquet` (lib) due to 48 previous errors; 2 warnings emitted
warning: build failed, waiting for other jobs to finish...

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 30, 2026
@Dandandan
Copy link
Contributor Author

run benchmark arrow_statistics

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speedup_statistics (c0c46ca) to 2c0eba4 diff
BENCH_NAME=arrow_statistics
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_statistics
BENCH_FILTER=
BENCH_BRANCH_NAME=speedup_statistics
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                                                      main                                   speedup_statistics
-----                                                                                                      ----                                   ------------------
Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    4.09     71.3±0.98µs        ? ?/sec    1.00     17.4±0.11µs        ? ?/sec
Extract data page statistics for F64/extract_statistics/F64                                                4.28     12.2±0.05µs        ? ?/sec    1.00      2.8±0.06µs        ? ?/sec
Extract data page statistics for Int64/extract_statistics/Int64                                            4.80     13.5±0.14µs        ? ?/sec    1.00      2.8±0.06µs        ? ?/sec
Extract data page statistics for String/extract_statistics/String                                          4.02     71.0±0.85µs        ? ?/sec    1.00     17.7±0.09µs        ? ?/sec
Extract data page statistics for UInt64/extract_statistics/UInt64                                          2.56     12.2±0.09µs        ? ?/sec    1.00      4.8±0.10µs        ? ?/sec
Extract row group statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    1.05  1022.8±14.75ns        ? ?/sec    1.00   970.5±13.44ns        ? ?/sec
Extract row group statistics for F64/extract_statistics/F64                                                1.00   533.1±17.82ns        ? ?/sec    1.00    532.9±5.67ns        ? ?/sec
Extract row group statistics for Int64/extract_statistics/Int64                                            1.00    533.5±1.90ns        ? ?/sec    1.01    539.3±2.26ns        ? ?/sec
Extract row group statistics for String/extract_statistics/String                                          1.05   1019.9±7.93ns        ? ?/sec    1.00   968.4±23.31ns        ? ?/sec
Extract row group statistics for UInt64/extract_statistics/UInt64                                          1.00    535.0±2.57ns        ? ?/sec    1.00    534.9±1.64ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark arrow_statistics

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speedup_statistics (571b169) to 2c0eba4 diff
BENCH_NAME=arrow_statistics
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_statistics
BENCH_FILTER=
BENCH_BRANCH_NAME=speedup_statistics
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                                                      main                                   speedup_statistics
-----                                                                                                      ----                                   ------------------
Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    4.53     71.5±1.64µs        ? ?/sec    1.00     15.8±0.73µs        ? ?/sec
Extract data page statistics for F64/extract_statistics/F64                                                4.30     12.3±0.05µs        ? ?/sec    1.00      2.9±0.05µs        ? ?/sec
Extract data page statistics for Int64/extract_statistics/Int64                                            4.71     13.5±0.08µs        ? ?/sec    1.00      2.9±0.01µs        ? ?/sec
Extract data page statistics for String/extract_statistics/String                                          4.52     71.0±0.40µs        ? ?/sec    1.00     15.7±0.08µs        ? ?/sec
Extract data page statistics for UInt64/extract_statistics/UInt64                                          2.72     12.3±0.13µs        ? ?/sec    1.00      4.5±0.10µs        ? ?/sec
Extract row group statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    1.09  1058.9±27.37ns        ? ?/sec    1.00    975.5±7.95ns        ? ?/sec
Extract row group statistics for F64/extract_statistics/F64                                                1.02   540.7±16.52ns        ? ?/sec    1.00   529.3±14.42ns        ? ?/sec
Extract row group statistics for Int64/extract_statistics/Int64                                            1.02   550.1±10.76ns        ? ?/sec    1.00    538.4±8.02ns        ? ?/sec
Extract row group statistics for String/extract_statistics/String                                          1.07   1047.9±6.73ns        ? ?/sec    1.00   978.6±25.35ns        ? ?/sec
Extract row group statistics for UInt64/extract_statistics/UInt64                                          1.02    540.1±6.71ns        ? ?/sec    1.00    531.9±3.35ns        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive @Dandandan -- thank you

I have one small comment (found by codex and I verified) about a potential change in FixedSizeBinary stats handling, but otherwise this looks great

_ => b.append_nulls(len),
}
}
Ok(Arc::new(b.finish()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree the win is likely pretty small, but every little allocation helps

FWIW we could probably make it even better by not decoding into the ParquetMetadata structures at all (and directly decode Thrift to arrow arrays 🤔 )

}
/// Returns the null pages.
///
/// Values may be `None` when [`ColumnIndex::is_null_page()`] is `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment -- this API returns a vector of what pages are nulls. What is None? Or is it trying to say calling Self::min_value and Self::max_value will return None when is_null_page is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I should have removed that one (I think the comment was autogenerated).
I thought I could pass the values slice and null_pages slice to the builder (it is another 30% or so faster) instead of the iterator code, but the null_pages unfortunately has the valid/invalid reversed compared to "validity".

Copy link
Contributor

@alamb alamb Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the null_pages unfortunately has the valid/invalid reversed compared to "validity".

Maybe it is time (another PR) to actually implement something like unary_mut for boolean buffers (so we could invert the bits without a new allocation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps - although I think here we ideally leave the original in place and copy it (or avoid the copy altogether by returning vec/iterator prefarably...)

@alamb
Copy link
Contributor

alamb commented Jan 30, 2026

FWIW we could probably make it even better by not decoding into the ParquetMetadata structures at all (and directly decode Thrift to arrow arrays 🤔 )

And now I see you have already filed

👍

Dandandan and others added 3 commits January 30, 2026 16:21
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@Dandandan Dandandan merged commit e2b264f into apache:main Jan 31, 2026
28 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 2, 2026

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize data page statistics conversion

3 participants