Skip to content

Introduce PrimitiveArrayBuilder::build(), avoid use of ArrayData#9305

Draft
alamb wants to merge 4 commits intoapache:mainfrom
alamb:alamb/less_array_data_primtive
Draft

Introduce PrimitiveArrayBuilder::build(), avoid use of ArrayData#9305
alamb wants to merge 4 commits intoapache:mainfrom
alamb:alamb/less_array_data_primtive

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jan 29, 2026

Which issue does this PR close?

Note if this looks good I will file a ticket about adding build to the other array builders

Rationale for this change

While reviewing #9303 from @Dandandan I noticed that using the primitive builders to create arrays was non ideal for two reasons:

  1. It drops/recreates a DataType (which while not super expensive is total overhead)
  2. It uses ArrayData (which allocates a Vec unecessairly)

If this approach is accepted, I will make a ticket to track adding build to the other builders

What changes are included in this PR?

  1. Update finish and finish_cloned to avoid using ArrayData
  2. Introduce build which consumes the builder

This is similar to the build methods added to the other builders here

Are these changes tested?

Yes by CI and new doc tests

I will also run benchmarks

Are there any user-facing changes?

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Jan 29, 2026
@alamb alamb force-pushed the alamb/less_array_data_primtive branch from b3bbc1c to 12525b3 Compare January 29, 2026 20:00
@github-actions github-actions bot removed the parquet Changes to the parquet crate label Jan 29, 2026
@alamb alamb changed the title Alamb/less array data primtive Introduce PrimitiveArrayBuilder::build(), avoid use of ArrayData Jan 29, 2026
@github-actions github-actions bot added parquet Changes to the parquet crate arrow-flight Changes to the arrow-flight crate parquet-variant parquet-variant* crates labels Jan 29, 2026
@alamb
Copy link
Contributor Author

alamb commented Jan 29, 2026

run benchmark builder

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/less_array_data_primtive (a9b6504) to 2c0eba4 diff
BENCH_NAME=builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench builder
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_less_array_data_primtive
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                          alamb_less_array_data_primtive         main
-----                                          ------------------------------         ----
bench_bool/bench_bool                          1.45  1502.7±10.24µs   332.7 MB/sec    1.00  1035.3±33.48µs   483.0 MB/sec
bench_decimal128_builder                       1.21    102.3±1.14µs        ? ?/sec    1.00     84.9±1.61µs        ? ?/sec
bench_decimal256_builder                       1.22    105.9±1.39µs        ? ?/sec    1.00     87.0±3.74µs        ? ?/sec
bench_decimal32_builder                        1.00     51.2±0.23µs        ? ?/sec    1.11     57.1±2.58µs        ? ?/sec
bench_decimal64_builder                        1.26     58.2±0.32µs        ? ?/sec    1.00     46.3±0.35µs        ? ?/sec
bench_primitive/bench_primitive                1.00    176.4±4.36µs    22.1 GB/sec    1.00    175.7±4.53µs    22.2 GB/sec
bench_primitive/bench_string                   1.16      9.7±0.25ms   672.8 MB/sec    1.00      8.3±0.29ms   779.1 MB/sec
bench_primitive_nulls/bench_primitive_nulls    1.02  1230.6±20.87µs        ? ?/sec    1.00   1212.3±6.74µs        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Jan 29, 2026

run benchmark builder

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/less_array_data_primtive (fcebc95) to 2c0eba4 diff
BENCH_NAME=builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench builder
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_less_array_data_primtive
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                          alamb_less_array_data_primtive         main
-----                                          ------------------------------         ----
bench_bool/bench_bool                          1.46  1508.5±39.85µs   331.5 MB/sec    1.00  1032.6±10.07µs   484.2 MB/sec
bench_decimal128_builder                       1.24    101.5±1.01µs        ? ?/sec    1.00     82.1±5.90µs        ? ?/sec
bench_decimal256_builder                       1.27    105.7±2.87µs        ? ?/sec    1.00     83.2±3.76µs        ? ?/sec
bench_decimal32_builder                        1.00     51.2±0.22µs        ? ?/sec    1.09     56.0±0.49µs        ? ?/sec
bench_decimal64_builder                        1.26     57.9±0.20µs        ? ?/sec    1.00     45.9±0.27µs        ? ?/sec
bench_primitive/bench_primitive                1.00    174.2±4.24µs    22.4 GB/sec    1.01    176.5±3.66µs    22.1 GB/sec
bench_primitive/bench_string                   1.00      8.8±0.27ms   738.3 MB/sec    1.05      9.3±0.25ms   702.3 MB/sec
bench_primitive_nulls/bench_primitive_nulls    1.01   1226.1±6.26µs        ? ?/sec    1.00  1214.1±13.71µs        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Jan 29, 2026

🤔 seems to get slower. Will investigate

@alamb
Copy link
Contributor Author

alamb commented Jan 29, 2026

Seems the benchmark is not super useful:

Screenshot 2026-01-29 at 4 51 48 PM

@Dandandan
Copy link
Contributor

run benchmark builder

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/less_array_data_primtive (fcebc95) to 2c0eba4 diff
BENCH_NAME=builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench builder
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_less_array_data_primtive
Results will be posted here when complete

values: ScalarBuffer<T::Native>,
nulls: Option<NullBuffer>,
) -> Self {
Self::assert_compatible(&data_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does an extra check that wasn't there before when using build_unchecked?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not relevant if we removed all build

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think I have a way to remove this

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                          alamb_less_array_data_primtive         main
-----                                          ------------------------------         ----
bench_bool/bench_bool                          1.00   1058.2±9.69µs   472.5 MB/sec    1.42  1505.1±11.03µs   332.2 MB/sec
bench_decimal128_builder                       1.23    102.2±3.71µs        ? ?/sec    1.00     83.0±0.51µs        ? ?/sec
bench_decimal256_builder                       1.26    105.2±1.78µs        ? ?/sec    1.00     83.3±1.68µs        ? ?/sec
bench_decimal32_builder                        1.06     51.1±0.23µs        ? ?/sec    1.00     48.1±0.75µs        ? ?/sec
bench_decimal64_builder                        1.25     57.3±1.19µs        ? ?/sec    1.00     46.0±0.34µs        ? ?/sec
bench_primitive/bench_primitive                1.19    187.0±4.90µs    20.9 GB/sec    1.00    156.6±5.34µs    25.0 GB/sec
bench_primitive/bench_string                   1.02     10.2±0.11ms   635.9 MB/sec    1.00     10.0±0.08ms   647.4 MB/sec
bench_primitive_nulls/bench_primitive_nulls    1.00  1242.9±16.73µs        ? ?/sec    1.04  1297.5±16.00µs        ? ?/sec

@Dandandan
Copy link
Contributor

bench_bool/bench_bool 1.00 1058.2±9.69µs 472.5 MB/sec 1.42 1505.1±11.03µs 332.2 MB/sec

And it's the other way around (so noise confirmed)

@alamb
Copy link
Contributor Author

alamb commented Jan 30, 2026

bench_bool/bench_bool 1.00 1058.2±9.69µs 472.5 MB/sec 1.42 1505.1±11.03µs 332.2 MB/sec

And it's the other way around (so noise confirmed)

Yeah. I thought of some way to make this faster though, so working on that now

@alamb alamb force-pushed the alamb/less_array_data_primtive branch from fcebc95 to 11d7c80 Compare January 30, 2026 14:06
@alamb alamb force-pushed the alamb/less_array_data_primtive branch from 6b8125c to 6895106 Compare January 30, 2026 14:14
values_builder: Vec<T::Native>,
null_buffer_builder: NullBufferBuilder,
data_type: DataType,
/// Optional data type override (e.g. to add timezone or precision/scale)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time I tried switching to use Option so we only pay the extra type check when there is actually a datatype override

/// not [PrimitiveArray::is_compatible] with the builder's primitive type
/// `T`.
pub fn with_data_type(self, data_type: DataType) -> Self {
assert!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assert is done as part of the PrimitiveArray::with_type later on

I also updated the tests to show that

@alamb
Copy link
Contributor Author

alamb commented Jan 30, 2026

run benchmark builder

@apache apache deleted a comment from alamb-ghbot Jan 30, 2026
@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/less_array_data_primtive (6895106) to b3ad9a8 diff
BENCH_NAME=builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench builder
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_less_array_data_primtive
Results will be posted here when complete

@apache apache deleted a comment from alamb-ghbot Jan 30, 2026
@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                          alamb_less_array_data_primtive         main
-----                                          ------------------------------         ----
bench_bool/bench_bool                          1.00  1367.1±19.01µs   365.7 MB/sec    1.10  1505.9±33.52µs   332.0 MB/sec
bench_decimal128_builder                       1.29    106.9±2.04µs        ? ?/sec    1.00     83.1±1.96µs        ? ?/sec
bench_decimal256_builder                       1.33    110.3±0.47µs        ? ?/sec    1.00     82.7±0.62µs        ? ?/sec
bench_decimal32_builder                        1.13     53.9±0.33µs        ? ?/sec    1.00     47.9±0.14µs        ? ?/sec
bench_decimal64_builder                        1.30     59.9±1.28µs        ? ?/sec    1.00     46.0±0.54µs        ? ?/sec
bench_primitive/bench_primitive                1.14    177.2±5.11µs    22.0 GB/sec    1.00    155.9±6.20µs    25.1 GB/sec
bench_primitive/bench_string                   1.00      9.2±0.33ms   703.4 MB/sec    1.00      9.2±0.23ms   706.6 MB/sec
bench_primitive_nulls/bench_primitive_nulls    1.09  1398.6±15.07µs        ? ?/sec    1.00  1280.6±13.34µs        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Jan 30, 2026

run benchmark builder

@github-actions github-actions bot removed parquet Changes to the parquet crate arrow-flight Changes to the arrow-flight crate parquet-variant parquet-variant* crates labels Jan 30, 2026
@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/less_array_data_primtive (63fc146) to b3ad9a8 diff
BENCH_NAME=builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench builder
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_less_array_data_primtive
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Jan 30, 2026

(testing without changes to the benchmark)

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                          alamb_less_array_data_primtive         main
-----                                          ------------------------------         ----
bench_bool/bench_bool                          1.00   1060.9±9.21µs   471.3 MB/sec    1.42   1501.3±9.63µs   333.0 MB/sec
bench_decimal128_builder                       1.49    124.0±4.67µs        ? ?/sec    1.00     83.3±2.52µs        ? ?/sec
bench_decimal256_builder                       1.58    130.9±4.41µs        ? ?/sec    1.00     82.7±0.64µs        ? ?/sec
bench_decimal32_builder                        1.37    65.7±18.60µs        ? ?/sec    1.00     47.9±0.33µs        ? ?/sec
bench_decimal64_builder                        1.38     63.0±6.86µs        ? ?/sec    1.00     45.8±0.11µs        ? ?/sec
bench_primitive/bench_primitive                1.00    172.0±5.30µs    22.7 GB/sec    1.01    174.0±6.24µs    22.5 GB/sec
bench_primitive/bench_string                   1.04      9.3±0.37ms   695.6 MB/sec    1.00      9.0±0.24ms   720.0 MB/sec
bench_primitive_nulls/bench_primitive_nulls    1.22  1562.6±14.67µs        ? ?/sec    1.00   1279.0±6.08µs        ? ?/sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants