Skip to content

Replace ArrayData with direct Array construction#9338

Open
liamzwbao wants to merge 5 commits intoapache:mainfrom
liamzwbao:issue-9298-replace-arraydata
Open

Replace ArrayData with direct Array construction#9338
liamzwbao wants to merge 5 commits intoapache:mainfrom
liamzwbao:issue-9298-replace-arraydata

Conversation

@liamzwbao
Copy link
Contributor

@liamzwbao liamzwbao commented Feb 3, 2026

Which issue does this PR close?

Rationale for this change

Reduce unsafe ArrayData::new_unchecked usage by switching to concrete array constructors

What changes are included in this PR?

  • Use concrete array constructors in arrow-array and arrow-string (BooleanArray, regexp, substring).
  • Preserve FixedSizeBinaryArray length when size == 0 and clarify the expectations of the function in substring.

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 3, 2026
@liamzwbao liamzwbao marked this pull request as ready for review February 3, 2026 02:35
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice to be @liamzwbao -- I think we should maintain the sliced() call here but otherwise 👍

where
&'a S: StringArrayType<'a>,
{
let null_bit_buffer = array.nulls().map(|x| x.inner().sliced());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sliced call seems to have been lost 🤔

But all the tests pass

I am not sure how important it is to "materalize" the buffer like this , but I recommend we keep it the same

So something like

let nulls = array.nulls().sliced().filter(|n| n.null_count() > 0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sliced was needed because when using ArrayData we must pass in a Buffer for nulls; if we don't do sliced then the potential offset which is stored at the BooleanBuffer level gets lost in translation. If we are moving away from ArrayData here we can simply pass the nulls as is as the slice becomes unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have .sliced() unfortunately, but we could rebuild the NullBuffer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is we should be able to pass the nullbuffer as is, without needing to explicitly slice it, I believe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are moving away from ArrayData here we can simply pass the nulls as is as the slice becomes unnecessary

This makes sense, will wait for further discussion before making changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safest to leave the sliced() call in (avoid potential regressions) and make a follow on PR to try and remove it separately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sliced should not be needed now, but its also safe to leave it in. I doubt we could show any performance impact in a benchmark since regex matching would be significantly slower than bit slicing.

array
.nulls()
.cloned()
.or_else(|| Some(NullBuffer::new_valid(num_of_elements)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it need a Null buffer if the input didn't have a null buffer ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's mainly because FixedSizeBinaryArray::new treats size == 0 specially: if both values and nulls are None, it infers len = 0.

let len = if s == 0 {
if !values.is_empty() {
return Err(ArrowError::InvalidArgumentError(
"Buffer cannot have non-zero length if the item size is zero".to_owned(),
));
}
// If the item size is zero, try to determine the length from the null buffer
nulls.as_ref().map(|n| n.len()).unwrap_or(0)

In substring::tests::without_nulls_fixed_size_binary, some cases produce size‑0 substrings, and without this change the result collapses to an empty array and fail the tests. Providing a nulls buffer here simply preserves the expected length to produce correct results (e.g., ["", "", ""]).

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify further by removing need of preserving sliced() which seems to be a relic of using ArrayData API

where
&'a S: StringArrayType<'a>,
{
let null_bit_buffer = array.nulls().map(|x| x.inner().sliced());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sliced was needed because when using ArrayData we must pass in a Buffer for nulls; if we don't do sliced then the potential offset which is stored at the BooleanBuffer level gets lost in translation. If we are moving away from ArrayData here we can simply pass the nulls as is as the slice becomes unnecessary.

0,
data_len,
)))
.filter(|n| n.null_count() > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if there is a way to filter in advance, before the into() 🤔

Might be a minor optimization but worth thinking about perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! We can track number of non_nulls to see if we need to create a NullBuffer

let data = val_builder.as_slice_mut();

let null_slice = null_builder.as_slice_mut();
let mut non_nulls = 0usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking along the lines of if we can leverage information from null_builder (or perhaps explore using NullBufferBuilder), rather than have a separate variable to track this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null_builder is just a MutableBuffer, so it doesn’t carry validity metadata we can get directly.

In terms of NullBufferBuilder, it’s not obvious it would outperform this approach. We could evaluate it with a benchmark in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very performance critical code. I'll run some benchmarks to make sure this doesn't mess something up by accident

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it seems like this code previously didn't bother to check the null counts, so we could always leave the existing behavior as is 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullBuffer::new already counts the set bits, counting the nulls separately here could allow the usage of NullBuffer::new_unchecked, but I'm not sure that will really improve performance. I agree that the performance optimizations should be left to a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked that ArrayData::new_unchecked also filters out null buffers with all nulls, so the current code in the PR does not change any observable behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussion, I think it's better to limit the scope of this PR to simply replace the unsafe calls. I will revert the last commit and we could open a separate PR to benchmark the performance

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Thank you @liamzwbao and @Jefffrey

I am just worried about the potential performance implications. I'll fire off some benchmarks to see

where
&'a S: StringArrayType<'a>,
{
let null_bit_buffer = array.nulls().map(|x| x.inner().sliced());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safest to leave the sliced() call in (avoid potential regressions) and make a follow on PR to try and remove it separately

Comment on lines +206 to +207
.map(|b| NullBuffer::new(BooleanBuffer::new(b, 0, array.len())))
.filter(|n| n.null_count() > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern of creating a BooleanBuffer from with a length and then filtering for nulls count is quite common (and becoming more so)

I wonder if we should (as a follow on PR) make a function to do it, maybe like

let Option<NullBuffer> = NullBuffer::try_from_unsliced(b, array.len())

Then this code would look like

let nulls = array
        .nulls()
        .map(|n| n.inner().sliced())
        .map(|b| NullBuffer::try_from_unsliced(b, array.len()

🤔

let data = val_builder.as_slice_mut();

let null_slice = null_builder.as_slice_mut();
let mut non_nulls = 0usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very performance critical code. I'll run some benchmarks to make sure this doesn't mess something up by accident

let data = val_builder.as_slice_mut();

let null_slice = null_builder.as_slice_mut();
let mut non_nulls = 0usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it seems like this code previously didn't bother to check the null counts, so we could always leave the existing behavior as is 🤔

@alamb
Copy link
Contributor

alamb commented Feb 4, 2026

run benchmark array_from

@alamb-ghbot

This comment was marked as outdated.

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue-9298-replace-arraydata (6047a3c) to dc99a52 diff
BENCH_NAME=array_from
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench array_from
BENCH_FILTER=
BENCH_BRANCH_NAME=issue-9298-replace-arraydata
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                  issue-9298-replace-arraydata           main
-----                                  ----------------------------           ----
BooleanArray::from_iter                1.00     18.8±0.14µs        ? ?/sec    1.19     22.3±0.32µs        ? ?/sec
BooleanArray::from_trusted_len_iter    1.00     17.7±0.45µs        ? ?/sec    1.28     22.6±0.52µs        ? ?/sec
Int64Array::from_iter                  1.01     52.6±1.61µs        ? ?/sec    1.00     52.4±0.11µs        ? ?/sec
Int64Array::from_trusted_len_iter      1.00     18.9±0.08µs        ? ?/sec    1.00     18.9±0.13µs        ? ?/sec
array_from_vec 128                     1.01     67.2±0.30ns        ? ?/sec    1.00     66.5±0.35ns        ? ?/sec
array_from_vec 256                     1.00     75.1±0.65ns        ? ?/sec    1.00     75.1±0.80ns        ? ?/sec
array_from_vec 512                     1.00    125.1±0.48ns        ? ?/sec    1.02    127.2±1.21ns        ? ?/sec
array_string_from_vec 128              1.00  1092.5±12.34ns        ? ?/sec    1.10  1198.6±115.69ns        ? ?/sec
array_string_from_vec 256              1.00  1886.0±16.63ns        ? ?/sec    1.02  1926.5±18.22ns        ? ?/sec
array_string_from_vec 512              1.04      3.4±0.03µs        ? ?/sec    1.00      3.2±0.03µs        ? ?/sec
decimal128_array_from_vec 32768        1.00     54.3±3.91µs        ? ?/sec    1.14     62.2±0.70µs        ? ?/sec
decimal256_array_from_vec 32768        1.00      2.7±0.05µs        ? ?/sec    1.01      2.7±0.01µs        ? ?/sec
decimal32_array_from_vec 32768         1.06     44.0±0.59µs        ? ?/sec    1.00     41.5±0.66µs        ? ?/sec
decimal64_array_from_vec 32768         1.19     53.0±0.34µs        ? ?/sec    1.00     44.5±1.14µs        ? ?/sec
struct_array_from_vec 1024             1.00      9.4±0.18µs        ? ?/sec    1.02      9.6±1.07µs        ? ?/sec
struct_array_from_vec 128              1.00   1889.4±9.53ns        ? ?/sec    1.02  1929.1±59.68ns        ? ?/sec
struct_array_from_vec 256              1.00      3.0±0.04µs        ? ?/sec    1.02      3.0±0.03µs        ? ?/sec
struct_array_from_vec 512              1.00      5.1±0.21µs        ? ?/sec    1.01      5.1±0.08µs        ? ?/sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants