Replace ArrayData with direct Array construction#9338
Replace ArrayData with direct Array construction#9338liamzwbao wants to merge 5 commits intoapache:mainfrom
ArrayData with direct Array construction#9338Conversation
alamb
left a comment
There was a problem hiding this comment.
Looks nice to be @liamzwbao -- I think we should maintain the sliced() call here but otherwise 👍
| where | ||
| &'a S: StringArrayType<'a>, | ||
| { | ||
| let null_bit_buffer = array.nulls().map(|x| x.inner().sliced()); |
There was a problem hiding this comment.
The sliced call seems to have been lost 🤔
But all the tests pass
I am not sure how important it is to "materalize" the buffer like this , but I recommend we keep it the same
So something like
let nulls = array.nulls().sliced().filter(|n| n.null_count() > 0);There was a problem hiding this comment.
I think sliced was needed because when using ArrayData we must pass in a Buffer for nulls; if we don't do sliced then the potential offset which is stored at the BooleanBuffer level gets lost in translation. If we are moving away from ArrayData here we can simply pass the nulls as is as the slice becomes unnecessary.
There was a problem hiding this comment.
We don't have .sliced() unfortunately, but we could rebuild the NullBuffer
There was a problem hiding this comment.
My point is we should be able to pass the nullbuffer as is, without needing to explicitly slice it, I believe?
There was a problem hiding this comment.
If we are moving away from ArrayData here we can simply pass the nulls as is as the slice becomes unnecessary
This makes sense, will wait for further discussion before making changes
There was a problem hiding this comment.
I think it would be safest to leave the sliced() call in (avoid potential regressions) and make a follow on PR to try and remove it separately
There was a problem hiding this comment.
sliced should not be needed now, but its also safe to leave it in. I doubt we could show any performance impact in a benchmark since regex matching would be significantly slower than bit slicing.
arrow-string/src/substring.rs
Outdated
| array | ||
| .nulls() | ||
| .cloned() | ||
| .or_else(|| Some(NullBuffer::new_valid(num_of_elements))) |
There was a problem hiding this comment.
why does it need a Null buffer if the input didn't have a null buffer ?
There was a problem hiding this comment.
That's mainly because FixedSizeBinaryArray::new treats size == 0 specially: if both values and nulls are None, it infers len = 0.
arrow-rs/arrow-array/src/array/fixed_size_binary_array.rs
Lines 97 to 105 in 7291147
In substring::tests::without_nulls_fixed_size_binary, some cases produce size‑0 substrings, and without this change the result collapses to an empty array and fail the tests. Providing a nulls buffer here simply preserves the expected length to produce correct results (e.g., ["", "", ""]).
Jefffrey
left a comment
There was a problem hiding this comment.
I think we can simplify further by removing need of preserving sliced() which seems to be a relic of using ArrayData API
| where | ||
| &'a S: StringArrayType<'a>, | ||
| { | ||
| let null_bit_buffer = array.nulls().map(|x| x.inner().sliced()); |
There was a problem hiding this comment.
I think sliced was needed because when using ArrayData we must pass in a Buffer for nulls; if we don't do sliced then the potential offset which is stored at the BooleanBuffer level gets lost in translation. If we are moving away from ArrayData here we can simply pass the nulls as is as the slice becomes unnecessary.
| 0, | ||
| data_len, | ||
| ))) | ||
| .filter(|n| n.null_count() > 0); |
There was a problem hiding this comment.
I do wonder if there is a way to filter in advance, before the into() 🤔
Might be a minor optimization but worth thinking about perhaps?
There was a problem hiding this comment.
Good point! We can track number of non_nulls to see if we need to create a NullBuffer
| let data = val_builder.as_slice_mut(); | ||
|
|
||
| let null_slice = null_builder.as_slice_mut(); | ||
| let mut non_nulls = 0usize; |
There was a problem hiding this comment.
I was more thinking along the lines of if we can leverage information from null_builder (or perhaps explore using NullBufferBuilder), rather than have a separate variable to track this
There was a problem hiding this comment.
null_builder is just a MutableBuffer, so it doesn’t carry validity metadata we can get directly.
In terms of NullBufferBuilder, it’s not obvious it would outperform this approach. We could evaluate it with a benchmark in a separate PR.
There was a problem hiding this comment.
This is very performance critical code. I'll run some benchmarks to make sure this doesn't mess something up by accident
There was a problem hiding this comment.
Also it seems like this code previously didn't bother to check the null counts, so we could always leave the existing behavior as is 🤔
There was a problem hiding this comment.
NullBuffer::new already counts the set bits, counting the nulls separately here could allow the usage of NullBuffer::new_unchecked, but I'm not sure that will really improve performance. I agree that the performance optimizations should be left to a separate PR.
There was a problem hiding this comment.
Just checked that ArrayData::new_unchecked also filters out null buffers with all nulls, so the current code in the PR does not change any observable behavior.
There was a problem hiding this comment.
Given the discussion, I think it's better to limit the scope of this PR to simply replace the unsafe calls. I will revert the last commit and we could open a separate PR to benchmark the performance
alamb
left a comment
There was a problem hiding this comment.
Makes sense to me. Thank you @liamzwbao and @Jefffrey
I am just worried about the potential performance implications. I'll fire off some benchmarks to see
| where | ||
| &'a S: StringArrayType<'a>, | ||
| { | ||
| let null_bit_buffer = array.nulls().map(|x| x.inner().sliced()); |
There was a problem hiding this comment.
I think it would be safest to leave the sliced() call in (avoid potential regressions) and make a follow on PR to try and remove it separately
| .map(|b| NullBuffer::new(BooleanBuffer::new(b, 0, array.len()))) | ||
| .filter(|n| n.null_count() > 0); |
There was a problem hiding this comment.
This pattern of creating a BooleanBuffer from with a length and then filtering for nulls count is quite common (and becoming more so)
I wonder if we should (as a follow on PR) make a function to do it, maybe like
let Option<NullBuffer> = NullBuffer::try_from_unsliced(b, array.len())Then this code would look like
let nulls = array
.nulls()
.map(|n| n.inner().sliced())
.map(|b| NullBuffer::try_from_unsliced(b, array.len()🤔
| let data = val_builder.as_slice_mut(); | ||
|
|
||
| let null_slice = null_builder.as_slice_mut(); | ||
| let mut non_nulls = 0usize; |
There was a problem hiding this comment.
This is very performance critical code. I'll run some benchmarks to make sure this doesn't mess something up by accident
| let data = val_builder.as_slice_mut(); | ||
|
|
||
| let null_slice = null_builder.as_slice_mut(); | ||
| let mut non_nulls = 0usize; |
There was a problem hiding this comment.
Also it seems like this code previously didn't bother to check the null counts, so we could always leave the existing behavior as is 🤔
|
run benchmark array_from |
This comment was marked as outdated.
This comment was marked as outdated.
|
🤖 |
|
🤖: Benchmark completed Details
|
Which issue does this PR close?
ArrayDatawith direct Array construction, when possible #9298.Rationale for this change
Reduce unsafe
ArrayData::new_uncheckedusage by switching to concrete array constructorsWhat changes are included in this PR?
BooleanArray,regexp,substring).FixedSizeBinaryArraylength whensize == 0and clarify the expectations of the function insubstring.Are these changes tested?
Covered by existing tests
Are there any user-facing changes?
No