perf: skip bound checks in take native for better performance#9277
perf: skip bound checks in take native for better performance#9277rluvaton wants to merge 4 commits intoapache:mainfrom
Conversation
|
run benchmark take |
|
🤖 Hi @rluvaton, thanks for the request (#9277 (comment)).
Please choose one or more of these with |
|
run benchmark take_kernels |
|
The failing tests are out of bounds access tests that is now undefined behavior rather than a panic |
arrow-select/src/take.rs
Outdated
| let index = index.as_usize(); | ||
| // Safety: we either checked already bounds (passed check_bounds = true) or the user | ||
| // guarantees the value to be in range. | ||
| // Avoiding bound checks allows the compiler to vectorize it and do better loop unrolling |
There was a problem hiding this comment.
We can't just do this, as we get the indices from the user / other safe ckdd (and we don't check bounds by default).
There was a problem hiding this comment.
This is why I want first to merge:
There was a problem hiding this comment.
maybe we can take the conversation about that there?
We can't just make it unsafe as it will make safe code UB. See also #8879 |
|
show benchmark queue |
|
🤖 Hi @rluvaton, you asked to view the benchmark queue (#9277 (comment)).
|
I will add validation before and see what is the performance impact |
|
Show benchmark queue |
|
🤖 Hi @rluvaton, you asked to view the benchmark queue (#9277 (comment)).
|
|
Show benchmark queue |
|
🤖 Hi @rluvaton, you asked to view the benchmark queue (#9277 (comment)).
|
|
I just unstuck the bot |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark take_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
I don't know how much I can trust the results as it says: almost 2 times faster for a branch of code that I did not touch. (it can be possible that this improved due to instruction locality but I don't know |
|
run benchmark take_kernels |
|
🤖 |
yeah I agree all of these numbers need to be carefully reviewed. I also found that for benchmarks such as this which execute in µs, the allocation pattern / allocator can cause substantial changes in the performance if other benchmarks do a different allocation pattern, for example |
|
🤖: Benchmark completed Details
|
|
run benchmark take_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
Which issue does this PR close?
None
Rationale for this change
Making take kernel faster for building native.
You can see that the compiler vectorize and unroll the loop in GodBolt when using unchecked
What changes are included in this PR?
use
get_uncheckedintake_nativeAre these changes tested?
Existing tests
Are there any user-facing changes?
Undefined behavior if out of range
See: