perf: improve GenericByteBuilder::append_array to use SIMD for extending the offsets#8388
Conversation
…ing the offsets
Changing from:
```rust
let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1);
for &offset in &offsets[1..] {
intermediate.push(offset + shift)
}
```
to:
```rust
let mut intermediate = vec![T::Offset::zero(); offsets.len() - 1];
for (index, &offset) in offsets[1..].iter().enumerate() {
intermediate[index] = offset + shift;
}
```
improve the performance of concating bytes array between 8% to 50% on local machine:
```bash
concat str 1024 time: [7.2598 µs 7.2772 µs 7.2957 µs]
change: [+12.545% +13.070% +13.571%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
concat str nulls 1024 time: [4.6791 µs 4.6895 µs 4.7010 µs]
change: [+23.206% +23.792% +24.425%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
5 (5.00%) high mild
8 (8.00%) high severe
concat 1024 arrays str 4
time: [45.018 µs 45.213 µs 45.442 µs]
change: [+6.4195% +8.7377% +11.279%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
6 (6.00%) high mild
7 (7.00%) high severe
concat str 8192 over 100 arrays
time: [3.7561 ms 3.7814 ms 3.8086 ms]
change: [+25.394% +26.833% +28.370%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
concat str nulls 8192 over 100 arrays
time: [2.3144 ms 2.3269 ms 2.3403 ms]
change: [+51.533% +52.826% +54.109%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
6 (6.00%) high mild
2 (2.00%) high severe
```
When looking at the assembly
> Used rustc 1.89.0 and compiler flags `-C opt-level=2 -C target-feature=+avx2 -C codegen-units=1` in [godbold](https://godbolt.org/)
you see that for the old code:
```rust
let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1);
for &offset in &offsets[1..] {
intermediate.push(offset + shift)
}
```
the assembly for the loop is:
```asm
.LBB3_22:
mov rbx, qword ptr [r13 + 8*rbp + 8]
add rbx, r15
cmp rbp, qword ptr [rsp]
jne .LBB3_25
mov rdi, rsp
lea rsi, [rip + .Lanon.da681cffc384a5add117668a344b291b.6]
call qword ptr [rip + alloc::raw_vec::RawVec<T,A>::grow_one::ha1b398ade64b0727@GOTPCREL]
mov r14, qword ptr [rsp + 8]
jmp .LBB3_25
.LBB3_25:
mov qword ptr [r14 + 8*rbp], rbx
inc rbp
mov qword ptr [rsp + 16], rbp
add r12, -8
je .LBB3_9
```
and for the new code:
```rust
let mut intermediate = vec![T::Offset::zero(); offsets.len() - 1];
for (index, &offset) in offsets[1..].iter().enumerate() {
intermediate[index] = offset + shift;
}
```
the assembly for the loop is:
```asm
.LBB2_21:
vpaddq ymm1, ymm0, ymmword ptr [r12 + 8*rdx + 8]
vpaddq ymm2, ymm0, ymmword ptr [r12 + 8*rdx + 40]
vpaddq ymm3, ymm0, ymmword ptr [r12 + 8*rdx + 72]
vpaddq ymm4, ymm0, ymmword ptr [r12 + 8*rdx + 104]
vmovdqu ymmword ptr [rbx + 8*rdx], ymm1
vmovdqu ymmword ptr [rbx + 8*rdx + 32], ymm2
vmovdqu ymmword ptr [rbx + 8*rdx + 64], ymm3
vmovdqu ymmword ptr [rbx + 8*rdx + 96], ymm4
add rdx, 16
cmp rax, rdx
jne .LBB2_21
```
which uses SIMD instructions.
The code that I wrote in GodBolt:
For the old code:
```rust
#[inline(always)]
fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) {
assert_ne!(offsets.len(), 0);
let shift: T = next_offset + offsets[0];
let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1);
// Make it easier to find the loop in the assembly
let mut dummy = 0u64;
unsafe {
std::arch::asm!(
"# VECTORIZED_START
mov {}, 1",
out(reg) dummy,
options(nostack)
);
}
for &offset in &offsets[1..] {
intermediate.push(offset + shift)
}
// Make it easier to find the loop in the assembly
unsafe {
std::arch::asm!(
"# VECTORIZED_END
mov {}, 2",
out(reg) dummy,
options(nostack)
);
}
std::hint::black_box(dummy);
output.extend_from_slice(&intermediate);
}
#[no_mangle]
pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) {
extend_offsets(output, offsets, next_offset);
}
```
And for the new code:
```rust
#[inline(always)]
fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) {
assert_ne!(offsets.len(), 0);
let shift: T = next_offset + offsets[0];
let mut intermediate = vec![T::default(); offsets.len() - 1];
// Make it easier to find the loop in the assembly
let mut dummy = 0u64;
unsafe {
std::arch::asm!(
"# VECTORIZED_START
mov {}, 1",
out(reg) dummy,
options(nostack)
);
}
for (index, &offset) in offsets[1..].iter().enumerate() {
intermediate[index] = offset + shift
}
// Make it easier to find the loop in the assembly
unsafe {
std::arch::asm!(
"# VECTORIZED_END
mov {}, 2",
out(reg) dummy,
options(nostack)
);
}
std::hint::black_box(dummy);
output.extend_from_slice(&intermediate);
}
#[no_mangle]
pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) {
extend_offsets(output, offsets, next_offset);
}
```
GenericByteBuilder::append_array to use SIMD for extending the offsets
|
Thanks @rluvaton -- I have scheduled a benchmark run. |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Those are some pretty impressive results 👍 thank you @rluvaton |
|
Is it possible that the benchmark is not running with target cpu native? |
ea5835d to
43e6317
Compare
|
I've updated the code to no longer have intermediate buffer AND use SIMD |
43e6317 to
fd5a012
Compare
|
🤖 |
|
🤖: Benchmark completed Details
|
|
this is so good. Thank you @rluvaton |
Which issue does this PR close?
N/A
Rationale for this change
Just making things faster
What changes are included in this PR?
Explained below
Are these changes tested?
Existing tests
Are there any user-facing changes?
Nope
Changing from:
to:
When looking at the assembly
you see that for the old code:
the assembly for the loop is:
and for the new code:
the assembly for the loop is:
which uses SIMD instructions.
The code that I wrote in GodBolt:
For the old code:
And for the new code: