perf: improve `GenericByteBuilder::append_array` to use SIMD for extending the offsets by rluvaton · Pull Request #8388 · apache/arrow-rs

rluvaton · 2025-09-18T20:19:16Z

Which issue does this PR close?

N/A

Rationale for this change

Just making things faster

What changes are included in this PR?

Explained below

Are these changes tested?

Existing tests

Are there any user-facing changes?

Nope

Changing from:

let mut intermediate = Vec::with_capacity(offsets.len() - 1);

for &offset in &offsets[1..] {
    intermediate.push(offset + shift)
}

self.offsets_builder.extend_from_slice(&intermediate);

to:

self.offsets_builder.extend(
    offsets[..offsets.len() - 1]
        .iter()
        .map(|&offset| offset + shift),
);

When looking at the assembly

Used rustc 1.89.0 and compiler flags -C opt-level=2 -C target-feature=+avx2 -C codegen-units=1 in godbold

you see that for the old code:

let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1);
for &offset in &offsets[1..] {
    intermediate.push(offset + shift)
}

the assembly for the loop is:

.LBB3_22:
        mov     rbx, qword ptr [r13 + 8*rbp + 8]
        add     rbx, r15
        cmp     rbp, qword ptr [rsp]
        jne     .LBB3_25
        mov     rdi, rsp
        lea     rsi, [rip + .Lanon.da681cffc384a5add117668a344b291b.6]
        call    qword ptr [rip + alloc::raw_vec::RawVec<T,A>::grow_one::ha1b398ade64b0727@GOTPCREL]
        mov     r14, qword ptr [rsp + 8]
        jmp     .LBB3_25

.LBB3_25:
        mov     qword ptr [r14 + 8*rbp], rbx
        inc     rbp
        mov     qword ptr [rsp + 16], rbp
        add     r12, -8
        je      .LBB3_9

and for the new code:

self.offsets_builder.extend(
    offsets[..offsets.len() - 1]
        .iter()
        .map(|&offset| offset + shift),
);

the assembly for the loop is:

.LBB2_7:
        vpaddq  ymm1, ymm0, ymmword ptr [r14 + 8*r9]
        vpaddq  ymm2, ymm0, ymmword ptr [r14 + 8*r9 + 32]
        vpaddq  ymm3, ymm0, ymmword ptr [r14 + 8*r9 + 64]
        vpaddq  ymm4, ymm0, ymmword ptr [r14 + 8*r9 + 96]
        vmovdqu ymmword ptr [r8 + 8*r9 - 96], ymm1
        vmovdqu ymmword ptr [r8 + 8*r9 - 64], ymm2
        vmovdqu ymmword ptr [r8 + 8*r9 - 32], ymm3
        vmovdqu ymmword ptr [r8 + 8*r9], ymm4
        add     r9, 16
        cmp     rdx, r9
        jne     .LBB2_7
        cmp     rbx, rdx
        je      .LBB2_12

which uses SIMD instructions.

The code that I wrote in GodBolt:

For the old code:

#[inline(always)]
fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) {
    assert_ne!(offsets.len(), 0);
    let shift: T = next_offset + offsets[0];

    let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1);

    // Make it easier to find the loop in the assembly
    let mut dummy = 0u64;
    unsafe {
        std::arch::asm!(
            "# VECTORIZED_START
             mov {}, 1",
            out(reg) dummy,
            options(nostack)
        );
    }

    for &offset in &offsets[1..] {
        intermediate.push(offset + shift)
    }

    // Make it easier to find the loop in the assembly
    unsafe {
        std::arch::asm!(
            "# VECTORIZED_END
             mov {}, 2",
            out(reg) dummy,
            options(nostack)
        );
    }
    std::hint::black_box(dummy);

    output.extend_from_slice(&intermediate);
}

#[no_mangle]
pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) {
  extend_offsets(output, offsets, next_offset);
}

And for the new code:

#[inline(always)]
fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) {
    assert_ne!(offsets.len(), 0);

    let shift: T = next_offset + offsets[0];
    output.extend(offsets[..(offsets.len() - 1)]
        .iter()
        .map(|&offset| offset + shift));
}

#[no_mangle]
pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) {
  extend_offsets(output, offsets, next_offset);
}

…ing the offsets Changing from: ```rust let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1); for &offset in &offsets[1..] { intermediate.push(offset + shift) } ``` to: ```rust let mut intermediate = vec![T::Offset::zero(); offsets.len() - 1]; for (index, &offset) in offsets[1..].iter().enumerate() { intermediate[index] = offset + shift; } ``` improve the performance of concating bytes array between 8% to 50% on local machine: ```bash concat str 1024 time: [7.2598 µs 7.2772 µs 7.2957 µs] change: [+12.545% +13.070% +13.571%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 4 (4.00%) high mild 2 (2.00%) high severe concat str nulls 1024 time: [4.6791 µs 4.6895 µs 4.7010 µs] change: [+23.206% +23.792% +24.425%] (p = 0.00 < 0.05) Performance has regressed. Found 13 outliers among 100 measurements (13.00%) 5 (5.00%) high mild 8 (8.00%) high severe concat 1024 arrays str 4 time: [45.018 µs 45.213 µs 45.442 µs] change: [+6.4195% +8.7377% +11.279%] (p = 0.00 < 0.05) Performance has regressed. Found 13 outliers among 100 measurements (13.00%) 6 (6.00%) high mild 7 (7.00%) high severe concat str 8192 over 100 arrays time: [3.7561 ms 3.7814 ms 3.8086 ms] change: [+25.394% +26.833% +28.370%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild concat str nulls 8192 over 100 arrays time: [2.3144 ms 2.3269 ms 2.3403 ms] change: [+51.533% +52.826% +54.109%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) high mild 2 (2.00%) high severe ``` When looking at the assembly > Used rustc 1.89.0 and compiler flags `-C opt-level=2 -C target-feature=+avx2 -C codegen-units=1` in [godbold](https://godbolt.org/) you see that for the old code: ```rust let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1); for &offset in &offsets[1..] { intermediate.push(offset + shift) } ``` the assembly for the loop is: ```asm .LBB3_22: mov rbx, qword ptr [r13 + 8*rbp + 8] add rbx, r15 cmp rbp, qword ptr [rsp] jne .LBB3_25 mov rdi, rsp lea rsi, [rip + .Lanon.da681cffc384a5add117668a344b291b.6] call qword ptr [rip + alloc::raw_vec::RawVec<T,A>::grow_one::ha1b398ade64b0727@GOTPCREL] mov r14, qword ptr [rsp + 8] jmp .LBB3_25 .LBB3_25: mov qword ptr [r14 + 8*rbp], rbx inc rbp mov qword ptr [rsp + 16], rbp add r12, -8 je .LBB3_9 ``` and for the new code: ```rust let mut intermediate = vec![T::Offset::zero(); offsets.len() - 1]; for (index, &offset) in offsets[1..].iter().enumerate() { intermediate[index] = offset + shift; } ``` the assembly for the loop is: ```asm .LBB2_21: vpaddq ymm1, ymm0, ymmword ptr [r12 + 8*rdx + 8] vpaddq ymm2, ymm0, ymmword ptr [r12 + 8*rdx + 40] vpaddq ymm3, ymm0, ymmword ptr [r12 + 8*rdx + 72] vpaddq ymm4, ymm0, ymmword ptr [r12 + 8*rdx + 104] vmovdqu ymmword ptr [rbx + 8*rdx], ymm1 vmovdqu ymmword ptr [rbx + 8*rdx + 32], ymm2 vmovdqu ymmword ptr [rbx + 8*rdx + 64], ymm3 vmovdqu ymmword ptr [rbx + 8*rdx + 96], ymm4 add rdx, 16 cmp rax, rdx jne .LBB2_21 ``` which uses SIMD instructions. The code that I wrote in GodBolt: For the old code: ```rust #[inline(always)] fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) { assert_ne!(offsets.len(), 0); let shift: T = next_offset + offsets[0]; let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1); // Make it easier to find the loop in the assembly let mut dummy = 0u64; unsafe { std::arch::asm!( "# VECTORIZED_START mov {}, 1", out(reg) dummy, options(nostack) ); } for &offset in &offsets[1..] { intermediate.push(offset + shift) } // Make it easier to find the loop in the assembly unsafe { std::arch::asm!( "# VECTORIZED_END mov {}, 2", out(reg) dummy, options(nostack) ); } std::hint::black_box(dummy); output.extend_from_slice(&intermediate); } #[no_mangle] pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) { extend_offsets(output, offsets, next_offset); } ``` And for the new code: ```rust #[inline(always)] fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) { assert_ne!(offsets.len(), 0); let shift: T = next_offset + offsets[0]; let mut intermediate = vec![T::default(); offsets.len() - 1]; // Make it easier to find the loop in the assembly let mut dummy = 0u64; unsafe { std::arch::asm!( "# VECTORIZED_START mov {}, 1", out(reg) dummy, options(nostack) ); } for (index, &offset) in offsets[1..].iter().enumerate() { intermediate[index] = offset + shift } // Make it easier to find the loop in the assembly unsafe { std::arch::asm!( "# VECTORIZED_END mov {}, 2", out(reg) dummy, options(nostack) ); } std::hint::black_box(dummy); output.extend_from_slice(&intermediate); } #[no_mangle] pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) { extend_offsets(output, offsets, next_offset); } ```

alamb · 2025-09-19T15:32:30Z

Thanks @rluvaton -- I have scheduled a benchmark run.

alamb · 2025-09-19T16:30:43Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing make-append-array-more-use-simd (2ee80db) to f4840f6 diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=make-append-array-more-use-simd
Results will be posted here when complete

alamb · 2025-09-19T16:40:56Z

🤖: Benchmark completed

Details

group                                                          main                                   make-append-array-more-use-simd
-----                                                          ----                                   -------------------------------
concat 1024 arrays boolean 4                                   1.01     28.6±0.05µs        ? ?/sec    1.00     28.3±0.03µs        ? ?/sec
concat 1024 arrays i32 4                                       1.02     14.5±0.03µs        ? ?/sec    1.00     14.2±0.12µs        ? ?/sec
concat 1024 arrays str 4                                       1.00     55.9±0.13µs        ? ?/sec    1.07     60.0±0.77µs        ? ?/sec
concat boolean 1024                                            1.00    400.2±0.68ns        ? ?/sec    1.09    437.1±1.02ns        ? ?/sec
concat boolean 8192 over 100 arrays                            1.00     44.2±0.04µs        ? ?/sec    1.16     51.2±0.06µs        ? ?/sec
concat boolean nulls 1024                                      1.00    719.0±0.70ns        ? ?/sec    1.06    759.5±1.08ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays                      1.00     96.3±0.19µs        ? ?/sec    1.14    110.1±0.24µs        ? ?/sec
concat fixed size lists                                        1.00   812.2±33.78µs        ? ?/sec    1.00   808.4±51.64µs        ? ?/sec
concat i32 1024                                                1.03    408.8±0.88ns        ? ?/sec    1.00    397.8±1.02ns        ? ?/sec
concat i32 8192 over 100 arrays                                1.00   239.1±10.18µs        ? ?/sec    1.02    244.2±6.82µs        ? ?/sec
concat i32 nulls 1024                                          1.00    731.4±2.95ns        ? ?/sec    1.01    738.4±1.62ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                          1.00    307.5±7.52µs        ? ?/sec    1.07   328.3±21.74µs        ? ?/sec
concat str 1024                                                1.14     14.7±0.72µs        ? ?/sec    1.00     12.9±0.53µs        ? ?/sec
concat str 8192 over 100 arrays                                1.01    103.3±0.71ms        ? ?/sec    1.00    102.2±0.97ms        ? ?/sec
concat str nulls 1024                                          1.22      7.2±0.44µs        ? ?/sec    1.00      5.9±0.17µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.02     53.5±0.42ms        ? ?/sec    1.00     52.3±0.60ms        ? ?/sec
concat str_dict 1024                                           1.06      3.0±0.01µs        ? ?/sec    1.00      2.8±0.01µs        ? ?/sec
concat str_dict_sparse 1024                                    1.00      6.9±0.02µs        ? ?/sec    1.00      6.9±0.01µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2           1.04      7.0±0.08µs        ? ?/sec    1.00      6.7±0.04µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0               1.00     77.3±0.39µs        ? ?/sec    1.00     77.3±0.29µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2             1.00     83.7±0.23µs        ? ?/sec    1.00     83.4±0.35µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                1.14     87.8±0.22µs        ? ?/sec    1.00     77.1±0.21µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2              1.13     94.0±0.13µs        ? ?/sec    1.00     83.6±0.25µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0      1.00     45.4±2.50µs        ? ?/sec    1.00     45.3±3.23µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2    1.00     53.1±2.68µs        ? ?/sec    1.00     53.0±2.99µs        ? ?/sec

alamb · 2025-09-19T19:00:56Z

Those are some pretty impressive results 👍 thank you @rluvaton

concat str 1024                                                1.14     14.7±0.72µs        ? ?/sec    1.00     12.9±0.53µs        ? ?/sec
concat str 8192 over 100 arrays                                1.01    103.3±0.71ms        ? ?/sec    1.00    102.2±0.97ms        ? ?/sec
concat str nulls 1024                                          1.22      7.2±0.44µs        ? ?/sec    1.00      5.9±0.17µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.02     53.5±0.42ms        ? ?/sec    1.00     52.3±0.60ms        ? ?/sec

rluvaton · 2025-09-20T17:55:06Z

Is it possible that the benchmark is not running with target cpu native?

rluvaton · 2025-09-20T23:42:56Z

I've updated the code to no longer have intermediate buffer AND use SIMD

alamb · 2025-09-22T10:32:49Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing make-append-array-more-use-simd (fd5a012) to f4840f6 diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=make-append-array-more-use-simd
Results will be posted here when complete

alamb · 2025-09-22T10:42:58Z

🤖: Benchmark completed

Details

group                                                          main                                   make-append-array-more-use-simd
-----                                                          ----                                   -------------------------------
concat 1024 arrays boolean 4                                   1.00     27.9±0.08µs        ? ?/sec    1.02     28.4±0.04µs        ? ?/sec
concat 1024 arrays i32 4                                       1.00     13.9±0.02µs        ? ?/sec    1.00     13.9±0.02µs        ? ?/sec
concat 1024 arrays str 4                                       1.54     55.0±0.30µs        ? ?/sec    1.00     35.6±0.30µs        ? ?/sec
concat boolean 1024                                            1.10    434.6±0.32ns        ? ?/sec    1.00    394.6±7.21ns        ? ?/sec
concat boolean 8192 over 100 arrays                            1.15     50.8±0.04µs        ? ?/sec    1.00     44.2±0.12µs        ? ?/sec
concat boolean nulls 1024                                      1.02    738.4±3.33ns        ? ?/sec    1.00    724.8±3.73ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays                      1.14    109.7±0.16µs        ? ?/sec    1.00     96.3±0.13µs        ? ?/sec
concat fixed size lists                                        1.07   763.1±22.77µs        ? ?/sec    1.00   713.7±45.55µs        ? ?/sec
concat i32 1024                                                1.03    400.4±1.12ns        ? ?/sec    1.00    389.5±0.63ns        ? ?/sec
concat i32 8192 over 100 arrays                                1.03    214.0±2.71µs        ? ?/sec    1.00    207.7±6.93µs        ? ?/sec
concat i32 nulls 1024                                          1.01    713.4±3.30ns        ? ?/sec    1.00    704.6±1.82ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                          1.05    286.1±9.38µs        ? ?/sec    1.00    271.2±9.11µs        ? ?/sec
concat str 1024                                                1.18     14.3±0.88µs        ? ?/sec    1.00     12.1±1.05µs        ? ?/sec
concat str 8192 over 100 arrays                                1.01    104.4±0.78ms        ? ?/sec    1.00    103.4±0.60ms        ? ?/sec
concat str nulls 1024                                          1.32      7.4±0.31µs        ? ?/sec    1.00      5.6±0.44µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.03     53.5±0.65ms        ? ?/sec    1.00     51.9±0.60ms        ? ?/sec
concat str_dict 1024                                           1.07      2.9±0.01µs        ? ?/sec    1.00      2.7±0.02µs        ? ?/sec
concat str_dict_sparse 1024                                    1.00      7.0±0.03µs        ? ?/sec    1.03      7.2±0.03µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2           1.00      6.4±0.02µs        ? ?/sec    1.09      6.9±0.22µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0               1.00     77.5±0.48µs        ? ?/sec    1.00     77.5±0.69µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2             1.01     83.8±0.42µs        ? ?/sec    1.00     83.3±0.79µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                1.15     88.6±0.71µs        ? ?/sec    1.00     77.1±0.33µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2              1.15     94.8±0.28µs        ? ?/sec    1.00     82.5±0.54µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0      1.00     48.2±3.55µs        ? ?/sec    1.01     48.9±4.39µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2    1.06     54.4±2.97µs        ? ?/sec    1.00     51.4±2.65µs        ? ?/sec

alamb · 2025-09-22T10:47:54Z

this is so good. Thank you @rluvaton

github-actions bot added the arrow Changes to the arrow crate label Sep 18, 2025

rluvaton changed the title ~~perf: improve GenericByteBuilder::append_array to use SIMD for extending the offsets~~ perf: improve GenericByteBuilder::append_array to use SIMD for extending the offsets Sep 18, 2025

alamb approved these changes Sep 19, 2025

View reviewed changes

alamb added the performance label Sep 19, 2025

rluvaton force-pushed the make-append-array-more-use-simd branch from ea5835d to 43e6317 Compare September 20, 2025 23:41

change to iterator that will also use SIMD without intermediate buffer

fd5a012

rluvaton force-pushed the make-append-array-more-use-simd branch from 43e6317 to fd5a012 Compare September 21, 2025 00:00

alamb merged commit 13fb041 into apache:main Sep 22, 2025
26 checks passed

rluvaton deleted the make-append-array-more-use-simd branch September 22, 2025 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve `GenericByteBuilder::append_array` to use SIMD for extending the offsets#8388

perf: improve `GenericByteBuilder::append_array` to use SIMD for extending the offsets#8388
alamb merged 2 commits intoapache:mainfrom
rluvaton:make-append-array-more-use-simd

rluvaton commented Sep 18, 2025 •

edited

Loading

Uh oh!

alamb commented Sep 19, 2025

Uh oh!

alamb commented Sep 19, 2025

Uh oh!

alamb commented Sep 19, 2025

Uh oh!

alamb commented Sep 19, 2025

Uh oh!

rluvaton commented Sep 20, 2025

Uh oh!

rluvaton commented Sep 20, 2025

Uh oh!

alamb commented Sep 22, 2025

Uh oh!

alamb commented Sep 22, 2025

Uh oh!

alamb commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rluvaton commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Sep 19, 2025

Uh oh!

alamb commented Sep 19, 2025

Uh oh!

alamb commented Sep 19, 2025

Uh oh!

alamb commented Sep 19, 2025

Uh oh!

rluvaton commented Sep 20, 2025

Uh oh!

rluvaton commented Sep 20, 2025

Uh oh!

alamb commented Sep 22, 2025

Uh oh!

alamb commented Sep 22, 2025

Uh oh!

alamb commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rluvaton commented Sep 18, 2025 •

edited

Loading