Orasort based sort kernels by vertexclique · Pull Request #9300 · apache/arrow-rs

vertexclique · 2026-01-29T15:23:43Z

Which issue does this PR close?

Implements Orasort sorting algorithm for sort kernels.

Rationale for this change

Orasort is spliced based on prefix and uses radix sort in spliced chunks.

What changes are included in this PR?

Orasort sorting inclusion, adapting to prefix splices for array buffers.

Are these changes tested?

Yes, tests are already covering it.
In addition to that extra benchmarks are added to demonstrate the gain.

Are there any user-facing changes?

No.

Bench Results (main vs this branch)

     Running benches/sort_kernel.rs (target/release/deps/sort_kernel-64fa6c88d50ded54)
Benchmarking sort string[10] nulls 2^19: Collecting 100 samples in estimated 5.14sort string[10] nulls 2^19
                        time:   [12.913 ms 12.964 ms 13.022 ms]
                        change: [−16.013% −15.570% −15.085%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking sort string[100] nulls 2^19: Collecting 100 samples in estimated 5.6sort string[100] nulls 2^19
                        time:   [11.255 ms 11.299 ms 11.351 ms]
                        change: [−16.676% −16.115% −15.551%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Benchmarking sort string[1000] nulls 2^19: Collecting 100 samples in estimated 6.sort string[1000] nulls 2^19
                        time:   [15.421 ms 15.493 ms 15.573 ms]
                        change: [−21.110% −20.594% −20.048%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

Orasort core implementation: https://github.com/psila-ai/orasort
Perf defaults of the Orasort: https://github.com/psila-ai/orasort?tab=readme-ov-file#performance

Dandandan · 2026-01-30T07:11:38Z

Looks interesting - I am wondering if we can't do the same in arrow-rs without relying on a new dependency?
I.e. inline small values to avoid memory access for string values as well (as we do already for string view)

Dandandan · 2026-01-30T07:13:21Z

arrow-ord/src/sort.rs

+
+impl<'a> KeyAccessor for FixedSizeBinaryAccessor<'a> {
+    #[inline(always)]
+    fn get_key(&self, index: usize) -> &[u8] {


Should be marked unsafe

Dandandan · 2026-01-30T07:14:47Z

arrow-ord/src/sort.rs

-    let mut valids: Vec<(u32, u32, u64)> = value_indices
-        .into_iter()
-        .map(|idx| unsafe {
+    // Build (index, 8-byte prefix) tuples for prefix-accelerated comparison sort


What is the improvement without orasort on this?

alamb · 2026-02-02T16:48:45Z

run benchmark sort_kernels

alamb-ghbot · 2026-02-02T16:49:57Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing vclq/orasort-sorting (17d678b) to a49af1d diff
BENCH_NAME=sort_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench sort_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=vclq_orasort-sorting
Results will be posted here when complete

alamb-ghbot · 2026-02-02T16:49:59Z

Benchmark script failed with exit code 101.

Last 10 lines of output:

Click to expand

 Downloading crates ...
  Downloaded cuneiform v0.1.1
  Downloaded slab v0.4.12
  Downloaded zmij v1.0.19
  Downloaded orasort v0.1.2
  Downloaded insta v1.46.3
  Downloaded hyper-util v0.1.20
error: no bench target named `sort_kernels` in default-run packages

help: a target with a similar name exists: `sort_kernel`

alamb · 2026-02-02T18:59:27Z

run benchmark sort_kernel

alamb-ghbot · 2026-02-02T18:59:49Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing vclq/orasort-sorting (17d678b) to a49af1d diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=vclq_orasort-sorting
Results will be posted here when complete

alamb-ghbot · 2026-02-02T19:19:38Z

🤖: Benchmark completed

Details

group                                                   main                                   vclq_orasort-sorting
-----                                                   ----                                   --------------------
lexsort (bool, bool) 2^12                               1.00    115.0±1.46µs        ? ?/sec    1.02    117.2±1.78µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    151.7±1.88µs        ? ?/sec    1.10    166.8±0.93µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.01     45.2±0.36µs        ? ?/sec    1.00     45.0±0.59µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.01    213.6±4.24µs        ? ?/sec    1.00    212.1±5.74µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.9±0.32µs        ? ?/sec    1.00     39.0±0.60µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.02     42.1±0.54µs        ? ?/sec    1.00     41.2±0.27µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     79.3±0.82µs        ? ?/sec    1.00     79.0±2.08µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.01    214.3±1.32µs        ? ?/sec    1.00    211.4±1.31µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     55.7±0.18µs        ? ?/sec    1.00     55.1±0.50µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.02    261.4±4.26µs        ? ?/sec    1.00    256.6±2.49µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.02     89.8±0.72µs        ? ?/sec    1.00     87.8±0.49µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.03     91.1±1.47µs        ? ?/sec    1.00     88.8±0.45µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.03    102.7±0.58µs        ? ?/sec    1.00     99.7±3.40µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.02    261.7±4.03µs        ? ?/sec    1.00    256.9±3.47µs        ? ?/sec
rank f32 2^12                                           1.06     73.1±0.66µs        ? ?/sec    1.00     68.8±1.53µs        ? ?/sec
rank f32 nulls 2^12                                     1.05     37.5±0.37µs        ? ?/sec    1.00     35.5±0.17µs        ? ?/sec
rank string[10] 2^12                                    1.02    256.1±3.72µs        ? ?/sec    1.00    250.6±3.26µs        ? ?/sec
rank string[10] nulls 2^12                              1.01    122.1±1.02µs        ? ?/sec    1.00    121.0±1.32µs        ? ?/sec
sort f32 2^12                                           1.00     70.0±2.07µs        ? ?/sec    1.00     69.7±0.91µs        ? ?/sec
sort f32 nulls 2^12                                     1.02     30.0±0.44µs        ? ?/sec    1.00     29.5±0.76µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     37.7±0.35µs        ? ?/sec    1.00     37.6±0.40µs        ? ?/sec
sort f32 to indices 2^12                                1.00     71.3±0.49µs        ? ?/sec    1.00     71.0±0.74µs        ? ?/sec
sort i32 2^10                                           1.10      8.6±0.05µs        ? ?/sec    1.00      7.8±0.06µs        ? ?/sec
sort i32 2^12                                           1.11     42.3±0.27µs        ? ?/sec    1.00     38.2±0.34µs        ? ?/sec
sort i32 nulls 2^10                                     1.06      5.0±0.06µs        ? ?/sec    1.00      4.7±0.02µs        ? ?/sec
sort i32 nulls 2^12                                     1.06     21.0±0.26µs        ? ?/sec    1.00     19.9±0.15µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.09      7.5±0.14µs        ? ?/sec    1.00      6.9±0.03µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.13     32.2±0.77µs        ? ?/sec    1.00     28.6±0.21µs        ? ?/sec
sort i32 to indices 2^10                                1.15     12.8±0.55µs        ? ?/sec    1.00     11.1±0.10µs        ? ?/sec
sort i32 to indices 2^12                                1.17     60.9±0.83µs        ? ?/sec    1.00     52.0±0.62µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.0±0.07µs        ? ?/sec    1.23      7.4±0.12µs        ? ?/sec
sort primitive run to indices 2^12                      1.10      8.3±0.13µs        ? ?/sec    1.00      7.5±0.08µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00     43.0±0.44µs        ? ?/sec    1.08     46.4±0.70µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00     90.5±0.41µs        ? ?/sec    1.09     98.4±1.32µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00     50.8±0.41µs        ? ?/sec    1.18     60.2±0.76µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    129.0±0.69µs        ? ?/sec    1.09    140.0±3.60µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00     42.9±0.21µs        ? ?/sec    1.10     47.1±0.64µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00     92.0±0.80µs        ? ?/sec    1.07     98.0±1.56µs        ? ?/sec
sort string[1000] nulls 2^19                                                                   1.00     14.4±0.30ms        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00     44.0±1.66µs        ? ?/sec    1.09     47.9±0.20µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00     89.0±1.81µs        ? ?/sec    1.08     96.1±1.76µs        ? ?/sec
sort string[100] nulls 2^19                                                                    1.00     11.7±0.20ms        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00     42.3±0.34µs        ? ?/sec    1.14     48.2±7.55µs        ? ?/sec
sort string[100] to indices 2^12                        1.00     88.5±1.67µs        ? ?/sec    1.08     96.0±0.51µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.01    152.2±1.63µs        ? ?/sec    1.00    150.3±0.47µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.02    318.9±5.12µs        ? ?/sec    1.00    314.0±2.88µs        ? ?/sec
sort string[10] nulls 2^19                                                                     1.00      9.6±0.29ms        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00     43.8±1.87µs        ? ?/sec    1.08     47.5±0.48µs        ? ?/sec
sort string[10] to indices 2^12                         1.00     88.1±1.05µs        ? ?/sec    1.08     95.0±0.93µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.18     55.7±1.25µs        ? ?/sec    1.00     47.2±0.25µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.24    120.4±5.89µs        ? ?/sec    1.00     97.2±1.13µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     44.9±1.02µs        ? ?/sec    1.01     45.5±0.53µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    103.7±0.51µs        ? ?/sec    1.03    106.5±0.60µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     42.9±0.69µs        ? ?/sec    1.01     43.5±0.78µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.01     94.2±1.06µs        ? ?/sec    1.00     93.4±2.25µs        ? ?/sec

alamb · 2026-02-02T22:07:18Z

run benchmark sort_kernel

alamb-ghbot · 2026-02-02T22:07:24Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing vclq/orasort-sorting (17d678b) to a49af1d diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=vclq_orasort-sorting
Results will be posted here when complete

alamb-ghbot · 2026-02-02T22:26:38Z

🤖: Benchmark completed

Details

group                                                   main                                   vclq_orasort-sorting
-----                                                   ----                                   --------------------
lexsort (bool, bool) 2^12                               1.00    115.1±1.11µs        ? ?/sec    1.02    117.5±1.61µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    152.1±1.30µs        ? ?/sec    1.09    166.5±1.24µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.02     45.6±2.80µs        ? ?/sec    1.00     44.8±0.70µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    213.2±1.93µs        ? ?/sec    1.00    212.2±3.07µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.7±0.14µs        ? ?/sec    1.00     38.6±0.49µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.3±0.43µs        ? ?/sec    1.00     41.0±0.13µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     79.2±1.01µs        ? ?/sec    1.00     79.0±0.94µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.01    213.2±1.75µs        ? ?/sec    1.00    211.8±1.10µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     55.6±0.52µs        ? ?/sec    1.00     54.9±0.99µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.02    261.1±2.07µs        ? ?/sec    1.00    256.3±1.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.01     89.4±1.35µs        ? ?/sec    1.00     88.4±2.47µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.02     90.6±1.72µs        ? ?/sec    1.00     88.9±0.54µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.03    102.1±0.92µs        ? ?/sec    1.00     99.2±1.37µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.02    261.1±1.58µs        ? ?/sec    1.00    256.5±1.48µs        ? ?/sec
rank f32 2^12                                           1.06     73.1±1.03µs        ? ?/sec    1.00     69.0±1.05µs        ? ?/sec
rank f32 nulls 2^12                                     1.05     37.4±0.20µs        ? ?/sec    1.00     35.7±0.32µs        ? ?/sec
rank string[10] 2^12                                    1.02    256.4±1.91µs        ? ?/sec    1.00    250.8±1.90µs        ? ?/sec
rank string[10] nulls 2^12                              1.01    122.6±1.68µs        ? ?/sec    1.00    121.4±2.82µs        ? ?/sec
sort f32 2^12                                           1.00     70.0±0.66µs        ? ?/sec    1.00     69.8±0.50µs        ? ?/sec
sort f32 nulls 2^12                                     1.02     30.0±0.27µs        ? ?/sec    1.00     29.4±0.34µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     37.7±0.41µs        ? ?/sec    1.00     37.7±0.57µs        ? ?/sec
sort f32 to indices 2^12                                1.01     71.2±0.24µs        ? ?/sec    1.00     70.7±0.47µs        ? ?/sec
sort i32 2^10                                           1.10      8.6±0.03µs        ? ?/sec    1.00      7.8±0.06µs        ? ?/sec
sort i32 2^12                                           1.13     42.8±2.35µs        ? ?/sec    1.00     37.9±0.31µs        ? ?/sec
sort i32 nulls 2^10                                     1.05      5.0±0.02µs        ? ?/sec    1.00      4.8±0.08µs        ? ?/sec
sort i32 nulls 2^12                                     1.13     22.5±4.68µs        ? ?/sec    1.00     19.9±0.12µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.09      7.5±0.04µs        ? ?/sec    1.00      6.9±0.04µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.12     32.1±0.23µs        ? ?/sec    1.00     28.6±0.20µs        ? ?/sec
sort i32 to indices 2^10                                1.14     12.7±0.07µs        ? ?/sec    1.00     11.1±0.06µs        ? ?/sec
sort i32 to indices 2^12                                1.16     60.6±0.40µs        ? ?/sec    1.00     52.2±1.17µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.0±0.03µs        ? ?/sec    1.22      7.3±0.08µs        ? ?/sec
sort primitive run to indices 2^12                      1.10      8.3±0.05µs        ? ?/sec    1.00      7.5±0.08µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00     42.8±0.53µs        ? ?/sec    1.09     46.6±0.39µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00     90.6±0.55µs        ? ?/sec    1.09     98.6±0.61µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00     51.2±2.37µs        ? ?/sec    1.18     60.2±0.49µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    129.2±1.18µs        ? ?/sec    1.08    139.7±2.87µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00     42.9±0.25µs        ? ?/sec    1.09     46.8±0.36µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00     92.6±0.79µs        ? ?/sec    1.06     97.9±2.17µs        ? ?/sec
sort string[1000] nulls 2^19                                                                   1.00     14.8±0.27ms        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00     43.7±0.56µs        ? ?/sec    1.10     47.9±1.09µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00     89.0±1.50µs        ? ?/sec    1.08     96.5±2.46µs        ? ?/sec
sort string[100] nulls 2^19                                                                    1.00     11.6±0.15ms        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00     42.4±0.48µs        ? ?/sec    1.09     46.4±0.33µs        ? ?/sec
sort string[100] to indices 2^12                        1.00     88.3±0.87µs        ? ?/sec    1.09     96.3±0.90µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    152.4±1.22µs        ? ?/sec    1.00    152.0±1.43µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.01    316.5±2.47µs        ? ?/sec    1.00    313.6±3.50µs        ? ?/sec
sort string[10] nulls 2^19                                                                     1.00      9.6±0.10ms        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00     43.3±0.24µs        ? ?/sec    1.09     47.2±1.13µs        ? ?/sec
sort string[10] to indices 2^12                         1.00     87.8±0.40µs        ? ?/sec    1.09     95.3±3.09µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.17     55.4±0.28µs        ? ?/sec    1.00     47.2±0.84µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.23    119.5±0.52µs        ? ?/sec    1.00     96.9±0.53µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     45.4±1.18µs        ? ?/sec    1.00     45.4±0.33µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.7±2.45µs        ? ?/sec    1.02    106.9±1.93µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     42.8±0.87µs        ? ?/sec    1.01     43.1±0.31µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.01     93.9±1.16µs        ? ?/sec    1.00     93.3±1.50µs        ? ?/sec

Dandandan · 2026-02-03T19:46:33Z

Seems not to reproduce on the VM 🤔 perhaps machine-dependent

vertexclique · 2026-02-04T00:13:34Z

I don't understand the benchmarks, can someone explain it to me?
I see 1.23 on main and 1 on this branch. In both runs main looks like having more than 1.
What am I missing?

vertexclique added 7 commits January 28, 2026 15:38

Orasort in arrow-ord kernels

e242544

sort key splicing

1802f0c

Hybrid dispatch

12c6fd7

remove indice based accessors

159b3da

Splice cutting

c00c929

Use latest orasort

f451076

use constants for the prefix sifts

404eb5c

github-actions bot added the arrow Changes to the arrow crate label Jan 29, 2026

Splice range for bit shakes

17d678b

Dandandan reviewed Jan 30, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

Conversation

vertexclique commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Bench Results (main vs this branch)

Uh oh!

Dandandan commented Jan 30, 2026

Uh oh!

Dandandan Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Dandandan Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 2, 2026

Uh oh!

This comment was marked as outdated.

alamb-ghbot commented Feb 2, 2026

Uh oh!

alamb-ghbot commented Feb 2, 2026

Uh oh!

alamb commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb-ghbot commented Feb 2, 2026

Uh oh!

alamb-ghbot commented Feb 2, 2026

Uh oh!

alamb commented Feb 2, 2026

Uh oh!

alamb-ghbot commented Feb 2, 2026

Uh oh!

alamb-ghbot commented Feb 2, 2026

Uh oh!

Dandandan commented Feb 3, 2026

Uh oh!

vertexclique commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vertexclique commented Jan 29, 2026 •

edited

Loading

alamb commented Feb 2, 2026 •

edited

Loading