Skip to content

Orasort based sort kernels#9300

Open
vertexclique wants to merge 8 commits intoapache:mainfrom
vertexclique:vclq/orasort-sorting
Open

Orasort based sort kernels#9300
vertexclique wants to merge 8 commits intoapache:mainfrom
vertexclique:vclq/orasort-sorting

Conversation

@vertexclique
Copy link
Contributor

@vertexclique vertexclique commented Jan 29, 2026

Which issue does this PR close?

  • Implements Orasort sorting algorithm for sort kernels.

Rationale for this change

Orasort is spliced based on prefix and uses radix sort in spliced chunks.

What changes are included in this PR?

Orasort sorting inclusion, adapting to prefix splices for array buffers.

Are these changes tested?

Yes, tests are already covering it.
In addition to that extra benchmarks are added to demonstrate the gain.

Are there any user-facing changes?

No.

Bench Results (main vs this branch)

     Running benches/sort_kernel.rs (target/release/deps/sort_kernel-64fa6c88d50ded54)
Benchmarking sort string[10] nulls 2^19: Collecting 100 samples in estimated 5.14sort string[10] nulls 2^19
                        time:   [12.913 ms 12.964 ms 13.022 ms]
                        change: [−16.013% −15.570% −15.085%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking sort string[100] nulls 2^19: Collecting 100 samples in estimated 5.6sort string[100] nulls 2^19
                        time:   [11.255 ms 11.299 ms 11.351 ms]
                        change: [−16.676% −16.115% −15.551%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Benchmarking sort string[1000] nulls 2^19: Collecting 100 samples in estimated 6.sort string[1000] nulls 2^19
                        time:   [15.421 ms 15.493 ms 15.573 ms]
                        change: [−21.110% −20.594% −20.048%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

Orasort core implementation: https://github.com/psila-ai/orasort
Perf defaults of the Orasort: https://github.com/psila-ai/orasort?tab=readme-ov-file#performance

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 29, 2026
@Dandandan
Copy link
Contributor

Looks interesting - I am wondering if we can't do the same in arrow-rs without relying on a new dependency?
I.e. inline small values to avoid memory access for string values as well (as we do already for string view)


impl<'a> KeyAccessor for FixedSizeBinaryAccessor<'a> {
#[inline(always)]
fn get_key(&self, index: usize) -> &[u8] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be marked unsafe

let mut valids: Vec<(u32, u32, u64)> = value_indices
.into_iter()
.map(|idx| unsafe {
// Build (index, 8-byte prefix) tuples for prefix-accelerated comparison sort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the improvement without orasort on this?

@alamb
Copy link
Contributor

alamb commented Feb 2, 2026

run benchmark sort_kernels

@alamb-ghbot

This comment was marked as outdated.

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing vclq/orasort-sorting (17d678b) to a49af1d diff
BENCH_NAME=sort_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench sort_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=vclq_orasort-sorting
Results will be posted here when complete

@alamb-ghbot
Copy link

Benchmark script failed with exit code 101.

Last 10 lines of output:

Click to expand
 Downloading crates ...
  Downloaded cuneiform v0.1.1
  Downloaded slab v0.4.12
  Downloaded zmij v1.0.19
  Downloaded orasort v0.1.2
  Downloaded insta v1.46.3
  Downloaded hyper-util v0.1.20
error: no bench target named `sort_kernels` in default-run packages

help: a target with a similar name exists: `sort_kernel`

@alamb
Copy link
Contributor

alamb commented Feb 2, 2026

run benchmark sort_kernel

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing vclq/orasort-sorting (17d678b) to a49af1d diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=vclq_orasort-sorting
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                   main                                   vclq_orasort-sorting
-----                                                   ----                                   --------------------
lexsort (bool, bool) 2^12                               1.00    115.0±1.46µs        ? ?/sec    1.02    117.2±1.78µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    151.7±1.88µs        ? ?/sec    1.10    166.8±0.93µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.01     45.2±0.36µs        ? ?/sec    1.00     45.0±0.59µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.01    213.6±4.24µs        ? ?/sec    1.00    212.1±5.74µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.9±0.32µs        ? ?/sec    1.00     39.0±0.60µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.02     42.1±0.54µs        ? ?/sec    1.00     41.2±0.27µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     79.3±0.82µs        ? ?/sec    1.00     79.0±2.08µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.01    214.3±1.32µs        ? ?/sec    1.00    211.4±1.31µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     55.7±0.18µs        ? ?/sec    1.00     55.1±0.50µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.02    261.4±4.26µs        ? ?/sec    1.00    256.6±2.49µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.02     89.8±0.72µs        ? ?/sec    1.00     87.8±0.49µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.03     91.1±1.47µs        ? ?/sec    1.00     88.8±0.45µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.03    102.7±0.58µs        ? ?/sec    1.00     99.7±3.40µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.02    261.7±4.03µs        ? ?/sec    1.00    256.9±3.47µs        ? ?/sec
rank f32 2^12                                           1.06     73.1±0.66µs        ? ?/sec    1.00     68.8±1.53µs        ? ?/sec
rank f32 nulls 2^12                                     1.05     37.5±0.37µs        ? ?/sec    1.00     35.5±0.17µs        ? ?/sec
rank string[10] 2^12                                    1.02    256.1±3.72µs        ? ?/sec    1.00    250.6±3.26µs        ? ?/sec
rank string[10] nulls 2^12                              1.01    122.1±1.02µs        ? ?/sec    1.00    121.0±1.32µs        ? ?/sec
sort f32 2^12                                           1.00     70.0±2.07µs        ? ?/sec    1.00     69.7±0.91µs        ? ?/sec
sort f32 nulls 2^12                                     1.02     30.0±0.44µs        ? ?/sec    1.00     29.5±0.76µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     37.7±0.35µs        ? ?/sec    1.00     37.6±0.40µs        ? ?/sec
sort f32 to indices 2^12                                1.00     71.3±0.49µs        ? ?/sec    1.00     71.0±0.74µs        ? ?/sec
sort i32 2^10                                           1.10      8.6±0.05µs        ? ?/sec    1.00      7.8±0.06µs        ? ?/sec
sort i32 2^12                                           1.11     42.3±0.27µs        ? ?/sec    1.00     38.2±0.34µs        ? ?/sec
sort i32 nulls 2^10                                     1.06      5.0±0.06µs        ? ?/sec    1.00      4.7±0.02µs        ? ?/sec
sort i32 nulls 2^12                                     1.06     21.0±0.26µs        ? ?/sec    1.00     19.9±0.15µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.09      7.5±0.14µs        ? ?/sec    1.00      6.9±0.03µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.13     32.2±0.77µs        ? ?/sec    1.00     28.6±0.21µs        ? ?/sec
sort i32 to indices 2^10                                1.15     12.8±0.55µs        ? ?/sec    1.00     11.1±0.10µs        ? ?/sec
sort i32 to indices 2^12                                1.17     60.9±0.83µs        ? ?/sec    1.00     52.0±0.62µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.0±0.07µs        ? ?/sec    1.23      7.4±0.12µs        ? ?/sec
sort primitive run to indices 2^12                      1.10      8.3±0.13µs        ? ?/sec    1.00      7.5±0.08µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00     43.0±0.44µs        ? ?/sec    1.08     46.4±0.70µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00     90.5±0.41µs        ? ?/sec    1.09     98.4±1.32µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00     50.8±0.41µs        ? ?/sec    1.18     60.2±0.76µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    129.0±0.69µs        ? ?/sec    1.09    140.0±3.60µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00     42.9±0.21µs        ? ?/sec    1.10     47.1±0.64µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00     92.0±0.80µs        ? ?/sec    1.07     98.0±1.56µs        ? ?/sec
sort string[1000] nulls 2^19                                                                   1.00     14.4±0.30ms        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00     44.0±1.66µs        ? ?/sec    1.09     47.9±0.20µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00     89.0±1.81µs        ? ?/sec    1.08     96.1±1.76µs        ? ?/sec
sort string[100] nulls 2^19                                                                    1.00     11.7±0.20ms        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00     42.3±0.34µs        ? ?/sec    1.14     48.2±7.55µs        ? ?/sec
sort string[100] to indices 2^12                        1.00     88.5±1.67µs        ? ?/sec    1.08     96.0±0.51µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.01    152.2±1.63µs        ? ?/sec    1.00    150.3±0.47µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.02    318.9±5.12µs        ? ?/sec    1.00    314.0±2.88µs        ? ?/sec
sort string[10] nulls 2^19                                                                     1.00      9.6±0.29ms        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00     43.8±1.87µs        ? ?/sec    1.08     47.5±0.48µs        ? ?/sec
sort string[10] to indices 2^12                         1.00     88.1±1.05µs        ? ?/sec    1.08     95.0±0.93µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.18     55.7±1.25µs        ? ?/sec    1.00     47.2±0.25µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.24    120.4±5.89µs        ? ?/sec    1.00     97.2±1.13µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     44.9±1.02µs        ? ?/sec    1.01     45.5±0.53µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    103.7±0.51µs        ? ?/sec    1.03    106.5±0.60µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     42.9±0.69µs        ? ?/sec    1.01     43.5±0.78µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.01     94.2±1.06µs        ? ?/sec    1.00     93.4±2.25µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Feb 2, 2026

run benchmark sort_kernel

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing vclq/orasort-sorting (17d678b) to a49af1d diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=vclq_orasort-sorting
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                   main                                   vclq_orasort-sorting
-----                                                   ----                                   --------------------
lexsort (bool, bool) 2^12                               1.00    115.1±1.11µs        ? ?/sec    1.02    117.5±1.61µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    152.1±1.30µs        ? ?/sec    1.09    166.5±1.24µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.02     45.6±2.80µs        ? ?/sec    1.00     44.8±0.70µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    213.2±1.93µs        ? ?/sec    1.00    212.2±3.07µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.7±0.14µs        ? ?/sec    1.00     38.6±0.49µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.3±0.43µs        ? ?/sec    1.00     41.0±0.13µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     79.2±1.01µs        ? ?/sec    1.00     79.0±0.94µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.01    213.2±1.75µs        ? ?/sec    1.00    211.8±1.10µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     55.6±0.52µs        ? ?/sec    1.00     54.9±0.99µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.02    261.1±2.07µs        ? ?/sec    1.00    256.3±1.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.01     89.4±1.35µs        ? ?/sec    1.00     88.4±2.47µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.02     90.6±1.72µs        ? ?/sec    1.00     88.9±0.54µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.03    102.1±0.92µs        ? ?/sec    1.00     99.2±1.37µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.02    261.1±1.58µs        ? ?/sec    1.00    256.5±1.48µs        ? ?/sec
rank f32 2^12                                           1.06     73.1±1.03µs        ? ?/sec    1.00     69.0±1.05µs        ? ?/sec
rank f32 nulls 2^12                                     1.05     37.4±0.20µs        ? ?/sec    1.00     35.7±0.32µs        ? ?/sec
rank string[10] 2^12                                    1.02    256.4±1.91µs        ? ?/sec    1.00    250.8±1.90µs        ? ?/sec
rank string[10] nulls 2^12                              1.01    122.6±1.68µs        ? ?/sec    1.00    121.4±2.82µs        ? ?/sec
sort f32 2^12                                           1.00     70.0±0.66µs        ? ?/sec    1.00     69.8±0.50µs        ? ?/sec
sort f32 nulls 2^12                                     1.02     30.0±0.27µs        ? ?/sec    1.00     29.4±0.34µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     37.7±0.41µs        ? ?/sec    1.00     37.7±0.57µs        ? ?/sec
sort f32 to indices 2^12                                1.01     71.2±0.24µs        ? ?/sec    1.00     70.7±0.47µs        ? ?/sec
sort i32 2^10                                           1.10      8.6±0.03µs        ? ?/sec    1.00      7.8±0.06µs        ? ?/sec
sort i32 2^12                                           1.13     42.8±2.35µs        ? ?/sec    1.00     37.9±0.31µs        ? ?/sec
sort i32 nulls 2^10                                     1.05      5.0±0.02µs        ? ?/sec    1.00      4.8±0.08µs        ? ?/sec
sort i32 nulls 2^12                                     1.13     22.5±4.68µs        ? ?/sec    1.00     19.9±0.12µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.09      7.5±0.04µs        ? ?/sec    1.00      6.9±0.04µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.12     32.1±0.23µs        ? ?/sec    1.00     28.6±0.20µs        ? ?/sec
sort i32 to indices 2^10                                1.14     12.7±0.07µs        ? ?/sec    1.00     11.1±0.06µs        ? ?/sec
sort i32 to indices 2^12                                1.16     60.6±0.40µs        ? ?/sec    1.00     52.2±1.17µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.0±0.03µs        ? ?/sec    1.22      7.3±0.08µs        ? ?/sec
sort primitive run to indices 2^12                      1.10      8.3±0.05µs        ? ?/sec    1.00      7.5±0.08µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00     42.8±0.53µs        ? ?/sec    1.09     46.6±0.39µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00     90.6±0.55µs        ? ?/sec    1.09     98.6±0.61µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00     51.2±2.37µs        ? ?/sec    1.18     60.2±0.49µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    129.2±1.18µs        ? ?/sec    1.08    139.7±2.87µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00     42.9±0.25µs        ? ?/sec    1.09     46.8±0.36µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00     92.6±0.79µs        ? ?/sec    1.06     97.9±2.17µs        ? ?/sec
sort string[1000] nulls 2^19                                                                   1.00     14.8±0.27ms        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00     43.7±0.56µs        ? ?/sec    1.10     47.9±1.09µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00     89.0±1.50µs        ? ?/sec    1.08     96.5±2.46µs        ? ?/sec
sort string[100] nulls 2^19                                                                    1.00     11.6±0.15ms        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00     42.4±0.48µs        ? ?/sec    1.09     46.4±0.33µs        ? ?/sec
sort string[100] to indices 2^12                        1.00     88.3±0.87µs        ? ?/sec    1.09     96.3±0.90µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    152.4±1.22µs        ? ?/sec    1.00    152.0±1.43µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.01    316.5±2.47µs        ? ?/sec    1.00    313.6±3.50µs        ? ?/sec
sort string[10] nulls 2^19                                                                     1.00      9.6±0.10ms        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00     43.3±0.24µs        ? ?/sec    1.09     47.2±1.13µs        ? ?/sec
sort string[10] to indices 2^12                         1.00     87.8±0.40µs        ? ?/sec    1.09     95.3±3.09µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.17     55.4±0.28µs        ? ?/sec    1.00     47.2±0.84µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.23    119.5±0.52µs        ? ?/sec    1.00     96.9±0.53µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     45.4±1.18µs        ? ?/sec    1.00     45.4±0.33µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.7±2.45µs        ? ?/sec    1.02    106.9±1.93µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     42.8±0.87µs        ? ?/sec    1.01     43.1±0.31µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.01     93.9±1.16µs        ? ?/sec    1.00     93.3±1.50µs        ? ?/sec

@Dandandan
Copy link
Contributor

Seems not to reproduce on the VM 🤔 perhaps machine-dependent

@vertexclique
Copy link
Contributor Author

I don't understand the benchmarks, can someone explain it to me?
I see 1.23 on main and 1 on this branch. In both runs main looks like having more than 1.
What am I missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants