Intermediate result blocked approach to aggregation memory management by Rachelint · Pull Request #15591 · apache/datafusion

Rachelint · 2025-04-05T07:47:59Z

Which issue does this PR close?

Part of Improve aggregate performance with adaptive sizing in accumulators / avoiding reallocations in accumulators #7065

Rationale for this change

As mentioned in #7065 , we use a single Vec to manage aggregation intermediate results both in GroupAccumulator and GroupValues.

It is simple but not efficient enough in high-cardinality aggregation, because when Vec is not large enough, we need to allocate a new Vec and copy all data from the old one.

Copying a large amount of data(due to high-cardinality) is obviously expansive
And it is also not friendly to cpu (will refresh cache and tlb)

So this pr introduces a blocked approach to manage the aggregation intermediate results. We will never resize the Vec in the approach, and instead we split the data to blocks, when the capacity is not enough, we just allocate a new block. Detail can see #7065

What changes are included in this PR?

Implement the sketch for blocked approach
Implement blocked groups supporting PrimitiveGroupsAccumulator and GroupValuesPrimitive as the example

Are these changes tested?

Test by exist tests. And new unit tests, new fuzzy tests.

Are there any user-facing changes?

Two functions are added to GroupValues and GroupAccumulator trait.

But as you can see, there are default implementations for them, and users can choose to really support the blocked approach when wanting a better performance for their udafs.

    /// Returns `true` if this accumulator supports blocked groups.
    fn supports_blocked_groups(&self) -> bool {
        false
    }

    /// Alter the block size in the accumulator
    ///
    /// If the target block size is `None`, it will use a single big
    /// block(can think it a `Vec`) to manage the state.
    ///
    /// If the target block size` is `Some(blk_size)`, it will try to
    /// set the block size to `blk_size`, and the try will only success
    /// when the accumulator has supported blocked mode.
    ///
    /// NOTICE: After altering block size, all data in previous will be cleared.
    ///
    fn alter_block_size(&mut self, block_size: Option<usize>) -> Result<()> {
        if block_size.is_some() {
            return Err(DataFusionError::NotImplemented(
                "this accumulator doesn't support blocked mode yet".to_string(),
            ));
        }

        Ok(())
    }

Dandandan · 2025-04-08T07:36:20Z

Hi @Rachelint I think I have a alternative proposal that seems relatively easy to implement.
I'll share it with you once I have some time to validate the design (probably this evening).

Rachelint · 2025-04-08T07:54:02Z

Hi @Rachelint I think I have a alternative proposal that seems relatively easy to implement. I'll share it with you once I have some time to validate the design (probably this evening).

Really thanks. This design in pr indeed still introduces quite a few code changes...

I tried to not modify anythings about GroupAccumulator firstly:

Only implement the blocked logic in GroupValues
Then we reorder the input batch according to their block indices got from GroupValues
Apply input batch to related GroupAccumulator using slice
And when we found the new block is needed, create a new GroupAccumulator (one block one GroupAccumulator)

But I found this way will introduce too many extra cost...

Maybe we place the block indices into values in merge/update_batch as a Array?

Rachelint · 2025-04-17T12:03:17Z

Has finished development(and test) of all needed common structs!
Rest four things for this one:

Support blocked related logic in GroupedHashAggregateStream(we can copy it from Sketch for aggregation intermediate results blocked management #11943 )
Logic about deciding when we should enable this optimization
Example blocked version for GroupAccumulator and GroupValues
Unit test for blocked GroupValuesPrimitive, it is a bit complex
Fuzzy tests
Chore: fix docs, fix clippy, add more comments...

Rachelint · 2025-04-21T13:58:12Z

It is very close, just need to add more tests!

- Promote `push_block`/`pop_block` to `BlockStore` trait methods so any block store can be drained generically. `FlatBlockStore` implements them as direct replace + `mem::take`; `BlockedBlockStore` introduces `EmitContext` to lazily move accumulation blocks out on first pop and drain via cursor. - Replace the `VecBlockStore<T>` extension trait with a `VecBlockStore<T, S>` struct that wraps any `S: BlockStore<Vec<T>>` and implements `emit` purely via `push_block`/`pop_block`, removing the per-store `emit` impls. - Update `PrimitiveGroupsState` and `GroupValuesPrimitiveState` to bound the inner store with `BlockStore<Vec<V>> + Send`, hold the wrapper as `VecBlockStore<V, VB>`, and add `V: Send` where the closure passed to `NullState::accumulate` requires it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…anBlock> - Replace the `SeenValueStore` extension trait (and the per-store `emit` impls on `FlatBlockStore<BooleanBlock>` / `BlockedBlockStore<BooleanBlock>`) with a `SeenValueStore<S>` struct that wraps any `S: BlockStore<BooleanBlock>` and implements `emit` purely via `pop_block` + `BooleanBlock::finish` + `push_block`. - Update `SeenValues` and `NullState` to bound `S` with `BlockStore<BooleanBlock>` and hold the wrapper as `SeenValueStore<S>`; `NullState::new` wraps the empty builder internally so callers stay unchanged. - Update `PrimitiveGroupsState` bound from `SeenValueStore + Send` to `BlockStore<BooleanBlock> + Send`. - Keep only the inherent methods that have call sites (`set_bit`, `size`, `resize`, `num_blocks`, `emit`) plus `Index`/`IndexMut`; drop the unused `push_block`/`pop_block`/`allocate_block`/`is_empty`/`clear` delegators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alamb · 2026-06-16T20:55:53Z

Marking as a draft as I don't think this one is ready to merge quite yet and I am trying to clean up the review / merge queue

Rachelint · 2026-06-18T06:57:12Z

Marking as a draft as I don't think this one is ready to merge quite yet and I am trying to clean up the review / merge queue

Yes, and I think the whole feature will be suitable to push forward after the aggregation refactoring stable.

Howerver, actually to parts are included in this:

One part is about refactoring GroupValues and GroupAccumulator
The other part is about applying the blocked logic in aggreagating

How about we split this pr into twos or mores? And push forward the part one (GroupValues and GroupAccumulator) in parallel with the aggregation refactoring?
@alamb @2010YOUY01 @ariel-miculas

Rachelint · 2026-06-18T06:59:41Z

And before splitting, I will continue to make and prove the refactoring of GroupValues and GroupAccumulator will not lead to regression in this one.

ariel-miculas · 2026-06-18T08:16:54Z

How about we split this pr into twos or mores? And push forward the part one (GroupValues and GroupAccumulator) in parallel with the aggregation refactoring?

I think it's a good idea, this is important work and it would be easier to review if split into smaller PRs.

2010YOUY01 · 2026-06-18T13:21:46Z

Marking as a draft as I don't think this one is ready to merge quite yet and I am trying to clean up the review / merge queue

Yes, and I think the whole feature will be suitable to push forward after the aggregation refactoring stable.

Howerver, actually to parts are included in this:

One part is about refactoring GroupValues and GroupAccumulator

The other part is about applying the blocked logic in aggreagating

How about we split this pr into twos or mores? And push forward the part one (GroupValues and GroupAccumulator) in parallel with the aggregation refactoring? @alamb @2010YOUY01 @ariel-miculas

I think the steps are

Complete [EPIC] Split Aggregation Logic into Dedicated Streams #22710
Initial PR for blocked states: The major issue is to agree on the API changes for GroupValues and GroupsAccumulator, and how to organize future works.
Update all GroupValues and GroupsAccumulator (There are around 20 of them IIRC)

The performance seems to be a nearly solved issue, the PoC already showed high cardinality cases are faster (with several micro optimizations left on the table), low cardinality is slightly slower but @alamb's suggestion in #22712 (comment) is doable I think, to bring back the performance.

I suggest not trying to parallelize steps 1 and 2, as they will likely conflict with each other. Step 3 should be highly parallelizable.

As for the refactoring progress, I'd estimate it's about 50% complete. I haven't seen any major technical blockers so far—just need some time to better structure the implementation.

Rachelint · 2026-06-18T23:26:04Z

I suggest not trying to parallelize steps 1 and 2, as they will likely conflict with each other. Step 3 should be highly parallelizable...

Make sense.

The performance seems to be a nearly solved issue...

Yes, and actually I think it make few difference to performance after experiment before (some steps are improved like removing slice of record batch, removing Vec resizing, and some steps are regressed like we need to perform 2 index op, and finally near to no difference will be made), and just a better memory management approach.
#15591 (comment)
And I am paying effort to make it able to lead not regression when we disable the feature(regression happend due to we need to performance two index op even disabling).
#15591 (comment)

ariel-miculas · 2026-06-19T10:38:42Z

Yes, and actually I think it make few difference to performance after experiment before (some steps are improved like removing slice of record batch, removing Vec resizing, and some steps are regressed like we need to perform 2 index op, and finally near to no difference will be made), and just a better memory management approach.

I disagree, since the memory management is directly tied to performance via the spilling mechanism when running with memory limits configured. See #22526 (comment)
The memory overaccounting issues caused by the current design of hash aggregation have a real performance impact in the downstream operators which are either:

forced to spill prematurely
outright fail because they don't have enough memory, see Accurately reserve memory in the build side of hash joins #22861

So I believe the new "blocked" approach will have significant performance improvements in production-like workloads.

2010YOUY01 · 2026-06-20T06:55:52Z

Yes, and actually I think it make few difference to performance after experiment before (some steps are improved like removing slice of record batch, removing Vec resizing, and some steps are regressed like we need to perform 2 index op, and finally near to no difference will be made), and just a better memory management approach.

I disagree, since the memory management is directly tied to performance via the spilling mechanism when running with memory limits configured. See #22526 (comment) The memory overaccounting issues caused by the current design of hash aggregation have a real performance impact in the downstream operators which are either:

forced to spill prematurely

outright fail because they don't have enough memory, see Accurately reserve memory in the build side of hash joins #22861

So I believe the new "blocked" approach will have significant performance improvements in production-like workloads.

I agree we could proceed first without worrying too much about the benchmark numbers.

This is like a tradeoff between micro-optimizations and algorithmic improvements to memory efficiency.

I think completely giving up 10%-ish performance for architectural win is already a good idea. But realistically, I also believe it should be possible to avoid the regressions entirely with some low-level optimizations, but we'd better discuss those opportunities later.

Rachelint · 2026-06-20T09:34:10Z

I disagree, since the memory management is directly tied to performance via the spilling mechanism when running with memory limits configured.

Good point, no difference to performance is maybe just for benchmark.

adriangb · 2026-06-20T11:15:41Z

We can run benchmarks with memory limits to force spilling if that helps

github-actions · 2026-06-22T01:37:32Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion v54.0.0 (current)
       Built [ 106.692s] (current)
     Parsing datafusion v54.0.0 (current)
      Parsed [   0.037s] (current)
    Building datafusion v54.0.0 (baseline)
       Built [ 103.270s] (baseline)
     Parsing datafusion v54.0.0 (baseline)
      Parsed [   0.038s] (baseline)
    Checking datafusion v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.884s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 212.512s] datafusion
    Building datafusion-common v54.0.0 (current)
       Built [  32.136s] (current)
     Parsing datafusion-common v54.0.0 (current)
      Parsed [   0.067s] (current)
    Building datafusion-common v54.0.0 (baseline)
       Built [  32.790s] (baseline)
     Parsing datafusion-common v54.0.0 (baseline)
      Parsed [   0.065s] (baseline)
    Checking datafusion-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   1.039s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ExecutionOptions.enable_aggregation_blocked_groups in /home/runner/work/datafusion/datafusion/datafusion/common/src/config.rs:723

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  67.915s] datafusion-common
    Building datafusion-expr-common v54.0.0 (current)
       Built [  18.659s] (current)
     Parsing datafusion-expr-common v54.0.0 (current)
      Parsed [   0.019s] (current)
    Building datafusion-expr-common v54.0.0 (baseline)
       Built [  18.871s] (baseline)
     Parsing datafusion-expr-common v54.0.0 (baseline)
      Parsed [   0.020s] (baseline)
    Checking datafusion-expr-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.302s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/enum_variant_added.ron

Failed in:
  variant EmitTo:NextBlock in /home/runner/work/datafusion/datafusion/datafusion/expr-common/src/groups_accumulator.rs:39

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  38.677s] datafusion-expr-common
    Building datafusion-ffi v54.0.0 (current)
       Built [  60.353s] (current)
     Parsing datafusion-ffi v54.0.0 (current)
      Parsed [   0.067s] (current)
    Building datafusion-ffi v54.0.0 (baseline)
       Built [  59.553s] (baseline)
     Parsing datafusion-ffi v54.0.0 (baseline)
      Parsed [   0.068s] (baseline)
    Checking datafusion-ffi v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.392s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 122.872s] datafusion-ffi
    Building datafusion-functions-aggregate v54.0.0 (current)
       Built [  30.409s] (current)
     Parsing datafusion-functions-aggregate v54.0.0 (current)
      Parsed [   0.048s] (current)
    Building datafusion-functions-aggregate v54.0.0 (baseline)
       Built [  29.847s] (baseline)
     Parsing datafusion-functions-aggregate v54.0.0 (baseline)
      Parsed [   0.049s] (baseline)
    Checking datafusion-functions-aggregate v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.278s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  62.108s] datafusion-functions-aggregate
    Building datafusion-functions-aggregate-common v54.0.0 (current)
       Built [  20.139s] (current)
     Parsing datafusion-functions-aggregate-common v54.0.0 (current)
      Parsed [   0.024s] (current)
    Building datafusion-functions-aggregate-common v54.0.0 (baseline)
       Built [  20.215s] (baseline)
     Parsing datafusion-functions-aggregate-common v54.0.0 (baseline)
      Parsed [   0.021s] (baseline)
    Checking datafusion-functions-aggregate-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.188s] 223 checks: 219 pass, 4 fail, 0 warn, 30 skip

--- failure enum_struct_variant_field_added: pub enum struct variant field added ---

Description:
An enum's exhaustive struct variant has a new field, which has to be included when constructing or matching on this variant.
        ref: https://doc.rust-lang.org/reference/attributes/type_system.html#the-non_exhaustive-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/enum_struct_variant_field_added.ron

Failed in:
  field pending_builder of variant SeenValues::All in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:113
  field builder of variant SeenValues::Some in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:116

--- failure enum_struct_variant_field_missing: pub enum struct variant's field removed or renamed ---

Description:
A publicly-visible enum has a struct variant whose field is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/enum_struct_variant_field_missing.ron

Failed in:
  field values of variant SeenValues::Some, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/75d3c87db2328833532578616ff1c9c11e735e05/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:46

--- failure trait_requires_more_generic_type_params: trait now requires more generic type parameters ---

Description:
A trait now requires more generic type parameters than it used to. Uses of this trait that supplied the previously-required number of generic types will be broken. To fix this, consider supplying default values for newly-added generic types.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#trait-new-parameter-no-default
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/trait_requires_more_generic_type_params.ron

Failed in:
  trait NullState (0 -> 2 required generic types) in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:305
  trait SeenValues (0 -> 1 required generic types) in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:99

--- failure type_requires_more_generic_type_params: type now requires more generic type parameters ---

Description:
A type now requires more generic type parameters than it used to. Uses of this type that supplied the previously-required number of generic types will be broken. To fix this, consider supplying default values for newly-added generic types.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#trait-new-parameter-no-default
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/type_requires_more_generic_type_params.ron

Failed in:
  Struct NullState (0 -> 2 required generic types) in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:305
  Enum SeenValues (0 -> 1 required generic types) in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:99

     Summary semver requires new major version: 4 major and 0 minor checks failed
    Finished [  41.530s] datafusion-functions-aggregate-common
    Building datafusion-physical-expr v54.0.0 (current)
       Built [  28.523s] (current)
     Parsing datafusion-physical-expr v54.0.0 (current)
      Parsed [   0.051s] (current)
    Building datafusion-physical-expr v54.0.0 (baseline)
       Built [  28.843s] (baseline)
     Parsing datafusion-physical-expr v54.0.0 (baseline)
      Parsed [   0.054s] (baseline)
    Checking datafusion-physical-expr v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.496s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  58.823s] datafusion-physical-expr
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  36.865s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.142s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  37.869s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.143s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.857s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  78.086s] datafusion-physical-plan
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 193.327s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.023s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 182.351s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.027s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.124s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 380.044s] datafusion-sqllogictest

Rachelint · 2026-06-23T11:20:20Z

We can run benchmarks with memory limits to force spilling if that helps

@adriangb hello, is it possible to authorize me to trigger benchmark through bot?
I am trying to improve aggregation performance recently.

alamb · 2026-06-23T16:01:49Z

We can run benchmarks with memory limits to force spilling if that helps

@adriangb hello, is it possible to authorize me to trigger benchmark through bot? I am trying to improve aggregation performance recently.

I also sent @adriangb a direct message as well

adriangb · 2026-06-23T17:18:08Z

done in adriangb/datafusion-benchmarking#16

Rachelint · 2026-06-24T01:29:14Z

@adriangb @alamb Thanks!

Rachelint · 2026-06-24T01:51:19Z

run benchmarks clickbench_partitioned

adriangbot · 2026-06-24T01:54:13Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4785188060-647-xvzht 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing intermeidate-result-blocked-approach (5869167) to a27f030 (merge-base) diff using: clickbench_partitioned
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-24T02:14:50Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and intermeidate-result-blocked-approach
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃   intermeidate-result-blocked-approach ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.23 / 3.91 ±5.30 / 14.51 ms │           1.22 / 3.90 ±5.29 / 14.48 ms │     no change │
│ QQuery 1  │        12.52 / 12.80 ±0.16 / 13.00 ms │         12.77 / 12.87 ±0.05 / 12.93 ms │     no change │
│ QQuery 2  │        35.72 / 35.98 ±0.25 / 36.35 ms │         35.93 / 36.41 ±0.60 / 37.43 ms │     no change │
│ QQuery 3  │        30.19 / 30.83 ±0.71 / 32.21 ms │         30.03 / 30.36 ±0.27 / 30.80 ms │     no change │
│ QQuery 4  │     220.44 / 222.86 ±1.27 / 223.76 ms │      232.67 / 236.08 ±2.39 / 239.41 ms │  1.06x slower │
│ QQuery 5  │     275.25 / 278.35 ±2.41 / 281.95 ms │      265.43 / 268.94 ±3.19 / 273.97 ms │     no change │
│ QQuery 6  │           1.26 / 1.41 ±0.23 / 1.86 ms │            1.25 / 1.40 ±0.22 / 1.84 ms │     no change │
│ QQuery 7  │        14.28 / 14.44 ±0.09 / 14.54 ms │         13.64 / 13.81 ±0.15 / 14.01 ms │     no change │
│ QQuery 8  │     330.48 / 335.49 ±3.02 / 339.34 ms │      315.81 / 318.80 ±2.80 / 323.81 ms │     no change │
│ QQuery 9  │     473.44 / 483.37 ±5.48 / 490.16 ms │      454.22 / 464.83 ±7.98 / 477.60 ms │     no change │
│ QQuery 10 │        70.99 / 73.29 ±3.74 / 80.69 ms │         68.19 / 72.03 ±6.11 / 84.20 ms │     no change │
│ QQuery 11 │        81.42 / 83.29 ±1.46 / 84.98 ms │         79.85 / 81.96 ±1.08 / 82.78 ms │     no change │
│ QQuery 12 │     277.56 / 283.69 ±5.17 / 289.56 ms │      262.77 / 267.14 ±3.56 / 271.51 ms │ +1.06x faster │
│ QQuery 13 │    380.32 / 394.52 ±11.41 / 413.87 ms │      363.62 / 380.65 ±9.00 / 389.67 ms │     no change │
│ QQuery 14 │     288.98 / 295.10 ±4.33 / 301.87 ms │      277.79 / 281.39 ±2.72 / 285.07 ms │     no change │
│ QQuery 15 │     283.82 / 292.30 ±5.70 / 300.28 ms │      281.75 / 293.14 ±7.76 / 302.75 ms │     no change │
│ QQuery 16 │    624.39 / 638.83 ±10.83 / 657.01 ms │      605.26 / 613.60 ±4.85 / 619.95 ms │     no change │
│ QQuery 17 │     630.65 / 646.02 ±8.95 / 656.39 ms │      617.25 / 623.47 ±6.64 / 632.77 ms │     no change │
│ QQuery 18 │ 1291.95 / 1322.58 ±24.46 / 1363.70 ms │  1240.33 / 1261.01 ±16.16 / 1285.74 ms │     no change │
│ QQuery 19 │        27.94 / 28.41 ±0.26 / 28.70 ms │         27.41 / 29.59 ±4.28 / 38.15 ms │     no change │
│ QQuery 20 │    515.48 / 529.33 ±11.54 / 544.72 ms │      517.30 / 521.34 ±3.04 / 526.61 ms │     no change │
│ QQuery 21 │     518.96 / 526.74 ±5.53 / 535.97 ms │      509.46 / 519.34 ±5.92 / 526.44 ms │     no change │
│ QQuery 22 │ 1003.60 / 1030.83 ±23.45 / 1060.15 ms │      982.29 / 992.46 ±5.89 / 999.88 ms │     no change │
│ QQuery 23 │ 3163.56 / 3208.28 ±48.54 / 3295.42 ms │  3151.81 / 3211.18 ±40.42 / 3266.77 ms │     no change │
│ QQuery 24 │        41.49 / 51.52 ±5.77 / 58.79 ms │         41.80 / 42.67 ±0.98 / 44.35 ms │ +1.21x faster │
│ QQuery 25 │     111.27 / 111.96 ±0.89 / 113.61 ms │      111.04 / 113.57 ±4.15 / 121.80 ms │     no change │
│ QQuery 26 │        41.50 / 42.35 ±0.50 / 42.87 ms │         41.51 / 42.27 ±0.70 / 43.54 ms │     no change │
│ QQuery 27 │     664.58 / 675.43 ±5.79 / 680.45 ms │     674.35 / 685.78 ±14.19 / 710.56 ms │     no change │
│ QQuery 28 │  3030.00 / 3034.51 ±4.91 / 3042.90 ms │   3020.15 / 3034.08 ±7.84 / 3043.74 ms │     no change │
│ QQuery 29 │        40.73 / 45.01 ±6.31 / 57.27 ms │         40.36 / 40.66 ±0.50 / 41.65 ms │ +1.11x faster │
│ QQuery 30 │    296.81 / 309.65 ±12.07 / 331.50 ms │     297.35 / 308.84 ±12.39 / 328.12 ms │     no change │
│ QQuery 31 │     278.98 / 288.15 ±7.22 / 299.13 ms │      276.96 / 284.72 ±5.08 / 289.46 ms │     no change │
│ QQuery 32 │   916.77 / 959.08 ±35.57 / 1013.44 ms │    925.97 / 977.05 ±44.21 / 1052.50 ms │     no change │
│ QQuery 33 │ 1426.83 / 1454.86 ±18.33 / 1478.22 ms │  1437.69 / 1473.51 ±29.58 / 1512.16 ms │     no change │
│ QQuery 34 │ 1486.59 / 1541.22 ±54.65 / 1639.59 ms │ 1487.08 / 1563.69 ±109.94 / 1781.94 ms │     no change │
│ QQuery 35 │    274.38 / 311.25 ±31.15 / 354.24 ms │     288.85 / 331.06 ±72.24 / 475.19 ms │  1.06x slower │
│ QQuery 36 │        63.27 / 68.48 ±3.65 / 73.27 ms │         65.90 / 70.54 ±4.44 / 78.58 ms │     no change │
│ QQuery 37 │       36.26 / 43.59 ±12.86 / 69.24 ms │         35.18 / 40.41 ±4.38 / 44.95 ms │ +1.08x faster │
│ QQuery 38 │        42.87 / 46.52 ±3.37 / 50.52 ms │         40.11 / 42.59 ±1.46 / 44.01 ms │ +1.09x faster │
│ QQuery 39 │     143.45 / 148.86 ±6.48 / 160.77 ms │      146.04 / 151.46 ±5.63 / 160.01 ms │     no change │
│ QQuery 40 │        13.90 / 21.07 ±9.56 / 39.77 ms │         13.71 / 14.23 ±0.68 / 15.52 ms │ +1.48x faster │
│ QQuery 41 │        13.54 / 15.77 ±2.88 / 21.42 ms │         13.26 / 13.56 ±0.19 / 13.81 ms │ +1.16x faster │
│ QQuery 42 │        12.83 / 13.14 ±0.17 / 13.37 ms │         12.93 / 13.12 ±0.10 / 13.20 ms │     no change │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                   ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                   │ 19955.07ms │
│ Total Time (intermeidate-result-blocked-approach)   │ 19779.53ms │
│ Average Time (HEAD)                                 │   464.07ms │
│ Average Time (intermeidate-result-blocked-approach) │   459.99ms │
│ Queries Faster                                      │          7 │
│ Queries Slower                                      │          2 │
│ Queries with No Change                              │         34 │
│ Queries with Failure                                │          0 │
└─────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric	Value
Wall time	105.0s
Peak memory	11.0 GiB
Avg memory	4.3 GiB
CPU user	1013.6s
CPU sys	70.7s
Peak spill	0 B

clickbench_partitioned — branch

Metric	Value
Wall time	100.0s
Peak memory	11.4 GiB
Avg memory	4.3 GiB
CPU user	1003.5s
CPU sys	71.2s
Peak spill	0 B

File an issue against this benchmark runner

Rachelint changed the title ~~Impl Intermeidate result blocked approach framework~~ Impl intermeidate result blocked approach framework Apr 5, 2025

Rachelint changed the title ~~Impl intermeidate result blocked approach framework~~ Impl intermeidate result blocked approach sketch Apr 5, 2025

Rachelint mentioned this pull request Apr 5, 2025

Improve aggregate performance with adaptive sizing in accumulators / avoiding reallocations in accumulators #7065

Open

2 tasks

github-actions Bot added the logical-expr Logical plan and expressions label Apr 5, 2025

Rachelint mentioned this pull request Apr 8, 2025

Implement PoC block allocation for count accumulator #15642

Closed

Rachelint force-pushed the intermeidate-result-blocked-approach branch 2 times, most recently from cc37eba to f690940 Compare April 9, 2025 14:37

Rachelint mentioned this pull request Apr 9, 2025

Sketch for aggregation intermediate results blocked management #11943

Closed

github-actions Bot added the functions Changes to functions implementation label Apr 10, 2025

Rachelint force-pushed the intermeidate-result-blocked-approach branch from 95c6a36 to a4c6f42 Compare April 10, 2025 11:10

github-actions Bot added the physical-expr Changes to the physical-expr crates label Apr 10, 2025

Rachelint force-pushed the intermeidate-result-blocked-approach branch 6 times, most recently from 2100a5b to 0ee951c Compare April 17, 2025 11:56

Rachelint force-pushed the intermeidate-result-blocked-approach branch 2 times, most recently from c51d409 to 2863809 Compare April 20, 2025 14:46

github-actions Bot added execution Related to the execution crate common Related to common crate sqllogictest SQL Logic Tests (.slt) labels Apr 21, 2025

Rachelint force-pushed the intermeidate-result-blocked-approach branch 3 times, most recently from 31d660d to 2b8dd1e Compare April 22, 2025 18:52

define the needed methods in GroupAccumulator and GroupValues.

4353748

Rachelint and others added 2 commits June 8, 2026 02:10

refactor: add block store abstractions

3c51e0c

Rachelint added 5 commits June 21, 2026 01:02

Simplify null state construction

e49260f

Derive null state block size from block store

2462069

Remove BlockStore is_empty method

be352fd

Import Deref traits in accumulate

8cd0092

Merge branch 'main' into intermeidate-result-blocked-approach

d355609

Rachelint added 3 commits June 22, 2026 13:41

Fix primitive group values emit first test

eef162c

Simplify blocked block store emission state

0b3179a

Optimize primitive groups accumulator size

5869167

adriangb mentioned this pull request Jun 23, 2026

Authorize Rachelint to trigger benchmarks adriangb/datafusion-benchmarking#16

Merged

Uh oh!

Conversation

Rachelint commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Apr 8, 2025

Uh oh!

Rachelint commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachelint commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachelint commented Apr 21, 2025

Uh oh!

alamb commented Jun 16, 2026

Uh oh!

Rachelint commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachelint commented Jun 18, 2026

Uh oh!

ariel-miculas commented Jun 18, 2026

Uh oh!

2010YOUY01 commented Jun 18, 2026

Uh oh!

Rachelint commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ariel-miculas commented Jun 19, 2026

Uh oh!

2010YOUY01 commented Jun 20, 2026

Uh oh!

Rachelint commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachelint commented Jun 23, 2026

Uh oh!

alamb commented Jun 23, 2026

Uh oh!

adriangb commented Jun 23, 2026

Uh oh!

Rachelint commented Jun 24, 2026

Uh oh!

Rachelint commented Jun 24, 2026

Uh oh!

adriangbot commented Jun 24, 2026

Uh oh!

adriangbot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Rachelint commented Apr 5, 2025 •

edited

Loading

Rachelint commented Apr 8, 2025 •

edited

Loading

Rachelint commented Apr 17, 2025 •

edited

Loading

Rachelint commented Jun 18, 2026 •

edited

Loading

Rachelint commented Jun 18, 2026 •

edited

Loading

Rachelint commented Jun 20, 2026 •

edited

Loading

github-actions Bot commented Jun 22, 2026 •

edited

Loading