Skip to content

Intermediate result blocked approach to aggregation memory management#15591

Draft
Rachelint wants to merge 102 commits into
apache:mainfrom
Rachelint:intermeidate-result-blocked-approach
Draft

Intermediate result blocked approach to aggregation memory management#15591
Rachelint wants to merge 102 commits into
apache:mainfrom
Rachelint:intermeidate-result-blocked-approach

Conversation

@Rachelint

@Rachelint Rachelint commented Apr 5, 2025

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

As mentioned in #7065 , we use a single Vec to manage aggregation intermediate results both in GroupAccumulator and GroupValues.

It is simple but not efficient enough in high-cardinality aggregation, because when Vec is not large enough, we need to allocate a new Vec and copy all data from the old one.

  • Copying a large amount of data(due to high-cardinality) is obviously expansive
  • And it is also not friendly to cpu (will refresh cache and tlb)

So this pr introduces a blocked approach to manage the aggregation intermediate results. We will never resize the Vec in the approach, and instead we split the data to blocks, when the capacity is not enough, we just allocate a new block. Detail can see #7065

What changes are included in this PR?

  • Implement the sketch for blocked approach
  • Implement blocked groups supporting PrimitiveGroupsAccumulator and GroupValuesPrimitive as the example

Are these changes tested?

Test by exist tests. And new unit tests, new fuzzy tests.

Are there any user-facing changes?

Two functions are added to GroupValues and GroupAccumulator trait.

But as you can see, there are default implementations for them, and users can choose to really support the blocked approach when wanting a better performance for their udafs.

    /// Returns `true` if this accumulator supports blocked groups.
    fn supports_blocked_groups(&self) -> bool {
        false
    }

    /// Alter the block size in the accumulator
    ///
    /// If the target block size is `None`, it will use a single big
    /// block(can think it a `Vec`) to manage the state.
    ///
    /// If the target block size` is `Some(blk_size)`, it will try to
    /// set the block size to `blk_size`, and the try will only success
    /// when the accumulator has supported blocked mode.
    ///
    /// NOTICE: After altering block size, all data in previous will be cleared.
    ///
    fn alter_block_size(&mut self, block_size: Option<usize>) -> Result<()> {
        if block_size.is_some() {
            return Err(DataFusionError::NotImplemented(
                "this accumulator doesn't support blocked mode yet".to_string(),
            ));
        }

        Ok(())
    }

@Rachelint Rachelint changed the title Impl Intermeidate result blocked approach framework Impl intermeidate result blocked approach framework Apr 5, 2025
@Rachelint Rachelint changed the title Impl intermeidate result blocked approach framework Impl intermeidate result blocked approach sketch Apr 5, 2025
@github-actions github-actions Bot added the logical-expr Logical plan and expressions label Apr 5, 2025
@Dandandan

Copy link
Copy Markdown
Contributor

Hi @Rachelint I think I have a alternative proposal that seems relatively easy to implement.
I'll share it with you once I have some time to validate the design (probably this evening).

@Rachelint

Rachelint commented Apr 8, 2025

Copy link
Copy Markdown
Contributor Author

Hi @Rachelint I think I have a alternative proposal that seems relatively easy to implement. I'll share it with you once I have some time to validate the design (probably this evening).

Really thanks. This design in pr indeed still introduces quite a few code changes...

I tried to not modify anythings about GroupAccumulator firstly:

  • Only implement the blocked logic in GroupValues
  • Then we reorder the input batch according to their block indices got from GroupValues
  • Apply input batch to related GroupAccumulator using slice
  • And when we found the new block is needed, create a new GroupAccumulator (one block one GroupAccumulator)

But I found this way will introduce too many extra cost...

Maybe we place the block indices into values in merge/update_batch as a Array?

@Rachelint Rachelint force-pushed the intermeidate-result-blocked-approach branch 2 times, most recently from cc37eba to f690940 Compare April 9, 2025 14:37
@github-actions github-actions Bot added the functions Changes to functions implementation label Apr 10, 2025
@Rachelint Rachelint force-pushed the intermeidate-result-blocked-approach branch from 95c6a36 to a4c6f42 Compare April 10, 2025 11:10
@github-actions github-actions Bot added the physical-expr Changes to the physical-expr crates label Apr 10, 2025
@Rachelint Rachelint force-pushed the intermeidate-result-blocked-approach branch 6 times, most recently from 2100a5b to 0ee951c Compare April 17, 2025 11:56
@Rachelint

Rachelint commented Apr 17, 2025

Copy link
Copy Markdown
Contributor Author

Has finished development(and test) of all needed common structs!
Rest four things for this one:

  • Support blocked related logic in GroupedHashAggregateStream(we can copy it from Sketch for aggregation intermediate results blocked management #11943 )
  • Logic about deciding when we should enable this optimization
  • Example blocked version for GroupAccumulator and GroupValues
  • Unit test for blocked GroupValuesPrimitive, it is a bit complex
  • Fuzzy tests
  • Chore: fix docs, fix clippy, add more comments...

@Rachelint Rachelint force-pushed the intermeidate-result-blocked-approach branch 2 times, most recently from c51d409 to 2863809 Compare April 20, 2025 14:46
@github-actions github-actions Bot added execution Related to the execution crate common Related to common crate sqllogictest SQL Logic Tests (.slt) labels Apr 21, 2025
@Rachelint

Copy link
Copy Markdown
Contributor Author

It is very close, just need to add more tests!

@Rachelint Rachelint force-pushed the intermeidate-result-blocked-approach branch 3 times, most recently from 31d660d to 2b8dd1e Compare April 22, 2025 18:52
Rachelint and others added 2 commits June 8, 2026 02:10
- Promote `push_block`/`pop_block` to `BlockStore` trait methods so any
  block store can be drained generically. `FlatBlockStore` implements
  them as direct replace + `mem::take`; `BlockedBlockStore` introduces
  `EmitContext` to lazily move accumulation blocks out on first pop and
  drain via cursor.
- Replace the `VecBlockStore<T>` extension trait with a
  `VecBlockStore<T, S>` struct that wraps any `S: BlockStore<Vec<T>>`
  and implements `emit` purely via `push_block`/`pop_block`, removing
  the per-store `emit` impls.
- Update `PrimitiveGroupsState` and `GroupValuesPrimitiveState` to bound
  the inner store with `BlockStore<Vec<V>> + Send`, hold the wrapper as
  `VecBlockStore<V, VB>`, and add `V: Send` where the closure passed to
  `NullState::accumulate` requires it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anBlock>

- Replace the `SeenValueStore` extension trait (and the per-store `emit`
  impls on `FlatBlockStore<BooleanBlock>` / `BlockedBlockStore<BooleanBlock>`)
  with a `SeenValueStore<S>` struct that wraps any
  `S: BlockStore<BooleanBlock>` and implements `emit` purely via
  `pop_block` + `BooleanBlock::finish` + `push_block`.
- Update `SeenValues` and `NullState` to bound `S` with
  `BlockStore<BooleanBlock>` and hold the wrapper as `SeenValueStore<S>`;
  `NullState::new` wraps the empty builder internally so callers stay
  unchanged.
- Update `PrimitiveGroupsState` bound from `SeenValueStore + Send` to
  `BlockStore<BooleanBlock> + Send`.
- Keep only the inherent methods that have call sites (`set_bit`, `size`,
  `resize`, `num_blocks`, `emit`) plus `Index`/`IndexMut`; drop the
  unused `push_block`/`pop_block`/`allocate_block`/`is_empty`/`clear`
  delegators.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alamb

alamb commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Marking as a draft as I don't think this one is ready to merge quite yet and I am trying to clean up the review / merge queue

@Rachelint

Rachelint commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Marking as a draft as I don't think this one is ready to merge quite yet and I am trying to clean up the review / merge queue

Yes, and I think the whole feature will be suitable to push forward after the aggregation refactoring stable.

Howerver, actually to parts are included in this:

  • One part is about refactoring GroupValues and GroupAccumulator
  • The other part is about applying the blocked logic in aggreagating

How about we split this pr into twos or mores? And push forward the part one (GroupValues and GroupAccumulator) in parallel with the aggregation refactoring?
@alamb @2010YOUY01 @ariel-miculas

@Rachelint

Copy link
Copy Markdown
Contributor Author

And before splitting, I will continue to make and prove the refactoring of GroupValues and GroupAccumulator will not lead to regression in this one.

@ariel-miculas

Copy link
Copy Markdown
Contributor

How about we split this pr into twos or mores? And push forward the part one (GroupValues and GroupAccumulator) in parallel with the aggregation refactoring?

I think it's a good idea, this is important work and it would be easier to review if split into smaller PRs.

@2010YOUY01

Copy link
Copy Markdown
Contributor

Marking as a draft as I don't think this one is ready to merge quite yet and I am trying to clean up the review / merge queue

Yes, and I think the whole feature will be suitable to push forward after the aggregation refactoring stable.

Howerver, actually to parts are included in this:

  • One part is about refactoring GroupValues and GroupAccumulator
  • The other part is about applying the blocked logic in aggreagating

How about we split this pr into twos or mores? And push forward the part one (GroupValues and GroupAccumulator) in parallel with the aggregation refactoring? @alamb @2010YOUY01 @ariel-miculas

I think the steps are

  1. Complete [EPIC] Split Aggregation Logic into Dedicated Streams #22710
  2. Initial PR for blocked states: The major issue is to agree on the API changes for GroupValues and GroupsAccumulator, and how to organize future works.
  3. Update all GroupValues and GroupsAccumulator (There are around 20 of them IIRC)

The performance seems to be a nearly solved issue, the PoC already showed high cardinality cases are faster (with several micro optimizations left on the table), low cardinality is slightly slower but @alamb's suggestion in #22712 (comment) is doable I think, to bring back the performance.

I suggest not trying to parallelize steps 1 and 2, as they will likely conflict with each other. Step 3 should be highly parallelizable.

As for the refactoring progress, I'd estimate it's about 50% complete. I haven't seen any major technical blockers so far—just need some time to better structure the implementation.

@Rachelint

Rachelint commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

I suggest not trying to parallelize steps 1 and 2, as they will likely conflict with each other. Step 3 should be highly parallelizable...

Make sense.

The performance seems to be a nearly solved issue...

Yes, and actually I think it make few difference to performance after experiment before (some steps are improved like removing slice of record batch, removing Vec resizing, and some steps are regressed like we need to perform 2 index op, and finally near to no difference will be made), and just a better memory management approach.
#15591 (comment)
And I am paying effort to make it able to lead not regression when we disable the feature(regression happend due to we need to performance two index op even disabling).
#15591 (comment)

@ariel-miculas

Copy link
Copy Markdown
Contributor

Yes, and actually I think it make few difference to performance after experiment before (some steps are improved like removing slice of record batch, removing Vec resizing, and some steps are regressed like we need to perform 2 index op, and finally near to no difference will be made), and just a better memory management approach.

I disagree, since the memory management is directly tied to performance via the spilling mechanism when running with memory limits configured. See #22526 (comment)
The memory overaccounting issues caused by the current design of hash aggregation have a real performance impact in the downstream operators which are either:

So I believe the new "blocked" approach will have significant performance improvements in production-like workloads.

@2010YOUY01

Copy link
Copy Markdown
Contributor

Yes, and actually I think it make few difference to performance after experiment before (some steps are improved like removing slice of record batch, removing Vec resizing, and some steps are regressed like we need to perform 2 index op, and finally near to no difference will be made), and just a better memory management approach.

I disagree, since the memory management is directly tied to performance via the spilling mechanism when running with memory limits configured. See #22526 (comment) The memory overaccounting issues caused by the current design of hash aggregation have a real performance impact in the downstream operators which are either:

So I believe the new "blocked" approach will have significant performance improvements in production-like workloads.

I agree we could proceed first without worrying too much about the benchmark numbers.

This is like a tradeoff between micro-optimizations and algorithmic improvements to memory efficiency.

I think completely giving up 10%-ish performance for architectural win is already a good idea. But realistically, I also believe it should be possible to avoid the regressions entirely with some low-level optimizations, but we'd better discuss those opportunities later.

@Rachelint

Rachelint commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

I disagree, since the memory management is directly tied to performance via the spilling mechanism when running with memory limits configured.

Good point, no difference to performance is maybe just for benchmark.

@adriangb

Copy link
Copy Markdown
Contributor

We can run benchmarks with memory limits to force spilling if that helps

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v54.0.0 (current)
       Built [ 106.692s] (current)
     Parsing datafusion v54.0.0 (current)
      Parsed [   0.037s] (current)
    Building datafusion v54.0.0 (baseline)
       Built [ 103.270s] (baseline)
     Parsing datafusion v54.0.0 (baseline)
      Parsed [   0.038s] (baseline)
    Checking datafusion v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.884s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 212.512s] datafusion
    Building datafusion-common v54.0.0 (current)
       Built [  32.136s] (current)
     Parsing datafusion-common v54.0.0 (current)
      Parsed [   0.067s] (current)
    Building datafusion-common v54.0.0 (baseline)
       Built [  32.790s] (baseline)
     Parsing datafusion-common v54.0.0 (baseline)
      Parsed [   0.065s] (baseline)
    Checking datafusion-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   1.039s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ExecutionOptions.enable_aggregation_blocked_groups in /home/runner/work/datafusion/datafusion/datafusion/common/src/config.rs:723

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  67.915s] datafusion-common
    Building datafusion-expr-common v54.0.0 (current)
       Built [  18.659s] (current)
     Parsing datafusion-expr-common v54.0.0 (current)
      Parsed [   0.019s] (current)
    Building datafusion-expr-common v54.0.0 (baseline)
       Built [  18.871s] (baseline)
     Parsing datafusion-expr-common v54.0.0 (baseline)
      Parsed [   0.020s] (baseline)
    Checking datafusion-expr-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.302s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/enum_variant_added.ron

Failed in:
  variant EmitTo:NextBlock in /home/runner/work/datafusion/datafusion/datafusion/expr-common/src/groups_accumulator.rs:39

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  38.677s] datafusion-expr-common
    Building datafusion-ffi v54.0.0 (current)
       Built [  60.353s] (current)
     Parsing datafusion-ffi v54.0.0 (current)
      Parsed [   0.067s] (current)
    Building datafusion-ffi v54.0.0 (baseline)
       Built [  59.553s] (baseline)
     Parsing datafusion-ffi v54.0.0 (baseline)
      Parsed [   0.068s] (baseline)
    Checking datafusion-ffi v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.392s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 122.872s] datafusion-ffi
    Building datafusion-functions-aggregate v54.0.0 (current)
       Built [  30.409s] (current)
     Parsing datafusion-functions-aggregate v54.0.0 (current)
      Parsed [   0.048s] (current)
    Building datafusion-functions-aggregate v54.0.0 (baseline)
       Built [  29.847s] (baseline)
     Parsing datafusion-functions-aggregate v54.0.0 (baseline)
      Parsed [   0.049s] (baseline)
    Checking datafusion-functions-aggregate v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.278s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  62.108s] datafusion-functions-aggregate
    Building datafusion-functions-aggregate-common v54.0.0 (current)
       Built [  20.139s] (current)
     Parsing datafusion-functions-aggregate-common v54.0.0 (current)
      Parsed [   0.024s] (current)
    Building datafusion-functions-aggregate-common v54.0.0 (baseline)
       Built [  20.215s] (baseline)
     Parsing datafusion-functions-aggregate-common v54.0.0 (baseline)
      Parsed [   0.021s] (baseline)
    Checking datafusion-functions-aggregate-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.188s] 223 checks: 219 pass, 4 fail, 0 warn, 30 skip

--- failure enum_struct_variant_field_added: pub enum struct variant field added ---

Description:
An enum's exhaustive struct variant has a new field, which has to be included when constructing or matching on this variant.
        ref: https://doc.rust-lang.org/reference/attributes/type_system.html#the-non_exhaustive-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/enum_struct_variant_field_added.ron

Failed in:
  field pending_builder of variant SeenValues::All in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:113
  field builder of variant SeenValues::Some in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:116

--- failure enum_struct_variant_field_missing: pub enum struct variant's field removed or renamed ---

Description:
A publicly-visible enum has a struct variant whose field is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/enum_struct_variant_field_missing.ron

Failed in:
  field values of variant SeenValues::Some, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/75d3c87db2328833532578616ff1c9c11e735e05/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:46

--- failure trait_requires_more_generic_type_params: trait now requires more generic type parameters ---

Description:
A trait now requires more generic type parameters than it used to. Uses of this trait that supplied the previously-required number of generic types will be broken. To fix this, consider supplying default values for newly-added generic types.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#trait-new-parameter-no-default
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/trait_requires_more_generic_type_params.ron

Failed in:
  trait NullState (0 -> 2 required generic types) in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:305
  trait SeenValues (0 -> 1 required generic types) in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:99

--- failure type_requires_more_generic_type_params: type now requires more generic type parameters ---

Description:
A type now requires more generic type parameters than it used to. Uses of this type that supplied the previously-required number of generic types will be broken. To fix this, consider supplying default values for newly-added generic types.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#trait-new-parameter-no-default
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/type_requires_more_generic_type_params.ron

Failed in:
  Struct NullState (0 -> 2 required generic types) in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:305
  Enum SeenValues (0 -> 1 required generic types) in /home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:99

     Summary semver requires new major version: 4 major and 0 minor checks failed
    Finished [  41.530s] datafusion-functions-aggregate-common
    Building datafusion-physical-expr v54.0.0 (current)
       Built [  28.523s] (current)
     Parsing datafusion-physical-expr v54.0.0 (current)
      Parsed [   0.051s] (current)
    Building datafusion-physical-expr v54.0.0 (baseline)
       Built [  28.843s] (baseline)
     Parsing datafusion-physical-expr v54.0.0 (baseline)
      Parsed [   0.054s] (baseline)
    Checking datafusion-physical-expr v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.496s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  58.823s] datafusion-physical-expr
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  36.865s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.142s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  37.869s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.143s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.857s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  78.086s] datafusion-physical-plan
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 193.327s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.023s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 182.351s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.027s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.124s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 380.044s] datafusion-sqllogictest

@Rachelint

Copy link
Copy Markdown
Contributor Author

We can run benchmarks with memory limits to force spilling if that helps

@adriangb hello, is it possible to authorize me to trigger benchmark through bot?
I am trying to improve aggregation performance recently.

@alamb

alamb commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

We can run benchmarks with memory limits to force spilling if that helps

@adriangb hello, is it possible to authorize me to trigger benchmark through bot? I am trying to improve aggregation performance recently.

I also sent @adriangb a direct message as well

@adriangb

Copy link
Copy Markdown
Contributor

done in adriangb/datafusion-benchmarking#16

@Rachelint

Copy link
Copy Markdown
Contributor Author

@adriangb @alamb Thanks!

@Rachelint

Copy link
Copy Markdown
Contributor Author

run benchmarks clickbench_partitioned

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4785188060-647-xvzht 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing intermeidate-result-blocked-approach (5869167) to a27f030 (merge-base) diff using: clickbench_partitioned
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and intermeidate-result-blocked-approach
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃   intermeidate-result-blocked-approach ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.23 / 3.91 ±5.30 / 14.51 ms │           1.22 / 3.90 ±5.29 / 14.48 ms │     no change │
│ QQuery 1  │        12.52 / 12.80 ±0.16 / 13.00 ms │         12.77 / 12.87 ±0.05 / 12.93 ms │     no change │
│ QQuery 2  │        35.72 / 35.98 ±0.25 / 36.35 ms │         35.93 / 36.41 ±0.60 / 37.43 ms │     no change │
│ QQuery 3  │        30.19 / 30.83 ±0.71 / 32.21 ms │         30.03 / 30.36 ±0.27 / 30.80 ms │     no change │
│ QQuery 4  │     220.44 / 222.86 ±1.27 / 223.76 ms │      232.67 / 236.08 ±2.39 / 239.41 ms │  1.06x slower │
│ QQuery 5  │     275.25 / 278.35 ±2.41 / 281.95 ms │      265.43 / 268.94 ±3.19 / 273.97 ms │     no change │
│ QQuery 6  │           1.26 / 1.41 ±0.23 / 1.86 ms │            1.25 / 1.40 ±0.22 / 1.84 ms │     no change │
│ QQuery 7  │        14.28 / 14.44 ±0.09 / 14.54 ms │         13.64 / 13.81 ±0.15 / 14.01 ms │     no change │
│ QQuery 8  │     330.48 / 335.49 ±3.02 / 339.34 ms │      315.81 / 318.80 ±2.80 / 323.81 ms │     no change │
│ QQuery 9  │     473.44 / 483.37 ±5.48 / 490.16 ms │      454.22 / 464.83 ±7.98 / 477.60 ms │     no change │
│ QQuery 10 │        70.99 / 73.29 ±3.74 / 80.69 ms │         68.19 / 72.03 ±6.11 / 84.20 ms │     no change │
│ QQuery 11 │        81.42 / 83.29 ±1.46 / 84.98 ms │         79.85 / 81.96 ±1.08 / 82.78 ms │     no change │
│ QQuery 12 │     277.56 / 283.69 ±5.17 / 289.56 ms │      262.77 / 267.14 ±3.56 / 271.51 ms │ +1.06x faster │
│ QQuery 13 │    380.32 / 394.52 ±11.41 / 413.87 ms │      363.62 / 380.65 ±9.00 / 389.67 ms │     no change │
│ QQuery 14 │     288.98 / 295.10 ±4.33 / 301.87 ms │      277.79 / 281.39 ±2.72 / 285.07 ms │     no change │
│ QQuery 15 │     283.82 / 292.30 ±5.70 / 300.28 ms │      281.75 / 293.14 ±7.76 / 302.75 ms │     no change │
│ QQuery 16 │    624.39 / 638.83 ±10.83 / 657.01 ms │      605.26 / 613.60 ±4.85 / 619.95 ms │     no change │
│ QQuery 17 │     630.65 / 646.02 ±8.95 / 656.39 ms │      617.25 / 623.47 ±6.64 / 632.77 ms │     no change │
│ QQuery 18 │ 1291.95 / 1322.58 ±24.46 / 1363.70 ms │  1240.33 / 1261.01 ±16.16 / 1285.74 ms │     no change │
│ QQuery 19 │        27.94 / 28.41 ±0.26 / 28.70 ms │         27.41 / 29.59 ±4.28 / 38.15 ms │     no change │
│ QQuery 20 │    515.48 / 529.33 ±11.54 / 544.72 ms │      517.30 / 521.34 ±3.04 / 526.61 ms │     no change │
│ QQuery 21 │     518.96 / 526.74 ±5.53 / 535.97 ms │      509.46 / 519.34 ±5.92 / 526.44 ms │     no change │
│ QQuery 22 │ 1003.60 / 1030.83 ±23.45 / 1060.15 ms │      982.29 / 992.46 ±5.89 / 999.88 ms │     no change │
│ QQuery 23 │ 3163.56 / 3208.28 ±48.54 / 3295.42 ms │  3151.81 / 3211.18 ±40.42 / 3266.77 ms │     no change │
│ QQuery 24 │        41.49 / 51.52 ±5.77 / 58.79 ms │         41.80 / 42.67 ±0.98 / 44.35 ms │ +1.21x faster │
│ QQuery 25 │     111.27 / 111.96 ±0.89 / 113.61 ms │      111.04 / 113.57 ±4.15 / 121.80 ms │     no change │
│ QQuery 26 │        41.50 / 42.35 ±0.50 / 42.87 ms │         41.51 / 42.27 ±0.70 / 43.54 ms │     no change │
│ QQuery 27 │     664.58 / 675.43 ±5.79 / 680.45 ms │     674.35 / 685.78 ±14.19 / 710.56 ms │     no change │
│ QQuery 28 │  3030.00 / 3034.51 ±4.91 / 3042.90 ms │   3020.15 / 3034.08 ±7.84 / 3043.74 ms │     no change │
│ QQuery 29 │        40.73 / 45.01 ±6.31 / 57.27 ms │         40.36 / 40.66 ±0.50 / 41.65 ms │ +1.11x faster │
│ QQuery 30 │    296.81 / 309.65 ±12.07 / 331.50 ms │     297.35 / 308.84 ±12.39 / 328.12 ms │     no change │
│ QQuery 31 │     278.98 / 288.15 ±7.22 / 299.13 ms │      276.96 / 284.72 ±5.08 / 289.46 ms │     no change │
│ QQuery 32 │   916.77 / 959.08 ±35.57 / 1013.44 ms │    925.97 / 977.05 ±44.21 / 1052.50 ms │     no change │
│ QQuery 33 │ 1426.83 / 1454.86 ±18.33 / 1478.22 ms │  1437.69 / 1473.51 ±29.58 / 1512.16 ms │     no change │
│ QQuery 34 │ 1486.59 / 1541.22 ±54.65 / 1639.59 ms │ 1487.08 / 1563.69 ±109.94 / 1781.94 ms │     no change │
│ QQuery 35 │    274.38 / 311.25 ±31.15 / 354.24 ms │     288.85 / 331.06 ±72.24 / 475.19 ms │  1.06x slower │
│ QQuery 36 │        63.27 / 68.48 ±3.65 / 73.27 ms │         65.90 / 70.54 ±4.44 / 78.58 ms │     no change │
│ QQuery 37 │       36.26 / 43.59 ±12.86 / 69.24 ms │         35.18 / 40.41 ±4.38 / 44.95 ms │ +1.08x faster │
│ QQuery 38 │        42.87 / 46.52 ±3.37 / 50.52 ms │         40.11 / 42.59 ±1.46 / 44.01 ms │ +1.09x faster │
│ QQuery 39 │     143.45 / 148.86 ±6.48 / 160.77 ms │      146.04 / 151.46 ±5.63 / 160.01 ms │     no change │
│ QQuery 40 │        13.90 / 21.07 ±9.56 / 39.77 ms │         13.71 / 14.23 ±0.68 / 15.52 ms │ +1.48x faster │
│ QQuery 41 │        13.54 / 15.77 ±2.88 / 21.42 ms │         13.26 / 13.56 ±0.19 / 13.81 ms │ +1.16x faster │
│ QQuery 42 │        12.83 / 13.14 ±0.17 / 13.37 ms │         12.93 / 13.12 ±0.10 / 13.20 ms │     no change │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                   ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                   │ 19955.07ms │
│ Total Time (intermeidate-result-blocked-approach)   │ 19779.53ms │
│ Average Time (HEAD)                                 │   464.07ms │
│ Average Time (intermeidate-result-blocked-approach) │   459.99ms │
│ Queries Faster                                      │          7 │
│ Queries Slower                                      │          2 │
│ Queries with No Change                              │         34 │
│ Queries with Failure                                │          0 │
└─────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric Value
Wall time 105.0s
Peak memory 11.0 GiB
Avg memory 4.3 GiB
CPU user 1013.6s
CPU sys 70.7s
Peak spill 0 B

clickbench_partitioned — branch

Metric Value
Wall time 100.0s
Peak memory 11.4 GiB
Avg memory 4.3 GiB
CPU user 1003.5s
CPU sys 71.2s
Peak spill 0 B

File an issue against this benchmark runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants