Skip to content

feat(vectorized): add VectorizedGroupByOperator with MIN/MAX support and extended tests#59

Merged
poyrazK merged 7 commits intomainfrom
feature/vectorized-groupby
Apr 23, 2026
Merged

feat(vectorized): add VectorizedGroupByOperator with MIN/MAX support and extended tests#59
poyrazK merged 7 commits intomainfrom
feature/vectorized-groupby

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented Apr 21, 2026

Summary

  • Add VectorizedGroupByOperator to the vectorized execution path with hash-based grouping
  • Add MIN/MAX aggregate support to the GROUP BY operator
  • Add comprehensive tests: multiple-column GROUP BY, MIN/MAX, NULL keys, key value verification

Changes

  • include/executor/vectorized_operator.hpp: Added MIN/MAX handling in update_accumulators() and produce_output_batch()
  • tests/vectorized_operator_tests.cpp: Added 4 new tests (MultipleColumnGroupBy, MinMaxAggregates, NullGroupKeys, VerifyGroupKeyValues)

Test plan

  • All 21 vectorized_operator_tests pass
  • CI passes

Related

Closes #56 (partially - completes vectorized GROUP BY feature)

Summary by CodeRabbit

  • New Features
    • Added GROUP BY support to the vectorized execution engine with hash-based grouped aggregation
    • Supports COUNT(*), SUM, MIN, and MAX aggregate functions for INT64 and FLOAT64 columns
    • Enables grouping by multiple columns with proper NULL key handling
    • Handles streaming input efficiently across multiple batches

poyrazK added 4 commits April 21, 2026 13:51
Add VectorizedGroupByOperator for batch-at-a-time grouped aggregation.
- Hash-based grouping using unordered_map
- Supports COUNT and SUM aggregates
- Two-phase processing: Input (populate hash table) then Output (serve groups)
Add VectorizedGroupByOperator for batch-at-a-time grouped aggregation:
- Hash-based grouping using unordered_map
- Supports COUNT and SUM aggregates
- Two-phase processing: Input (populate hash table) then Output
- Store original Value types in group keys

Add 4 tests:
- SingleGroup: GROUP BY with 1 group
- MultipleGroups: GROUP BY with 3 groups
- EmptyInput: GROUP BY on empty table
- MultiBatchGroups: 2500 rows with 10 groups across batches
- Added VectorizedGroupByOperator to vectorized operators list
- Added new section 5 documenting Vectorized GROUP BY feature
- Mentioned supported aggregates (COUNT, SUM)
- Add MIN/MAX aggregate handling in VectorizedGroupByOperator
- Update update_accumulators to track min/max per group
- Update produce_output_batch to output MIN/MAX values
- Add MultipleColumnGroupBy test (2-column GROUP BY)
- Add MinMaxAggregates test
- Add NullGroupKeys test
- Add VerifyGroupKeyValues test
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@github-actions[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes and 52 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 33 minutes and 52 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f6857203-70c7-48cf-9c89-e04bd9aa98bc

📥 Commits

Reviewing files that changed from the base of the PR and between 7706a7d and 2f5bb4a.

📒 Files selected for processing (2)
  • docs/phases/PHASE_8_ANALYTICS.md
  • include/executor/vectorized_operator.hpp
📝 Walkthrough

Walkthrough

A new VectorizedGroupByOperator has been implemented to support GROUP BY operations in the vectorized execution engine. The operator uses hash-based grouped aggregation with a two-phase control flow: first consuming all input batches to populate a hash table, then emitting grouped results in subsequent calls. The operator supports COUNT(\*), SUM, MIN, and MAX aggregates.

Changes

Cohort / File(s) Summary
Documentation
docs/phases/PHASE_8_ANALYTICS.md
Added Phase 8 documentation describing the new VectorizedGroupByOperator, its hash-based implementation, two-phase flow, and supported aggregates (COUNT(\*), SUM, MIN, MAX).
Vectorized Operator Implementation
include/executor/vectorized_operator.hpp
Added new VectorizedGroupState struct to hold per-group accumulator data and new VectorizedGroupByOperator class implementing two-phase grouped aggregation with hash table management, NULL handling, and support for multiple aggregate functions.
Operator Tests
tests/vectorized_operator_tests.cpp
Added comprehensive test suite for VectorizedGroupByOperator covering single/multi-group aggregation, empty inputs, multi-batch streaming, multiple key columns, MIN/MAX correctness, NULL key handling, and output validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Groups gather 'round the hash table's call,
Where vectors dance and aggregates enthrall,
Two phases waltz through batches stream by stream,
COUNT and SUM and MIN fulfill the dream,
🥕 Grouped harmony blooms—no data left behind!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR objectives state the changes implement a new VectorizedGroupByOperator feature, but the linked issue #56 requires fixing a bug in VectorizedFilterOperator. No changes to VectorizedFilterOperator are documented. Address the bug in VectorizedFilterOperator by removing the early return and ensuring all child batches are consumed until EOF, as specified in issue #56.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check ❓ Inconclusive The PR implements VectorizedGroupByOperator with MIN/MAX support, which is in scope for completing vectorized GROUP BY. However, the linked issue #56 requires fixing VectorizedFilterOperator, which is not addressed in this PR. Clarify whether this PR is intended to address both the new VectorizedGroupByOperator feature and the VectorizedFilterOperator bug fix, or if they should be separate PRs.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding VectorizedGroupByOperator with MIN/MAX support and tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/vectorized-groupby

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
include/executor/vectorized_operator.hpp (1)

9-16: ⚠️ Potential issue | 🟡 Minor

Add explicit #include <unordered_map> header.

std::unordered_map is used on line 318 but is not explicitly included in this file. Currently it compiles due to transitive inclusion through storage/columnar_table.hppstorage/storage_manager.hpp, but this dependency is fragile and could break if the transitive chain changes. Include it explicitly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/executor/vectorized_operator.hpp` around lines 9 - 16, The header is
missing a direct include for std::unordered_map used later; add an explicit
`#include` <unordered_map> at the top of include/executor/vectorized_operator.hpp
(near the other standard headers) so usages in this file (e.g., any references
to std::unordered_map in the VectorizedOperator-related declarations) no longer
rely on fragile transitive includes such as those from
storage/columnar_table.hpp or storage_manager.hpp.
🧹 Nitpick comments (1)
include/executor/vectorized_operator.hpp (1)

366-400: Cache group-by column indices once in the constructor.

schema.find_column(group_by_[i]->to_string()) is invoked for every input row (and twice when the key is new). For a 2500-row, multi-key workload this is N×G string-keyed lookups plus expression to_string() rebuilds per row. Resolve the expressions to column indices once in the constructor and reuse the cached vector in the hot loop. This also makes it trivial to fail fast if a group-by expression does not resolve against the child schema (see related comment on NULL/missing-column conflation).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/executor/vectorized_operator.hpp` around lines 366 - 400, The hot
loop in process_input_batch repeatedly calls
child_->output_schema().find_column(group_by_[i]->to_string()), rebuilding
strings and doing N×G lookups; fix this by resolving group_by_ once in the
constructor: add a member vector<size_t> (e.g. group_by_col_indices_) populated
in the class ctor by iterating group_by_ and calling find_column exactly once
per expression, and if any expression fails either record a sentinel
(static_cast<size_t>(-1)) or fail fast with an error; then update
process_input_batch to use group_by_col_indices_ for both key-building and
key_vals population instead of calling find_column/to_string() repeatedly,
leaving groups_, group_keys_, and group_values_ logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/phases/PHASE_8_ANALYTICS.md`:
- Around line 30-35: Update the docs section describing
VectorizedGroupByOperator to reflect that MIN and MAX aggregates are now
supported (in addition to COUNT(*) and SUM); mention that MIN/MAX operate on
INT64 and FLOAT64 columns, describe that they are handled in the same two-phase
hash-based grouping flow, and cite the MinMaxAggregates test as the
implementation reference for behavior and type support.

In `@include/executor/vectorized_operator.hpp`:
- Around line 468-487: produce_output_batch currently always emits Sum as
Value::make_float64(state.sums[i]); change it to mirror
VectorizedAggregateOperator by preserving separate int64 and float64
accumulators (e.g., state.sums_int64 and state.sums_float64 or a tagged sum in
the aggregation state) and, inside the Sum case of produce_output_batch, branch
on the output column type/schema to append either Value::make_int64(...) when
the column is INT64 or Value::make_float64(...) when it is FLOAT64 (or
null/convert as appropriate); update the aggregate state and accumulation logic
where aggregates_[i].type == AggregateType::Sum to maintain both accumulators
(or a typed accumulator) so large INT64 sums do not lose precision when emitted.
- Around line 366-406: process_input_batch builds ambiguous string keys that can
collide and conflates missing columns with NULLs; change key construction to a
collision-safe encoding (e.g., length-prefixed or type-tagged fields and a
dedicated NULL marker) instead of simple "|" concatenation in the loop that
inspects group_by_; on find_column(...) == static_cast<size_t>(-1) call
set_error(...) (do not silently emit "NULL|") so unresolved group-by expressions
fail fast; update places that insert into groups_ (emplace(...)), group_keys_,
and group_values_ to use the new encoded key and distinct NULL handling, and
leave update_accumulators(...) unchanged except it should consume the new safe
encoding where needed.
- Around line 325-329: The derived class declares a member ProcessState state_
that shadows the base VectorizedOperator::state_ (ExecState); rename the derived
enum/member (e.g., enum ProcessState -> ProcessPhase and state_ ->
process_phase_ or process_state_) and update all references within the class
(constructors, methods, comparisons) to use the new name so the base ExecState
member (VectorizedOperator::state_) is not hidden and callers that use base
state or set_error() remain consistent.

---

Outside diff comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 9-16: The header is missing a direct include for
std::unordered_map used later; add an explicit `#include` <unordered_map> at the
top of include/executor/vectorized_operator.hpp (near the other standard
headers) so usages in this file (e.g., any references to std::unordered_map in
the VectorizedOperator-related declarations) no longer rely on fragile
transitive includes such as those from storage/columnar_table.hpp or
storage_manager.hpp.

---

Nitpick comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 366-400: The hot loop in process_input_batch repeatedly calls
child_->output_schema().find_column(group_by_[i]->to_string()), rebuilding
strings and doing N×G lookups; fix this by resolving group_by_ once in the
constructor: add a member vector<size_t> (e.g. group_by_col_indices_) populated
in the class ctor by iterating group_by_ and calling find_column exactly once
per expression, and if any expression fails either record a sentinel
(static_cast<size_t>(-1)) or fail fast with an error; then update
process_input_batch to use group_by_col_indices_ for both key-building and
key_vals population instead of calling find_column/to_string() repeatedly,
leaving groups_, group_keys_, and group_values_ logic unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 069c0922-caf7-49a6-aadc-c8e4f2805ded

📥 Commits

Reviewing files that changed from the base of the PR and between de67738 and 7706a7d.

📒 Files selected for processing (3)
  • docs/phases/PHASE_8_ANALYTICS.md
  • include/executor/vectorized_operator.hpp
  • tests/vectorized_operator_tests.cpp

Comment thread docs/phases/PHASE_8_ANALYTICS.md
Comment thread include/executor/vectorized_operator.hpp Outdated
Comment thread include/executor/vectorized_operator.hpp
Comment thread include/executor/vectorized_operator.hpp
poyrazK and others added 2 commits April 22, 2026 23:24
- Add explicit #include <unordered_map>
- Rename ProcessState/state_ to ProcessPhase/process_phase_ to avoid shadowing base class
- Use separate sums_int64/sums_float64 accumulators with has_float_value_ tracking
- Branch in produce_output_batch to emit based on output column type
- Add collision-safe key encoding with length-prefixed values and dedicated NULL marker
- Pre-resolve group_by column indices in constructor (group_by_col_indices_)
- Fail fast on unresolved group-by expressions instead of silently emitting "NULL|"
- Update PHASE_8_ANALYTICS.md to document MIN/MAX support and implementation details
Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to merge

@poyrazK poyrazK merged commit bafac13 into main Apr 23, 2026
14 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant