feat(vectorized): add VectorizedGroupByOperator with MIN/MAX support and extended tests by poyrazK · Pull Request #59 · poyrazK/cloudSQL

poyrazK · 2026-04-21T11:16:09Z

Summary

Add VectorizedGroupByOperator to the vectorized execution path with hash-based grouping
Add MIN/MAX aggregate support to the GROUP BY operator
Add comprehensive tests: multiple-column GROUP BY, MIN/MAX, NULL keys, key value verification

Changes

include/executor/vectorized_operator.hpp: Added MIN/MAX handling in update_accumulators() and produce_output_batch()
tests/vectorized_operator_tests.cpp: Added 4 new tests (MultipleColumnGroupBy, MinMaxAggregates, NullGroupKeys, VerifyGroupKeyValues)

Test plan

All 21 vectorized_operator_tests pass
CI passes

Summary by CodeRabbit

New Features
- Added GROUP BY support to the vectorized execution engine with hash-based grouped aggregation
- Supports COUNT(*), SUM, MIN, and MAX aggregate functions for INT64 and FLOAT64 columns
- Enables grouping by multiple columns with proper NULL key handling
- Handles streaming input efficiently across multiple batches

Add VectorizedGroupByOperator for batch-at-a-time grouped aggregation. - Hash-based grouping using unordered_map - Supports COUNT and SUM aggregates - Two-phase processing: Input (populate hash table) then Output (serve groups)

Add VectorizedGroupByOperator for batch-at-a-time grouped aggregation: - Hash-based grouping using unordered_map - Supports COUNT and SUM aggregates - Two-phase processing: Input (populate hash table) then Output - Store original Value types in group keys Add 4 tests: - SingleGroup: GROUP BY with 1 group - MultipleGroups: GROUP BY with 3 groups - EmptyInput: GROUP BY on empty table - MultiBatchGroups: 2500 rows with 10 groups across batches

- Added VectorizedGroupByOperator to vectorized operators list - Added new section 5 documenting Vectorized GROUP BY feature - Mentioned supported aggregates (COUNT, SUM)

- Add MIN/MAX aggregate handling in VectorizedGroupByOperator - Update update_accumulators to track min/max per group - Update produce_output_batch to output MIN/MAX values - Add MultipleColumnGroupBy test (2-column GROUP BY) - Add MinMaxAggregates test - Add NullGroupKeys test - Add VerifyGroupKeyValues test

coderabbitai · 2026-04-21T11:16:16Z

Warning

Rate limit exceeded

@github-actions[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes and 52 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 33 minutes and 52 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f6857203-70c7-48cf-9c89-e04bd9aa98bc

📥 Commits

Reviewing files that changed from the base of the PR and between 7706a7d and 2f5bb4a.

📒 Files selected for processing (2)

docs/phases/PHASE_8_ANALYTICS.md
include/executor/vectorized_operator.hpp

📝 Walkthrough

Walkthrough

A new VectorizedGroupByOperator has been implemented to support GROUP BY operations in the vectorized execution engine. The operator uses hash-based grouped aggregation with a two-phase control flow: first consuming all input batches to populate a hash table, then emitting grouped results in subsequent calls. The operator supports COUNT(\*), SUM, MIN, and MAX aggregates.

Changes

Cohort / File(s)	Summary
Documentation `docs/phases/PHASE_8_ANALYTICS.md`	Added Phase 8 documentation describing the new `VectorizedGroupByOperator`, its hash-based implementation, two-phase flow, and supported aggregates (COUNT(\*), SUM, MIN, MAX).
Vectorized Operator Implementation `include/executor/vectorized_operator.hpp`	Added new `VectorizedGroupState` struct to hold per-group accumulator data and new `VectorizedGroupByOperator` class implementing two-phase grouped aggregation with hash table management, NULL handling, and support for multiple aggregate functions.
Operator Tests `tests/vectorized_operator_tests.cpp`	Added comprehensive test suite for `VectorizedGroupByOperator` covering single/multi-group aggregation, empty inputs, multi-batch streaming, multiple key columns, MIN/MAX correctness, NULL key handling, and output validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Groups gather 'round the hash table's call,
Where vectors dance and aggregates enthrall,
Two phases waltz through batches stream by stream,
COUNT and SUM and MIN fulfill the dream,
🥕 Grouped harmony blooms—no data left behind!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR objectives state the changes implement a new VectorizedGroupByOperator feature, but the linked issue `#56` requires fixing a bug in VectorizedFilterOperator. No changes to VectorizedFilterOperator are documented.	Address the bug in VectorizedFilterOperator by removing the early return and ensuring all child batches are consumed until EOF, as specified in issue `#56`.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check	❓ Inconclusive	The PR implements VectorizedGroupByOperator with MIN/MAX support, which is in scope for completing vectorized GROUP BY. However, the linked issue `#56` requires fixing VectorizedFilterOperator, which is not addressed in this PR.	Clarify whether this PR is intended to address both the new VectorizedGroupByOperator feature and the VectorizedFilterOperator bug fix, or if they should be separate PRs.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding VectorizedGroupByOperator with MIN/MAX support and tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/vectorized-groupby

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

include/executor/vectorized_operator.hpp (1)
9-16: ⚠️ Potential issue | 🟡 Minor

Add explicit #include <unordered_map> header.

std::unordered_map is used on line 318 but is not explicitly included in this file. Currently it compiles due to transitive inclusion through storage/columnar_table.hpp → storage/storage_manager.hpp, but this dependency is fragile and could break if the transitive chain changes. Include it explicitly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/executor/vectorized_operator.hpp` around lines 9 - 16, The header is
missing a direct include for std::unordered_map used later; add an explicit
`#include` <unordered_map> at the top of include/executor/vectorized_operator.hpp
(near the other standard headers) so usages in this file (e.g., any references
to std::unordered_map in the VectorizedOperator-related declarations) no longer
rely on fragile transitive includes such as those from
storage/columnar_table.hpp or storage_manager.hpp.

🧹 Nitpick comments (1)

include/executor/vectorized_operator.hpp (1)
366-400: Cache group-by column indices once in the constructor.

schema.find_column(group_by_[i]->to_string()) is invoked for every input row (and twice when the key is new). For a 2500-row, multi-key workload this is N×G string-keyed lookups plus expression to_string() rebuilds per row. Resolve the expressions to column indices once in the constructor and reuse the cached vector in the hot loop. This also makes it trivial to fail fast if a group-by expression does not resolve against the child schema (see related comment on NULL/missing-column conflation).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/executor/vectorized_operator.hpp` around lines 366 - 400, The hot
loop in process_input_batch repeatedly calls
child_->output_schema().find_column(group_by_[i]->to_string()), rebuilding
strings and doing N×G lookups; fix this by resolving group_by_ once in the
constructor: add a member vector<size_t> (e.g. group_by_col_indices_) populated
in the class ctor by iterating group_by_ and calling find_column exactly once
per expression, and if any expression fails either record a sentinel
(static_cast<size_t>(-1)) or fail fast with an error; then update
process_input_batch to use group_by_col_indices_ for both key-building and
key_vals population instead of calling find_column/to_string() repeatedly,
leaving groups_, group_keys_, and group_values_ logic unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/phases/PHASE_8_ANALYTICS.md`:
- Around line 30-35: Update the docs section describing
VectorizedGroupByOperator to reflect that MIN and MAX aggregates are now
supported (in addition to COUNT(*) and SUM); mention that MIN/MAX operate on
INT64 and FLOAT64 columns, describe that they are handled in the same two-phase
hash-based grouping flow, and cite the MinMaxAggregates test as the
implementation reference for behavior and type support.

In `@include/executor/vectorized_operator.hpp`:
- Around line 468-487: produce_output_batch currently always emits Sum as
Value::make_float64(state.sums[i]); change it to mirror
VectorizedAggregateOperator by preserving separate int64 and float64
accumulators (e.g., state.sums_int64 and state.sums_float64 or a tagged sum in
the aggregation state) and, inside the Sum case of produce_output_batch, branch
on the output column type/schema to append either Value::make_int64(...) when
the column is INT64 or Value::make_float64(...) when it is FLOAT64 (or
null/convert as appropriate); update the aggregate state and accumulation logic
where aggregates_[i].type == AggregateType::Sum to maintain both accumulators
(or a typed accumulator) so large INT64 sums do not lose precision when emitted.
- Around line 366-406: process_input_batch builds ambiguous string keys that can
collide and conflates missing columns with NULLs; change key construction to a
collision-safe encoding (e.g., length-prefixed or type-tagged fields and a
dedicated NULL marker) instead of simple "|" concatenation in the loop that
inspects group_by_; on find_column(...) == static_cast<size_t>(-1) call
set_error(...) (do not silently emit "NULL|") so unresolved group-by expressions
fail fast; update places that insert into groups_ (emplace(...)), group_keys_,
and group_values_ to use the new encoded key and distinct NULL handling, and
leave update_accumulators(...) unchanged except it should consume the new safe
encoding where needed.
- Around line 325-329: The derived class declares a member ProcessState state_
that shadows the base VectorizedOperator::state_ (ExecState); rename the derived
enum/member (e.g., enum ProcessState -> ProcessPhase and state_ ->
process_phase_ or process_state_) and update all references within the class
(constructors, methods, comparisons) to use the new name so the base ExecState
member (VectorizedOperator::state_) is not hidden and callers that use base
state or set_error() remain consistent.

---

Outside diff comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 9-16: The header is missing a direct include for
std::unordered_map used later; add an explicit `#include` <unordered_map> at the
top of include/executor/vectorized_operator.hpp (near the other standard
headers) so usages in this file (e.g., any references to std::unordered_map in
the VectorizedOperator-related declarations) no longer rely on fragile
transitive includes such as those from storage/columnar_table.hpp or
storage_manager.hpp.

---

Nitpick comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 366-400: The hot loop in process_input_batch repeatedly calls
child_->output_schema().find_column(group_by_[i]->to_string()), rebuilding
strings and doing N×G lookups; fix this by resolving group_by_ once in the
constructor: add a member vector<size_t> (e.g. group_by_col_indices_) populated
in the class ctor by iterating group_by_ and calling find_column exactly once
per expression, and if any expression fails either record a sentinel
(static_cast<size_t>(-1)) or fail fast with an error; then update
process_input_batch to use group_by_col_indices_ for both key-building and
key_vals population instead of calling find_column/to_string() repeatedly,
leaving groups_, group_keys_, and group_values_ logic unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 069c0922-caf7-49a6-aadc-c8e4f2805ded

📥 Commits

Reviewing files that changed from the base of the PR and between de67738 and 7706a7d.

📒 Files selected for processing (3)

docs/phases/PHASE_8_ANALYTICS.md
include/executor/vectorized_operator.hpp
tests/vectorized_operator_tests.cpp

- Add explicit #include <unordered_map> - Rename ProcessState/state_ to ProcessPhase/process_phase_ to avoid shadowing base class - Use separate sums_int64/sums_float64 accumulators with has_float_value_ tracking - Branch in produce_output_batch to emit based on output column type - Add collision-safe key encoding with length-prefixed values and dedicated NULL marker - Pre-resolve group_by column indices in constructor (group_by_col_indices_) - Fail fast on unresolved group-by expressions instead of silently emitting "NULL|" - Update PHASE_8_ANALYTICS.md to document MIN/MAX support and implementation details

poyrazK

It's okay to merge

poyrazK added 4 commits April 21, 2026 13:51

feat(vectorized): add VectorizedGroupByOperator class

af6b369

Add VectorizedGroupByOperator for batch-at-a-time grouped aggregation. - Hash-based grouping using unordered_map - Supports COUNT and SUM aggregates - Two-phase processing: Input (populate hash table) then Output (serve groups)

docs: update PHASE_8_ANALYTICS.md with VectorizedGroupByOperator

c1d3f91

- Added VectorizedGroupByOperator to vectorized operators list - Added new section 5 documenting Vectorized GROUP BY feature - Mentioned supported aggregates (COUNT, SUM)

style: automated clang-format fixes

7706a7d

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread docs/phases/PHASE_8_ANALYTICS.md

Comment thread include/executor/vectorized_operator.hpp Outdated

Comment thread include/executor/vectorized_operator.hpp

Comment thread include/executor/vectorized_operator.hpp

poyrazK and others added 2 commits April 22, 2026 23:24

style: automated clang-format fixes

2f5bb4a

poyrazK commented Apr 23, 2026

View reviewed changes

poyrazK merged commit bafac13 into main Apr 23, 2026
14 of 17 checks passed

coderabbitai Bot mentioned this pull request May 2, 2026

fix(vectorized): correct RIGHT and FULL outer join emission logic #72

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vectorized): add VectorizedGroupByOperator with MIN/MAX support and extended tests#59

feat(vectorized): add VectorizedGroupByOperator with MIN/MAX support and extended tests#59
poyrazK merged 7 commits intomainfrom
feature/vectorized-groupby

poyrazK commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poyrazK left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poyrazK commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poyrazK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poyrazK commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading