Conversation
Add VectorizedGroupByOperator for batch-at-a-time grouped aggregation. - Hash-based grouping using unordered_map - Supports COUNT and SUM aggregates - Two-phase processing: Input (populate hash table) then Output (serve groups)
Add VectorizedGroupByOperator for batch-at-a-time grouped aggregation: - Hash-based grouping using unordered_map - Supports COUNT and SUM aggregates - Two-phase processing: Input (populate hash table) then Output - Store original Value types in group keys Add 4 tests: - SingleGroup: GROUP BY with 1 group - MultipleGroups: GROUP BY with 3 groups - EmptyInput: GROUP BY on empty table - MultiBatchGroups: 2500 rows with 10 groups across batches
- Added VectorizedGroupByOperator to vectorized operators list - Added new section 5 documenting Vectorized GROUP BY feature - Mentioned supported aggregates (COUNT, SUM)
- Add MIN/MAX aggregate handling in VectorizedGroupByOperator - Update update_accumulators to track min/max per group - Update produce_output_batch to output MIN/MAX values - Add MultipleColumnGroupBy test (2-column GROUP BY) - Add MinMaxAggregates test - Add NullGroupKeys test - Add VerifyGroupKeyValues test
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 33 minutes and 52 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughA new Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
include/executor/vectorized_operator.hpp (1)
9-16:⚠️ Potential issue | 🟡 MinorAdd explicit
#include <unordered_map>header.
std::unordered_mapis used on line 318 but is not explicitly included in this file. Currently it compiles due to transitive inclusion throughstorage/columnar_table.hpp→storage/storage_manager.hpp, but this dependency is fragile and could break if the transitive chain changes. Include it explicitly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/executor/vectorized_operator.hpp` around lines 9 - 16, The header is missing a direct include for std::unordered_map used later; add an explicit `#include` <unordered_map> at the top of include/executor/vectorized_operator.hpp (near the other standard headers) so usages in this file (e.g., any references to std::unordered_map in the VectorizedOperator-related declarations) no longer rely on fragile transitive includes such as those from storage/columnar_table.hpp or storage_manager.hpp.
🧹 Nitpick comments (1)
include/executor/vectorized_operator.hpp (1)
366-400: Cache group-by column indices once in the constructor.
schema.find_column(group_by_[i]->to_string())is invoked for every input row (and twice when the key is new). For a 2500-row, multi-key workload this is N×G string-keyed lookups plus expressionto_string()rebuilds per row. Resolve the expressions to column indices once in the constructor and reuse the cached vector in the hot loop. This also makes it trivial to fail fast if a group-by expression does not resolve against the child schema (see related comment on NULL/missing-column conflation).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/executor/vectorized_operator.hpp` around lines 366 - 400, The hot loop in process_input_batch repeatedly calls child_->output_schema().find_column(group_by_[i]->to_string()), rebuilding strings and doing N×G lookups; fix this by resolving group_by_ once in the constructor: add a member vector<size_t> (e.g. group_by_col_indices_) populated in the class ctor by iterating group_by_ and calling find_column exactly once per expression, and if any expression fails either record a sentinel (static_cast<size_t>(-1)) or fail fast with an error; then update process_input_batch to use group_by_col_indices_ for both key-building and key_vals population instead of calling find_column/to_string() repeatedly, leaving groups_, group_keys_, and group_values_ logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/phases/PHASE_8_ANALYTICS.md`:
- Around line 30-35: Update the docs section describing
VectorizedGroupByOperator to reflect that MIN and MAX aggregates are now
supported (in addition to COUNT(*) and SUM); mention that MIN/MAX operate on
INT64 and FLOAT64 columns, describe that they are handled in the same two-phase
hash-based grouping flow, and cite the MinMaxAggregates test as the
implementation reference for behavior and type support.
In `@include/executor/vectorized_operator.hpp`:
- Around line 468-487: produce_output_batch currently always emits Sum as
Value::make_float64(state.sums[i]); change it to mirror
VectorizedAggregateOperator by preserving separate int64 and float64
accumulators (e.g., state.sums_int64 and state.sums_float64 or a tagged sum in
the aggregation state) and, inside the Sum case of produce_output_batch, branch
on the output column type/schema to append either Value::make_int64(...) when
the column is INT64 or Value::make_float64(...) when it is FLOAT64 (or
null/convert as appropriate); update the aggregate state and accumulation logic
where aggregates_[i].type == AggregateType::Sum to maintain both accumulators
(or a typed accumulator) so large INT64 sums do not lose precision when emitted.
- Around line 366-406: process_input_batch builds ambiguous string keys that can
collide and conflates missing columns with NULLs; change key construction to a
collision-safe encoding (e.g., length-prefixed or type-tagged fields and a
dedicated NULL marker) instead of simple "|" concatenation in the loop that
inspects group_by_; on find_column(...) == static_cast<size_t>(-1) call
set_error(...) (do not silently emit "NULL|") so unresolved group-by expressions
fail fast; update places that insert into groups_ (emplace(...)), group_keys_,
and group_values_ to use the new encoded key and distinct NULL handling, and
leave update_accumulators(...) unchanged except it should consume the new safe
encoding where needed.
- Around line 325-329: The derived class declares a member ProcessState state_
that shadows the base VectorizedOperator::state_ (ExecState); rename the derived
enum/member (e.g., enum ProcessState -> ProcessPhase and state_ ->
process_phase_ or process_state_) and update all references within the class
(constructors, methods, comparisons) to use the new name so the base ExecState
member (VectorizedOperator::state_) is not hidden and callers that use base
state or set_error() remain consistent.
---
Outside diff comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 9-16: The header is missing a direct include for
std::unordered_map used later; add an explicit `#include` <unordered_map> at the
top of include/executor/vectorized_operator.hpp (near the other standard
headers) so usages in this file (e.g., any references to std::unordered_map in
the VectorizedOperator-related declarations) no longer rely on fragile
transitive includes such as those from storage/columnar_table.hpp or
storage_manager.hpp.
---
Nitpick comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 366-400: The hot loop in process_input_batch repeatedly calls
child_->output_schema().find_column(group_by_[i]->to_string()), rebuilding
strings and doing N×G lookups; fix this by resolving group_by_ once in the
constructor: add a member vector<size_t> (e.g. group_by_col_indices_) populated
in the class ctor by iterating group_by_ and calling find_column exactly once
per expression, and if any expression fails either record a sentinel
(static_cast<size_t>(-1)) or fail fast with an error; then update
process_input_batch to use group_by_col_indices_ for both key-building and
key_vals population instead of calling find_column/to_string() repeatedly,
leaving groups_, group_keys_, and group_values_ logic unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 069c0922-caf7-49a6-aadc-c8e4f2805ded
📒 Files selected for processing (3)
docs/phases/PHASE_8_ANALYTICS.mdinclude/executor/vectorized_operator.hpptests/vectorized_operator_tests.cpp
- Add explicit #include <unordered_map> - Rename ProcessState/state_ to ProcessPhase/process_phase_ to avoid shadowing base class - Use separate sums_int64/sums_float64 accumulators with has_float_value_ tracking - Branch in produce_output_batch to emit based on output column type - Add collision-safe key encoding with length-prefixed values and dedicated NULL marker - Pre-resolve group_by column indices in constructor (group_by_col_indices_) - Fail fast on unresolved group-by expressions instead of silently emitting "NULL|" - Update PHASE_8_ANALYTICS.md to document MIN/MAX support and implementation details
Summary
VectorizedGroupByOperatorto the vectorized execution path with hash-based groupingChanges
include/executor/vectorized_operator.hpp: Added MIN/MAX handling inupdate_accumulators()andproduce_output_batch()tests/vectorized_operator_tests.cpp: Added 4 new tests (MultipleColumnGroupBy, MinMaxAggregates, NullGroupKeys, VerifyGroupKeyValues)Test plan
Related
Closes #56 (partially - completes vectorized GROUP BY feature)
Summary by CodeRabbit