test(vectorized): add operator unit tests by poyrazK · Pull Request #57 · poyrazK/cloudSQL

poyrazK · 2026-04-21T09:39:04Z

Summary

Add comprehensive unit tests for vectorized operators:

VectorizedSeqScanOperator: Empty table, single batch, non-aligned boundaries, sequential calls until EOF
VectorizedProjectOperator: Empty input, multiple expressions, computed expressions
VectorizedAggregateOperator: Empty input, COUNT-only, SUM with FLOAT64
VectorizedFilterOperator: No matches, all match, pipelined batch processing

Test plan

vectorized_operator_tests passes
analytics_tests still passes
CI verification

Summary by CodeRabbit

New Features
- Vectorized filter operator now processes multiple batches sequentially for enhanced query execution capabilities.
Tests
- Comprehensive test suite added for vectorized operators, covering empty inputs, batch boundaries, expression evaluation, aggregation, and filtering scenarios.

Add comprehensive tests for VectorizedSeqScanOperator, VectorizedProjectOperator, VectorizedAggregateOperator, and VectorizedFilterOperator covering: - Empty table handling - Single and multi-batch scenarios - Non-aligned boundaries - Pipelined batch processing

coderabbitai · 2026-04-21T09:39:12Z

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 54 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 54 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d270fd3a-6961-422a-87d3-6579caf2d3a1

📥 Commits

Reviewing files that changed from the base of the PR and between c77cdb2 and c41abe0.

📒 Files selected for processing (2)

include/executor/vectorized_operator.hpp
tests/vectorized_operator_tests.cpp

📝 Walkthrough

Walkthrough

The changes add a new test executable to the CMake configuration, modify VectorizedFilterOperator::next_batch to iterate through multiple child batches accumulating matches before returning, and introduce a comprehensive test suite covering vectorized operator types with various input scenarios and edge cases.

Changes

Cohort / File(s)	Summary
Build Configuration `CMakeLists.txt`	Added `vectorized_operator_tests` GoogleTest executable target for testing vectorized operators.
Operator Implementation `include/executor/vectorized_operator.hpp`	Modified `VectorizedFilterOperator::next_batch` to iterate through child batches in a loop, accumulating matching rows into output before returning, instead of processing a single batch per call.
Test Suite `tests/vectorized_operator_tests.cpp`	New comprehensive test file with fixtures and test cases for `VectorizedSeqScanOperator`, `VectorizedProjectOperator`, `VectorizedAggregateOperator`, and `VectorizedFilterOperator`, covering empty inputs, multiple batches, boundary conditions, and repeated scanning.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

fix(vectorized): correct filter operator to exhaust all child batches #56: Directly modifies VectorizedFilterOperator::next_batch with the same batching accumulation approach, replacing single-batch early returns with multi-batch iteration.

Poem

🐰 Hopping through batches in vectorized flight,
Filters loop onwards from batch to batch,
Selection masks twirl and conditions ignite,
Accumulating rows in a perfect match,
Tests carrot-approved—operators take flight! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding comprehensive unit tests for vectorized operators, which aligns with the primary changeset additions.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/vectorized-operator-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

include/executor/vectorized_operator.hpp (1)

104-146: ⚠️ Potential issue | 🟠 Major

Fragile output-batch contract: missing clear() + early return can silently stall the pipeline.

Unlike VectorizedProjectOperator::next_batch (line 168) and VectorizedAggregateOperator::next_batch (line 264), this method never calls out_batch.clear(). Combined with:

the early return at lines 106-108 (if (out_batch.row_count() > 0) return true;), and
the accumulation-style write at line 138 (set_row_count(out_batch.row_count() + selection.size())),

the operator's correctness now depends on the caller clearing out_batch between every next_batch call. If a caller forgets (or reuses the batch as Project/Aggregate patterns suggest they can), the filter will return true immediately without advancing the child scan, effectively looping on stale results. Conversely, if the caller always clears, the "accumulated rows from a previous call" branch is dead code and the comment is misleading.

The PipelinedBatches test happens to clear after each call, so the bug is not caught by the new suite.

Recommend aligning with the sibling operators: clear at entry, drop the early return, and write a fresh row count.

🔒 Proposed fix

     bool next_batch(VectorBatch& out_batch) override {
-        // If we already have accumulated rows from a previous call, return them
-        if (out_batch.row_count() > 0) {
-            return true;
-        }
-
-        // Ensure output batch is structured for current schema
-        if (out_batch.column_count() == 0) {
+        out_batch.clear();
+        if (out_batch.column_count() == 0) {
             out_batch.init_from_schema(output_schema_);
         }

         // Process child batches until we find matches or exhaust input
         while (child_->next_batch(*input_batch_)) {
             selection_mask_->clear();
             condition_->evaluate_vectorized(*input_batch_, child_->output_schema(),
                                             *selection_mask_);

             std::vector<size_t> selection;
             for (size_t r = 0; r < input_batch_->row_count(); ++r) {
                 common::Value val = selection_mask_->get(r);
                 if (!val.is_null() && val.as_bool()) {
                     selection.push_back(r);
                 }
             }

             if (!selection.empty()) {
                 for (size_t c = 0; c < input_batch_->column_count(); ++c) {
                     auto& src_col = input_batch_->get_column(c);
                     auto& dest_col = out_batch.get_column(c);
                     for (size_t r : selection) {
                         dest_col.append(src_col.get(r));
                     }
                 }
-                out_batch.set_row_count(out_batch.row_count() + selection.size());
+                out_batch.set_row_count(selection.size());
                 input_batch_->clear();
                 return true;
             }
             input_batch_->clear();
         }

         return false;
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/executor/vectorized_operator.hpp` around lines 104 - 146, In
VectorizedFilterOperator::next_batch(VectorBatch& out_batch) the method must
clear the output batch at entry (call out_batch.clear()) and remove the
early-return that checks out_batch.row_count() so the operator doesn't silently
reuse stale rows; when writing matches, set the output row count to the number
of matches (out_batch.set_row_count(selection.size())) instead of adding to the
existing count, and otherwise keep the existing selection/build logic and
input_batch_->clear() behavior to match the approach used by
VectorizedProjectOperator::next_batch and
VectorizedAggregateOperator::next_batch.

🧹 Nitpick comments (1)

tests/vectorized_operator_tests.cpp (1)
25-25: Hardcoded relative paths risk cross-run / cross-test pollution.

Each fixture (VectorizedSeqScanTests, VectorizedProjectTests, VectorizedAggregateTests, VectorizedFilterTests) points StorageManager at a fixed path like "./test_scan_ops". TearDown only resets the unique_ptr — it does not remove the directory. Side effects:

Artifacts persist between runs; a schema/layout change can make stale files poison subsequent runs.

Parallel CTest (ctest -j) could interleave tests sharing a path if any identifier collides.

Running from a different CWD changes where files land.

Consider using std::filesystem::temp_directory_path() + a unique per-test subdirectory in SetUp, and std::filesystem::remove_all in TearDown. Applies to lines 25, 137, 250, 346.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vectorized_operator_tests.cpp` at line 25, The test fixtures currently
hardcode "./test_scan_ops" in SetUp when constructing StorageManager (e.g., in
SetUp of VectorizedSeqScanTests, VectorizedProjectTests,
VectorizedAggregateTests, VectorizedFilterTests), which can cause cross-run
pollution and parallel-test collisions; change each SetUp to create a unique
temp directory (use std::filesystem::temp_directory_path() combined with a
per-test unique suffix such as std::filesystem::unique_path() or a timestamp +
test name) and pass that path to the StorageManager constructor, then update
TearDown to call std::filesystem::remove_all on that directory before resetting
the storage_ unique_ptr so artifacts are cleaned up between runs and tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/vectorized_operator_tests.cpp`:
- Around line 256-280: The test is silent about a SQL-spec bug: COUNT(*) over an
empty table should return 0 but current VectorizedAggregateOperator leaves
has_value_ false so the result is NULL; fix the operator by initializing
has_value_[i] = true and the corresponding accumulator to 0 for aggregates where
aggs[i].type == AggregateType::Count (in the VectorizedAggregateOperator
constructor/initialization code that processes VectorizedAggregateInfo), then
update the CountOnlyEmpty test (VectorizedAggregateTests::CountOnlyEmpty) to
assert result->get_column(0).get(0).as_int64() == 0 (and keep row_count() == 1).

---

Outside diff comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 104-146: In VectorizedFilterOperator::next_batch(VectorBatch&
out_batch) the method must clear the output batch at entry (call
out_batch.clear()) and remove the early-return that checks out_batch.row_count()
so the operator doesn't silently reuse stale rows; when writing matches, set the
output row count to the number of matches
(out_batch.set_row_count(selection.size())) instead of adding to the existing
count, and otherwise keep the existing selection/build logic and
input_batch_->clear() behavior to match the approach used by
VectorizedProjectOperator::next_batch and
VectorizedAggregateOperator::next_batch.

---

Nitpick comments:
In `@tests/vectorized_operator_tests.cpp`:
- Line 25: The test fixtures currently hardcode "./test_scan_ops" in SetUp when
constructing StorageManager (e.g., in SetUp of VectorizedSeqScanTests,
VectorizedProjectTests, VectorizedAggregateTests, VectorizedFilterTests), which
can cause cross-run pollution and parallel-test collisions; change each SetUp to
create a unique temp directory (use std::filesystem::temp_directory_path()
combined with a per-test unique suffix such as std::filesystem::unique_path() or
a timestamp + test name) and pass that path to the StorageManager constructor,
then update TearDown to call std::filesystem::remove_all on that directory
before resetting the storage_ unique_ptr so artifacts are cleaned up between
runs and tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 020e5854-b4f7-4201-a91b-9d23a98f6dec

📥 Commits

Reviewing files that changed from the base of the PR and between adfa4c5 and c77cdb2.

📒 Files selected for processing (3)

CMakeLists.txt
include/executor/vectorized_operator.hpp
tests/vectorized_operator_tests.cpp

coderabbitai · 2026-04-21T09:43:08Z

+TEST_F(VectorizedAggregateTests, CountOnlyEmpty) {
+    // Note: COUNT(*) on empty table currently returns NULL due to has_value_ semantics
+    // This test verifies behavior as-is rather than expecting specific results
+    Schema schema;
+    schema.add_column("val", common::ValueType::TYPE_INT64);
+
+    ColumnarTable table("count_empty_agg", *storage_, schema);
+    ASSERT_TRUE(table.create());
+    ASSERT_TRUE(table.open());
+
+    auto table_ptr = std::make_shared<ColumnarTable>(table);
+    auto scan = std::make_unique<VectorizedSeqScanOperator>("count_empty_agg", table_ptr);
+
+    Schema out_schema;
+    out_schema.add_column("cnt", common::ValueType::TYPE_INT64);
+
+    std::vector<VectorizedAggregateInfo> aggs = {{AggregateType::Count, -1}};
+
+    VectorizedAggregateOperator agg(std::move(scan), std::move(out_schema), aggs);
+
+    auto result = VectorBatch::create(agg.output_schema());
+    ASSERT_TRUE(agg.next_batch(*result));
+    EXPECT_EQ(result->row_count(), 1);
+    // Empty table aggregate produces a result batch (done_ = true)
+}


⚠️ Potential issue | 🟡 Minor

CountOnlyEmpty locks in a SQL-spec violation rather than flagging it.

Per SQL standard, COUNT(*) on an empty relation must return 0, not NULL. The test comment acknowledges the aggregate operator returns NULL here because has_value_[i] remains false for COUNT over zero batches, but the test only asserts row_count() == 1 — it never checks the actual count value, so the bug will pass CI silently and future regressions in either direction go undetected.

Recommend either:

Fix VectorizedAggregateOperator so COUNT is initialized with has_value_[i] = true (COUNT always has a value), and assert result->get_column(0).get(0).as_int64() == 0, or

Explicitly assert the (buggy) NULL behavior via EXPECT_TRUE(result->get_column(0).get(0).is_null()) and open a tracking issue.

Leaving the assertion absent makes this test effectively a no-op for its stated concern.

Want me to draft the aggregate fix (initializing has_value_ for Count in the constructor) and update the test accordingly?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/vectorized_operator_tests.cpp` around lines 256 - 280, The test is silent about a SQL-spec bug: COUNT(*) over an empty table should return 0 but current VectorizedAggregateOperator leaves has_value_ false so the result is NULL; fix the operator by initializing has_value_[i] = true and the corresponding accumulator to 0 for aggregates where aggs[i].type == AggregateType::Count (in the VectorizedAggregateOperator constructor/initialization code that processes VectorizedAggregateInfo), then update the CountOnlyEmpty test (VectorizedAggregateTests::CountOnlyEmpty) to assert result->get_column(0).get(0).as_int64() == 0 (and keep row_count() == 1).

…table VectorizedFilterOperator::next_batch(): - Clear output batch at entry for clean state - Remove early-return optimization that checked row_count > 0 - Set row count to selection.size() instead of accumulating VectorizedAggregateOperator: - Initialize has_value_[i] = true for COUNT aggregates in constructor - This makes COUNT(*) return 0 (not NULL) for empty tables per SQL spec Test update: - CountOnlyEmpty now asserts COUNT(*) = 0 for empty table

poyrazK

It's okay to merge

style: automated clang-format fixes

c77cdb2

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

poyrazK commented Apr 21, 2026

View reviewed changes

poyrazK merged commit de67738 into main Apr 21, 2026
9 checks passed

poyrazK deleted the feature/vectorized-operator-tests branch April 21, 2026 10:30

coderabbitai Bot mentioned this pull request Apr 28, 2026

refactor(vectorized): streaming hash join with bounded memory #69

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(vectorized): add operator unit tests#57

test(vectorized): add operator unit tests#57
poyrazK merged 3 commits intomainfrom
feature/vectorized-operator-tests

poyrazK commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

poyrazK left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poyrazK commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

poyrazK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poyrazK commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading