Skip to content

test(vectorized): add operator unit tests#57

Merged
poyrazK merged 3 commits intomainfrom
feature/vectorized-operator-tests
Apr 21, 2026
Merged

test(vectorized): add operator unit tests#57
poyrazK merged 3 commits intomainfrom
feature/vectorized-operator-tests

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented Apr 21, 2026

Summary

Add comprehensive unit tests for vectorized operators:

  • VectorizedSeqScanOperator: Empty table, single batch, non-aligned boundaries, sequential calls until EOF
  • VectorizedProjectOperator: Empty input, multiple expressions, computed expressions
  • VectorizedAggregateOperator: Empty input, COUNT-only, SUM with FLOAT64
  • VectorizedFilterOperator: No matches, all match, pipelined batch processing

Test plan

  • vectorized_operator_tests passes
  • analytics_tests still passes
  • CI verification

Summary by CodeRabbit

  • New Features

    • Vectorized filter operator now processes multiple batches sequentially for enhanced query execution capabilities.
  • Tests

    • Comprehensive test suite added for vectorized operators, covering empty inputs, batch boundaries, expression evaluation, aggregation, and filtering scenarios.

Add comprehensive tests for VectorizedSeqScanOperator, VectorizedProjectOperator,
VectorizedAggregateOperator, and VectorizedFilterOperator covering:
- Empty table handling
- Single and multi-batch scenarios
- Non-aligned boundaries
- Pipelined batch processing
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 54 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 54 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d270fd3a-6961-422a-87d3-6579caf2d3a1

📥 Commits

Reviewing files that changed from the base of the PR and between c77cdb2 and c41abe0.

📒 Files selected for processing (2)
  • include/executor/vectorized_operator.hpp
  • tests/vectorized_operator_tests.cpp
📝 Walkthrough

Walkthrough

The changes add a new test executable to the CMake configuration, modify VectorizedFilterOperator::next_batch to iterate through multiple child batches accumulating matches before returning, and introduce a comprehensive test suite covering vectorized operator types with various input scenarios and edge cases.

Changes

Cohort / File(s) Summary
Build Configuration
CMakeLists.txt
Added vectorized_operator_tests GoogleTest executable target for testing vectorized operators.
Operator Implementation
include/executor/vectorized_operator.hpp
Modified VectorizedFilterOperator::next_batch to iterate through child batches in a loop, accumulating matching rows into output before returning, instead of processing a single batch per call.
Test Suite
tests/vectorized_operator_tests.cpp
New comprehensive test file with fixtures and test cases for VectorizedSeqScanOperator, VectorizedProjectOperator, VectorizedAggregateOperator, and VectorizedFilterOperator, covering empty inputs, multiple batches, boundary conditions, and repeated scanning.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 Hopping through batches in vectorized flight,
Filters loop onwards from batch to batch,
Selection masks twirl and conditions ignite,
Accumulating rows in a perfect match,
Tests carrot-approved—operators take flight! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding comprehensive unit tests for vectorized operators, which aligns with the primary changeset additions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/vectorized-operator-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
include/executor/vectorized_operator.hpp (1)

104-146: ⚠️ Potential issue | 🟠 Major

Fragile output-batch contract: missing clear() + early return can silently stall the pipeline.

Unlike VectorizedProjectOperator::next_batch (line 168) and VectorizedAggregateOperator::next_batch (line 264), this method never calls out_batch.clear(). Combined with:

  • the early return at lines 106-108 (if (out_batch.row_count() > 0) return true;), and
  • the accumulation-style write at line 138 (set_row_count(out_batch.row_count() + selection.size())),

the operator's correctness now depends on the caller clearing out_batch between every next_batch call. If a caller forgets (or reuses the batch as Project/Aggregate patterns suggest they can), the filter will return true immediately without advancing the child scan, effectively looping on stale results. Conversely, if the caller always clears, the "accumulated rows from a previous call" branch is dead code and the comment is misleading.

The PipelinedBatches test happens to clear after each call, so the bug is not caught by the new suite.

Recommend aligning with the sibling operators: clear at entry, drop the early return, and write a fresh row count.

🔒 Proposed fix
     bool next_batch(VectorBatch& out_batch) override {
-        // If we already have accumulated rows from a previous call, return them
-        if (out_batch.row_count() > 0) {
-            return true;
-        }
-
-        // Ensure output batch is structured for current schema
-        if (out_batch.column_count() == 0) {
+        out_batch.clear();
+        if (out_batch.column_count() == 0) {
             out_batch.init_from_schema(output_schema_);
         }

         // Process child batches until we find matches or exhaust input
         while (child_->next_batch(*input_batch_)) {
             selection_mask_->clear();
             condition_->evaluate_vectorized(*input_batch_, child_->output_schema(),
                                             *selection_mask_);

             std::vector<size_t> selection;
             for (size_t r = 0; r < input_batch_->row_count(); ++r) {
                 common::Value val = selection_mask_->get(r);
                 if (!val.is_null() && val.as_bool()) {
                     selection.push_back(r);
                 }
             }

             if (!selection.empty()) {
                 for (size_t c = 0; c < input_batch_->column_count(); ++c) {
                     auto& src_col = input_batch_->get_column(c);
                     auto& dest_col = out_batch.get_column(c);
                     for (size_t r : selection) {
                         dest_col.append(src_col.get(r));
                     }
                 }
-                out_batch.set_row_count(out_batch.row_count() + selection.size());
+                out_batch.set_row_count(selection.size());
                 input_batch_->clear();
                 return true;
             }
             input_batch_->clear();
         }

         return false;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/executor/vectorized_operator.hpp` around lines 104 - 146, In
VectorizedFilterOperator::next_batch(VectorBatch& out_batch) the method must
clear the output batch at entry (call out_batch.clear()) and remove the
early-return that checks out_batch.row_count() so the operator doesn't silently
reuse stale rows; when writing matches, set the output row count to the number
of matches (out_batch.set_row_count(selection.size())) instead of adding to the
existing count, and otherwise keep the existing selection/build logic and
input_batch_->clear() behavior to match the approach used by
VectorizedProjectOperator::next_batch and
VectorizedAggregateOperator::next_batch.
🧹 Nitpick comments (1)
tests/vectorized_operator_tests.cpp (1)

25-25: Hardcoded relative paths risk cross-run / cross-test pollution.

Each fixture (VectorizedSeqScanTests, VectorizedProjectTests, VectorizedAggregateTests, VectorizedFilterTests) points StorageManager at a fixed path like "./test_scan_ops". TearDown only resets the unique_ptr — it does not remove the directory. Side effects:

  • Artifacts persist between runs; a schema/layout change can make stale files poison subsequent runs.
  • Parallel CTest (ctest -j) could interleave tests sharing a path if any identifier collides.
  • Running from a different CWD changes where files land.

Consider using std::filesystem::temp_directory_path() + a unique per-test subdirectory in SetUp, and std::filesystem::remove_all in TearDown. Applies to lines 25, 137, 250, 346.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vectorized_operator_tests.cpp` at line 25, The test fixtures currently
hardcode "./test_scan_ops" in SetUp when constructing StorageManager (e.g., in
SetUp of VectorizedSeqScanTests, VectorizedProjectTests,
VectorizedAggregateTests, VectorizedFilterTests), which can cause cross-run
pollution and parallel-test collisions; change each SetUp to create a unique
temp directory (use std::filesystem::temp_directory_path() combined with a
per-test unique suffix such as std::filesystem::unique_path() or a timestamp +
test name) and pass that path to the StorageManager constructor, then update
TearDown to call std::filesystem::remove_all on that directory before resetting
the storage_ unique_ptr so artifacts are cleaned up between runs and tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/vectorized_operator_tests.cpp`:
- Around line 256-280: The test is silent about a SQL-spec bug: COUNT(*) over an
empty table should return 0 but current VectorizedAggregateOperator leaves
has_value_ false so the result is NULL; fix the operator by initializing
has_value_[i] = true and the corresponding accumulator to 0 for aggregates where
aggs[i].type == AggregateType::Count (in the VectorizedAggregateOperator
constructor/initialization code that processes VectorizedAggregateInfo), then
update the CountOnlyEmpty test (VectorizedAggregateTests::CountOnlyEmpty) to
assert result->get_column(0).get(0).as_int64() == 0 (and keep row_count() == 1).

---

Outside diff comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 104-146: In VectorizedFilterOperator::next_batch(VectorBatch&
out_batch) the method must clear the output batch at entry (call
out_batch.clear()) and remove the early-return that checks out_batch.row_count()
so the operator doesn't silently reuse stale rows; when writing matches, set the
output row count to the number of matches
(out_batch.set_row_count(selection.size())) instead of adding to the existing
count, and otherwise keep the existing selection/build logic and
input_batch_->clear() behavior to match the approach used by
VectorizedProjectOperator::next_batch and
VectorizedAggregateOperator::next_batch.

---

Nitpick comments:
In `@tests/vectorized_operator_tests.cpp`:
- Line 25: The test fixtures currently hardcode "./test_scan_ops" in SetUp when
constructing StorageManager (e.g., in SetUp of VectorizedSeqScanTests,
VectorizedProjectTests, VectorizedAggregateTests, VectorizedFilterTests), which
can cause cross-run pollution and parallel-test collisions; change each SetUp to
create a unique temp directory (use std::filesystem::temp_directory_path()
combined with a per-test unique suffix such as std::filesystem::unique_path() or
a timestamp + test name) and pass that path to the StorageManager constructor,
then update TearDown to call std::filesystem::remove_all on that directory
before resetting the storage_ unique_ptr so artifacts are cleaned up between
runs and tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 020e5854-b4f7-4201-a91b-9d23a98f6dec

📥 Commits

Reviewing files that changed from the base of the PR and between adfa4c5 and c77cdb2.

📒 Files selected for processing (3)
  • CMakeLists.txt
  • include/executor/vectorized_operator.hpp
  • tests/vectorized_operator_tests.cpp

Comment on lines +256 to +280
TEST_F(VectorizedAggregateTests, CountOnlyEmpty) {
// Note: COUNT(*) on empty table currently returns NULL due to has_value_ semantics
// This test verifies behavior as-is rather than expecting specific results
Schema schema;
schema.add_column("val", common::ValueType::TYPE_INT64);

ColumnarTable table("count_empty_agg", *storage_, schema);
ASSERT_TRUE(table.create());
ASSERT_TRUE(table.open());

auto table_ptr = std::make_shared<ColumnarTable>(table);
auto scan = std::make_unique<VectorizedSeqScanOperator>("count_empty_agg", table_ptr);

Schema out_schema;
out_schema.add_column("cnt", common::ValueType::TYPE_INT64);

std::vector<VectorizedAggregateInfo> aggs = {{AggregateType::Count, -1}};

VectorizedAggregateOperator agg(std::move(scan), std::move(out_schema), aggs);

auto result = VectorBatch::create(agg.output_schema());
ASSERT_TRUE(agg.next_batch(*result));
EXPECT_EQ(result->row_count(), 1);
// Empty table aggregate produces a result batch (done_ = true)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

CountOnlyEmpty locks in a SQL-spec violation rather than flagging it.

Per SQL standard, COUNT(*) on an empty relation must return 0, not NULL. The test comment acknowledges the aggregate operator returns NULL here because has_value_[i] remains false for COUNT over zero batches, but the test only asserts row_count() == 1 — it never checks the actual count value, so the bug will pass CI silently and future regressions in either direction go undetected.

Recommend either:

  1. Fix VectorizedAggregateOperator so COUNT is initialized with has_value_[i] = true (COUNT always has a value), and assert result->get_column(0).get(0).as_int64() == 0, or
  2. Explicitly assert the (buggy) NULL behavior via EXPECT_TRUE(result->get_column(0).get(0).is_null()) and open a tracking issue.

Leaving the assertion absent makes this test effectively a no-op for its stated concern.

Want me to draft the aggregate fix (initializing has_value_ for Count in the constructor) and update the test accordingly?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vectorized_operator_tests.cpp` around lines 256 - 280, The test is
silent about a SQL-spec bug: COUNT(*) over an empty table should return 0 but
current VectorizedAggregateOperator leaves has_value_ false so the result is
NULL; fix the operator by initializing has_value_[i] = true and the
corresponding accumulator to 0 for aggregates where aggs[i].type ==
AggregateType::Count (in the VectorizedAggregateOperator
constructor/initialization code that processes VectorizedAggregateInfo), then
update the CountOnlyEmpty test (VectorizedAggregateTests::CountOnlyEmpty) to
assert result->get_column(0).get(0).as_int64() == 0 (and keep row_count() == 1).

…table

VectorizedFilterOperator::next_batch():
- Clear output batch at entry for clean state
- Remove early-return optimization that checked row_count > 0
- Set row count to selection.size() instead of accumulating

VectorizedAggregateOperator:
- Initialize has_value_[i] = true for COUNT aggregates in constructor
- This makes COUNT(*) return 0 (not NULL) for empty tables per SQL spec

Test update:
- CountOnlyEmpty now asserts COUNT(*) = 0 for empty table
Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to merge

@poyrazK poyrazK merged commit de67738 into main Apr 21, 2026
9 checks passed
@poyrazK poyrazK deleted the feature/vectorized-operator-tests branch April 21, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant