test(vectorized): add operator unit tests#57
Conversation
Add comprehensive tests for VectorizedSeqScanOperator, VectorizedProjectOperator, VectorizedAggregateOperator, and VectorizedFilterOperator covering: - Empty table handling - Single and multi-batch scenarios - Non-aligned boundaries - Pipelined batch processing
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 54 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe changes add a new test executable to the CMake configuration, modify Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
include/executor/vectorized_operator.hpp (1)
104-146:⚠️ Potential issue | 🟠 MajorFragile output-batch contract: missing
clear()+ early return can silently stall the pipeline.Unlike
VectorizedProjectOperator::next_batch(line 168) andVectorizedAggregateOperator::next_batch(line 264), this method never callsout_batch.clear(). Combined with:
- the early return at lines 106-108 (
if (out_batch.row_count() > 0) return true;), and- the accumulation-style write at line 138 (
set_row_count(out_batch.row_count() + selection.size())),the operator's correctness now depends on the caller clearing
out_batchbetween everynext_batchcall. If a caller forgets (or reuses the batch as Project/Aggregate patterns suggest they can), the filter will returntrueimmediately without advancing the child scan, effectively looping on stale results. Conversely, if the caller always clears, the "accumulated rows from a previous call" branch is dead code and the comment is misleading.The
PipelinedBatchestest happens to clear after each call, so the bug is not caught by the new suite.Recommend aligning with the sibling operators: clear at entry, drop the early return, and write a fresh row count.
🔒 Proposed fix
bool next_batch(VectorBatch& out_batch) override { - // If we already have accumulated rows from a previous call, return them - if (out_batch.row_count() > 0) { - return true; - } - - // Ensure output batch is structured for current schema - if (out_batch.column_count() == 0) { + out_batch.clear(); + if (out_batch.column_count() == 0) { out_batch.init_from_schema(output_schema_); } // Process child batches until we find matches or exhaust input while (child_->next_batch(*input_batch_)) { selection_mask_->clear(); condition_->evaluate_vectorized(*input_batch_, child_->output_schema(), *selection_mask_); std::vector<size_t> selection; for (size_t r = 0; r < input_batch_->row_count(); ++r) { common::Value val = selection_mask_->get(r); if (!val.is_null() && val.as_bool()) { selection.push_back(r); } } if (!selection.empty()) { for (size_t c = 0; c < input_batch_->column_count(); ++c) { auto& src_col = input_batch_->get_column(c); auto& dest_col = out_batch.get_column(c); for (size_t r : selection) { dest_col.append(src_col.get(r)); } } - out_batch.set_row_count(out_batch.row_count() + selection.size()); + out_batch.set_row_count(selection.size()); input_batch_->clear(); return true; } input_batch_->clear(); } return false; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/executor/vectorized_operator.hpp` around lines 104 - 146, In VectorizedFilterOperator::next_batch(VectorBatch& out_batch) the method must clear the output batch at entry (call out_batch.clear()) and remove the early-return that checks out_batch.row_count() so the operator doesn't silently reuse stale rows; when writing matches, set the output row count to the number of matches (out_batch.set_row_count(selection.size())) instead of adding to the existing count, and otherwise keep the existing selection/build logic and input_batch_->clear() behavior to match the approach used by VectorizedProjectOperator::next_batch and VectorizedAggregateOperator::next_batch.
🧹 Nitpick comments (1)
tests/vectorized_operator_tests.cpp (1)
25-25: Hardcoded relative paths risk cross-run / cross-test pollution.Each fixture (
VectorizedSeqScanTests,VectorizedProjectTests,VectorizedAggregateTests,VectorizedFilterTests) pointsStorageManagerat a fixed path like"./test_scan_ops".TearDownonly resets theunique_ptr— it does not remove the directory. Side effects:
- Artifacts persist between runs; a schema/layout change can make stale files poison subsequent runs.
- Parallel CTest (
ctest -j) could interleave tests sharing a path if any identifier collides.- Running from a different CWD changes where files land.
Consider using
std::filesystem::temp_directory_path()+ a unique per-test subdirectory inSetUp, andstd::filesystem::remove_allinTearDown. Applies to lines 25, 137, 250, 346.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/vectorized_operator_tests.cpp` at line 25, The test fixtures currently hardcode "./test_scan_ops" in SetUp when constructing StorageManager (e.g., in SetUp of VectorizedSeqScanTests, VectorizedProjectTests, VectorizedAggregateTests, VectorizedFilterTests), which can cause cross-run pollution and parallel-test collisions; change each SetUp to create a unique temp directory (use std::filesystem::temp_directory_path() combined with a per-test unique suffix such as std::filesystem::unique_path() or a timestamp + test name) and pass that path to the StorageManager constructor, then update TearDown to call std::filesystem::remove_all on that directory before resetting the storage_ unique_ptr so artifacts are cleaned up between runs and tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/vectorized_operator_tests.cpp`:
- Around line 256-280: The test is silent about a SQL-spec bug: COUNT(*) over an
empty table should return 0 but current VectorizedAggregateOperator leaves
has_value_ false so the result is NULL; fix the operator by initializing
has_value_[i] = true and the corresponding accumulator to 0 for aggregates where
aggs[i].type == AggregateType::Count (in the VectorizedAggregateOperator
constructor/initialization code that processes VectorizedAggregateInfo), then
update the CountOnlyEmpty test (VectorizedAggregateTests::CountOnlyEmpty) to
assert result->get_column(0).get(0).as_int64() == 0 (and keep row_count() == 1).
---
Outside diff comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 104-146: In VectorizedFilterOperator::next_batch(VectorBatch&
out_batch) the method must clear the output batch at entry (call
out_batch.clear()) and remove the early-return that checks out_batch.row_count()
so the operator doesn't silently reuse stale rows; when writing matches, set the
output row count to the number of matches
(out_batch.set_row_count(selection.size())) instead of adding to the existing
count, and otherwise keep the existing selection/build logic and
input_batch_->clear() behavior to match the approach used by
VectorizedProjectOperator::next_batch and
VectorizedAggregateOperator::next_batch.
---
Nitpick comments:
In `@tests/vectorized_operator_tests.cpp`:
- Line 25: The test fixtures currently hardcode "./test_scan_ops" in SetUp when
constructing StorageManager (e.g., in SetUp of VectorizedSeqScanTests,
VectorizedProjectTests, VectorizedAggregateTests, VectorizedFilterTests), which
can cause cross-run pollution and parallel-test collisions; change each SetUp to
create a unique temp directory (use std::filesystem::temp_directory_path()
combined with a per-test unique suffix such as std::filesystem::unique_path() or
a timestamp + test name) and pass that path to the StorageManager constructor,
then update TearDown to call std::filesystem::remove_all on that directory
before resetting the storage_ unique_ptr so artifacts are cleaned up between
runs and tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 020e5854-b4f7-4201-a91b-9d23a98f6dec
📒 Files selected for processing (3)
CMakeLists.txtinclude/executor/vectorized_operator.hpptests/vectorized_operator_tests.cpp
| TEST_F(VectorizedAggregateTests, CountOnlyEmpty) { | ||
| // Note: COUNT(*) on empty table currently returns NULL due to has_value_ semantics | ||
| // This test verifies behavior as-is rather than expecting specific results | ||
| Schema schema; | ||
| schema.add_column("val", common::ValueType::TYPE_INT64); | ||
|
|
||
| ColumnarTable table("count_empty_agg", *storage_, schema); | ||
| ASSERT_TRUE(table.create()); | ||
| ASSERT_TRUE(table.open()); | ||
|
|
||
| auto table_ptr = std::make_shared<ColumnarTable>(table); | ||
| auto scan = std::make_unique<VectorizedSeqScanOperator>("count_empty_agg", table_ptr); | ||
|
|
||
| Schema out_schema; | ||
| out_schema.add_column("cnt", common::ValueType::TYPE_INT64); | ||
|
|
||
| std::vector<VectorizedAggregateInfo> aggs = {{AggregateType::Count, -1}}; | ||
|
|
||
| VectorizedAggregateOperator agg(std::move(scan), std::move(out_schema), aggs); | ||
|
|
||
| auto result = VectorBatch::create(agg.output_schema()); | ||
| ASSERT_TRUE(agg.next_batch(*result)); | ||
| EXPECT_EQ(result->row_count(), 1); | ||
| // Empty table aggregate produces a result batch (done_ = true) | ||
| } |
There was a problem hiding this comment.
CountOnlyEmpty locks in a SQL-spec violation rather than flagging it.
Per SQL standard, COUNT(*) on an empty relation must return 0, not NULL. The test comment acknowledges the aggregate operator returns NULL here because has_value_[i] remains false for COUNT over zero batches, but the test only asserts row_count() == 1 — it never checks the actual count value, so the bug will pass CI silently and future regressions in either direction go undetected.
Recommend either:
- Fix
VectorizedAggregateOperatorsoCOUNTis initialized withhas_value_[i] = true(COUNT always has a value), and assertresult->get_column(0).get(0).as_int64() == 0, or - Explicitly assert the (buggy) NULL behavior via
EXPECT_TRUE(result->get_column(0).get(0).is_null())and open a tracking issue.
Leaving the assertion absent makes this test effectively a no-op for its stated concern.
Want me to draft the aggregate fix (initializing has_value_ for Count in the constructor) and update the test accordingly?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/vectorized_operator_tests.cpp` around lines 256 - 280, The test is
silent about a SQL-spec bug: COUNT(*) over an empty table should return 0 but
current VectorizedAggregateOperator leaves has_value_ false so the result is
NULL; fix the operator by initializing has_value_[i] = true and the
corresponding accumulator to 0 for aggregates where aggs[i].type ==
AggregateType::Count (in the VectorizedAggregateOperator
constructor/initialization code that processes VectorizedAggregateInfo), then
update the CountOnlyEmpty test (VectorizedAggregateTests::CountOnlyEmpty) to
assert result->get_column(0).get(0).as_int64() == 0 (and keep row_count() == 1).
…table VectorizedFilterOperator::next_batch(): - Clear output batch at entry for clean state - Remove early-return optimization that checked row_count > 0 - Set row count to selection.size() instead of accumulating VectorizedAggregateOperator: - Initialize has_value_[i] = true for COUNT aggregates in constructor - This makes COUNT(*) return 0 (not NULL) for empty tables per SQL spec Test update: - CountOnlyEmpty now asserts COUNT(*) = 0 for empty table
Summary
Add comprehensive unit tests for vectorized operators:
VectorizedSeqScanOperator: Empty table, single batch, non-aligned boundaries, sequential calls until EOFVectorizedProjectOperator: Empty input, multiple expressions, computed expressionsVectorizedAggregateOperator: Empty input, COUNT-only, SUM with FLOAT64VectorizedFilterOperator: No matches, all match, pipelined batch processingTest plan
Summary by CodeRabbit
New Features
Tests