fix(vectorized): correct filter operator to exhaust all child batches#56
fix(vectorized): correct filter operator to exhaust all child batches#56
Conversation
VectorizedFilterOperator::next_batch() was returning true after copying matches from the first batch, instead of continuing to consume all child batches until EOF.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 27 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
include/executor/vectorized_operator.hpp (1)
104-138:⚠️ Potential issue | 🟠 MajorFilter now drains the entire child pipeline in a single call — breaks vectorized/pipelined execution and risks unbounded memory.
With the early return removed,
next_batchno longer returns once a non-empty match set is produced; instead it loops until the child reports EOF, appending every matching row across every child batch into a singleout_batch. Consequences:
- Unbounded output batch. For a low-selectivity predicate over a large table,
out_batchaccumulates potentially millions of rows in oneVectorBatch, defeating the batch-at-a-time design (compare withVectorizedSeqScanOperatorwhich yields ~batch_size_rows).- Breaks pipelining. Any downstream operator that benefits from incremental delivery (e.g., a future
LIMIT, top-N, or a consumer that interleaves work) is now forced to wait for the full scan. This turnsFilterinto a de-facto blocking operator, similar toVectorizedAggregateOperator— butFilteris not terminal.- Diagnosis of original bug may be wrong. Standard vectorized semantics is exactly "return
truewhen this call produced rows; caller re-invokes to pull more." Sincechild_preserves its own position across calls, the previous early-return was correct as long as the caller loops onnext_batch. If the reported failure was that the caller only callednext_batchonce, the real fix belongs in the caller.Suggested fix: return as soon as any matches are produced (or cap by a target batch size) and let the next caller invocation resume. Because
child_->next_batchnaturally returnsfalseat EOF, no extra state is needed.♻️ Proposed fix (return per-batch, preserving pipelining)
while (child_->next_batch(*input_batch_)) { selection_mask_->clear(); condition_->evaluate_vectorized(*input_batch_, child_->output_schema(), *selection_mask_); std::vector<size_t> selection; for (size_t r = 0; r < input_batch_->row_count(); ++r) { common::Value val = selection_mask_->get(r); if (!val.is_null() && val.as_bool()) { selection.push_back(r); } } if (!selection.empty()) { - // Batch-level append optimization: iterate columns once for (size_t c = 0; c < input_batch_->column_count(); ++c) { auto& src_col = input_batch_->get_column(c); auto& dest_col = out_batch.get_column(c); for (size_t r : selection) { dest_col.append(src_col.get(r)); } } out_batch.set_row_count(out_batch.row_count() + selection.size()); + input_batch_->clear(); + return true; } input_batch_->clear(); } - // Return true if we accumulated any rows, false if no matches found - return out_batch.row_count() > 0; + return false;If coalescing small selections across child batches is desired for efficiency, cap the accumulation by a target size (e.g.,
batch_size) and return when that threshold is reached, rather than only at child EOF.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/executor/vectorized_operator.hpp` around lines 104 - 138, The current VectorizedFilter next_batch implementation drains the entire child in one call, causing unbounded out_batch growth and breaking pipelining; fix next_batch (the override) so it returns as soon as it has appended any rows to out_batch (or when out_batch reaches a configured target size like batch_size_), i.e., after processing a child_->next_batch() and appending selection rows to out_batch, immediately return true if out_batch.row_count() > 0 (or if accumulated >= batch_size_), otherwise continue fetching child batches until EOF; update logic around input_batch_, selection_mask_, and out_batch to preserve position across calls but avoid aggregating all child rows in one invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 104-138: The current VectorizedFilter next_batch implementation
drains the entire child in one call, causing unbounded out_batch growth and
breaking pipelining; fix next_batch (the override) so it returns as soon as it
has appended any rows to out_batch (or when out_batch reaches a configured
target size like batch_size_), i.e., after processing a child_->next_batch() and
appending selection rows to out_batch, immediately return true if
out_batch.row_count() > 0 (or if accumulated >= batch_size_), otherwise continue
fetching child batches until EOF; update logic around input_batch_,
selection_mask_, and out_batch to preserve position across calls but avoid
aggregating all child rows in one invocation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ac3ddd17-5450-4732-b33c-6b74f89a50ff
📒 Files selected for processing (1)
include/executor/vectorized_operator.hpp
…h at a time Previously, VectorizedFilterOperator::next_batch() would drain the entire child operator in one call, causing unbounded out_batch growth and breaking pipelining. Now each call processes exactly one child batch, enabling pipelined execution downstream.
Summary
VectorizedFilterOperator::next_batch() was returning true after copying matches from the first batch, instead of continuing to consume all child batches until EOF.
Fix
Test plan
Summary by CodeRabbit