refactor(vectorized): streaming hash join with bounded memory#69
refactor(vectorized): streaming hash join with bounded memory#69
Conversation
- Enhance SelectWithGroupBy to verify actual aggregated values - Add SelectWithGroupByCount test (COUNT with column) - Add SelectWithGroupByMinMax test (MIN/MAX aggregates) - Add SelectWithGroupByMultipleColumns test (multi-column GROUP BY)
… test coverage - Add LEFT join support tracking unmatched left rows and emitting with NULLs - Add join_type_ member to differentiate INNER/LEFT behavior - Add unmatched_indices_ and left_matched_in_batch_ for LEFT join tracking - Add 6 new test cases: EmptyRight, EmptyLeft, MultipleMatches, LeftNullKeys, OutputValues, MultiBatch - Document known limitation: first-match-only behavior for duplicate right keys - Add *.bin to .gitignore to ignore test data files
…in hash join
The break statement at line 743 caused each left row to only match ONE right row,
even when multiple right rows had the same key. This was incorrect behavior for
both INNER and LEFT joins - a proper hash join should match ALL right rows
with the same key.
Example: Right has id={1,1,2}, Left has id={1,2}
- Before: 2 matches (first match wins, second ignored)
- After: 3 matches (left_id=1 matches 2 right rows, left_id=2 matches 1)
… resumable scanning When a single left row matches multiple right rows, the probe loop could emit more than BATCH_SIZE rows before checking capacity. Add probe state cursors (resuming_bucket_scan_, resumed_bucket_idx_, resumed_entry_idx_, resumed_key_val_) that persist across next_batch calls so bucket scanning can be paused when batch is full and resumed on the next invocation. Also add explicit right_ids assertions to VectorizedHashJoinMultipleMatches test to mirror the left_ids checks.
- Load left rows into buffer once for repeated probing - Process right table in 1024-row chunks - Support LEFT join with unmatched row emission - State machine: LoadLeftBuffer → BuildRightChunk → ProbeChunk → EmitUnmatched → Done - Fix batch overflow bug with resumable bucket scanning
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis pull request adds documentation for a bounded-memory streaming hash join implementation ( Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…tch test - Fix VectorizedHashBucket.key_values comment to match implementation (stores one Value per row, not a vector of column values) - Add VectorizedHashJoinLeftMultiBatch test to verify LEFT join correctness when right table spans multiple 1024-row chunks
- Add section 6 documenting the streaming hash join - Cover bounded memory design (1024-row chunks) - Document state machine architecture - Cover LEFT join support and cross-chunk deduplication
Resolve conflicts by keeping streaming hash join implementation - Take HEAD version (streaming/chunked hash join) over main's non-streaming version - Re-add VectorizedHashJoinLeftMultiBatch test that was lost in conflict resolution
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/vectorized_operator_tests.cpp (1)
1384-1396: Strengthen this test to verify key correctness, not only row/null counts.Current assertions can still pass if some non-null matches are attached to wrong left keys. Please assert that all non-null rows are
(left.id=1, right.id=1)and that NULL-emitted rows are exactly for left ids 2 and 3.Proposed assertion hardening
auto result = VectorBatch::create(join->output_schema()); int64_t total_rows = 0; int64_t rows_with_nulls = 0; // LEFT id=2,3 should emit with NULLs + int64_t matched_rows = 0; + int64_t null_for_left2 = 0; + int64_t null_for_left3 = 0; while (join->next_batch(*result)) { for (size_t i = 0; i < result->row_count(); ++i) { + int64_t left_id = result->get_column(0).get(i).as_int64(); if (result->get_column(1).get(i).is_null()) { rows_with_nulls++; + if (left_id == 2) ++null_for_left2; + else if (left_id == 3) ++null_for_left3; + else ADD_FAILURE() << "Unexpected NULL-emitted left.id=" << left_id; } + else { + int64_t right_id = result->get_column(1).get(i).as_int64(); + EXPECT_EQ(left_id, 1); + EXPECT_EQ(right_id, 1); + ++matched_rows; + } } total_rows += result->row_count(); result->clear(); } // LEFT join: id=1 matches 1500 rows, id=2,3 emit with NULLs EXPECT_EQ(total_rows, 1502); // 1500 matches + 2 unmatched with NULLs EXPECT_EQ(rows_with_nulls, 2); // id=2 and id=3 have no match + EXPECT_EQ(matched_rows, 1500); + EXPECT_EQ(null_for_left2, 1); + EXPECT_EQ(null_for_left3, 1);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/vectorized_operator_tests.cpp` around lines 1384 - 1396, The test only checks total_rows and rows_with_nulls but not that non-null rows correspond to (left.id=1,right.id=1) and the two NULL rows are for left ids 2 and 3; update the loop that consumes join->next_batch(*result) to inspect columns via result->get_column(...) and for each row collect left.id and right.id (treat right.id nulls explicitly), assert every non-null right.id row has left.id==1 and right.id==1, and assert exactly two rows have right.id null and their left ids are {2,3}; keep using result->row_count(), result->clear(), and the same batch iteration logic but add these per-row checks and a final check on the set of null-left ids.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/phases/PHASE_8_ANALYTICS.md`:
- Around line 41-48: The docs mention outdated internals (RIGHT_CHUNK_SIZE,
left_rows_buffer_, four-phase flow) that don't match the current
VectorizedHashJoinOperator implementation; update PHASE_8_ANALYTICS.md to use
the real symbols and behavior: replace RIGHT_CHUNK_SIZE/left_rows_buffer_ with
BATCH_SIZE and the actual in-memory buffering behavior, change the phase list to
BuildRight → ProbeLeft → Done (and note resumable probe state fields used for
incremental probing), and remove/replace references to 4-phase flow, 64 hash
buckets (if not used), and left_row_matched_ if it's not present; ensure
terminology matches class names/methods (VectorizedHashJoinOperator, BuildRight,
ProbeLeft, Done, and any resumable probe state fields) so the documentation
reflects the current code.
---
Nitpick comments:
In `@tests/vectorized_operator_tests.cpp`:
- Around line 1384-1396: The test only checks total_rows and rows_with_nulls but
not that non-null rows correspond to (left.id=1,right.id=1) and the two NULL
rows are for left ids 2 and 3; update the loop that consumes
join->next_batch(*result) to inspect columns via result->get_column(...) and for
each row collect left.id and right.id (treat right.id nulls explicitly), assert
every non-null right.id row has left.id==1 and right.id==1, and assert exactly
two rows have right.id null and their left ids are {2,3}; keep using
result->row_count(), result->clear(), and the same batch iteration logic but add
these per-row checks and a final check on the set of null-left ids.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 238cfea9-b149-41f0-8b64-0384f9688efa
📒 Files selected for processing (2)
docs/phases/PHASE_8_ANALYTICS.mdtests/vectorized_operator_tests.cpp
Summary
Commits (6 total)
Test plan
cmake --build build --target vectorized_operator_tests./build/vectorized_operator_tests --gtest_filter="*VectorizedHashJoin*"— 9 tests passSummary by CodeRabbit
Documentation
Tests