refactor(vectorized): streaming hash join with bounded memory by poyrazK · Pull Request #69 · poyrazK/cloudSQL

poyrazK · 2026-04-26T22:09:59Z

Summary

Bounded memory hash join: process right table in 1024-row chunks instead of loading all at once
LEFT join support with unmatched row emission
State machine: LoadLeftBuffer → BuildRightChunk → ProbeChunk → EmitUnmatched → Done
Fix INNER join cross-chunk deduplication (set matched flag for both join types)
All left rows buffered for repeated probing across chunks

Commits (6 total)

test(query_executor): add GROUP BY tests — baseline GROUP BY test coverage
Implement LEFT join for VectorizedHashJoinOperator — LEFT join with comprehensive tests
fix: remove break to allow multiple matches per left row — fix join correctness
style: automated clang-format fixes — code formatting
fix: prevent batch overflow during multi-match probe — resumable bucket scanning
refactor: streaming hash join with bounded memory — final streaming architecture

Test plan

cmake --build build --target vectorized_operator_tests
./build/vectorized_operator_tests --gtest_filter="*VectorizedHashJoin*" — 9 tests pass
All 30 vectorized operator tests pass

Summary by CodeRabbit

Documentation
- Updated analytics engine documentation for hash join operator behavior.
Tests
- Added unit tests for LEFT hash join operations with large dataset scenarios.

- Enhance SelectWithGroupBy to verify actual aggregated values - Add SelectWithGroupByCount test (COUNT with column) - Add SelectWithGroupByMinMax test (MIN/MAX aggregates) - Add SelectWithGroupByMultipleColumns test (multi-column GROUP BY)

… test coverage - Add LEFT join support tracking unmatched left rows and emitting with NULLs - Add join_type_ member to differentiate INNER/LEFT behavior - Add unmatched_indices_ and left_matched_in_batch_ for LEFT join tracking - Add 6 new test cases: EmptyRight, EmptyLeft, MultipleMatches, LeftNullKeys, OutputValues, MultiBatch - Document known limitation: first-match-only behavior for duplicate right keys - Add *.bin to .gitignore to ignore test data files

…in hash join The break statement at line 743 caused each left row to only match ONE right row, even when multiple right rows had the same key. This was incorrect behavior for both INNER and LEFT joins - a proper hash join should match ALL right rows with the same key. Example: Right has id={1,1,2}, Left has id={1,2} - Before: 2 matches (first match wins, second ignored) - After: 3 matches (left_id=1 matches 2 right rows, left_id=2 matches 1)

… resumable scanning When a single left row matches multiple right rows, the probe loop could emit more than BATCH_SIZE rows before checking capacity. Add probe state cursors (resuming_bucket_scan_, resumed_bucket_idx_, resumed_entry_idx_, resumed_key_val_) that persist across next_batch calls so bucket scanning can be paused when batch is full and resumed on the next invocation. Also add explicit right_ids assertions to VectorizedHashJoinMultipleMatches test to mirror the left_ids checks.

- Load left rows into buffer once for repeated probing - Process right table in 1024-row chunks - Support LEFT join with unmatched row emission - State machine: LoadLeftBuffer → BuildRightChunk → ProbeChunk → EmitUnmatched → Done - Fix batch overflow bug with resumable bucket scanning

coderabbitai · 2026-04-26T22:10:06Z

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 48 minutes and 32 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c1824e88-8ec5-4e68-8d87-f58eca4614b2

📥 Commits

Reviewing files that changed from the base of the PR and between 1c0660b and 06e571e.

📒 Files selected for processing (2)

docs/phases/PHASE_8_ANALYTICS.md
tests/vectorized_operator_tests.cpp

📝 Walkthrough

Walkthrough

This pull request adds documentation for a bounded-memory streaming hash join implementation (VectorizedHashJoinOperator) that uses 1024-row chunks and multi-phase state management, along with a unit test validating LEFT join behavior with unmatched left rows emitting NULL-filled right columns when right input exceeds chunk boundaries.

Changes

Cohort / File(s)	Summary
Analytics Documentation `docs/phases/PHASE_8_ANALYTICS.md`	Documents VectorizedHashJoinOperator feature: bounded-memory hash join with right-side chunking (1024 rows), left-row preloading, multi-phase state machine (load-left, build-right, probe, emit), 64-bucket partitioning, and LEFT/INNER join semantics including cross-chunk deduplication and unmatched left row emission.
Vectorized Operator Tests `tests/vectorized_operator_tests.cpp`	Adds LEFT hash join unit test covering scenario where right input exceeds 1024-row chunk size with duplicate join keys. Validates unmatched left rows emit rows with NULLs in right-side columns and verifies total row count (1502) and NULL-row count (2).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

fix(vectorized): remove break to allow multiple matches per left row in hash join #64: Implements probe loop modification enabling multiple matches per left row, which is core functionality exercised by the new LEFT hash join test with duplicate right-side keys.
feat(vectorized): implement LEFT join for VectorizedHashJoinOperator #63: Implements LEFT join semantics and unmatched-left tracking in VectorizedHashJoinOperator, the feature being documented and validated by this PR's test.
test(vectorized): add operator unit tests #57: Modifies the same test file (tests/vectorized_operator_tests.cpp), adding related vectorized operator unit test coverage.

Poem

🐰 A streaming hash joins with bounded grace,
Chunking rows at a steady pace,
Left unmatched fill the right with NULL,
Vectorized magic, efficient and full! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'refactor(vectorized): streaming hash join with bounded memory' directly and accurately summarizes the main change: implementing a streaming, bounded-memory hash join operator with right-side chunking, replacing the prior approach of loading all rows at once.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/streaming-hash-join

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…tch test - Fix VectorizedHashBucket.key_values comment to match implementation (stores one Value per row, not a vector of column values) - Add VectorizedHashJoinLeftMultiBatch test to verify LEFT join correctness when right table spans multiple 1024-row chunks

- Add section 6 documenting the streaming hash join - Cover bounded memory design (1024-row chunks) - Document state machine architecture - Cover LEFT join support and cross-chunk deduplication

Resolve conflicts by keeping streaming hash join implementation - Take HEAD version (streaming/chunked hash join) over main's non-streaming version - Re-add VectorizedHashJoinLeftMultiBatch test that was lost in conflict resolution

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/vectorized_operator_tests.cpp (1)

1384-1396: Strengthen this test to verify key correctness, not only row/null counts.

Current assertions can still pass if some non-null matches are attached to wrong left keys. Please assert that all non-null rows are (left.id=1, right.id=1) and that NULL-emitted rows are exactly for left ids 2 and 3.

Proposed assertion hardening

     auto result = VectorBatch::create(join->output_schema());
     int64_t total_rows = 0;
     int64_t rows_with_nulls = 0;  // LEFT id=2,3 should emit with NULLs
+    int64_t matched_rows = 0;
+    int64_t null_for_left2 = 0;
+    int64_t null_for_left3 = 0;

     while (join->next_batch(*result)) {
         for (size_t i = 0; i < result->row_count(); ++i) {
+            int64_t left_id = result->get_column(0).get(i).as_int64();
             if (result->get_column(1).get(i).is_null()) {
                 rows_with_nulls++;
+                if (left_id == 2) ++null_for_left2;
+                else if (left_id == 3) ++null_for_left3;
+                else ADD_FAILURE() << "Unexpected NULL-emitted left.id=" << left_id;
             }
+            else {
+                int64_t right_id = result->get_column(1).get(i).as_int64();
+                EXPECT_EQ(left_id, 1);
+                EXPECT_EQ(right_id, 1);
+                ++matched_rows;
+            }
         }
         total_rows += result->row_count();
         result->clear();
     }

     // LEFT join: id=1 matches 1500 rows, id=2,3 emit with NULLs
     EXPECT_EQ(total_rows, 1502);    // 1500 matches + 2 unmatched with NULLs
     EXPECT_EQ(rows_with_nulls, 2);  // id=2 and id=3 have no match
+    EXPECT_EQ(matched_rows, 1500);
+    EXPECT_EQ(null_for_left2, 1);
+    EXPECT_EQ(null_for_left3, 1);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/vectorized_operator_tests.cpp` around lines 1384 - 1396, The test only
checks total_rows and rows_with_nulls but not that non-null rows correspond to
(left.id=1,right.id=1) and the two NULL rows are for left ids 2 and 3; update
the loop that consumes join->next_batch(*result) to inspect columns via
result->get_column(...) and for each row collect left.id and right.id (treat
right.id nulls explicitly), assert every non-null right.id row has left.id==1
and right.id==1, and assert exactly two rows have right.id null and their left
ids are {2,3}; keep using result->row_count(), result->clear(), and the same
batch iteration logic but add these per-row checks and a final check on the set
of null-left ids.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/phases/PHASE_8_ANALYTICS.md`:
- Around line 41-48: The docs mention outdated internals (RIGHT_CHUNK_SIZE,
left_rows_buffer_, four-phase flow) that don't match the current
VectorizedHashJoinOperator implementation; update PHASE_8_ANALYTICS.md to use
the real symbols and behavior: replace RIGHT_CHUNK_SIZE/left_rows_buffer_ with
BATCH_SIZE and the actual in-memory buffering behavior, change the phase list to
BuildRight → ProbeLeft → Done (and note resumable probe state fields used for
incremental probing), and remove/replace references to 4-phase flow, 64 hash
buckets (if not used), and left_row_matched_ if it's not present; ensure
terminology matches class names/methods (VectorizedHashJoinOperator, BuildRight,
ProbeLeft, Done, and any resumable probe state fields) so the documentation
reflects the current code.

---

Nitpick comments:
In `@tests/vectorized_operator_tests.cpp`:
- Around line 1384-1396: The test only checks total_rows and rows_with_nulls but
not that non-null rows correspond to (left.id=1,right.id=1) and the two NULL
rows are for left ids 2 and 3; update the loop that consumes
join->next_batch(*result) to inspect columns via result->get_column(...) and for
each row collect left.id and right.id (treat right.id nulls explicitly), assert
every non-null right.id row has left.id==1 and right.id==1, and assert exactly
two rows have right.id null and their left ids are {2,3}; keep using
result->row_count(), result->clear(), and the same batch iteration logic but add
these per-row checks and a final check on the set of null-left ids.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 238cfea9-b149-41f0-8b64-0384f9688efa

📥 Commits

Reviewing files that changed from the base of the PR and between 109c3b9 and 1c0660b.

📒 Files selected for processing (2)

docs/phases/PHASE_8_ANALYTICS.md
tests/vectorized_operator_tests.cpp

poyrazK

It's okay to merge

poyrazK and others added 6 commits April 27, 2026 01:08

test(query_executor): add GROUP BY tests

548ba1a

- Enhance SelectWithGroupBy to verify actual aggregated values - Add SelectWithGroupByCount test (COUNT with column) - Add SelectWithGroupByMinMax test (MIN/MAX aggregates) - Add SelectWithGroupByMultipleColumns test (multi-column GROUP BY)

style: automated clang-format fixes

9a4c4ef

poyrazK and others added 4 commits April 27, 2026 01:26

docs: document VectorizedHashJoinOperator in PHASE_8_ANALYTICS.md

d56d70d

- Add section 6 documenting the streaming hash join - Cover bounded memory design (1024-row chunks) - Document state machine architecture - Cover LEFT join support and cross-chunk deduplication

Merge main into refactor/streaming-hash-join

841a30c

Resolve conflicts by keeping streaming hash join implementation - Take HEAD version (streaming/chunked hash join) over main's non-streaming version - Re-add VectorizedHashJoinLeftMultiBatch test that was lost in conflict resolution

style: automated clang-format fixes

1c0660b

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/phases/PHASE_8_ANALYTICS.md Outdated

poyrazK and others added 3 commits April 28, 2026 14:46

fix: update docs and test to match actual implementation

2389c59

style: automated clang-format fixes

3e28461

fix: add missing <algorithm> include for std::sort

06e571e

poyrazK commented Apr 28, 2026

View reviewed changes

poyrazK merged commit c2d812f into main Apr 28, 2026
9 checks passed

coderabbitai Bot mentioned this pull request May 2, 2026

fix(vectorized): correct RIGHT and FULL outer join emission logic #72

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(vectorized): streaming hash join with bounded memory#69

refactor(vectorized): streaming hash join with bounded memory#69
poyrazK merged 13 commits intomainfrom
refactor/streaming-hash-join

poyrazK commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

poyrazK left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poyrazK commented Apr 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits (6 total)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

poyrazK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poyrazK commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading