add merge large file correctness tests and fixes by Shailesh-Kumar-Singh · Pull Request #21576 · opensearch-project/OpenSearch

Shailesh-Kumar-Singh · 2026-05-09T05:32:35Z

Description

Tests:
Added Large file merge correctness tests and,
Fixes:
Bloom filter config not propagating: When bloom_filter_enabled = false, bloom filter metadata was still being written. Fixed by only setting bloom filter properties when enabled. [ we were setting bloom filter FPP / NDV settings after disabling bloom filter, setting them implicitly enables boom filter]

TIER 2 merge edge case: After draining a full batch and loading the next via advance_past_batch(), the cursor wasn't yielding back to the heap, could emit rows out of global sort order. Added yield check at batch boundary.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2026-05-09T05:33:50Z

PR Reviewer Guide 🔍

(Review updated until commit `2dad307`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Logic Error After draining a batch at line 194, the code checks if the new batch's first value is greater than heap_top and yields if true. However, this check uses the wrong comparison direction for descending sorts. When reverse_sorts is true, a cursor with a smaller value should yield (since smaller means "later" in descending order), but the code only yields when val > heap_top in the natural ordering. This causes the cursor to continue emitting when it should yield, breaking global sort order in descending merges. // Check if cursor should yield after loading new batch let val = cursor.current_sort_values()?; if cmp_sort_values(&val, heap_top, reverse_sorts) == Ordering::Greater { heap.push(HeapItem { sort_values: val, file_id, reverse_sorts: Arc::clone(&reverse_sorts_arc), }); break; } Possible Issue Line 887 removes null entries from the source map using `source.values().removeIf(Objects::isNull)`. This modifies the values collection, which may not behave as expected for all Map implementations. If the map's values collection does not support removal, this throws UnsupportedOperationException. Even if it works for HashMap, the intent is unclear: removing null values means those fields are missing from the document, but the test also uses `randomBoolean() ? null : ...` to generate nulls, suggesting nulls should be indexed as missing fields. The removal is redundant if that's the intent, or incorrect if nulls should be explicitly indexed. source.values().removeIf(java.util.Objects::isNull);

github-actions · 2026-05-09T05:34:24Z

PR Code Suggestions ✨

Latest suggestions up to 2dad307

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Fix null handling in test documents Removing null values from the source map after construction means documents with missing fields are indexed differently than intended. The test aims to verify null handling with `nulls_first`/`nulls_last`, but removing nulls prevents testing the actual missing value behavior. Instead, omit null entries during map construction or use a separate approach to represent missing fields. sandbox/plugins/composite-engine/src/internalClusterTest/java/org/opensearch/composite/CompositeMergeIT.java [880-892] -source.values().removeIf(java.util.Objects::isNull); +// Build source map without null entries (missing fields) +var source = new java.util.HashMap<String, Object>(); +Object ageVal = randomBoolean() ? null : randomIntBetween(0, 100); +Object scoreVal = randomBoolean() ? null : randomLongBetween(0, 10000); +Object tagVal = randomBoolean() ? null : randomAlphaOfLengthBetween(3, 6); +if (ageVal != null) source.put("age", ageVal); +if (scoreVal != null) source.put("score", scoreVal); +if (tagVal != null) source.put("tag", tagVal); Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that `source.values().removeIf(Objects::isNull)` at line 887 removes null values after they're added, which defeats the purpose of testing null handling. The improved approach of conditionally adding non-null values is cleaner and more explicit about representing missing fields, which is important for testing `nulls_first`/`nulls_last` behavior.	Medium
	Verify binary search invariant with assertion The comment claims the binary search invariant guarantees `cursor.current_value >` `heap_top`, but this is only true if the binary search correctly positions the cursor. If `run_end` equals `cursor.row_idx` (no rows to emit), the cursor advances but may land on a value equal to or less than `heap_top`, violating the invariant. Add a defensive check to verify the invariant holds before unconditionally yielding. sandbox/plugins/parquet-data-format/src/main/rust/src/merge/sorted.rs [254-262] -// Binary search invariant guarantees cursor.current_value > heap_top -// so we always yield here (no need for conditional check) let val = cursor.current_sort_values()?; +// Verify binary search invariant: cursor value should be > heap_top +debug_assert!( + cmp_sort_values(&val, heap_top, reverse_sorts) == Ordering::Greater, + "Binary search invariant violated: cursor value should be > heap_top" +); heap.push(HeapItem { sort_values: val, file_id, reverse_sorts: Arc::clone(&reverse_sorts_arc), }); break; Suggestion importance[1-10]: 5 __ Why: Adding a `debug_assert!` to verify the binary search invariant is a reasonable defensive programming practice. However, the comment in the PR explicitly states this invariant is guaranteed by the binary search logic, and the extensive test suite (including `test_tier2_yield_after_batch_boundary` and related tests) validates this behavior. The assertion adds safety but isn't critical.	Low
	Clarify equality handling in yield logic The TIER 2 yield check after loading a new batch uses the same comparison logic as TIER 3, but TIER 2 operates on full batches while TIER 3 uses binary search. This creates a subtle correctness issue: when `cursor.current_sort_values()` equals `heap_top`, TIER 2 continues draining (correct), but the comment suggests this is a "yield" check. The logic should explicitly handle equality to ensure the cursor doesn't incorrectly yield when values are equal. sandbox/plugins/parquet-data-format/src/main/rust/src/merge/sorted.rs [199-208] // Check if cursor should yield after loading new batch let val = cursor.current_sort_values()?; +// Only yield if strictly greater than heap_top (not equal) if cmp_sort_values(&val, heap_top, reverse_sorts) == Ordering::Greater { heap.push(HeapItem { sort_values: val, file_id, reverse_sorts: Arc::clone(&reverse_sorts_arc), }); break; } +// If equal or less, continue draining this cursor Suggestion importance[1-10]: 3 __ Why: The suggestion adds a clarifying comment but doesn't change the logic. The existing code already correctly handles equality (continues draining when equal, yields only when Greater). The added comment provides minimal value since the code behavior is already correct and the comparison check is explicit.	Low

Previous suggestions

Suggestions up to commit a89dc54

Category	Suggestion	Impact
Possible issue	Avoid unsafe collection modification Calling `removeIf` on `values()` modifies the underlying map's values collection, which may throw `UnsupportedOperationException` depending on the map implementation. Instead, create a new map with non-null entries or use `entrySet().removeIf()` to safely remove null-valued entries. sandbox/plugins/composite-engine/src/internalClusterTest/java/org/opensearch/composite/CompositeMergeIT.java [887] -source.values().removeIf(java.util.Objects::isNull); +source.entrySet().removeIf(entry -> entry.getValue() == null); Suggestion importance[1-10]: 8 __ Why: The `values().removeIf()` call can throw `UnsupportedOperationException` on some map implementations. Using `entrySet().removeIf()` is safer and more explicit about modifying the map entries.	Medium

Suggestions up to commit d88b19d

Category	Suggestion	Impact
Possible issue	Fix unsafe map modification pattern Calling `removeIf` on the values collection of a map can lead to `ConcurrentModificationException` or undefined behavior, as it attempts to modify the map through its values view. Instead, use `entrySet().removeIf()` with a predicate on the value, or iterate and collect non-null entries into a new map. sandbox/plugins/composite-engine/src/internalClusterTest/java/org/opensearch/composite/CompositeMergeIT.java [887] -source.values().removeIf(java.util.Objects::isNull); +source.entrySet().removeIf(entry -> entry.getValue() == null); Suggestion importance[1-10]: 9 __ Why: This is a critical bug. Calling `removeIf` on `values()` view of a map can cause `ConcurrentModificationException` or undefined behavior. The suggested fix using `entrySet().removeIf()` is the correct approach to safely remove null-valued entries from the map.	High
General	Explicitly disable bloom filters when disabled When bloom filter is disabled, the function returns the builder without explicitly disabling bloom filters. If the builder has default bloom filter settings enabled, they will remain active. Explicitly call `set_bloom_filter_enabled(false)` in the else branch to ensure bloom filters are disabled when `config.get_bloom_filter_enabled()` returns false. sandbox/plugins/parquet-data-format/src/main/rust/src/writer_properties_builder.rs [136-145] fn apply_bloom_filter_settings( mut builder: parquet::file::properties::WriterPropertiesBuilder, config: &NativeSettings ) -> parquet::file::properties::WriterPropertiesBuilder { if config.get_bloom_filter_enabled() { builder = builder.set_bloom_filter_enabled(true); builder = builder.set_bloom_filter_fpp(config.get_bloom_filter_fpp()); builder = builder.set_bloom_filter_ndv(config.get_bloom_filter_ndv()); + } else { + builder = builder.set_bloom_filter_enabled(false); } builder } Suggestion importance[1-10]: 7 __ Why: While the current code may work if the builder defaults to disabled, explicitly setting `set_bloom_filter_enabled(false)` in the else branch ensures clarity and prevents potential issues if builder defaults change. However, the test at line 242 verifies that `bloom_filter_properties` is `None` when disabled, suggesting the current implementation may already work correctly.	Medium

github-actions · 2026-05-09T05:59:01Z

Persistent review updated to latest commit a89dc54

github-actions · 2026-05-09T06:53:24Z

✅ Gradle check result for a89dc54: SUCCESS

codecov · 2026-05-09T06:54:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.46%. Comparing base (8f72a95) to head (2dad307).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #21576      +/-   ##
============================================
- Coverage     73.48%   73.46%   -0.03%     
+ Complexity    74646    74615      -31     
============================================
  Files          5980     5980              
  Lines        338777   338777              
  Branches      48848    48848              
============================================
- Hits         248964   248896      -68     
- Misses        70026    70049      +23     
- Partials      19787    19832      +45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Shailesh-Kumar-Singh <shaileshkumarsingh260@gmail.com>

github-actions · 2026-05-09T20:19:13Z

Persistent review updated to latest commit 2dad307

github-actions · 2026-05-09T21:16:54Z

✅ Gradle check result for 2dad307: SUCCESS

Shailesh-Kumar-Singh requested a review from a team as a code owner May 9, 2026 05:32

Shailesh-Kumar-Singh force-pushed the merge_parquet branch from d88b19d to a89dc54 Compare May 9, 2026 05:57

add merge large file correctness tests and fixes

2dad307

Signed-off-by: Shailesh-Kumar-Singh <shaileshkumarsingh260@gmail.com>

Shailesh-Kumar-Singh force-pushed the merge_parquet branch from a89dc54 to 2dad307 Compare May 9, 2026 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add merge large file correctness tests and fixes#21576

add merge large file correctness tests and fixes#21576
Shailesh-Kumar-Singh wants to merge 1 commit intoopensearch-project:mainfrom
Shailesh-Kumar-Singh:merge_parquet

Shailesh-Kumar-Singh commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

codecov Bot commented May 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Shailesh-Kumar-Singh commented May 9, 2026

Description

Related Issues

Check List

Uh oh!

github-actions Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 2dad307)

Uh oh!

github-actions Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

codecov Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 9, 2026 •

edited

Loading

(Review updated until commit `2dad307`)

github-actions Bot commented May 9, 2026 •

edited

Loading

codecov Bot commented May 9, 2026 •

edited

Loading