Skip to content

add merge large file correctness tests and fixes#21576

Open
Shailesh-Kumar-Singh wants to merge 1 commit intoopensearch-project:mainfrom
Shailesh-Kumar-Singh:merge_parquet
Open

add merge large file correctness tests and fixes#21576
Shailesh-Kumar-Singh wants to merge 1 commit intoopensearch-project:mainfrom
Shailesh-Kumar-Singh:merge_parquet

Conversation

@Shailesh-Kumar-Singh
Copy link
Copy Markdown
Contributor

Description

Tests:
Added Large file merge correctness tests and,
Fixes:
Bloom filter config not propagating: When bloom_filter_enabled = false, bloom filter metadata was still being written. Fixed by only setting bloom filter properties when enabled. [ we were setting bloom filter FPP / NDV settings after disabling bloom filter, setting them implicitly enables boom filter]

TIER 2 merge edge case: After draining a full batch and loading the next via advance_past_batch(), the cursor wasn't yielding back to the heap, could emit rows out of global sort order. Added yield check at batch boundary.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Shailesh-Kumar-Singh Shailesh-Kumar-Singh requested a review from a team as a code owner May 9, 2026 05:32
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

PR Reviewer Guide 🔍

(Review updated until commit 2dad307)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Logic Error

After draining a batch at line 194, the code checks if the new batch's first value is greater than heap_top and yields if true. However, this check uses the wrong comparison direction for descending sorts. When reverse_sorts is true, a cursor with a smaller value should yield (since smaller means "later" in descending order), but the code only yields when val > heap_top in the natural ordering. This causes the cursor to continue emitting when it should yield, breaking global sort order in descending merges.

// Check if cursor should yield after loading new batch
let val = cursor.current_sort_values()?;
if cmp_sort_values(&val, heap_top, reverse_sorts) == Ordering::Greater {
    heap.push(HeapItem {
        sort_values: val,
        file_id,
        reverse_sorts: Arc::clone(&reverse_sorts_arc),
    });
    break;
}
Possible Issue

Line 887 removes null entries from the source map using source.values().removeIf(Objects::isNull). This modifies the values collection, which may not behave as expected for all Map implementations. If the map's values collection does not support removal, this throws UnsupportedOperationException. Even if it works for HashMap, the intent is unclear: removing null values means those fields are missing from the document, but the test also uses randomBoolean() ? null : ... to generate nulls, suggesting nulls should be indexed as missing fields. The removal is redundant if that's the intent, or incorrect if nulls should be explicitly indexed.

source.values().removeIf(java.util.Objects::isNull);

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

PR Code Suggestions ✨

Latest suggestions up to 2dad307

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Fix null handling in test documents

Removing null values from the source map after construction means documents with
missing fields are indexed differently than intended. The test aims to verify null
handling with nulls_first/nulls_last, but removing nulls prevents testing the actual
missing value behavior. Instead, omit null entries during map construction or use a
separate approach to represent missing fields.

sandbox/plugins/composite-engine/src/internalClusterTest/java/org/opensearch/composite/CompositeMergeIT.java [880-892]

-source.values().removeIf(java.util.Objects::isNull);
+// Build source map without null entries (missing fields)
+var source = new java.util.HashMap<String, Object>();
+Object ageVal = randomBoolean() ? null : randomIntBetween(0, 100);
+Object scoreVal = randomBoolean() ? null : randomLongBetween(0, 10000);
+Object tagVal = randomBoolean() ? null : randomAlphaOfLengthBetween(3, 6);
+if (ageVal != null) source.put("age", ageVal);
+if (scoreVal != null) source.put("score", scoreVal);
+if (tagVal != null) source.put("tag", tagVal);
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that source.values().removeIf(Objects::isNull) at line 887 removes null values after they're added, which defeats the purpose of testing null handling. The improved approach of conditionally adding non-null values is cleaner and more explicit about representing missing fields, which is important for testing nulls_first/nulls_last behavior.

Medium
Verify binary search invariant with assertion

The comment claims the binary search invariant guarantees cursor.current_value >
heap_top, but this is only true if the binary search correctly positions the cursor.
If run_end equals cursor.row_idx (no rows to emit), the cursor advances but may land
on a value equal to or less than heap_top, violating the invariant. Add a defensive
check to verify the invariant holds before unconditionally yielding.

sandbox/plugins/parquet-data-format/src/main/rust/src/merge/sorted.rs [254-262]

-// Binary search invariant guarantees cursor.current_value > heap_top
-// so we always yield here (no need for conditional check)
 let val = cursor.current_sort_values()?;
+// Verify binary search invariant: cursor value should be > heap_top
+debug_assert!(
+    cmp_sort_values(&val, heap_top, reverse_sorts) == Ordering::Greater,
+    "Binary search invariant violated: cursor value should be > heap_top"
+);
 heap.push(HeapItem {
     sort_values: val,
     file_id,
     reverse_sorts: Arc::clone(&reverse_sorts_arc),
 });
 break;
Suggestion importance[1-10]: 5

__

Why: Adding a debug_assert! to verify the binary search invariant is a reasonable defensive programming practice. However, the comment in the PR explicitly states this invariant is guaranteed by the binary search logic, and the extensive test suite (including test_tier2_yield_after_batch_boundary and related tests) validates this behavior. The assertion adds safety but isn't critical.

Low
Clarify equality handling in yield logic

The TIER 2 yield check after loading a new batch uses the same comparison logic as
TIER 3, but TIER 2 operates on full batches while TIER 3 uses binary search. This
creates a subtle correctness issue: when cursor.current_sort_values() equals
heap_top, TIER 2 continues draining (correct), but the comment suggests this is a
"yield" check. The logic should explicitly handle equality to ensure the cursor
doesn't incorrectly yield when values are equal.

sandbox/plugins/parquet-data-format/src/main/rust/src/merge/sorted.rs [199-208]

 // Check if cursor should yield after loading new batch
 let val = cursor.current_sort_values()?;
+// Only yield if strictly greater than heap_top (not equal)
 if cmp_sort_values(&val, heap_top, reverse_sorts) == Ordering::Greater {
     heap.push(HeapItem {
         sort_values: val,
         file_id,
         reverse_sorts: Arc::clone(&reverse_sorts_arc),
     });
     break;
 }
+// If equal or less, continue draining this cursor
Suggestion importance[1-10]: 3

__

Why: The suggestion adds a clarifying comment but doesn't change the logic. The existing code already correctly handles equality (continues draining when equal, yields only when Greater). The added comment provides minimal value since the code behavior is already correct and the comparison check is explicit.

Low

Previous suggestions

Suggestions up to commit a89dc54
CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid unsafe collection modification

Calling removeIf on values() modifies the underlying map's values collection, which
may throw UnsupportedOperationException depending on the map implementation.
Instead, create a new map with non-null entries or use entrySet().removeIf() to
safely remove null-valued entries.

sandbox/plugins/composite-engine/src/internalClusterTest/java/org/opensearch/composite/CompositeMergeIT.java [887]

-source.values().removeIf(java.util.Objects::isNull);
+source.entrySet().removeIf(entry -> entry.getValue() == null);
Suggestion importance[1-10]: 8

__

Why: The values().removeIf() call can throw UnsupportedOperationException on some map implementations. Using entrySet().removeIf() is safer and more explicit about modifying the map entries.

Medium
Suggestions up to commit d88b19d
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix unsafe map modification pattern

Calling removeIf on the values collection of a map can lead to
ConcurrentModificationException or undefined behavior, as it attempts to modify the
map through its values view. Instead, use entrySet().removeIf() with a predicate on
the value, or iterate and collect non-null entries into a new map.

sandbox/plugins/composite-engine/src/internalClusterTest/java/org/opensearch/composite/CompositeMergeIT.java [887]

-source.values().removeIf(java.util.Objects::isNull);
+source.entrySet().removeIf(entry -> entry.getValue() == null);
Suggestion importance[1-10]: 9

__

Why: This is a critical bug. Calling removeIf on values() view of a map can cause ConcurrentModificationException or undefined behavior. The suggested fix using entrySet().removeIf() is the correct approach to safely remove null-valued entries from the map.

High
General
Explicitly disable bloom filters when disabled

When bloom filter is disabled, the function returns the builder without explicitly
disabling bloom filters. If the builder has default bloom filter settings enabled,
they will remain active. Explicitly call set_bloom_filter_enabled(false) in the else
branch to ensure bloom filters are disabled when config.get_bloom_filter_enabled()
returns false.

sandbox/plugins/parquet-data-format/src/main/rust/src/writer_properties_builder.rs [136-145]

 fn apply_bloom_filter_settings(
     mut builder: parquet::file::properties::WriterPropertiesBuilder,
     config: &NativeSettings
 ) -> parquet::file::properties::WriterPropertiesBuilder {
     if config.get_bloom_filter_enabled() {
         builder = builder.set_bloom_filter_enabled(true);
         builder = builder.set_bloom_filter_fpp(config.get_bloom_filter_fpp());
         builder = builder.set_bloom_filter_ndv(config.get_bloom_filter_ndv());
+    } else {
+        builder = builder.set_bloom_filter_enabled(false);
     }
     builder
 }
Suggestion importance[1-10]: 7

__

Why: While the current code may work if the builder defaults to disabled, explicitly setting set_bloom_filter_enabled(false) in the else branch ensures clarity and prevents potential issues if builder defaults change. However, the test at line 242 verifies that bloom_filter_properties is None when disabled, suggesting the current implementation may already work correctly.

Medium

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Persistent review updated to latest commit a89dc54

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

✅ Gradle check result for a89dc54: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.46%. Comparing base (8f72a95) to head (2dad307).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21576      +/-   ##
============================================
- Coverage     73.48%   73.46%   -0.03%     
+ Complexity    74646    74615      -31     
============================================
  Files          5980     5980              
  Lines        338777   338777              
  Branches      48848    48848              
============================================
- Hits         248964   248896      -68     
- Misses        70026    70049      +23     
- Partials      19787    19832      +45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Shailesh-Kumar-Singh <shaileshkumarsingh260@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Persistent review updated to latest commit 2dad307

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

✅ Gradle check result for 2dad307: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant