Skip to content

Changes to handle session context handle close#21519

Merged
bharath-techie merged 2 commits into
opensearch-project:mainfrom
bharath-techie:close-ctx
May 7, 2026
Merged

Changes to handle session context handle close#21519
bharath-techie merged 2 commits into
opensearch-project:mainfrom
bharath-techie:close-ctx

Conversation

@bharath-techie
Copy link
Copy Markdown
Contributor

@bharath-techie bharath-techie commented May 6, 2026

Description

Session context lifecycle is as follows

  1. Create session context during shard scan
  2. Execute query - session context gets converted to rust native object during execute query

This PR handles 2 problems

  1. Any failures between 1 & 2 are handled by calling backendContext close
  2. The session context that gets absorbed in rust heap is marked as consumed in the ConsumedNativeHandle

Related Issues

Resolves TODO in #21479

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@bharath-techie bharath-techie requested a review from a team as a code owner May 6, 2026 18:13
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
📝 TODO sections

🔀 Multiple PR themes

Sub-PR theme: Native handle ownership transfer and consumed-flag mechanism

Relevant files:

  • sandbox/libs/analytics-framework/src/main/java/org/opensearch/analytics/backend/jni/ConsumableNativeHandle.java
  • sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/nativelib/SessionContextHandle.java
  • sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/nativelib/NativeBridge.java
  • sandbox/plugins/analytics-backend-datafusion/rust/src/ffm.rs
  • sandbox/plugins/analytics-backend-datafusion/rust/src/query_executor.rs
  • sandbox/plugins/analytics-backend-datafusion/src/test/java/org/opensearch/be/datafusion/DataFusionNativeBridgeTests.java

Sub-PR theme: BackendExecutionContext lifecycle and error-path cleanup in orchestrators

Relevant files:

  • sandbox/libs/analytics-framework/src/main/java/org/opensearch/analytics/spi/BackendExecutionContext.java
  • sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DataFusionSessionState.java
  • sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DatafusionContext.java
  • sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/AnalyticsSearchService.java
  • sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/stage/LocalStageScheduler.java

⚡ Recommended focus areas for review

Exception Swallowing

In executeWithContextAsync, if call.invoke throws an exception, markConsumed() is called in the finally block (correct), but then the outer catch (Throwable throwable) catches it and calls listener.onFailure(...). However, listener.onResponse(result) is placed outside the inner try-finally but still inside the outer try (var call = new NativeCall()) block. If listener.onResponse(result) itself throws, that exception will be caught by the outer catch and reported as a failure — which may be unexpected behavior. Additionally, result may be uninitialized if call.invoke throws, yet the code structure could allow reaching listener.onResponse(result) in some edge cases depending on compiler flow analysis.

try (var call = new NativeCall()) {
    var plan = call.bytes(substraitPlan);
    long planLen = (long) substraitPlan.length;
    long result;
    try {
        result = call.invoke(EXECUTE_WITH_CONTEXT, sessionCtxPtr, plan, planLen);
    } finally {
        // Rust took ownership via Box::from_raw; do not let doClose() double-free.
        sessionContext.markConsumed();
    }
    listener.onResponse(result);
} catch (Throwable throwable) {
    listener.onFailure(throwable instanceof Exception ? (Exception) throwable : new RuntimeException(throwable));
}
Resource Leak

In LocalStageScheduler, when a handler returns a new backendContext, the previous context is closed. However, the final backendContext (the last one produced) is always closed in the finally block. This means if the reduce path is ever extended to hand off backendContext to the sink, the unconditional close in finally would close a context that has already been transferred, potentially causing a double-free or use-after-free. The comment acknowledges this but the design is fragile.

} finally {
    // The reduce path does not currently hand backendContext off to the sink
    // provider — any resources attached by instruction handlers must be released
    // here. Close is idempotent so a future handoff can coexist with this call.
    if (backendContext != null) {
        try {
            backendContext.close();
        } catch (Exception closeFailure) {
            if (primaryFailure != null) {
                primaryFailure.addSuppressed(closeFailure);
            } else {
                primaryFailure = closeFailure;
            }
        }
    }
}
Panic Safety

Box::from_raw is called before get_rt_manager()?. If get_rt_manager() returns an error (via the ? operator), the session_handle is dropped (freeing the SessionContext), which is correct. However, if a panic occurs between Box::from_raw and the drop of session_handle, Rust's panic-unwind behavior may or may not drop session_handle depending on whether the FFI boundary uses catch_unwind. Since this is an extern "C" function, unwinding across FFI is undefined behavior. Consider wrapping the body in std::panic::catch_unwind to ensure safe cleanup.

let session_handle = *Box::from_raw(session_ctx_ptr as *mut crate::session_context::SessionContextHandle);

let mgr = get_rt_manager()?;
let plan_bytes = slice::from_raw_parts(plan_ptr, plan_len as usize);
let cpu_executor = mgr.cpu_executor();
mgr.io_runtime
    .block_on(crate::query_executor::execute_with_context(
        session_handle,
        plan_bytes,
        cpu_executor,
    ))
    .map_err(|e| e.to_string())
Double Close

DataFusionSessionState#close() closes the sessionContextHandle, and DatafusionContext#close() also closes the same sessionContextHandle. Since SessionContextHandle extends ConsumableNativeHandle which has idempotent close behavior, this should be safe — but it relies on the idempotency guarantee being correctly implemented throughout the inheritance chain. This should be explicitly validated in tests.

public void close() {
    if (sessionContextHandle != null) {
        sessionContextHandle.close();
    }
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent double-invocation of listener on response callback failure

The markConsumed() call is in a finally block, which means it runs even when
call.invoke throws an exception. In that case, result is uninitialized and
listener.onResponse(result) is never reached, but listener.onFailure in the outer
catch block will be called — which is correct. However, if call.invoke throws, Rust
has already consumed the pointer via Box::from_raw, so markConsumed() is correct.
The real issue is that listener.onResponse(result) is placed outside the
try/finally but inside the try-with-resources for NativeCall, meaning if
onResponse itself throws, the outer catch (Throwable) will call listener.onFailure
after onResponse was already called. Move listener.onResponse(result) outside the
try-with-resources block to avoid double-calling the listener.

sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/nativelib/NativeBridge.java [765-778]

-try {
-    result = call.invoke(EXECUTE_WITH_CONTEXT, sessionCtxPtr, plan, planLen);
-} finally {
-    // Rust took ownership via Box::from_raw; do not let doClose() double-free.
-    sessionContext.markConsumed();
+long result;
+try (var call = new NativeCall()) {
+    var plan = call.bytes(substraitPlan);
+    long planLen = (long) substraitPlan.length;
+    try {
+        result = call.invoke(EXECUTE_WITH_CONTEXT, sessionCtxPtr, plan, planLen);
+    } finally {
+        // Rust took ownership via Box::from_raw; do not let doClose() double-free.
+        sessionContext.markConsumed();
+    }
+} catch (Throwable throwable) {
+    listener.onFailure(throwable instanceof Exception ? (Exception) throwable : new RuntimeException(throwable));
+    return;
 }
 listener.onResponse(result);
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that listener.onResponse(result) inside the try-with-resources block means if onResponse throws, the outer catch (Throwable) will call listener.onFailure after onResponse was already called, resulting in double-invocation. Moving onResponse outside the try-with-resources is a valid fix for this listener contract violation.

Medium
General
Suppress close exceptions to avoid masking original failures

If previous.close() throws an exception, it will propagate out of the for loop and
be caught by the outer catch (Throwable t), but the primaryFailure will be the close
exception rather than any original handler failure. Additionally, if handler.apply
throws after returning a new context, the new backendContext is never set, so the
previous context won't be closed in the finally block either. The previous.close()
call should be wrapped in a try-catch to suppress close failures and not mask the
original exception.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/stage/LocalStageScheduler.java [68-74]

 BackendExecutionContext previous = backendContext;
 backendContext = handler.apply(node, context, backendContext);
 // A handler that returns a new reference implicitly abandons the previous
 // context — close it now so its resources aren't orphaned.
 if (previous != null && previous != backendContext) {
-    previous.close();
+    try {
+        previous.close();
+    } catch (Exception closeEx) {
+        if (primaryFailure != null) {
+            primaryFailure.addSuppressed(closeEx);
+        } else {
+            primaryFailure = closeEx;
+        }
+    }
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid — if previous.close() throws, it becomes the primaryFailure and masks any original handler exception. Wrapping it in a try-catch with suppression is a reasonable improvement, though the scenario (handler returning new context then close failing) is an edge case.

Low

@bharath-techie bharath-techie added skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. labels May 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

✅ Gradle check result for 5a4a561: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.48%. Comparing base (8d64f1d) to head (e011d3c).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21519      +/-   ##
============================================
+ Coverage     73.40%   73.48%   +0.07%     
- Complexity    74426    74488      +62     
============================================
  Files          5970     5970              
  Lines        338267   338294      +27     
  Branches      48753    48759       +6     
============================================
+ Hits         248316   248605     +289     
+ Misses        70160    69896     -264     
- Partials      19791    19793       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

❌ Gradle check result for e011d3c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

❌ Gradle check result for e011d3c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

✅ Gradle check result for e011d3c: SUCCESS

@bharath-techie bharath-techie merged commit b2fa9f8 into opensearch-project:main May 7, 2026
22 of 26 checks passed
public interface BackendExecutionContext {}
public interface BackendExecutionContext extends AutoCloseable {
@Override
default void close() throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to avoid bigger changes ? Ideally we should not give a default for close to FORCE child classes to implement and then they no-op on their end.

imRishN pushed a commit to imRishN/OpenSearch that referenced this pull request May 8, 2026
)

* changes to handle session context handle close

Signed-off-by: bharath-techie <bharath78910@gmail.com>

* fixing sandbox check

Signed-off-by: bharath-techie <bharath78910@gmail.com>

---------

Signed-off-by: bharath-techie <bharath78910@gmail.com>
vishwasgarg18 pushed a commit to vishwasgarg18/OpenSearch that referenced this pull request May 8, 2026
)

* changes to handle session context handle close

Signed-off-by: bharath-techie <bharath78910@gmail.com>

* fixing sandbox check

Signed-off-by: bharath-techie <bharath78910@gmail.com>

---------

Signed-off-by: bharath-techie <bharath78910@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants