Skip to content

Fix: Prevent hangs on transport crashes by racing tool calls against session task#4904

Closed
JamesDuncanNz wants to merge 2 commits intogoogle:mainfrom
JamesDuncanNz:main
Closed

Fix: Prevent hangs on transport crashes by racing tool calls against session task#4904
JamesDuncanNz wants to merge 2 commits intogoogle:mainfrom
JamesDuncanNz:main

Conversation

@JamesDuncanNz
Copy link
Copy Markdown

Race tool call coroutines against the background session task so that transport crashes (e.g. non-2xx HTTP responses) surface immediately instead of hanging until sse_read_timeout expires.

  • Add run_guarded() method to SessionContext
  • Add is_task_alive property for clean cross-class API
  • Let ConnectionError propagate to retry_on_errors for session recovery
  • Enhance _is_session_disconnected to detect dead background tasks
  • Add 6 new test cases

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

4901
4162

2. Or, if no issue exists, describe the change:

Problem:

When an MCP transport crashes (e.g. the remote server returns a non-2xx HTTP response), the session.call_tool() coroutine hangs indefinitely until sse_read_timeout expires. The background session task has already died with the transport error, but nothing notices until the timeout fires. This creates a poor user experience with long, unexplained hangs.

Solution:

To provide immediate feedback to the user upon transport failures, this change races the tool call coroutine against the background session task using asyncio.wait(return_when=FIRST_COMPLETED).

If the background task dies first (indicating a transport crash), run_guarded() cancels the tool call coroutine and raises a ConnectionError wrapping the original transport exception. This error propagates to retry_on_errors, which triggers session recovery by creating a fresh session. Exceptions originating from the tool call coroutine itself are propagated unwrapped.

Key changes:

  • SessionContext.run_guarded() — races a coroutine against the background task; if the task dies first, cancels the coroutine and raises ConnectionError wrapping the original exception.
  • SessionContext.is_task_alive — property that returns True only when the background task is running, used by _is_session_disconnected to detect dead sessions.
  • MCPSessionManager._is_session_disconnected() — now accepts an optional SessionContext and checks is_task_alive in addition to stream state.
  • MCPSessionManager.get_session_context() — new method to retrieve the SessionContext for a given session, used by MCPTool to call run_guarded().
  • MCPTool._call_tool_async() — wraps the call_tool coroutine with run_guarded() when a SessionContext is available.

Testing Plan

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.

6 new test cases added:

Test File Description
test_run_guarded_normal_completion test_session_context.py Coroutine completes normally and returns its result
test_run_guarded_background_task_crash test_session_context.py Background task crash raises ConnectionError with original exception as cause
test_run_guarded_task_already_dead test_session_context.py Raises immediately when task is already done before run_guarded is called
test_run_guarded_cancels_coroutine_on_crash test_session_context.py Verifies the hung coroutine is cancelled when the background task crashes
test_run_guarded_coroutine_raises test_session_context.py Coroutine exceptions propagate unwrapped (not wrapped in ConnectionError)
test_run_guarded_no_task test_session_context.py Raises ConnectionError when task has not been started

Additional tests updated in test_mcp_session_manager.py:

  • TestGetSessionContext — 2 tests for the new get_session_context() method
  • TestIsSessionDisconnectedWithContext — 3 tests for the enhanced _is_session_disconnected() with SessionContext
  • Existing tests updated to include SessionContext in session pool tuples
pytest tests/unittests/tools/mcp_tool/ -v

Manual End-to-End (E2E) Tests:

Tested against a local MCP server configured to return HTTP 403 after initial connection. Before this change, call_tool hung for the full sse_read_timeout duration. After this change, the ConnectionError surfaces within milliseconds and retry_on_errors triggers a fresh session.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

The SessionContext is now stored as the 4th element in the session pool tuple (_sessions dict), changing the tuple from 3 elements to 4. All existing tests have been updated to reflect this.

JamesDuncanNz and others added 2 commits March 19, 2026 20:45
Race tool call coroutines against the background session task so that
transport crashes (e.g. non-2xx HTTP responses) surface immediately
instead of hanging until sse_read_timeout expires.

- Add run_guarded() method to SessionContext
- Add is_task_alive property for clean cross-class API
- Let ConnectionError propagate to retry_on_errors for session recovery
- Enhance _is_session_disconnected to detect dead background tasks
- Add 6 new test cases
@adk-bot adk-bot added the mcp [Component] Issues about MCP support label Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mcp [Component] Issues about MCP support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants