Fix: Prevent hangs on transport crashes by racing tool calls against session task#4904
Closed
JamesDuncanNz wants to merge 2 commits intogoogle:mainfrom
Closed
Fix: Prevent hangs on transport crashes by racing tool calls against session task#4904JamesDuncanNz wants to merge 2 commits intogoogle:mainfrom
JamesDuncanNz wants to merge 2 commits intogoogle:mainfrom
Conversation
Race tool call coroutines against the background session task so that transport crashes (e.g. non-2xx HTTP responses) surface immediately instead of hanging until sse_read_timeout expires. - Add run_guarded() method to SessionContext - Add is_task_alive property for clean cross-class API - Let ConnectionError propagate to retry_on_errors for session recovery - Enhance _is_session_disconnected to detect dead background tasks - Add 6 new test cases
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Race tool call coroutines against the background session task so that transport crashes (e.g. non-2xx HTTP responses) surface immediately instead of hanging until sse_read_timeout expires.
run_guarded()method toSessionContextis_task_aliveproperty for clean cross-class APIConnectionErrorpropagate toretry_on_errorsfor session recovery_is_session_disconnectedto detect dead background tasksPlease ensure you have read the contribution guide before creating a pull request.
Link to Issue or Description of Change
1. Link to an existing issue (if applicable):
4901
4162
2. Or, if no issue exists, describe the change:
Problem:
When an MCP transport crashes (e.g. the remote server returns a non-2xx HTTP response), the
session.call_tool()coroutine hangs indefinitely untilsse_read_timeoutexpires. The background session task has already died with the transport error, but nothing notices until the timeout fires. This creates a poor user experience with long, unexplained hangs.Solution:
To provide immediate feedback to the user upon transport failures, this change races the tool call coroutine against the background session task using
asyncio.wait(return_when=FIRST_COMPLETED).If the background task dies first (indicating a transport crash),
run_guarded()cancels the tool call coroutine and raises aConnectionErrorwrapping the original transport exception. This error propagates toretry_on_errors, which triggers session recovery by creating a fresh session. Exceptions originating from the tool call coroutine itself are propagated unwrapped.Key changes:
SessionContext.run_guarded()— races a coroutine against the background task; if the task dies first, cancels the coroutine and raisesConnectionErrorwrapping the original exception.SessionContext.is_task_alive— property that returnsTrueonly when the background task is running, used by_is_session_disconnectedto detect dead sessions.MCPSessionManager._is_session_disconnected()— now accepts an optionalSessionContextand checksis_task_alivein addition to stream state.MCPSessionManager.get_session_context()— new method to retrieve theSessionContextfor a given session, used byMCPToolto callrun_guarded().MCPTool._call_tool_async()— wraps thecall_toolcoroutine withrun_guarded()when aSessionContextis available.Testing Plan
Unit Tests:
6 new test cases added:
test_run_guarded_normal_completiontest_session_context.pytest_run_guarded_background_task_crashtest_session_context.pyConnectionErrorwith original exception as causetest_run_guarded_task_already_deadtest_session_context.pyrun_guardedis calledtest_run_guarded_cancels_coroutine_on_crashtest_session_context.pytest_run_guarded_coroutine_raisestest_session_context.pyConnectionError)test_run_guarded_no_tasktest_session_context.pyConnectionErrorwhen task has not been startedAdditional tests updated in
test_mcp_session_manager.py:TestGetSessionContext— 2 tests for the newget_session_context()methodTestIsSessionDisconnectedWithContext— 3 tests for the enhanced_is_session_disconnected()withSessionContextSessionContextin session pool tuplesManual End-to-End (E2E) Tests:
Tested against a local MCP server configured to return HTTP 403 after initial connection. Before this change,
call_toolhung for the fullsse_read_timeoutduration. After this change, theConnectionErrorsurfaces within milliseconds andretry_on_errorstriggers a fresh session.Checklist
Additional context
The
SessionContextis now stored as the 4th element in the session pool tuple (_sessionsdict), changing the tuple from 3 elements to 4. All existing tests have been updated to reflect this.