Skip to content

Resolve MCP session deadlock with thread-safe per-transport locking#114

Merged
dhilloulinoracle merged 3 commits intooracle:mainfrom
gotsysdba:112_deadlock
Mar 20, 2026
Merged

Resolve MCP session deadlock with thread-safe per-transport locking#114
dhilloulinoracle merged 3 commits intooracle:mainfrom
gotsysdba:112_deadlock

Conversation

@gotsysdba
Copy link
Member

Fixes #112

Fixes a deadlock where get_or_create_session() called from async methods (MCPTool.run_async, MCPToolBox._get_tools_inner_async) blocked on the portal's own event loop, and a TOCTOU race condition where concurrent callers could create duplicate sessions for the same transport.

Changes:

  • tools.py: Offload get_or_create_session() to a worker thread via anyio.to_thread.run_sync in both MCPTool.run_async and MCPToolBox._get_tools_inner_async, so the event loop is never blocked
  • _session_persistence.py: Replace the direct _create_long_lived_session call with a per-transport double-checked locking pattern:
    • Fast-path check before any lock (existing behavior)
    • Global _lock held briefly only to get/create a per-transport threading.Lock
    • Transport lock serializes same-transport session creation (prevents duplicates) while allowing different transports to proceed in parallel
    • Re-check under transport lock before creating

@gotsysdba gotsysdba requested a review from a team March 9, 2026 11:50
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Mar 9, 2026
@dhilloulinoracle
Copy link
Contributor

Thank you @gotsysdba for the PR!

@paul-cayet will review it

Copy link
Member

@sonleoracle sonleoracle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add regression coverage for the behavior this PR is changing?

For instance:

  • Concurrent cold session creation for the same transport/conversation: A good test would call AsyncRuntime.get_or_create_session() from two threads at the same time and assert that only one session is created and both callers get the same session back.

  • Async cold session creation should not block the event loop: Since this PR moves get_or_create_session() behind to_thread.run_sync(...) in MCPTool.run_async() and MCPToolBox.get_tools_async(), it would be good to add tests that schedule a small heartbeat task alongside those calls and assert the heartbeat still runs while session creation is in progress.

You could add them to wayflowcore/tests/mcptools/test_mcp_tools.py near the existing session-persistence tests around line 260.

@gotsysdba
Copy link
Member Author

@sonleoracle, added tests:

  1. test_concurrent_cold_session_creation_returns_same_session: Submits two concurrent get_or_create_session() calls for the same transport via a ThreadPoolExecutor and asserts both callers get back the exact same session object (is identity check).
  2. test_async_session_creation_does_not_block_event_loop: Runs a heartbeat coroutine alongside MCPToolBox.get_tools_async() and asserts the heartbeat ticked multiple times, proving the event loop wasn't blocked during session creation.
  3. test_async_mcp_tool_run_does_not_block_event_loop: Same heartbeat pattern but exercises MCPTool.run_async(), validating the to_thread.run_sync() change

================================================================= 3 passed in 2.32s =================================================================
% LLAMA_API_URL=x OSS_API_URL=x LLAMA70BV33_API_URL=x OLLAMA8BV32_API_URL=x OCI_REASONING_MODEL=x GEMMA_API_URL=x COMPARTMENT_ID=x \
INSTANCE_PRINCIPAL_ENDPOINT_BASE_URL=x E5largev2_EMBEDDING_API_URL=x OLLAMA_EMBEDDING_API_URL=x \
pytest wayflowcore/tests/mcptools/test_mcp_tools.py::test_concurrent_cold_session_creation_returns_same_session \
wayflowcore/tests/mcptools/test_mcp_tools.py::test_async_session_creation_does_not_block_event_loop \
wayflowcore/tests/mcptools/test_mcp_tools.py::test_async_mcp_tool_run_does_not_block_event_loop -v
================================================================ test session starts ================================================================
platform darwin -- Python 3.14.3, pytest-7.4.4, pluggy-1.6.0
cachedir: .pytest_cache
configfile: setup.cfg
plugins: anyio-4.11.0, xdist-3.8.0
collected 3 items                                                                                                                                   

wayflowcore/tests/mcptools/test_mcp_tools.py::test_concurrent_cold_session_creation_returns_same_session PASSED                               [ 33%]
wayflowcore/tests/mcptools/test_mcp_tools.py::test_async_session_creation_does_not_block_event_loop PASSED                                    [ 66%]
wayflowcore/tests/mcptools/test_mcp_tools.py::test_async_mcp_tool_run_does_not_block_event_loop PASSED                                        [100%]

================================================================= 3 passed in 2.37s =================================================================

@dhilloulinoracle
Copy link
Contributor

Internal regression failed: Build ID #748

@dhilloulinoracle
Copy link
Contributor

Internal regression succeeded 🍏: Build ID #749

@dhilloulinoracle
Copy link
Contributor

Merge Gate INPROGRESS

@dhilloulinoracle
Copy link
Contributor

Internal regression succeeded 🍏: Build ID #1540

@dhilloulinoracle dhilloulinoracle merged commit 961db8b into oracle:main Mar 20, 2026
2 of 3 checks passed
@dhilloulinoracle
Copy link
Contributor

Merge Gate succeeded 🍏: Build ID #2280

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug]: deadlock on self-hosted MCP servers during first-call

4 participants