fix(auth): use asyncio.to_thread for credential resolution in signing hook#270
fix(auth): use asyncio.to_thread for credential resolution in signing hook#270nix-tkobayashi wants to merge 4 commits into
Conversation
… hook `_sign_request_hook` is an async httpx event hook, but it calls `session_holder.session.get_credentials()` and `session_holder.refresh_if_needed()` synchronously. For profiles that use assumed IAM roles (chained credentials), `get_credentials()` triggers a blocking STS `AssumeRole` call that stalls the event loop and causes connection timeouts (~60 s). Add `SessionHolder.async_get_credentials()` and `SessionHolder.async_refresh_if_needed()` which delegate to `asyncio.to_thread`, keeping the event loop responsive. Update `_sign_request_hook` to call the async variants. Fixes aws#176 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback: - Clarify that get_credentials() is the primary blocking path; async_refresh_if_needed() is a defensive measure - Replace timeout-based non-blocking test with ticker coroutine that proves the event loop stays responsive during credential resolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Did you manage to reproduce the issue that was reported? |
|
@arnewouters
The key to reproducing is that the credential resolution chain must trigger an actual STS
If credentials are already resolved and cached (e.g., from a prior request in the same session), property access returns immediately and you won't see the blocking behavior. With static IAM user credentials or SSO profiles where tokens are pre-cached, it also returns instantly. The ~60s timeout occurs because the synchronous STS call (typically 1-3s) runs on the asyncio event loop, preventing the MCP handshake from completing within the client's timeout window. Happy to help with any additional details or testing needed. |
|
I'm struggling to understand what the blocked event loop actually starves in practice:
I'd like to understand the actual failure mode and see a reproducible example before merging any fixes. |
|
Thanks for the detailed review — these are fair points and I want to address them precisely. You're right that a single fast STS call alone does not explain a 60s timeout. The narrower claim of this PR is:
Let me walk through the mechanism and then show a deterministic reproduction. The mechanismIn stdio proxy mode, the proxy runs a single asyncio event loop that handles both stdio (client ↔ proxy) and HTTP (proxy ↔ AWS). When the client sends
The client's timeout clock is ticking from step 1 to step 5. The event-loop block at step 4 consumes part of that budget. How much depends entirely on how long For static IAM keys, Deterministic reproductionThis can be reproduced without any AWS dependency by injecting a blocking delay: import asyncio, time, unittest.mock as mock
import httpx
from mcp_proxy_for_aws.sigv4_helper import _sign_request_hook, SessionHolder
async def test_event_loop_blocked():
"""Proves that a slow get_credentials() blocks the entire event loop."""
tick_count = 0
async def ticker():
nonlocal tick_count
while True:
await asyncio.sleep(0.05)
tick_count += 1
# Simulate a 3-second credential resolution
slow_session = mock.MagicMock()
slow_session.get_credentials.side_effect = lambda: time.sleep(3) or mock.MagicMock()
holder = SessionHolder(slow_session)
fake_request = httpx.Request("POST", "https://example.com", content=b'{}')
task = asyncio.create_task(ticker())
await _sign_request_hook("us-east-1", "execute-api", holder, fake_request)
task.cancel()
# Without fix: tick_count == 0 (event loop was frozen for 3s)
# With asyncio.to_thread fix: tick_count >= 50
print(f"Ticks during credential resolution: {tick_count}")
asyncio.run(test_event_loop_blocked())With the current code, If the blocking duration exceeds the client's remaining timeout budget, the connection fails. You can verify this end-to-end by setting a short client timeout (e.g., 2s) and injecting a 3s blocking Why this matters in practiceI encountered this in a Lambda-based setup using cross-account AssumeRole, where I acknowledge I cannot definitively prove that the original issue #176 reporter's 60s timeout was caused solely by this — there may be additional factors in their environment. But the event-loop blocking is a real correctness issue regardless of whether it fully explains their specific case. Happy to add the reproduction script above to the PR as an integration test if that would help. |
krishmakochhar
left a comment
There was a problem hiding this comment.
LGTM.
Two suggestions for a follow-up:
-
Thread safety on refresh: If
async_refresh_if_needed()is ever called concurrently (e.g., shared holder, pipelined requests), two threads could race onself.sessionreplacement. Anasyncio.Lockwith a double-check pattern would close this cheaply. Not urgent given current single-holder-per-connection usage. -
PR description: Consider softening "Fixes #176" to "Partially addresses #176" — the event-loop blocking is a real correctness issue, but as you acknowledged, it may not be the sole cause of the original reporter's 60s timeout.
|
Thank you for the review and the approval! Good call on both suggestions:
|
Summary
Changes
_sign_request_hookis an async httpx event hook, but it callssession_holder.session.get_credentials()andsession_holder.refresh_if_needed()synchronously. For AWS profiles that use assumed IAM roles (chained credentials),get_credentials()triggers a blocking STSAssumeRolecall that stalls the asyncio event loop, adding directly to client-observed latency.This PR:
SessionHolder.async_get_credentials()andSessionHolder.async_refresh_if_needed()which delegate blocking botocore calls toasyncio.to_thread()_sign_request_hookto use the async variantsUser experience
Before: Using an AWS profile backed by an assumed IAM role (chained credentials) blocks the event loop during credential resolution, adding blocking latency that contributes to connection timeouts:
After: The connection succeeds normally. The STS
AssumeRolecall runs in a worker thread, keeping the event loop responsive.Checklist
Is this a breaking change? (Y/N)
Please add details about how this change was tested.
Acknowledgment
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Partially addresses #176