fix(auth): use asyncio.to_thread for credential resolution in signing hook by nix-tkobayashi · Pull Request #270 · aws/mcp-proxy-for-aws

nix-tkobayashi · 2026-05-08T10:29:33Z

Summary

Changes

_sign_request_hook is an async httpx event hook, but it calls session_holder.session.get_credentials() and session_holder.refresh_if_needed() synchronously. For AWS profiles that use assumed IAM roles (chained credentials), get_credentials() triggers a blocking STS AssumeRole call that stalls the asyncio event loop, adding directly to client-observed latency.

This PR:

Adds SessionHolder.async_get_credentials() and SessionHolder.async_refresh_if_needed() which delegate blocking botocore calls to asyncio.to_thread()
Updates _sign_request_hook to use the async variants
Adds unit tests including a ticker-coroutine test that proves the event loop stays responsive during credential resolution

Note: A similar synchronous get_credentials() call exists in client.py:106 (aws_iam_streamablehttp_client). That code path is separate from the proxy signing hook and is not addressed in this PR.

User experience

Before: Using an AWS profile backed by an assumed IAM role (chained credentials) blocks the event loop during credential resolution, adding blocking latency that contributes to connection timeouts:

[error] Error connecting to MCP server: MCP error -32001: Request timed out

After: The connection succeeds normally. The STS AssumeRole call runs in a worker thread, keeping the event loop responsive.

Checklist

I have reviewed the contributing guidelines
I have performed a self-review of this change
Changes have been tested
Changes are documented

Is this a breaking change? (Y/N)

Yes
No

Please add details about how this change was tested.

All 183 unit tests pass with 96% coverage (80% required)
Pre-commit hooks pass
Ruff linting and formatting pass
Pyright passes (no new errors)
Verified in production with Lambda-based MCP client using cross-account AssumeRole
Integration tests not run locally (requires AWS OIDC credentials)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Partially addresses #176

… hook `_sign_request_hook` is an async httpx event hook, but it calls `session_holder.session.get_credentials()` and `session_holder.refresh_if_needed()` synchronously. For profiles that use assumed IAM roles (chained credentials), `get_credentials()` triggers a blocking STS `AssumeRole` call that stalls the event loop and causes connection timeouts (~60 s). Add `SessionHolder.async_get_credentials()` and `SessionHolder.async_refresh_if_needed()` which delegate to `asyncio.to_thread`, keeping the event loop responsive. Update `_sign_request_hook` to call the async variants. Fixes aws#176 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address review feedback: - Clarify that get_credentials() is the primary blocking path; async_refresh_if_needed() is a defensive measure - Replace timeout-based non-blocking test with ticker coroutine that proves the event loop stays responsive during credential resolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

arnewouters · 2026-05-19T06:52:50Z

Did you manage to reproduce the issue that was reported?

nix-tkobayashi · 2026-05-19T11:56:49Z

@arnewouters
Yes, I reproduced it with a cross-account AssumeRole profile. My setup:

MCP client: Lambda-based (custom)
Profile config: ~/.aws/config with role_arn + source_profile (chained credentials)
MFA: Not enabled for this role

The key to reproducing is that the credential resolution chain must trigger an actual STS AssumeRole call. For profiles with role_arn, botocore's AssumeRoleProvider returns a DeferredRefreshableCredentials object — the STS call is deferred until the credentials are first accessed (e.g., during SigV4 signing when .access_key / .secret_key / .token properties are read). This only blocks when:

The profile uses role_arn (chained credentials), and
The credentials haven't been resolved yet or have expired

If credentials are already resolved and cached (e.g., from a prior request in the same session), property access returns immediately and you won't see the blocking behavior. With static IAM user credentials or SSO profiles where tokens are pre-cached, it also returns instantly.

The ~60s timeout occurs because the synchronous STS call (typically 1-3s) runs on the asyncio event loop, preventing the MCP handshake from completing within the client's timeout window.

Happy to help with any additional details or testing needed.

arnewouters · 2026-05-20T05:43:53Z

I'm struggling to understand what the blocked event loop actually starves in practice:

The signing hook runs before the request is sent, so nothing is waiting on a response yet
The MCP client timeout is measured on the client side (separate process), not on the proxy's event loop
A 1-3s STS call shouldn't exhaust a 60s client timeout

I'd like to understand the actual failure mode and see a reproducible example before merging any fixes.

nix-tkobayashi · 2026-05-20T07:51:15Z

Thanks for the detailed review — these are fair points and I want to address them precisely.

You're right that a single fast STS call alone does not explain a 60s timeout. The narrower claim of this PR is:

Calling synchronous credential resolution (get_credentials() → STS AssumeRole) from the proxy's asyncio event loop blocks all coroutines on that loop, and that blocking time adds directly to the client's timeout budget.

Let me walk through the mechanism and then show a deterministic reproduction.

The mechanism

In stdio proxy mode, the proxy runs a single asyncio event loop that handles both stdio (client ↔ proxy) and HTTP (proxy ↔ AWS). When the client sends initialize:

Proxy reads the message from stdin
Proxy prepares the HTTP request to AWS
_sign_request_hook runs — calls get_credentials() synchronously
Event loop is frozen for the duration of the STS call
Request is sent, AWS responds, proxy writes back to stdout

The client's timeout clock is ticking from step 1 to step 5. The event-loop block at step 4 consumes part of that budget. How much depends entirely on how long get_credentials() takes — which varies by credential provider chain, network conditions, and retry configuration.

For static IAM keys, get_credentials() returns instantly (no network call) — no issue. For AssumeRole profiles, it makes a synchronous STS API call whose duration is environment-dependent and unbounded.

Deterministic reproduction

This can be reproduced without any AWS dependency by injecting a blocking delay:

import asyncio, time, unittest.mock as mock
import httpx
from mcp_proxy_for_aws.sigv4_helper import _sign_request_hook, SessionHolder

async def test_event_loop_blocked():
    """Proves that a slow get_credentials() blocks the entire event loop."""
    tick_count = 0

    async def ticker():
        nonlocal tick_count
        while True:
            await asyncio.sleep(0.05)
            tick_count += 1

    # Simulate a 3-second credential resolution
    slow_session = mock.MagicMock()
    slow_session.get_credentials.side_effect = lambda: time.sleep(3) or mock.MagicMock()
    holder = SessionHolder(slow_session)

    fake_request = httpx.Request("POST", "https://example.com", content=b'{}')

    task = asyncio.create_task(ticker())
    await _sign_request_hook("us-east-1", "execute-api", holder, fake_request)
    task.cancel()

    # Without fix: tick_count == 0 (event loop was frozen for 3s)
    # With asyncio.to_thread fix: tick_count >= 50
    print(f"Ticks during credential resolution: {tick_count}")

asyncio.run(test_event_loop_blocked())

With the current code, tick_count is 0 — the event loop was completely frozen. With asyncio.to_thread(), it runs normally. This is essentially what test_sign_request_hook_does_not_block_event_loop in the PR validates.

If the blocking duration exceeds the client's remaining timeout budget, the connection fails. You can verify this end-to-end by setting a short client timeout (e.g., 2s) and injecting a 3s blocking get_credentials().

Why this matters in practice

I encountered this in a Lambda-based setup using cross-account AssumeRole, where get_credentials() involves chained STS calls that can take seconds to tens of seconds depending on provider chain, retries, and network conditions. The asyncio.to_thread() fix resolved the issue without changing behavior on fast credential paths.

I acknowledge I cannot definitively prove that the original issue #176 reporter's 60s timeout was caused solely by this — there may be additional factors in their environment. But the event-loop blocking is a real correctness issue regardless of whether it fully explains their specific case.

Happy to add the reproduction script above to the PR as an integration test if that would help.

krishmakochhar

LGTM.

Two suggestions for a follow-up:

Thread safety on refresh: If async_refresh_if_needed() is ever called concurrently (e.g., shared holder, pipelined requests), two threads could race on self.session replacement. An asyncio.Lock with a double-check pattern would close this cheaply. Not urgent given current single-holder-per-connection usage.
PR description: Consider softening "Fixes #176" to "Partially addresses #176" — the event-loop blocking is a real correctness issue, but as you acknowledged, it may not be the sole cause of the original reporter's 60s timeout.

nix-tkobayashi · 2026-05-20T12:09:46Z

Thank you for the review and the approval!

Good call on both suggestions:

Thread safety: Agreed — an asyncio.Lock with double-check on async_refresh_if_needed() would close the race cheaply. I'll open a follow-up PR for this.
PR description: Updated to "Partially addresses Support for assumed IAM role (chained credentials) in MCP proxy #176" and softened the timeout causation language throughout.

nix-tkobayashi requested a review from a team as a code owner May 8, 2026 10:29

nix-tkobayashi requested review from acmlau and harv-aws May 8, 2026 10:29

nix-tkobayashi mentioned this pull request May 8, 2026

Support for assumed IAM role (chained credentials) in MCP proxy #176

Open

2 tasks

nix-tkobayashi and others added 3 commits May 8, 2026 19:35

Merge branch 'main' into fix/async-credential-resolution

65c2ba3

Merge branch 'main' into fix/async-credential-resolution

a1aa047

krishmakochhar approved these changes May 20, 2026

View reviewed changes

nix-tkobayashi mentioned this pull request May 20, 2026

fix: add asyncio.Lock to prevent concurrent refresh races #286

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(auth): use asyncio.to_thread for credential resolution in signing hook#270

fix(auth): use asyncio.to_thread for credential resolution in signing hook#270
nix-tkobayashi wants to merge 4 commits into
aws:mainfrom
nix-tkobayashi:fix/async-credential-resolution

nix-tkobayashi commented May 8, 2026 •

edited

Loading

Uh oh!

arnewouters commented May 19, 2026

Uh oh!

nix-tkobayashi commented May 19, 2026

Uh oh!

arnewouters commented May 20, 2026

Uh oh!

nix-tkobayashi commented May 20, 2026

Uh oh!

krishmakochhar left a comment

Uh oh!

nix-tkobayashi commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nix-tkobayashi commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

User experience

Checklist

Acknowledgment

Uh oh!

arnewouters commented May 19, 2026

Uh oh!

nix-tkobayashi commented May 19, 2026

Uh oh!

arnewouters commented May 20, 2026

Uh oh!

nix-tkobayashi commented May 20, 2026

The mechanism

Deterministic reproduction

Why this matters in practice

Uh oh!

krishmakochhar left a comment

Choose a reason for hiding this comment

Uh oh!

nix-tkobayashi commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nix-tkobayashi commented May 8, 2026 •

edited

Loading