Skip to content

fix(auth): use asyncio.to_thread for credential resolution in signing hook#270

Open
nix-tkobayashi wants to merge 4 commits into
aws:mainfrom
nix-tkobayashi:fix/async-credential-resolution
Open

fix(auth): use asyncio.to_thread for credential resolution in signing hook#270
nix-tkobayashi wants to merge 4 commits into
aws:mainfrom
nix-tkobayashi:fix/async-credential-resolution

Conversation

@nix-tkobayashi
Copy link
Copy Markdown

@nix-tkobayashi nix-tkobayashi commented May 8, 2026

Summary

Changes

_sign_request_hook is an async httpx event hook, but it calls session_holder.session.get_credentials() and session_holder.refresh_if_needed() synchronously. For AWS profiles that use assumed IAM roles (chained credentials), get_credentials() triggers a blocking STS AssumeRole call that stalls the asyncio event loop, adding directly to client-observed latency.

This PR:

  • Adds SessionHolder.async_get_credentials() and SessionHolder.async_refresh_if_needed() which delegate blocking botocore calls to asyncio.to_thread()
  • Updates _sign_request_hook to use the async variants
  • Adds unit tests including a ticker-coroutine test that proves the event loop stays responsive during credential resolution

Note: A similar synchronous get_credentials() call exists in client.py:106 (aws_iam_streamablehttp_client). That code path is separate from the proxy signing hook and is not addressed in this PR.

User experience

Before: Using an AWS profile backed by an assumed IAM role (chained credentials) blocks the event loop during credential resolution, adding blocking latency that contributes to connection timeouts:

[error] Error connecting to MCP server: MCP error -32001: Request timed out

After: The connection succeeds normally. The STS AssumeRole call runs in a worker thread, keeping the event loop responsive.

Checklist

  • I have reviewed the contributing guidelines
  • I have performed a self-review of this change
  • Changes have been tested
  • Changes are documented

Is this a breaking change? (Y/N)

  • Yes
  • No

Please add details about how this change was tested.

  • All 183 unit tests pass with 96% coverage (80% required)
  • Pre-commit hooks pass
  • Ruff linting and formatting pass
  • Pyright passes (no new errors)
  • Verified in production with Lambda-based MCP client using cross-account AssumeRole
  • Integration tests not run locally (requires AWS OIDC credentials)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Partially addresses #176

… hook

`_sign_request_hook` is an async httpx event hook, but it calls
`session_holder.session.get_credentials()` and
`session_holder.refresh_if_needed()` synchronously. For profiles that
use assumed IAM roles (chained credentials), `get_credentials()` triggers
a blocking STS `AssumeRole` call that stalls the event loop and causes
connection timeouts (~60 s).

Add `SessionHolder.async_get_credentials()` and
`SessionHolder.async_refresh_if_needed()` which delegate to
`asyncio.to_thread`, keeping the event loop responsive. Update
`_sign_request_hook` to call the async variants.

Fixes aws#176

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nix-tkobayashi and others added 3 commits May 8, 2026 19:35
Address review feedback:
- Clarify that get_credentials() is the primary blocking path;
  async_refresh_if_needed() is a defensive measure
- Replace timeout-based non-blocking test with ticker coroutine that
  proves the event loop stays responsive during credential resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@arnewouters
Copy link
Copy Markdown
Contributor

Did you manage to reproduce the issue that was reported?

@nix-tkobayashi
Copy link
Copy Markdown
Author

@arnewouters
Yes, I reproduced it with a cross-account AssumeRole profile. My setup:

  • MCP client: Lambda-based (custom)
  • Profile config: ~/.aws/config with role_arn + source_profile (chained credentials)
  • MFA: Not enabled for this role

The key to reproducing is that the credential resolution chain must trigger an actual STS AssumeRole call. For profiles with role_arn, botocore's AssumeRoleProvider returns a DeferredRefreshableCredentials object — the STS call is deferred until the credentials are first accessed (e.g., during SigV4 signing when .access_key / .secret_key / .token properties are read). This only blocks when:

  1. The profile uses role_arn (chained credentials), and
  2. The credentials haven't been resolved yet or have expired

If credentials are already resolved and cached (e.g., from a prior request in the same session), property access returns immediately and you won't see the blocking behavior. With static IAM user credentials or SSO profiles where tokens are pre-cached, it also returns instantly.

The ~60s timeout occurs because the synchronous STS call (typically 1-3s) runs on the asyncio event loop, preventing the MCP handshake from completing within the client's timeout window.

Happy to help with any additional details or testing needed.

@arnewouters
Copy link
Copy Markdown
Contributor

I'm struggling to understand what the blocked event loop actually starves in practice:

  1. The signing hook runs before the request is sent, so nothing is waiting on a response yet
  2. The MCP client timeout is measured on the client side (separate process), not on the proxy's event loop
  3. A 1-3s STS call shouldn't exhaust a 60s client timeout

I'd like to understand the actual failure mode and see a reproducible example before merging any fixes.

@nix-tkobayashi
Copy link
Copy Markdown
Author

Thanks for the detailed review — these are fair points and I want to address them precisely.

You're right that a single fast STS call alone does not explain a 60s timeout. The narrower claim of this PR is:

Calling synchronous credential resolution (get_credentials() → STS AssumeRole) from the proxy's asyncio event loop blocks all coroutines on that loop, and that blocking time adds directly to the client's timeout budget.

Let me walk through the mechanism and then show a deterministic reproduction.

The mechanism

In stdio proxy mode, the proxy runs a single asyncio event loop that handles both stdio (client ↔ proxy) and HTTP (proxy ↔ AWS). When the client sends initialize:

  1. Proxy reads the message from stdin
  2. Proxy prepares the HTTP request to AWS
  3. _sign_request_hook runs — calls get_credentials() synchronously
  4. Event loop is frozen for the duration of the STS call
  5. Request is sent, AWS responds, proxy writes back to stdout

The client's timeout clock is ticking from step 1 to step 5. The event-loop block at step 4 consumes part of that budget. How much depends entirely on how long get_credentials() takes — which varies by credential provider chain, network conditions, and retry configuration.

For static IAM keys, get_credentials() returns instantly (no network call) — no issue. For AssumeRole profiles, it makes a synchronous STS API call whose duration is environment-dependent and unbounded.

Deterministic reproduction

This can be reproduced without any AWS dependency by injecting a blocking delay:

import asyncio, time, unittest.mock as mock
import httpx
from mcp_proxy_for_aws.sigv4_helper import _sign_request_hook, SessionHolder

async def test_event_loop_blocked():
    """Proves that a slow get_credentials() blocks the entire event loop."""
    tick_count = 0

    async def ticker():
        nonlocal tick_count
        while True:
            await asyncio.sleep(0.05)
            tick_count += 1

    # Simulate a 3-second credential resolution
    slow_session = mock.MagicMock()
    slow_session.get_credentials.side_effect = lambda: time.sleep(3) or mock.MagicMock()
    holder = SessionHolder(slow_session)

    fake_request = httpx.Request("POST", "https://example.com", content=b'{}')

    task = asyncio.create_task(ticker())
    await _sign_request_hook("us-east-1", "execute-api", holder, fake_request)
    task.cancel()

    # Without fix: tick_count == 0 (event loop was frozen for 3s)
    # With asyncio.to_thread fix: tick_count >= 50
    print(f"Ticks during credential resolution: {tick_count}")

asyncio.run(test_event_loop_blocked())

With the current code, tick_count is 0 — the event loop was completely frozen. With asyncio.to_thread(), it runs normally. This is essentially what test_sign_request_hook_does_not_block_event_loop in the PR validates.

If the blocking duration exceeds the client's remaining timeout budget, the connection fails. You can verify this end-to-end by setting a short client timeout (e.g., 2s) and injecting a 3s blocking get_credentials().

Why this matters in practice

I encountered this in a Lambda-based setup using cross-account AssumeRole, where get_credentials() involves chained STS calls that can take seconds to tens of seconds depending on provider chain, retries, and network conditions. The asyncio.to_thread() fix resolved the issue without changing behavior on fast credential paths.

I acknowledge I cannot definitively prove that the original issue #176 reporter's 60s timeout was caused solely by this — there may be additional factors in their environment. But the event-loop blocking is a real correctness issue regardless of whether it fully explains their specific case.

Happy to add the reproduction script above to the PR as an integration test if that would help.

Copy link
Copy Markdown

@krishmakochhar krishmakochhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Two suggestions for a follow-up:

  1. Thread safety on refresh: If async_refresh_if_needed() is ever called concurrently (e.g., shared holder, pipelined requests), two threads could race on self.session replacement. An asyncio.Lock with a double-check pattern would close this cheaply. Not urgent given current single-holder-per-connection usage.

  2. PR description: Consider softening "Fixes #176" to "Partially addresses #176" — the event-loop blocking is a real correctness issue, but as you acknowledged, it may not be the sole cause of the original reporter's 60s timeout.

@nix-tkobayashi
Copy link
Copy Markdown
Author

Thank you for the review and the approval!

Good call on both suggestions:

  1. Thread safety: Agreed — an asyncio.Lock with double-check on async_refresh_if_needed() would close the race cheaply. I'll open a follow-up PR for this.
  2. PR description: Updated to "Partially addresses Support for assumed IAM role (chained credentials) in MCP proxy #176" and softened the timeout causation language throughout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants