Skip to content

fix: add asyncio.Lock to prevent concurrent refresh races#286

Open
nix-tkobayashi wants to merge 6 commits into
aws:mainfrom
nix-tkobayashi:fix/async-refresh-lock
Open

fix: add asyncio.Lock to prevent concurrent refresh races#286
nix-tkobayashi wants to merge 6 commits into
aws:mainfrom
nix-tkobayashi:fix/async-refresh-lock

Conversation

@nix-tkobayashi
Copy link
Copy Markdown

@nix-tkobayashi nix-tkobayashi commented May 20, 2026

Summary

Changes

Follow-up to #270 (per review feedback).

SessionHolder.async_refresh_if_needed() can be called concurrently by multiple in-flight requests when the session is marked for refresh. Without synchronization, all callers race to create a new boto3 session simultaneously — wasting STS calls and risking inconsistent state.

This PR:

  • Adds an asyncio.Lock to SessionHolder with a double-check pattern: the outer _needs_refresh check avoids lock overhead on the fast path; the inner check ensures only one caller actually performs the refresh
  • Shields the to_thread call from task cancellation so that a cancelled caller does not release the lock while the worker thread is still running (prevents duplicate refresh on cancellation)
  • Adds two unit tests:
    • Concurrent-callers test using threading.Event barriers to assert create_aws_session is called exactly once
    • Cancellation test that verifies the lock is held until the refresh completes, even when the awaiting task is cancelled

User experience

No user-visible behavior change. This is a correctness hardening for the credential refresh path introduced in #270.

Checklist

  • I have reviewed the contributing guidelines
  • I have performed a self-review of this change
  • Changes have been tested
  • Changes are documented

Is this a breaking change? (Y/N)

  • Yes
  • No

Please add details about how this change was tested.

  • All 185 unit tests pass with 96% coverage (80% required)
  • Pre-commit hooks pass
  • Ruff linting and formatting pass
  • Pyright passes (no new errors)
  • Concurrent test uses threading.Event barriers to assert single refresh
  • Cancellation test verified to FAIL without shield and PASS with shield
  • Integration tests not run locally (requires AWS OIDC credentials)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

nix-tkobayashi and others added 5 commits May 8, 2026 19:29
… hook

`_sign_request_hook` is an async httpx event hook, but it calls
`session_holder.session.get_credentials()` and
`session_holder.refresh_if_needed()` synchronously. For profiles that
use assumed IAM roles (chained credentials), `get_credentials()` triggers
a blocking STS `AssumeRole` call that stalls the event loop and causes
connection timeouts (~60 s).

Add `SessionHolder.async_get_credentials()` and
`SessionHolder.async_refresh_if_needed()` which delegate to
`asyncio.to_thread`, keeping the event loop responsive. Update
`_sign_request_hook` to call the async variants.

Fixes aws#176

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback:
- Clarify that get_credentials() is the primary blocking path;
  async_refresh_if_needed() is a defensive measure
- Replace timeout-based non-blocking test with ticker coroutine that
  proves the event loop stays responsive during credential resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevent concurrent refresh races when multiple in-flight requests
trigger credential refresh simultaneously. Uses a double-check
pattern: the outer check avoids lock overhead on the fast path,
the inner check ensures only one caller performs the refresh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nix-tkobayashi nix-tkobayashi requested a review from a team as a code owner May 20, 2026 12:30
When the task holding _refresh_lock is cancelled during
asyncio.to_thread(), the lock would be released while the worker
thread is still running.  A second caller could then acquire the
lock, see _needs_refresh == True, and start a duplicate refresh.

Shield the to_thread call so that cancellation waits for the
refresh to complete before releasing the lock and re-raising
CancelledError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant