fix(sts): bound the AssumeRoleWithWebIdentity call with a request timeout by alukach · Pull Request #172 · source-cooperative/data.source.coop

alukach · 2026-06-29T22:10:06Z

Problem

Loading a private product occasionally fails with an unparseable contents-load error, then self-heals after a few reloads. The app-side symptom is the AWS S3 SDK throwing:

Error: char 'e' is not expected.:1:1
Deserialization error: ... '$metadata': [Object]

Root cause

Private products federate to AWS STS on the cold path: the OIDC backend-auth middleware POSTs AssumeRoleWithWebIdentity over the shared reqwest client, which is built with Client::new() — no timeout (src/lib.rs).

If that STS exchange stalls, the whole Worker request hangs until the Cloudflare edge terminates it and returns a non-XML error code: NNNN plaintext body. The caller's AWS S3 SDK tries to parse that as XML and chokes on byte 1 (char 'e' is not expected.:1:1). Nothing in the proxy/multistore Rust emits an e-leading body — every proxy response is <?xml …>-prefixed XML — so the e body originates at the edge, not in our code.

Why it self-heals + why only private products: the OIDC provider caches credentials across requests in a warm isolate, so STS only runs on a cold isolate / cache miss — exactly when it's slowest. Subsequent reloads hit a warm isolate with cached creds and skip STS entirely. Public/unlisted products list anonymously and never take this path.

Fix

Bound the STS POST with a 10s per-request timeout. reqwest's wasm/fetch backend honors RequestBuilder::timeout() via AbortController, so this works on the Workers runtime (the wasm ClientBuilder has no .timeout(), hence per-request).

On a stall the call now returns OidcProviderError::HttpError → ProxyError::BackendError → a proper 503 ServiceUnavailable S3 XML error the client can parse and retry, instead of an opaque edge-timeout body. STS normally answers in well under a second, so the bound only trips on genuine stalls.

Scope / caveats

Covers only the STS exchange — the one outbound call this repo builds directly. The Source API connection-resolve and the backend S3 call go through other clients (SourceCoopRegistry / object_store) and aren't bounded here; if stalls are observed there too, they'd need separate timeouts.
This is a mitigation that converts an unparseable failure into a clean, retryable S3 error. The complementary app-side change (degrade a contents-listing failure to an inline "try again" notice while keeping product metadata + the Edit link) is handled in source.coop.
Timeout is a fixed const; trivially promotable to an env var (à la STS_MAX_SESSION_DURATION) if a deployment needs to tune it.

Verification

cargo fmt --check, cargo clippy --target wasm32-unknown-unknown -- -D warnings, cargo check --target wasm32-unknown-unknown, and cargo test all pass locally.

🤖 Generated with Claude Code

…eout Private products federate to AWS STS on the cold path: the OIDC backend-auth middleware POSTs AssumeRoleWithWebIdentity over the shared reqwest client, which was built with `Client::new()` — no timeout. If that exchange stalls, the whole Worker request hangs until the Cloudflare edge kills it and returns a non-XML `error code: NNNN` plaintext body. The caller's AWS S3 SDK then fails to parse it ("char 'e' is not expected.:1:1"), surfacing as an opaque contents-load failure that self-heals after a few reloads. It self-heals because the OIDC provider caches credentials across requests in a warm isolate, so STS only runs on a cold isolate / cache miss — exactly when it's slow. Public products list anonymously and never hit this path, which is why only private products are affected. Bound the STS POST with a 10s per-request timeout (reqwest's wasm backend honors it via AbortController). On a stall the call now returns OidcProviderError:: HttpError → ProxyError::BackendError → a proper 503 ServiceUnavailable XML error the client can parse and retry, instead of an unparseable edge timeout body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

claude · 2026-06-29T22:10:19Z

Claude finished @alukach's task in 23s —— View job

✅ No blocking issues — safe to merge.

src/lib.rs:74 — The // ponytail: prefix on the inline comment is an unusual convention that adds noise without being a standard tag (not TODO, FIXME, NOTE, etc.). Minor — linter won't catch it, but worth normalizing to // NOTE: or just removing the prefix. Not blocking.
src/lib.rs:65–73 — The doc-comment on the const is unusually long for a constant (8 lines explaining the failure mode). The PR description already captures this context. Keeping a short one-liner like /// Per-request timeout for the STS AssumeRoleWithWebIdentity POST. would be sufficient. Not blocking.
The core change (.timeout(STS_REQUEST_TIMEOUT) at src/lib.rs:94) is correct: reqwest's wasm/fetch backend honors per-request timeouts via AbortController, the 10 s value is reasonable for an STS call that normally completes in well under a second, and the error propagation path (HttpError → BackendError → 503) is already in place. No issues with the fix itself.

github-actions · 2026-06-29T22:11:26Z

🚀 Latest commit deployed to https://source-data-proxy-pr-172.source-coop.workers.dev

Date: 2026-06-29T22:10:06Z
Commit: 979424b

alukach temporarily deployed to preview June 29, 2026 22:10 — with GitHub Actions Inactive

source-release-bot Bot added the fix label Jun 29, 2026

alukach temporarily deployed to preview June 29, 2026 22:10 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sts): bound the AssumeRoleWithWebIdentity call with a request timeout#172

fix(sts): bound the AssumeRoleWithWebIdentity call with a request timeout#172
alukach wants to merge 1 commit into
mainfrom
fix/sts-request-timeout

alukach commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

alukach commented Jun 29, 2026

Problem

Root cause

Fix

Scope / caveats

Verification

Uh oh!

claude Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jun 29, 2026 •

edited

Loading