fix(sts): bound the AssumeRoleWithWebIdentity call with a request timeout#172
fix(sts): bound the AssumeRoleWithWebIdentity call with a request timeout#172alukach wants to merge 1 commit into
Conversation
…eout
Private products federate to AWS STS on the cold path: the OIDC backend-auth
middleware POSTs AssumeRoleWithWebIdentity over the shared reqwest client, which
was built with `Client::new()` — no timeout. If that exchange stalls, the whole
Worker request hangs until the Cloudflare edge kills it and returns a non-XML
`error code: NNNN` plaintext body. The caller's AWS S3 SDK then fails to parse
it ("char 'e' is not expected.:1:1"), surfacing as an opaque contents-load
failure that self-heals after a few reloads.
It self-heals because the OIDC provider caches credentials across requests in a
warm isolate, so STS only runs on a cold isolate / cache miss — exactly when
it's slow. Public products list anonymously and never hit this path, which is
why only private products are affected.
Bound the STS POST with a 10s per-request timeout (reqwest's wasm backend honors
it via AbortController). On a stall the call now returns OidcProviderError::
HttpError → ProxyError::BackendError → a proper 503 ServiceUnavailable XML error
the client can parse and retry, instead of an unparseable edge timeout body.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Claude finished @alukach's task in 23s —— View job ✅ No blocking issues — safe to merge.
|
|
🚀 Latest commit deployed to https://source-data-proxy-pr-172.source-coop.workers.dev
|
Problem
Loading a private product occasionally fails with an unparseable contents-load error, then self-heals after a few reloads. The app-side symptom is the AWS S3 SDK throwing:
Root cause
Private products federate to AWS STS on the cold path: the OIDC backend-auth middleware POSTs
AssumeRoleWithWebIdentityover the sharedreqwestclient, which is built withClient::new()— no timeout (src/lib.rs).If that STS exchange stalls, the whole Worker request hangs until the Cloudflare edge terminates it and returns a non-XML
error code: NNNNplaintext body. The caller's AWS S3 SDK tries to parse that as XML and chokes on byte 1 (char 'e' is not expected.:1:1). Nothing in the proxy/multistore Rust emits ane-leading body — every proxy response is<?xml …>-prefixed XML — so theebody originates at the edge, not in our code.Why it self-heals + why only private products: the OIDC provider caches credentials across requests in a warm isolate, so STS only runs on a cold isolate / cache miss — exactly when it's slowest. Subsequent reloads hit a warm isolate with cached creds and skip STS entirely. Public/unlisted products list anonymously and never take this path.
Fix
Bound the STS POST with a 10s per-request timeout. reqwest's wasm/
fetchbackend honorsRequestBuilder::timeout()viaAbortController, so this works on the Workers runtime (the wasmClientBuilderhas no.timeout(), hence per-request).On a stall the call now returns
OidcProviderError::HttpError→ProxyError::BackendError→ a proper503 ServiceUnavailableS3 XML error the client can parse and retry, instead of an opaque edge-timeout body. STS normally answers in well under a second, so the bound only trips on genuine stalls.Scope / caveats
SourceCoopRegistry/object_store) and aren't bounded here; if stalls are observed there too, they'd need separate timeouts.source.coop.STS_MAX_SESSION_DURATION) if a deployment needs to tune it.Verification
cargo fmt --check,cargo clippy --target wasm32-unknown-unknown -- -D warnings,cargo check --target wasm32-unknown-unknown, andcargo testall pass locally.🤖 Generated with Claude Code