Skip to content

HA-safe OAuth state + Stateless StreamableHTTP transport (closes #113)#114

Merged
BorisTyshkevich merged 2 commits into
mainfrom
feature/ha-jwe-state
May 15, 2026
Merged

HA-safe OAuth state + Stateless StreamableHTTP transport (closes #113)#114
BorisTyshkevich merged 2 commits into
mainfrom
feature/ha-jwe-state

Conversation

@BorisTyshkevich
Copy link
Copy Markdown
Collaborator

@BorisTyshkevich BorisTyshkevich commented May 14, 2026

Closes #113.

Makes altinity-mcp safe to run with replicas >= 2 behind a non-sticky load balancer by removing the two pieces of per-pod in-memory state that broke under HA: the OAuth pending-auth / auth-code store, and the streamable HTTP session table.

Full reasoning, the alternatives I rejected, and the explicit single-use trade-off are in #113. This description sticks to what changed and how to verify.

Two commits

9f16fd3 — stateless OAuth state via JWE

cmd/altinity-mcp/oauth_server.go:

  • New HKDF labels altinity-mcp/oauth/pending-auth/v1 and altinity-mcp/oauth/auth-code/v1.
  • oauthPendingAuth and oauthIssuedCode now round-trip through encodeOAuthJWE / decodeOAuthJWE (the same helpers and jwe_auth.DeriveKey HKDF primitive already used for the stateless DCR client_id / client_secret / refresh-token paths).
  • oauthStateStore (the two in-memory maps, mutex, eviction, 10k cap) is deleted.
  • randomToken is removed (no callers).
  • Four new application methods: encodePendingAuth, decodePendingAuth, encodeAuthCode, decodeAuthCode.

cmd/altinity-mcp/main.go:

  • application.oauthState, oauthStateMu, and getOAuthStateStore removed.

pkg/jwe_auth/jwe_auth.go:

  • JWE claim whitelist gains resource and upstream_pkce_verifier. Required so decodeOAuthJWE accepts the new claim keys.

058f43d — Stateless StreamableHTTP transport

cmd/altinity-mcp/main.go:

  • Both mcp.NewStreamableHTTPHandler call sites now pass &mcp.StreamableHTTPOptions{Stateless: true}. The transport stops issuing and validating per-pod Mcp-Session-Ids; each request is self-contained.
  • Trade-off: no server-initiated requests (sampling, roots/list, log notifications outside an active request). altinity-mcp only handles client-initiated tool calls today — see Stateless OAuth state + Stateless MCP transport for HA #113 for the reasoning behind accepting this.

Test changes

Added TestOAuthStateJWERoundTrip in cmd/altinity-mcp/oauth_server_test.go with subtests:

  • pending_auth_round_trip — encode + decode preserves every field
  • auth_code_round_trip
  • cross_pod_portable_with_shared_secret — two application instances sharing only the signing secret; token minted by one decodes on the other
  • cross_pod_rejected_with_different_secret
  • expired_auth_code_rejected, expired_pending_auth_rejected
  • tampered_token_rejected — single-byte flip in the ciphertext
  • decode_missing_secret_fails_cleanly

Removed: TestOAuthStateStoreSizeCap, TestOAuthStateStore, TestOAuthStateStoreEviction — the in-memory store no longer exists.

Untouched canaries (still pass): TestOAuthForwardModeBrowserLoginUsesUpstreamBearerToken, TestOAuthForwardModeNoRefreshToken, TestOAuthE2EWithMockOIDC, the negative-path tests at lines 1603–1753.

Hard requirements after merge

  • MCP_OAUTH_SIGNING_SECRET must be a shared k8s Secret across replicas. All production deployments already source from <deployment>-mcp-signing-secret.
  • Mixed-version rollouts work only if both versions implement the JWE / Stateless changes.

Replace the per-pod oauthStateStore (in-memory maps for pending-auth
and issued auth codes) with stateless JWE tokens, using the existing
encodeOAuthJWE/decodeOAuthJWE + HKDF infrastructure already proven for
DCR client_id and refresh tokens.

Why: in forward mode altinity-mcp is the OAuth AS — /.well-known points
clients at MCP's own /authorize and /token, so with replicas>=2 and no
sticky sessions the legs of the OAuth dance land on different pods and
the in-memory state lookup fails (~75% of the time). Encoding the state
into the Google `state` parameter and the MCP auth `code` makes any
replica with the shared signing_secret able to decrypt either side.

Single-use enforcement on auth codes is intentionally not done server-
side: codes are bound to the client's PKCE verifier (RFC 7636) and live
60s, so replay within the TTL is limited to whoever holds the verifier.
Trading strict RFC 6749 §4.1.2 single-use for zero shared state.

New HKDF labels:
  altinity-mcp/oauth/pending-auth/v1
  altinity-mcp/oauth/auth-code/v1

Whitelist additions in jwe_auth: resource, upstream_pkce_verifier.

Removed: oauthStateStore, its mutex, eviction logic, maxOAuthStateEntries,
randomToken, application.oauthState/oauthStateMu fields,
getOAuthStateStore. Replaced TestOAuthStateStore*/TestOAuthStateStoreEviction
with TestOAuthStateJWERoundTrip covering round-trip, cross-pod
portability, mismatched-secret rejection, expiry, tamper, and missing
secret.

Affects forward-mode deployments only (antalya, billing, otel-google).
Gating-mode (otel, github via Auth0 CIMD) was already HA-safe — Auth0
owns the OAuth surface there.
NewStreamableHTTPHandler defaults to session-tracked mode where each
pod issues and validates its own Mcp-Session-Id. Under replicas>=2 with
non-sticky load balancing, the MCP `initialize` call lands on whichever
pod the LB picks, the client picks ONE returned session-id, and any
subsequent tool call that lands on the OTHER pod is rejected with code
32600 "Session terminated".

Switch both NewStreamableHTTPHandler call sites to Stateless: true.
Each request becomes self-contained, no per-pod session table required.
Trade-off: server-initiated requests (sampling, roots/list, log
notifications outside an active request) are not supported. altinity-mcp
only handles client-initiated tool calls, so this is safe today.

Pairs with the JWE OAuth-state refactor in 9f16fd3 — together they make
forward-mode and gating+broker_upstream deployments HA-safe.
@BorisTyshkevich BorisTyshkevich merged commit 91de43c into main May 15, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stateless OAuth state + Stateless MCP transport for HA

1 participant