From 00fbb43782585ab52979bce0a7be8f7979895e85 Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Wed, 24 Jun 2026 23:50:03 +0800 Subject: [PATCH 01/18] docs(adr): openab-agent multi-vendor OAuth & credential storage Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis OAuthVendor adapter (auth flow vs inference transport), a cross-process flock-guarded credential-store invariant for auth.json, the CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix, and the /auth (PR #1185) auth-trigger model. Surfaced while reviewing PR #1187 (first OAuth vendor). Co-Authored-By: Claude Opus 4.8 --- docs/adr/openab-agent-oauth.md | 267 +++++++++++++++++++++++++++++++++ 1 file changed, 267 insertions(+) create mode 100644 docs/adr/openab-agent-oauth.md diff --git a/docs/adr/openab-agent-oauth.md b/docs/adr/openab-agent-oauth.md new file mode 100644 index 000000000..41a74a8e5 --- /dev/null +++ b/docs/adr/openab-agent-oauth.md @@ -0,0 +1,267 @@ +# ADR: openab-agent — Multi-Vendor LLM-Provider OAuth & Credential Storage + +- **Status:** Proposed +- **Date:** 2026-06-24 +- **Author:** @brettchien +- **Related:** `docs/adr/openab-agent.md` (charter), `docs/adr/openab-agent-mcp.md` §6 (MCP OAuth + §6.1 storage format), PR #1187 (Anthropic OAuth, first provider), PR #1185 (`/auth` device-flow relay), PR #1111 (`--no-browser`) + +--- + +## 1. Context & Motivation + +### 1.1 Why now +`openab-agent` reaches LLM providers in two ways: `ANTHROPIC_API_KEY` (pay-per-token) and an existing +Codex subscription-OAuth tenant in `~/.openab/agent/auth.json`. **PR #1187** adds native **Anthropic +(Claude Pro/Max) OAuth** as a second subscription tenant. This is the moment to set the pattern for *every* +future provider rather than let each PR hand-roll its own flow. + +### 1.2 What PR #1187 surfaced +Reviewing #1187 exposed a latent, **release-blocker-class storage bug** that is independent of any single +provider: `auth.json` is a shared multi-writer file with an **unlocked read-modify-write**, and openab-agent +runs **one process per Discord thread** (`SessionPool` in `crates/openab-core/src/acp/pool.rs` → `crates/openab-core/src/acp/connection.rs` +spawns one `openab-agent` child per thread). So ordinary concurrent multi-thread usage = concurrent +processes refreshing the same OAuth token → refresh-token-rotation reuse → worst case OAuth 2.1 §10.4 +**token-family revocation = fleet-wide logout**. API-key users never hit this (no refresh); **OAuth adoption +is what activates the bug.** + +### 1.3 The wider demand +openab packages 14 agent variants (`kiro, claude, codex, copilot, cursor, gemini, grok, hermes, mimocode, +opencode, antigravity, pi, native, agentcore`). Several wrap a model vendor reachable by subscription OAuth. +A coherent extension model lets openab-agent (the `native` variant) host these directly. PR #1185 already +shipped a Discord `/auth` slash command that relays a device-flow login — the agreed near-term auth UX. + +--- + +## 2. Goals & Non-Goals + +### In scope +- A single **`OAuthVendor` adapter** (auth axis) reused by all subscription-OAuth providers. +- Keeping the **inference axis** (per-provider request/response transport) **separate** from auth. +- A **concurrency-safe credential store**: all `auth.json` writes funnel through one locked + read-modify-write helper (covers MCP `CredentialStore` + provider tenants). +- Support for the OAuth styles real vendors use: **PKCE public**, **PKCE + bundled client_secret**, + **device flow (RFC 8628)**, and **pre-provisioned long-lived token via env** (`CLAUDE_CODE_OAUTH_TOKEN`). +- Compatibility with PR #1185's `/auth` poll-and-exit relay model. + +### Out of scope +- **Layer-3 auto-trigger** (agent auto-launches login on a mid-turn 401). DEFERRED (Brett, 2026-06-24); + the manual `/auth` command is sufficient for now. +- Building every vendor at once. This ADR sets the model; vendors land incrementally. +- Non-OAuth backends: `agentcore` (AWS SigV4/IAM/Bedrock) is explicitly outside the OAuth surface. +- MCP-server OAuth internals — owned by `docs/adr/openab-agent-mcp.md`; this ADR only shares the storage + layer with it. + +--- + +## 3. Prior Art Survey +(Per `docs/adr/pr-contribution-guidelines.md`, OpenClaw + Hermes are mandatory references.) + +- **Pi (`earendil-works/pi`)** — primary source ported for #1187's Anthropic flow (PKCE endpoints, Claude + Code identity headers, system block, tool-name casing). Also ships `CLAUDE_CODE_OAUTH_TOKEN` support + (pi #3591) and provider extensions for Kiro / Cursor / xAI — evidence the per-vendor adapter shape works. +- **OpenClaw** — API keys + subscription OAuth. Anthropic via **setup-token** or **reuse of local Claude + CLI** (no native PKCE login). Codex via full PKCE. Stores per-profile `{access,refresh,expires,accountId}` + and treats the profile file as a **token sink refreshed under a file lock** — direct corroboration for the + locked-RMW decision (§5.4). +- **Hermes Agent (NousResearch)** — `PROVIDER_REGISTRY` dataclasses declare each provider's auth type + + URLs + env vars; one `resolve_runtime_provider()` entry point. Anthropic is **API-key only** (or reuse + `~/.claude/.credentials.json`). `auth.json` guarded with `fcntl`/`msvcrt` file locks — again corroborates + §5.4. The registry pattern is the spiritual model for `OAuthVendor`. +- **Vendor CLIs (evidence for the matrix, §6):** Gemini CLI (`code_assist/oauth2.ts`), Antigravity + (`opencode-antigravity-auth` + `ANTIGRAVITY_API_SPEC.md`), GitHub Copilot CLI, Kiro CLI, xAI/Grok, + Xiaomi MiMo Code — surveyed 2026-06-24 (§6). + +**How this ADR differs:** like OpenClaw/Hermes it keeps one namespaced multi-tenant credential file with +atomic writes + per-refresh rotation handling, and (unlike both) adds native PKCE logins. It adds two things +neither documents cleanly: a **two-axis** auth/inference split, and an explicit **cross-process** locked-RMW +invariant (both flag file locks but for the simpler single-process case). + +--- + +## 4. Design Decision + +1. **Adopt a two-axis model.** Auth (how a credential is obtained/refreshed/stored) and inference (how a + request is sent) are **orthogonal** and must not be coupled. A vendor that serves Claude over Google's + Code Assist envelope (agy — the Antigravity CLI variant; see §6) reuses neither Anthropic's Messages-V1 + transport nor its auth. +2. **Auth axis = one `OAuthVendor` descriptor + shared driver** (§5.1). New vendor = new descriptor; the + PKCE/state/loopback/paste/device-poll/storage driver is written once. +3. **Inference axis = one provider per wire format** (§5.2). Four formats today; no reuse across them. +4. **Credential storage = single locked-RMW funnel** (§5.4). Establishes the invariant: *every* write to + `auth.json` goes through `with_auth_locked`, with refresh performed outside the lock + re-read inside + (single-flight). This is a Consequence of the multi-writer/cross-process reality, not an optional perf + tweak. +5. **Credential-source precedence:** explicit API key → pre-provisioned long-lived OAuth token env + (`CLAUDE_CODE_OAUTH_TOKEN` and equivalents) → stored interactive OAuth tenant. Rationale + why the env + path is the preferred fleet mode: §5.3. + +--- + +## 5. Detailed Design + +### 5.1 `OAuthVendor` (auth axis) +```rust +trait OAuthVendor { + fn namespace(&self) -> &str; // "codex" / "anthropic-oauth" / "antigravity" ... + fn client_id(&self) -> String; // env override + default + fn client_secret(&self) -> Option { None } // Gemini = Some(bundled); agy TBD (§9 Q2); Anthropic/Codex = None + fn authorize_url(&self) -> &str; + fn token_url(&self) -> &str; + fn redirect(&self) -> (u16, &str); + fn scope(&self) -> &str; + fn extra_authorize_params(&self) -> &[(&str,&str)] { &[] } // Anthropic: ("code","true") + fn token_body(&self) -> TokenBodyFormat { TokenBodyFormat::Form } // Anthropic = Json-no-scope + fn grant(&self) -> AuthGrant { AuthGrant::Pkce } // DeviceCode for copilot/kiro +} +enum TokenBodyFormat { Form, Json } +enum AuthGrant { Pkce, DeviceCode } +``` +Shared driver handles browser-loopback, no-browser paste, device-code poll, and refresh once. Fold the +existing Codex flow into the shared `accept_callback_code` helper (its comment already says "fold it in"); +unify the `127.0.0.1` vs `localhost` bind while there. + +### 5.2 Inference providers (inference axis — no reuse) +| Provider | Endpoint | Wire format | Vendors | +|---|---|---|---| +| `AnthropicProvider` (exists) | `api.anthropic.com/v1/messages` | Anthropic Messages V1 | claude; mimocode `/anthropic` mirror | +| `OpenAiProvider` (exists) | OpenAI-style `/v1/chat/completions` | OpenAI Chat/Responses | codex, grok, copilot, mimocode | +| `AntigravityProvider` (new) | `cloudcode-pa.googleapis.com` | Google Code Assist (`{project,model,request}`→`{candidates[]}`) | gemini, agy | +| `AwsQProvider` (new, heaviest) | AWS CodeWhisperer/Q | AWS proprietary event-stream | kiro | + +OAuth-mode request decoration (Bearer + identity headers/system-block/tool-name casing) stays in the +inference provider; if shared, a small `decorate_request()` hook — never folded into `OAuthVendor`. + +### 5.3 Credential-source precedence & the env route +Anthropic offers a route that bypasses interactive login entirely: `claude setup-token` mints a long-lived +subscription OAuth token (~1-year per Anthropic's Claude Code docs) exposed as **`CLAUDE_CODE_OAUTH_TOKEN`**. For pods, ops mints it once and injects +it as a k8s secret — no interactive flow, no `auth.json` write, near-zero race exposure. openab-agent should +read it as a Bearer subscription source, precedence: `ANTHROPIC_API_KEY` → `CLAUDE_CODE_OAUTH_TOKEN` → +stored `anthropic-oauth` tenant. This is the recommended fleet mode; interactive OAuth is for self-service. + +### 5.4 Concurrency & storage invariant (folds in the flock decision) +`auth.json` is multi-tenant (`codex`, `anthropic-oauth`, `mcp:`×N) and written by two code paths +(`save_tokens_for`, `McpCredentialStore`) across **multiple processes** (one per Discord thread). The +required invariant: +```rust +// ALL writers funnel through this. auth.rs storage layer. +fn with_auth_locked(f: impl FnOnce(&mut HashMap) -> R) -> Result { + let _g = flock_exclusive("auth.json.lock")?; // sidecar file (NOT auth.json — rename swaps its inode) + let mut map = read_auth_file(&path)?; // re-read inside lock (anti lost-update + thundering herd) + let r = f(&mut map); + write_auth_file(&path, &map)?; // existing atomic tmp+rename + Ok(r) +} +``` +- **Refresh runs OUTSIDE the lock**, result committed inside (re-read → if file already newer, adopt it → + else write) ⟹ N processes do 1 real refresh. +- **`flock(2)`, not a sentinel lockfile**: kernel auto-releases on fd close / process death → no stale lock. +- **try-lock + timeout** so a hung holder degrades to a graceful error, never a wedged worker. +- **Crate:** `rustix::fs::flock` (already in the tree, safe API), gated `#[cfg(unix)]` with a non-unix + no-op — mirroring the existing atomic-write cfg split. (openab-agent is de-facto unix-only: its + `ci-openab-agent.yml` is linux, deploy is always container; Windows binaries are the broker only.) +- This invariant serves the MCP `CredentialStore` too — see `openab-agent-mcp.md` §6.1. +- **Until this lands**, prefer the `CLAUDE_CODE_OAUTH_TOKEN` env route (§5.3 — no refresh write, no race); + treat interactive Anthropic OAuth as not-yet-hardened for high concurrency. + +--- + +## 6. Vendor feasibility matrix (surveyed 2026-06-24) +``` +Variant OAuth style Inference bucket Native feasibility +────────────────────────────────────────────────────────────────────────────────────────── +claude PKCE public (+env token) Anthropic Messages V1 ✅ done (#1187) + add env route +codex PKCE public / device OpenAI ✅ done (has device flow) +grok (xAI) xai-oauth (sub) / api-key OpenAI-compatible 🟢 easy (reuse OpenAiProvider) +mimocode MiMo Platform OAuth/key OpenAI-compat (+/anthropic) 🟢 easy (dual-bucket; OAuth low-ROI) +copilot GitHub device flow OpenAI-compat (githubcopilot) 🟡 token exchange + CC headers +gemini PKCE + bundled secret Google Code Assist 🟡 new provider +antigravity PKCE + bundled secret Google Code Assist 🟡 same provider; ToS-risk* +kiro AWS Builder ID device flow AWS Q/CodeWhisperer (propr.) 🔴 hard (event-stream) +cursor Cursor browser OAuth Cursor proprietary proxy 🔴 reverse-eng, ToS-risk* +hermes API-key multi ⚪ agent shell, not a vendor +opencode BYO (per-auth plugins) multi ⚪ agent shell +pi BYO (provider extensions) multi ⚪ agent shell +native — — = openab-agent itself +agentcore AWS SigV4/IAM (not OAuth) AWS Bedrock ❌ out of OAuth scope +``` +Concrete values (verified): codex `app_EMoamEEZ73f0CkXaXp7hrann` (no secret, form); claude +`9d1c250a-e61b-44d9-88ed-5944d1962f5e` (no secret, JSON no-scope); gemini +`681255809395-oo8ft2oprdrnp9e3aqf6av3hmdib135j` (+ bundled `GOCSPX-…`, git-safe); agy +`1071006060591-tmhssin2h21lcre235vtolojh4g403ep` (likely bundled — UNCONFIRMED; redirect `localhost:51121`, +scopes add `cclog`/`experimentsandconfigs`; inference `cloudcode-pa`, needs GCP `project` field; one OAuth +unlocks Claude+Gemini+GPT-OSS; agy ≠ Messages V1). + +\* **ToS-risk** = relies on the vendor's official-client OAuth credentials + subscription quota from a +third-party application (openab-agent) rather than the vendor's own client — which may violate that vendor's +Terms of Service. + +--- + +## 7. Auth-trigger UX (PR #1185) +`/auth` is broker-side (`crates/openab-core/src/discord.rs`) — openab-agent advertises no slash commands; +it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMAND`. The relay is +**poll-and-exit**: print URL+code to stdout, poll the AS, exit 0. +- **Anthropic has NO device flow** (claude.ai = authorization_code only; RFC 8628 unshipped, + anthropics/claude-code #22992) and #1187's `--no-browser` reads the code from **stdin**, which the relay + cannot feed → undrivable by `/auth`. +- **Resolution for interactive Claude self-service = two-step, code-as-CLI-arg:** `/auth claude` persists + PKCE verifier+state as a pending entry (reuse the existing `mcp-pending`/`AuthEntry::Pending` machinery) + + prints the `code=true` URL (claude.ai shows a copyable code); `/auth claude ` → + `anthropic-oauth --code ` completes. No stdin. (Fallback: broker pipes a follow-up DM/modal to child + stdin — #1185 v2.) For pods, the §5.3 env route avoids all of this. + +--- + +## 8. Rejected alternatives +- **Per-vendor bespoke flows (status quo):** rejected — N copies of PKCE/loopback/refresh; #1187 already + duplicated the Codex flow. Doesn't scale to 5+ vendors. +- **Force everything through rmcp `CredentialStore`:** rejected — lossy. `TokenStore` (provider) and rmcp + `StoredCredentials` are different on-disk shapes (untagged `AuthEntry`); the translation drops fields + (see `openab-agent-mcp.md` §6.1). The shared layer must sit *below* both (file RMW), not in one's trait. +- **`oauth2` crate for the provider flows:** rejected as the primary path — it enforces RFC form-encoded + token bodies; Anthropic needs JSON-no-scope, fighting the lib. May still supply PKCE/state primitives. + Does **not** solve the race (stateless). +- **In-process `Mutex` / tokio single-flight, or a sentinel lockfile (create→delete), for the race:** + rejected — see §5.4 (in-process locks are useless across the per-thread processes; a sentinel lockfile + deadlocks if a holder dies, whereas `flock(2)` auto-releases on death). +- **Device flow for Anthropic:** not available (Anthropic ships no device endpoint). Hence the env route + + two-step interactive (§7). +- **Layer-3 auto-trigger now:** deferred — `/auth` manual is sufficient (Brett, 2026-06-24). + +--- + +## 9. Decisions & open questions +1. **Default-model staleness — DECIDED (Brett 2026-06-24): no hardcoded default; require via config/env, + fail-loud.** Hardcoding `claude-opus-4-8` is a recurring 404 timebomb: this PR exists because the prior + dated default 404'd on the subscription endpoint, and 4.6+ dateless IDs are **fixed canonical IDs, not + evergreen aliases** — there is no floating "-latest" to lean on, and Messages V1 mandates a `model`. + Resolve model as ACP/CLI `model_override` → `OPENAB_AGENT_MODEL` → **error** (no hardcoded fallback); + drop the three duplicated default sites (`llm.rs:153`, `acp.rs:385/446`). Consequence: removes the + zero-config default (deployments set model via values.yaml/env already; needs a clear error message + + CHANGELOG note). Also eliminates the silent Opus cost bump for API-key users. +2. agy `client_secret` requirement — confirm from the plugin `src/constants.ts` / agy binary. +3. Which heavy/ToS-risk vendors to actually build: kiro (AWS proprietary), cursor (reverse-eng + ToS-risk), + agy (ToS-risk). Needs an explicit go/no-go. +4. Does the locked-RMW fix also subsume `openab-agent-mcp.md` open items #1 (reqwest 0.12/0.13 split) and + #8 (doctor/runtime two-store split)? Likely partial. + +--- + +## 10. References + +### Internal +- `docs/adr/openab-agent.md` — agent charter (4 tools, no SDK, thin HTTP) +- `docs/adr/openab-agent-mcp.md` — MCP client + §6 OAuth + §6.1 storage format +- `docs/adr/pr-contribution-guidelines.md` — prior-art requirements +- PR #1187 (Anthropic OAuth), PR #1185 (`/auth`), PR #1111 (`--no-browser`) + +### External — projects +- Pi `earendil-works/pi` (ported flow; `CLAUDE_CODE_OAUTH_TOKEN` #3591) · OpenClaw · Hermes Agent +- Gemini CLI `code_assist/oauth2.ts` · `NoeFabris/opencode-antigravity-auth` (+ `ANTIGRAVITY_API_SPEC.md`) +- GitHub `copilot-cli` · Kiro CLI / `pi-provider-kiro` / `kiro-gateway` · xAI API / `pi-xai-oauth` +- Xiaomi `MiMo-Code` + +### External — specs +- RFC 8628 (Device Authorization Grant) · OAuth 2.1 §10.4 (refresh-token rotation/reuse) +- anthropics/claude-code #22992 (device-flow request), #20215 (MCP device flow) +- [Documenting Architecture Decisions — Nygard (2011)](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions.html) From b5e46e775dbb617af1bc57067aadfcb4fd69425e Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 00:07:35 +0800 Subject: [PATCH 02/18] =?UTF-8?q?docs(adr):=20address=20review=20=E2=80=94?= =?UTF-8?q?=20prefer=20oauth2=20crate,=20drop=20rollout,=20vendor=20names?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Build the OAuthVendor driver on the official `oauth2` crate (already in-tree via the MCP side) instead of a hand-rolled PKCE/exchange/refresh flow; the Anthropic JSON-token-body quirk is applied via the crate's custom http-client hook. Reframe §8 accordingly (hand-rolled flow is the rejected alternative). - Remove the project Rollout-plan section (internal sequencing, not ADR material); keep the race-window mitigation in §5.4. - Use vendor names only; drop internal fleet-agent references from the matrix. - Replace unexplained "ToS-gray" with "ToS-risk" + a definition. - Fix cross-references (crate-qualified paths, §-refs, line numbers) and move the settled model-default decision under "Decisions & open questions". Co-Authored-By: Claude Opus 4.8 --- docs/adr/openab-agent-oauth.md | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/docs/adr/openab-agent-oauth.md b/docs/adr/openab-agent-oauth.md index 41a74a8e5..cef14986a 100644 --- a/docs/adr/openab-agent-oauth.md +++ b/docs/adr/openab-agent-oauth.md @@ -84,8 +84,10 @@ invariant (both flag file locks but for the simpler single-process case). request is sent) are **orthogonal** and must not be coupled. A vendor that serves Claude over Google's Code Assist envelope (agy — the Antigravity CLI variant; see §6) reuses neither Anthropic's Messages-V1 transport nor its auth. -2. **Auth axis = one `OAuthVendor` descriptor + shared driver** (§5.1). New vendor = new descriptor; the - PKCE/state/loopback/paste/device-poll/storage driver is written once. +2. **Auth axis = one `OAuthVendor` descriptor + a shared driver built on the official `oauth2` crate** (§5.1; + the crate is already in-tree via the MCP side). New vendor = new descriptor; PKCE/CSRF/auth-code + exchange/refresh come from the crate, **not hand-rolled**. The few vendor quirks (e.g. Anthropic's JSON + token body) are applied through the crate's custom http-client hook, not by forking the flow. 3. **Inference axis = one provider per wire format** (§5.2). Four formats today; no reuse across them. 4. **Credential storage = single locked-RMW funnel** (§5.4). Establishes the invariant: *every* write to `auth.json` goes through `with_auth_locked`, with refresh performed outside the lock + re-read inside @@ -116,9 +118,13 @@ trait OAuthVendor { enum TokenBodyFormat { Form, Json } enum AuthGrant { Pkce, DeviceCode } ``` -Shared driver handles browser-loopback, no-browser paste, device-code poll, and refresh once. Fold the -existing Codex flow into the shared `accept_callback_code` helper (its comment already says "fold it in"); -unify the `127.0.0.1` vs `localhost` bind while there. +The shared driver is built on the **official `oauth2` crate** (already a dependency via the MCP side): it +supplies PKCE, CSRF `state`, the authorization-code exchange, and refresh; the descriptor only feeds it +per-vendor config. Hand-rolled code is limited to what the crate does not cover — the loopback/paste/ +device-code callback plumbing (fold the existing Codex flow into the shared `accept_callback_code` helper — +its comment already says "fold it in"; unify the `127.0.0.1` vs `localhost` bind) and the single +body-encoding override (Anthropic's JSON-no-scope token request, applied via the crate's custom http-client +hook rather than a separate flow). ### 5.2 Inference providers (inference axis — no reuse) | Provider | Endpoint | Wire format | Vendors | @@ -218,9 +224,11 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA - **Force everything through rmcp `CredentialStore`:** rejected — lossy. `TokenStore` (provider) and rmcp `StoredCredentials` are different on-disk shapes (untagged `AuthEntry`); the translation drops fields (see `openab-agent-mcp.md` §6.1). The shared layer must sit *below* both (file RMW), not in one's trait. -- **`oauth2` crate for the provider flows:** rejected as the primary path — it enforces RFC form-encoded - token bodies; Anthropic needs JSON-no-scope, fighting the lib. May still supply PKCE/state primitives. - Does **not** solve the race (stateless). +- **Fully hand-rolled OAuth flow:** rejected — it reimplements PKCE/CSRF/exchange/refresh that the official + `oauth2` crate (already in-tree) provides. The crate is the chosen basis (§4 decision 2, §5.1); its one + friction — it defaults to RFC form-encoded token bodies while Anthropic needs JSON-no-scope — is handled + via the crate's custom http-client hook, not by abandoning it. (`oauth2` is stateless and does **not** + solve the auth.json race — that's the storage-layer's job, §5.4.) - **In-process `Mutex` / tokio single-flight, or a sentinel lockfile (create→delete), for the race:** rejected — see §5.4 (in-process locks are useless across the per-thread processes; a sentinel lockfile deadlocks if a holder dies, whereas `flock(2)` auto-releases on death). From 285b335ccde6b0edcacabd614f7c035e194ae673 Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 00:14:54 +0800 Subject: [PATCH 03/18] docs(adr): per-tenant refresh lock + per-user PKCE keying (Mira review r1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fold in two release-blocker-class concurrency fixes from Mira's review: - §5.4 refresh-token rotation: the prior "refresh outside lock, re-read on commit" claim was wrong — N processes each send a refresh with the same RT_old before committing, tripping OAuth 2.1 §10.4 reuse -> token-family revocation. Replace with a per-tenant exclusive lock: network refresh held under the tenant lock only (not the global lock), so exactly one refresh per tenant per expiry with no head-of-line blocking across tenants. Extends to mcp: tenants. - §7 /auth: key the pending PKCE verifier+state by the initiating Discord user id instead of a single global entry — prevents concurrent-user overwrite (PKCE mismatch) and session hijack. - §5.1 OAuthVendor::redirect() -> Option, since device flows have no redirect. Co-Authored-By: Claude Opus 4.8 --- docs/adr/openab-agent-oauth.md | 69 +++++++++++++++++++++++++--------- 1 file changed, 52 insertions(+), 17 deletions(-) diff --git a/docs/adr/openab-agent-oauth.md b/docs/adr/openab-agent-oauth.md index cef14986a..f50a350d4 100644 --- a/docs/adr/openab-agent-oauth.md +++ b/docs/adr/openab-agent-oauth.md @@ -89,10 +89,11 @@ invariant (both flag file locks but for the simpler single-process case). exchange/refresh come from the crate, **not hand-rolled**. The few vendor quirks (e.g. Anthropic's JSON token body) are applied through the crate's custom http-client hook, not by forking the flow. 3. **Inference axis = one provider per wire format** (§5.2). Four formats today; no reuse across them. -4. **Credential storage = single locked-RMW funnel** (§5.4). Establishes the invariant: *every* write to - `auth.json` goes through `with_auth_locked`, with refresh performed outside the lock + re-read inside - (single-flight). This is a Consequence of the multi-writer/cross-process reality, not an optional perf - tweak. +4. **Credential storage = locked-RMW funnel + per-tenant refresh lock** (§5.4). *Every* write to `auth.json` + goes through `with_auth_locked` (global lock — file integrity); *every* token refresh serializes on a + **per-tenant** lock so concurrent processes perform exactly one network refresh per tenant and never + present a rotated `RT_old` twice (which would trigger OAuth 2.1 §10.4 token-family revocation). A + Consequence of the multi-writer/cross-process reality, not an optional perf tweak. 5. **Credential-source precedence:** explicit API key → pre-provisioned long-lived OAuth token env (`CLAUDE_CODE_OAUTH_TOKEN` and equivalents) → stored interactive OAuth tenant. Rationale + why the env path is the preferred fleet mode: §5.3. @@ -109,7 +110,7 @@ trait OAuthVendor { fn client_secret(&self) -> Option { None } // Gemini = Some(bundled); agy TBD (§9 Q2); Anthropic/Codex = None fn authorize_url(&self) -> &str; fn token_url(&self) -> &str; - fn redirect(&self) -> (u16, &str); + fn redirect(&self) -> Option<(u16, &'static str)> { None } // Some((port,path)) for loopback PKCE; None for device flow (no redirect endpoint) fn scope(&self) -> &str; fn extra_authorize_params(&self) -> &[(&str,&str)] { &[] } // Anthropic: ("code","true") fn token_body(&self) -> TokenBodyFormat { TokenBodyFormat::Form } // Anthropic = Json-no-scope @@ -146,26 +147,56 @@ stored `anthropic-oauth` tenant. This is the recommended fleet mode; interactive ### 5.4 Concurrency & storage invariant (folds in the flock decision) `auth.json` is multi-tenant (`codex`, `anthropic-oauth`, `mcp:`×N) and written by two code paths -(`save_tokens_for`, `McpCredentialStore`) across **multiple processes** (one per Discord thread). The -required invariant: +(`save_tokens_for`, `McpCredentialStore`) across **multiple processes** (one per Discord thread). Two +distinct hazards demand two locks. + +**(a) File integrity — one global lock.** Every write funnels through a single locked RMW so concurrent +writers never lost-update the shared file: ```rust // ALL writers funnel through this. auth.rs storage layer. fn with_auth_locked(f: impl FnOnce(&mut HashMap) -> R) -> Result { let _g = flock_exclusive("auth.json.lock")?; // sidecar file (NOT auth.json — rename swaps its inode) - let mut map = read_auth_file(&path)?; // re-read inside lock (anti lost-update + thundering herd) + let mut map = read_auth_file(&path)?; // re-read inside lock (anti lost-update) let r = f(&mut map); write_auth_file(&path, &map)?; // existing atomic tmp+rename Ok(r) } ``` -- **Refresh runs OUTSIDE the lock**, result committed inside (re-read → if file already newer, adopt it → - else write) ⟹ N processes do 1 real refresh. -- **`flock(2)`, not a sentinel lockfile**: kernel auto-releases on fd close / process death → no stale lock. -- **try-lock + timeout** so a hung holder degrades to a graceful error, never a wedged worker. + +**(b) Refresh-token rotation — one lock per tenant.** An earlier draft ran the refresh *outside* the lock +and committed the result inside, claiming "N processes do 1 real refresh." **That is wrong** (Mira review, +2026-06-24): re-read-on-commit only prevents a lost *write* — every process has already *sent* a network +refresh carrying the same `RT_old` before it reaches the commit. Under OAuth 2.1 §10.4 refresh-token +rotation, the second `RT_old` presentation reads as reuse and the AS **revokes the whole token family** = +exactly the fleet-wide logout this ADR exists to prevent. Holding the *global* exclusive lock across the +network refresh would serialize it, but then a slow refresh for one tenant head-of-line-blocks every other +tenant (MCP servers, Codex). So: **one exclusive lock file per tenant**, network I/O held under the tenant +lock only — never under the global lock: +```rust +fn get_or_refresh(tenant: &str) -> Result { + // 1. fast path — fresh token under a shared (read) global lock + if let Some(t) = read_fresh(tenant)? { return Ok(t); } + // 2. serialize refreshes for THIS tenant only (other tenants unaffected) + let _tg = flock_exclusive(&format!("auth.json.{tenant}.lock"))?; + // 3. double-check — another process may have refreshed while we waited on the tenant lock + if let Some(t) = read_fresh(tenant)? { return Ok(t); } + // 4. exactly one network refresh per tenant per expiry — tenant lock held, global lock NOT + let fresh = perform_network_refresh(tenant)?; + // 5. commit under the global lock (fast inode swap, no network I/O inside) + with_auth_locked(|m| m.insert(tenant.into(), fresh.clone()))?; + Ok(fresh.access_token) +} +``` +- **`flock(2)`, not a sentinel lockfile**: kernel auto-releases on fd close / process death → a hung or + killed refresher frees its tenant lock instantly. No stale lock, no manual timeout/orphan cleanup. +- **try-lock + timeout** on the global lock so a wedged writer degrades to a graceful error, never a wedged + worker. - **Crate:** `rustix::fs::flock` (already in the tree, safe API), gated `#[cfg(unix)]` with a non-unix no-op — mirroring the existing atomic-write cfg split. (openab-agent is de-facto unix-only: its `ci-openab-agent.yml` is linux, deploy is always container; Windows binaries are the broker only.) -- This invariant serves the MCP `CredentialStore` too — see `openab-agent-mcp.md` §6.1. +- Each MCP `mcp:` tenant takes its own tenant lock by the same rule, so the MCP `CredentialStore` + refreshes are serialized per server too — the invariant serves it directly (see `openab-agent-mcp.md` + §6.1). - **Until this lands**, prefer the `CLAUDE_CODE_OAUTH_TOKEN` env route (§5.3 — no refresh write, no race); treat interactive Anthropic OAuth as not-yet-hardened for high concurrency. @@ -211,10 +242,14 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA anthropics/claude-code #22992) and #1187's `--no-browser` reads the code from **stdin**, which the relay cannot feed → undrivable by `/auth`. - **Resolution for interactive Claude self-service = two-step, code-as-CLI-arg:** `/auth claude` persists - PKCE verifier+state as a pending entry (reuse the existing `mcp-pending`/`AuthEntry::Pending` machinery) + - prints the `code=true` URL (claude.ai shows a copyable code); `/auth claude ` → - `anthropic-oauth --code ` completes. No stdin. (Fallback: broker pipes a follow-up DM/modal to child - stdin — #1185 v2.) For pods, the §5.3 env route avoids all of this. + the PKCE verifier+state as a pending entry **keyed by the initiating Discord user id** + (`pending:claude:`, reuse the existing `mcp-pending`/`AuthEntry::Pending` machinery) + + prints the `code=true` URL (claude.ai shows a copyable code); `/auth claude ` forwards that same user + id so `anthropic-oauth --code ` loads the matching verifier and completes. No stdin. **Per-user + keying is required** (Mira review, 2026-06-24): a single global pending entry lets a second concurrent + user's verifier overwrite the first's → PKCE mismatch on exchange, and worse, lets user B complete a flow + user A initiated (session hijack). (Fallback: broker pipes a follow-up DM/modal to child stdin — #1185 v2.) + For pods, the §5.3 env route avoids all of this. --- From f3dedfe0abecb1d6d20b0f772abbee536786e1b2 Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 00:19:23 +0800 Subject: [PATCH 04/18] docs(adr): bundled-secret storage + pending-entry GC (Mira review r2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - §6/§9 Q2: correct the stale "git-safe" claim on the gemini GOCSPX- secret. The value is non-confidential by RFC 8252 / Google docs, but GitHub now push-protects Google secrets by default (changelog 2026-03) and partner-scans them for auto-revoke, so a raw literal is not safe in a public repo. Decide: encode-at-rest (scanner-evasion for a non-secret, NOT a security control) or env-inject at runtime — not raw text. - §7: pending PKCE entries get created_at + a 15-min GC sweep in with_auth_locked so abandoned /auth attempts don't accumulate. Co-Authored-By: Claude Opus 4.8 --- docs/adr/openab-agent-oauth.md | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/docs/adr/openab-agent-oauth.md b/docs/adr/openab-agent-oauth.md index f50a350d4..85b567dad 100644 --- a/docs/adr/openab-agent-oauth.md +++ b/docs/adr/openab-agent-oauth.md @@ -223,7 +223,7 @@ agentcore AWS SigV4/IAM (not OAuth) AWS Bedrock ❌ out of ``` Concrete values (verified): codex `app_EMoamEEZ73f0CkXaXp7hrann` (no secret, form); claude `9d1c250a-e61b-44d9-88ed-5944d1962f5e` (no secret, JSON no-scope); gemini -`681255809395-oo8ft2oprdrnp9e3aqf6av3hmdib135j` (+ bundled `GOCSPX-…`, git-safe); agy +`681255809395-oo8ft2oprdrnp9e3aqf6av3hmdib135j` (+ bundled `GOCSPX-…` — non-confidential by spec but **not** safe as raw repo text; storage decided in §9 Q2); agy `1071006060591-tmhssin2h21lcre235vtolojh4g403ep` (likely bundled — UNCONFIRMED; redirect `localhost:51121`, scopes add `cclog`/`experimentsandconfigs`; inference `cloudcode-pa`, needs GCP `project` field; one OAuth unlocks Claude+Gemini+GPT-OSS; agy ≠ Messages V1). @@ -250,6 +250,9 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA user's verifier overwrite the first's → PKCE mismatch on exchange, and worse, lets user B complete a flow user A initiated (session hijack). (Fallback: broker pipes a follow-up DM/modal to child stdin — #1185 v2.) For pods, the §5.3 env route avoids all of this. +- **Pending-entry GC** (Mira review, 2026-06-24): stamp each `AuthEntry::Pending` with `created_at`; + `with_auth_locked` opportunistically drops pending entries older than 15 min on every write, so abandoned + `/auth` attempts (user never pastes a code) don't accumulate stale verifiers in `auth.json`. --- @@ -282,7 +285,19 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA drop the three duplicated default sites (`llm.rs:153`, `acp.rs:385/446`). Consequence: removes the zero-config default (deployments set model via values.yaml/env already; needs a clear error message + CHANGELOG note). Also eliminates the silent Opus cost bump for API-key users. -2. agy `client_secret` requirement — confirm from the plugin `src/constants.ts` / agy binary. +2. **Bundled `client_secret` storage — REFINED (Mira review 2026-06-24).** Google Code-Assist vendors + (gemini, agy) ship a `GOCSPX-…` desktop-app secret. By RFC 8252 and Google's own docs this value is + **non-confidential** (installed-app secret, "obviously not treated as a secret") — there is no + confidentiality to protect, so obscuring it adds zero cryptographic security. But it is **not safe as raw + text in a public repo for operational reasons**: GitHub push-protection covers Google secrets **by default** + (changelog 2026-03), so a raw `GOCSPX-` literal blocks contributor `git push`, and GitHub↔Google partner + token-scanning may **auto-revoke** the credential once it lands in a public commit. Decision: do **not** + commit the raw literal; pick per vendor — + (a) **encode-at-rest** (split/base64) in source — purely scanner-evasion for an already-public value, *not* + a security control; label it as such inline so reviewers aren't misled into treating it as a real secret; or + (b) **inject at runtime via env** (no secret in the repo at all) — cleaner provenance, consistent with the + §5.3 env-route preference, at the cost of the bundled zero-config UX. + Still confirm whether agy actually *requires* a secret, from the plugin `src/constants.ts` / agy binary. 3. Which heavy/ToS-risk vendors to actually build: kiro (AWS proprietary), cursor (reverse-eng + ToS-risk), agy (ToS-risk). Needs an explicit go/no-go. 4. Does the locked-RMW fix also subsume `openab-agent-mcp.md` open items #1 (reqwest 0.12/0.13 split) and From 8a6ada102514f73129838be355ebc60b0f274875 Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 00:27:01 +0800 Subject: [PATCH 05/18] =?UTF-8?q?docs(adr):=20close=20=C2=A79=20Q2/Q3=20?= =?UTF-8?q?=E2=80=94=20env-injection=20default,=20vendor=20go/no-go=20(Bre?= =?UTF-8?q?tt=20decisions=20r3)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - §9 Q2 DECIDED: env-injection (b) is the default for the gemini/agy bundled client_secret; encode-at-rest (a) is the fallback for bundled zero-config binaries only, framed as scanner-evasion (not security). Cite rclone rcloneEncryptedClientSecret + obscure.MustReveal() as the canonical precedent (§10). - §9 Q3 DECIDED: GO gemini/grok (first wave) + agy (experimental, opt-in, ToS caveat — shares gemini's Code-Assist provider, residual risk is ToS not secret storage); No-Go cursor/kiro. Mirror as a build-decision line under §6. - §10: add rclone + GitHub/Google secret-scanning references. Co-Authored-By: Claude Opus 4.8 --- docs/adr/openab-agent-oauth.md | 53 ++++++++++++++++++++++++---------- 1 file changed, 38 insertions(+), 15 deletions(-) diff --git a/docs/adr/openab-agent-oauth.md b/docs/adr/openab-agent-oauth.md index 85b567dad..c8f75db48 100644 --- a/docs/adr/openab-agent-oauth.md +++ b/docs/adr/openab-agent-oauth.md @@ -232,6 +232,9 @@ unlocks Claude+Gemini+GPT-OSS; agy ≠ Messages V1). third-party application (openab-agent) rather than the vendor's own client — which may violate that vendor's Terms of Service. +**Build decision (§9 Q3, Brett 2026-06-24):** GO `gemini`/`grok` (first wave) and `agy` (experimental, +opt-in, ToS caveat); No-Go `cursor`/`kiro`. + --- ## 7. Auth-trigger UX (PR #1185) @@ -285,21 +288,36 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA drop the three duplicated default sites (`llm.rs:153`, `acp.rs:385/446`). Consequence: removes the zero-config default (deployments set model via values.yaml/env already; needs a clear error message + CHANGELOG note). Also eliminates the silent Opus cost bump for API-key users. -2. **Bundled `client_secret` storage — REFINED (Mira review 2026-06-24).** Google Code-Assist vendors - (gemini, agy) ship a `GOCSPX-…` desktop-app secret. By RFC 8252 and Google's own docs this value is - **non-confidential** (installed-app secret, "obviously not treated as a secret") — there is no - confidentiality to protect, so obscuring it adds zero cryptographic security. But it is **not safe as raw - text in a public repo for operational reasons**: GitHub push-protection covers Google secrets **by default** - (changelog 2026-03), so a raw `GOCSPX-` literal blocks contributor `git push`, and GitHub↔Google partner - token-scanning may **auto-revoke** the credential once it lands in a public commit. Decision: do **not** - commit the raw literal; pick per vendor — - (a) **encode-at-rest** (split/base64) in source — purely scanner-evasion for an already-public value, *not* - a security control; label it as such inline so reviewers aren't misled into treating it as a real secret; or - (b) **inject at runtime via env** (no secret in the repo at all) — cleaner provenance, consistent with the - §5.3 env-route preference, at the cost of the bundled zero-config UX. - Still confirm whether agy actually *requires* a secret, from the plugin `src/constants.ts` / agy binary. -3. Which heavy/ToS-risk vendors to actually build: kiro (AWS proprietary), cursor (reverse-eng + ToS-risk), - agy (ToS-risk). Needs an explicit go/no-go. +2. **Bundled `client_secret` storage — DECIDED (Brett 2026-06-24): env-injection default; encode-at-rest + fallback.** Google Code-Assist vendors (gemini, agy) ship a `GOCSPX-…` desktop-app secret. By RFC 8252 and + Google's own docs this value is **non-confidential** (installed-app secret, "obviously not treated as a + secret") — there is no confidentiality to protect, so obscuring it adds zero cryptographic security. But it + is **not safe as raw text in a public repo for operational reasons**: GitHub push-protection covers Google + secrets **by default** (changelog 2026-03), so a raw `GOCSPX-` literal blocks contributor `git push`, and + GitHub↔Google partner token-scanning may **auto-revoke** the credential once it lands in a public commit. + Decision: do **not** commit the raw literal — + - **(b) inject at runtime via env is the default** (no secret in the repo at all): cleanest provenance, + consistent with the §5.3 env-route preference, removes the ToS/leak surface entirely. Costs the bundled + zero-config UX, which fleet/pod deployments don't need (they set env already). + - **(a) encode-at-rest** (split/base64) is the **fallback** only where a bundled zero-config binary is + genuinely required: it is **scanner-evasion for an already-public value, *not* a security control** — + label it as such inline so reviewers aren't misled. This is a well-adopted pattern: **rclone** declares a + `rcloneEncryptedClientSecret` constant and calls `obscure.MustReveal()` at runtime for exactly this + reason (bypass static scanners / partner auto-revoke, explicitly not encryption) — see §10. + Still confirm whether agy actually *requires* a secret, from the plugin `src/constants.ts` / agy binary; + under the (b) default this is moot unless agy is shipped as a bundled zero-config binary. +3. **Vendor go/no-go — DECIDED (Brett 2026-06-24).** + - **GO:** `gemini`, `grok` (high value, clean APIs, low ToS risk) — first wave. + - **GO (experimental, opt-in):** `agy`. Marginal eng cost is low — it shares gemini's Google Code-Assist + `AntigravityProvider`, and one OAuth unlocks Claude+Gemini+GPT-OSS. Its one residual risk is **ToS** + (drives Antigravity's official client + the user's subscription quota from a third-party app), which is + *independent of* the now-solved secret-storage question and cannot be engineered away — so agy stays + behind an **explicit opt-in flag with a documented ToS caveat**, and the user accepts the risk on their + own subscription. Watch for `429 RESOURCE_EXHAUSTED` (shared-quota exhaustion) and `cloudcode-pa` + endpoint drift (semi-internal Google API; agy auto-updates). openab already runs agy in production via + the ECS `antigravity` variant, so the auth/quota behaviour is first-hand-known. + - **No-Go:** `cursor` (reverse-engineered proprietary proxy + high ToS/account-ban risk), `kiro` (AWS Q + event-stream protocol — high maintenance cost). Revisit only on explicit demand. 4. Does the locked-RMW fix also subsume `openab-agent-mcp.md` open items #1 (reqwest 0.12/0.13 split) and #8 (doctor/runtime two-store split)? Likely partial. @@ -318,8 +336,13 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA - Gemini CLI `code_assist/oauth2.ts` · `NoeFabris/opencode-antigravity-auth` (+ `ANTIGRAVITY_API_SPEC.md`) - GitHub `copilot-cli` · Kiro CLI / `pi-provider-kiro` / `kiro-gateway` · xAI API / `pi-xai-oauth` - Xiaomi `MiMo-Code` +- **rclone `rclone/rclone`** — `rcloneEncryptedClientSecret` constant + runtime `obscure.MustReveal()`: the + canonical real-world precedent for §9 Q2 encode-at-rest (scanner-evasion of a non-confidential bundled + OAuth secret, explicitly not encryption). ### External — specs - RFC 8628 (Device Authorization Grant) · OAuth 2.1 §10.4 (refresh-token rotation/reuse) - anthropics/claude-code #22992 (device-flow request), #20215 (MCP device flow) +- GitHub secret-scanning — Google secrets push-protected by default (changelog 2026-03-31); Google + google-auth-library-nodejs #959 (desktop client secret is non-confidential) - [Documenting Architecture Decisions — Nygard (2011)](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions.html) From ed83dc2b1ad4ec63764cba9efdb79819d7af4b4e Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 00:30:15 +0800 Subject: [PATCH 06/18] docs(adr): confirm agy secret + ecosystem evidence (GitHub survey r4) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit GitHub survey 2026-06-24 grounds the agy GO decision: - §6/§9 Q2: agy client_secret requirement CONFIRMED — it needs a GOCSPX- secret, a public constant >=20 antigravity-auth repos hardcode verbatim (NoeFabris/opencode-antigravity-auth, router-for-me/CLIProxyAPI, ...). Literal deliberately NOT pasted into the doc — would trip the very §9 Q2 push-protection, so we dogfood the env/encode decision. Redirect confirmed localhost:51121/oauth-callback. - §9 Q3: ecosystem evidence — agy OAuth is widely ported (opencode/pi/hermes/ openclaw plugins + proxies), proving the integration, while the same ecosystem's anti-ban / quota-locking / multi-account-rotation tooling empirically confirms the ToS-ban + 429 risks, reinforcing the opt-in gate. - §10: add the antigravity OAuth ecosystem reference set. Co-Authored-By: Claude Opus 4.8 --- docs/adr/openab-agent-oauth.md | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/docs/adr/openab-agent-oauth.md b/docs/adr/openab-agent-oauth.md index c8f75db48..2117ce936 100644 --- a/docs/adr/openab-agent-oauth.md +++ b/docs/adr/openab-agent-oauth.md @@ -224,7 +224,10 @@ agentcore AWS SigV4/IAM (not OAuth) AWS Bedrock ❌ out of Concrete values (verified): codex `app_EMoamEEZ73f0CkXaXp7hrann` (no secret, form); claude `9d1c250a-e61b-44d9-88ed-5944d1962f5e` (no secret, JSON no-scope); gemini `681255809395-oo8ft2oprdrnp9e3aqf6av3hmdib135j` (+ bundled `GOCSPX-…` — non-confidential by spec but **not** safe as raw repo text; storage decided in §9 Q2); agy -`1071006060591-tmhssin2h21lcre235vtolojh4g403ep` (likely bundled — UNCONFIRMED; redirect `localhost:51121`, +`1071006060591-tmhssin2h21lcre235vtolojh4g403ep` (bundled secret **CONFIRMED** 2026-06-24 — the same public +`GOCSPX-…` constant ≥20 antigravity-auth ecosystem repos hardcode verbatim, e.g. `NoeFabris/opencode-antigravity-auth`, +`router-for-me/CLIProxyAPI`; **deliberately not reproduced here** — pasting the literal would trip the very +§9 Q2 push-protection, so we dogfood the env/encode decision; redirect `localhost:51121/oauth-callback`, scopes add `cclog`/`experimentsandconfigs`; inference `cloudcode-pa`, needs GCP `project` field; one OAuth unlocks Claude+Gemini+GPT-OSS; agy ≠ Messages V1). @@ -304,8 +307,9 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA label it as such inline so reviewers aren't misled. This is a well-adopted pattern: **rclone** declares a `rcloneEncryptedClientSecret` constant and calls `obscure.MustReveal()` at runtime for exactly this reason (bypass static scanners / partner auto-revoke, explicitly not encryption) — see §10. - Still confirm whether agy actually *requires* a secret, from the plugin `src/constants.ts` / agy binary; - under the (b) default this is moot unless agy is shipped as a bundled zero-config binary. + agy secret requirement is now **CONFIRMED** (2026-06-24): agy *does* require a `GOCSPX-…` client_secret, a + public ecosystem constant (≥20 antigravity-auth repos hardcode it — §6). Under the (b) env default it is + injected, never committed; (a) applies only if agy is ever shipped as a bundled zero-config binary. 3. **Vendor go/no-go — DECIDED (Brett 2026-06-24).** - **GO:** `gemini`, `grok` (high value, clean APIs, low ToS risk) — first wave. - **GO (experimental, opt-in):** `agy`. Marginal eng cost is low — it shares gemini's Google Code-Assist @@ -315,7 +319,12 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA behind an **explicit opt-in flag with a documented ToS caveat**, and the user accepts the risk on their own subscription. Watch for `429 RESOURCE_EXHAUSTED` (shared-quota exhaustion) and `cloudcode-pa` endpoint drift (semi-internal Google API; agy auto-updates). openab already runs agy in production via - the ECS `antigravity` variant, so the auth/quota behaviour is first-hand-known. + the ECS `antigravity` variant, so the auth/quota behaviour is first-hand-known. **Ecosystem evidence + (GitHub survey 2026-06-24):** agy OAuth is widely ported — ≥20 public repos hardcode the identical + client_id/secret (opencode/pi/hermes/openclaw plugins, standalone proxies) — so the integration is + proven, *and* the same ecosystem is full of "anti-ban", "strict quota locking" and "multi-account + rotation" tooling, which empirically confirms the ToS-ban and `429` quota-exhaustion risks are real, not + theoretical — reinforcing the opt-in gate. - **No-Go:** `cursor` (reverse-engineered proprietary proxy + high ToS/account-ban risk), `kiro` (AWS Q event-stream protocol — high maintenance cost). Revisit only on explicit demand. 4. Does the locked-RMW fix also subsume `openab-agent-mcp.md` open items #1 (reqwest 0.12/0.13 split) and @@ -334,6 +343,10 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA ### External — projects - Pi `earendil-works/pi` (ported flow; `CLAUDE_CODE_OAUTH_TOKEN` #3591) · OpenClaw · Hermes Agent - Gemini CLI `code_assist/oauth2.ts` · `NoeFabris/opencode-antigravity-auth` (+ `ANTIGRAVITY_API_SPEC.md`) +- **Antigravity (agy) OAuth ecosystem** (GitHub survey 2026-06-24, ≥20 repos hardcoding the same client_id/ + secret) — `NoeFabris/opencode-antigravity-auth`, `router-for-me/CLIProxyAPI`, `andyvandaric/opencode-ag-auth` + (quota-locking/anti-ban), `Meapri/hermes-google-antigravity-plugin`, `wbbtmusic/openclaw-antigravity-oauth`; + evidence the integration is proven and that ToS-ban/quota mitigations are a real ecosystem concern (§9 Q3) - GitHub `copilot-cli` · Kiro CLI / `pi-provider-kiro` / `kiro-gateway` · xAI API / `pi-xai-oauth` - Xiaomi `MiMo-Code` - **rclone `rclone/rclone`** — `rcloneEncryptedClientSecret` constant + runtime `obscure.MustReveal()`: the From db88c7a2ed76cda9f7251bec3d86837e7fd2df1e Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 00:35:30 +0800 Subject: [PATCH 07/18] docs(adr): agy-vendor vs agy-CLI/ACP clarification + MCP store revamp in scope (r5) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Grounded by a codebase survey (2026-06-24): - §4/§6: clarify "agy as a GO vendor" means a native OAuthVendor + Code-Assist inference provider — it does NOT run the agy CLI. agy speaks no ACP; the existing `antigravity` runtime variant (Mira/ECS) only works via a dedicated agy-acp adapter that shells out to the agy binary per prompt and polls its SQLite DB. The provider path sidesteps ACP entirely and supersedes the CLI-wrapper for native use, so agy's lack of ACP doesn't block the GO. - §5.4/§9 Q4: make the MCP CredentialStore revamp explicitly in-scope. auth.json has NO lock today (only atomic rename); provider save_tokens (auth.rs:234) and McpCredentialStore::save/clear (auth.rs:284-328) are two independent unlocked RMW callers, so with_auth_locked must wrap BOTH or the race persists. Also correct the stale save_tokens_for name and note with_auth_locked is new. Co-Authored-By: Claude Opus 4.8 --- docs/adr/openab-agent-oauth.md | 32 +++++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/docs/adr/openab-agent-oauth.md b/docs/adr/openab-agent-oauth.md index 2117ce936..c09fdb413 100644 --- a/docs/adr/openab-agent-oauth.md +++ b/docs/adr/openab-agent-oauth.md @@ -82,8 +82,13 @@ invariant (both flag file locks but for the simpler single-process case). 1. **Adopt a two-axis model.** Auth (how a credential is obtained/refreshed/stored) and inference (how a request is sent) are **orthogonal** and must not be coupled. A vendor that serves Claude over Google's - Code Assist envelope (agy — the Antigravity CLI variant; see §6) reuses neither Anthropic's Messages-V1 - transport nor its auth. + Code Assist envelope (agy; see §6) reuses neither Anthropic's Messages-V1 transport nor its auth. + **Note — "agy as a GO vendor" ≠ running the agy CLI.** This ADR adds agy as an `OAuthVendor` + Code-Assist + inference *provider* consumed by the **native** openab-agent; it does **not** spawn the agy binary. That + matters because **agy speaks no ACP** — the existing `antigravity` *runtime variant* (Mira on ECS) only + works via a dedicated `agy-acp` adapter that shells out to the agy CLI per prompt and polls its SQLite + conversation DB to synthesize ACP events. The vendor/provider path here **sidesteps ACP entirely** and + supersedes that CLI-wrapper for native use — agy's lack of ACP is irrelevant to it (see §6). 2. **Auth axis = one `OAuthVendor` descriptor + a shared driver built on the official `oauth2` crate** (§5.1; the crate is already in-tree via the MCP side). New vendor = new descriptor; PKCE/CSRF/auth-code exchange/refresh come from the crate, **not hand-rolled**. The few vendor quirks (e.g. Anthropic's JSON @@ -146,9 +151,15 @@ read it as a Bearer subscription source, precedence: `ANTHROPIC_API_KEY` → `CL stored `anthropic-oauth` tenant. This is the recommended fleet mode; interactive OAuth is for self-service. ### 5.4 Concurrency & storage invariant (folds in the flock decision) -`auth.json` is multi-tenant (`codex`, `anthropic-oauth`, `mcp:`×N) and written by two code paths -(`save_tokens_for`, `McpCredentialStore`) across **multiple processes** (one per Discord thread). Two -distinct hazards demand two locks. +`auth.json` is multi-tenant (`codex`, `anthropic-oauth`, `mcp:`×N) and written by **two independent +read-modify-write call sites** across **multiple processes** (one per Discord thread): provider tokens via +`save_tokens` (`openab-agent/src/auth.rs:234`) and **MCP** OAuth creds via `McpCredentialStore::save`/`clear` +(`auth.rs:284-328`), plus the MCP pending-login finalize path. **Today there is *no* lock on `auth.json`** — +only an atomic `tmp+rename` in the shared low-level `write_auth_file`; the two RMW callers each do their own +unlocked `read_auth_file → mutate map → write_auth_file`, so a concurrent provider-refresh and MCP-save +last-writer-wins the *entire* map (lost update). Two distinct hazards demand two locks — **and the fix is not +provider-only: the MCP `CredentialStore` is a co-equal RMW caller that must be routed through the same +invariant** (see §9 Q4). (`with_auth_locked` below is *new* — the thing this ADR introduces.) **(a) File integrity — one global lock.** Every write funnels through a single locked RMW so concurrent writers never lost-update the shared file: @@ -327,8 +338,15 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA theoretical — reinforcing the opt-in gate. - **No-Go:** `cursor` (reverse-engineered proprietary proxy + high ToS/account-ban risk), `kiro` (AWS Q event-stream protocol — high maintenance cost). Revisit only on explicit demand. -4. Does the locked-RMW fix also subsume `openab-agent-mcp.md` open items #1 (reqwest 0.12/0.13 split) and - #8 (doctor/runtime two-store split)? Likely partial. +4. **MCP credential store must be revamped together — IN SCOPE (Brett 2026-06-24).** The locked-RMW + + per-tenant-lock invariant (§5.4) is **not** a provider-only change. `McpCredentialStore::save`/`clear` + (`auth.rs:284-328`) is a co-equal unlocked RMW writer of `auth.json`, and the MCP pending-login finalize + path writes it too. Introducing `with_auth_locked` therefore requires routing **both** the provider + (`save_tokens`) and the MCP `CredentialStore` (+ pending finalize) through it — otherwise the lock is + bypassed by half the writers and the race persists. Land them in the same change. This also likely + subsumes (partially) `openab-agent-mcp.md` open items #1 (reqwest 0.12/0.13 split) and #8 (doctor/runtime + two-store split); confirm against that ADR. Connects to the pending OAuth-revamp follow-up flagged on the + `feat/openab-agent-mcp-resilience` PR. --- From 2c6063b182e9d0982f01c996af62b1697fbd5904 Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 01:04:10 +0800 Subject: [PATCH 08/18] =?UTF-8?q?docs(adr):=20fix=20stale=20=C2=A79=20Q4?= =?UTF-8?q?=20cross-reference?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The §9 Q4 tail referenced 'openab-agent-mcp.md open items #1 (reqwest 0.12/0.13 split) and #8 (doctor/runtime two-store split)' — but that ADR's §10 Open Questions has only two items (mcp.json location; native-vs-broker parity), neither matching, and no such numbered items / terms exist anywhere in it. The phantom reference dated to the original draft. Replace with an accurate statement: McpCredentialStore reuses the same TokenStore/auth.json storage (openab-agent-mcp.md §6.1), so the lock lands once and serves both; the reqwest version conflict is the rmcp-OAuth dependency issue surfaced on the feat/openab-agent-mcp-resilience PR, not an mcp-ADR open item. Co-Authored-By: Claude Opus 4.8 --- docs/adr/openab-agent-oauth.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/adr/openab-agent-oauth.md b/docs/adr/openab-agent-oauth.md index c09fdb413..475b0cae6 100644 --- a/docs/adr/openab-agent-oauth.md +++ b/docs/adr/openab-agent-oauth.md @@ -343,10 +343,11 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA (`auth.rs:284-328`) is a co-equal unlocked RMW writer of `auth.json`, and the MCP pending-login finalize path writes it too. Introducing `with_auth_locked` therefore requires routing **both** the provider (`save_tokens`) and the MCP `CredentialStore` (+ pending finalize) through it — otherwise the lock is - bypassed by half the writers and the race persists. Land them in the same change. This also likely - subsumes (partially) `openab-agent-mcp.md` open items #1 (reqwest 0.12/0.13 split) and #8 (doctor/runtime - two-store split); confirm against that ADR. Connects to the pending OAuth-revamp follow-up flagged on the - `feat/openab-agent-mcp-resilience` PR. + bypassed by half the writers and the race persists. Land them in the same change. (`McpCredentialStore` + reuses the same `TokenStore`/`auth.json` storage that `openab-agent-mcp.md` §6.1 describes, so the lock + lands once and serves both.) Connects to the pending OAuth-revamp follow-up flagged on the + `feat/openab-agent-mcp-resilience` PR — itself driven by the rmcp/`reqwest` dependency-version conflict its + OAuth adoption surfaced. --- From 515d49b08ce66a6fb80d2ac88bbf6f3d633b7fde Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 01:05:44 +0800 Subject: [PATCH 09/18] =?UTF-8?q?docs(adr):=20=C2=A79=20Q2=20=E2=80=94=20e?= =?UTF-8?q?ncode-at-rest=20is=20the=20default,=20env=20is=20the=20alternat?= =?UTF-8?q?ive=20(Brett)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Brett's call given the 93%-plaintext ecosystem survey: keep the bundled zero-config UX via encode-at-rest (obscure, rclone obscure.MustReveal style) as the DEFAULT, with env-injection as the alternative for fleet/pod. Reverses the prior (b)-default ordering. Framing unchanged: encode-at-rest is scanner-evasion for a non-confidential value, not a security control. Record the survey numbers (99/107 plaintext; auto-revoke largely unrealized) and that the real risk mitigated is org-repo push-protection friction, not credential loss. Co-Authored-By: Claude Opus 4.8 --- docs/adr/openab-agent-oauth.md | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/docs/adr/openab-agent-oauth.md b/docs/adr/openab-agent-oauth.md index 475b0cae6..2e688af1e 100644 --- a/docs/adr/openab-agent-oauth.md +++ b/docs/adr/openab-agent-oauth.md @@ -302,25 +302,30 @@ it exposes CLI subcommands the relay shells out to via `$OPENAB_AGENT_AUTH_COMMA drop the three duplicated default sites (`llm.rs:153`, `acp.rs:385/446`). Consequence: removes the zero-config default (deployments set model via values.yaml/env already; needs a clear error message + CHANGELOG note). Also eliminates the silent Opus cost bump for API-key users. -2. **Bundled `client_secret` storage — DECIDED (Brett 2026-06-24): env-injection default; encode-at-rest - fallback.** Google Code-Assist vendors (gemini, agy) ship a `GOCSPX-…` desktop-app secret. By RFC 8252 and +2. **Bundled `client_secret` storage — DECIDED (Brett 2026-06-24): encode-at-rest default; env-injection + alternative.** Google Code-Assist vendors (gemini, agy) ship a `GOCSPX-…` desktop-app secret. By RFC 8252 and Google's own docs this value is **non-confidential** (installed-app secret, "obviously not treated as a secret") — there is no confidentiality to protect, so obscuring it adds zero cryptographic security. But it is **not safe as raw text in a public repo for operational reasons**: GitHub push-protection covers Google secrets **by default** (changelog 2026-03), so a raw `GOCSPX-` literal blocks contributor `git push`, and GitHub↔Google partner token-scanning may **auto-revoke** the credential once it lands in a public commit. Decision: do **not** commit the raw literal — - - **(b) inject at runtime via env is the default** (no secret in the repo at all): cleanest provenance, - consistent with the §5.3 env-route preference, removes the ToS/leak surface entirely. Costs the bundled - zero-config UX, which fleet/pod deployments don't need (they set env already). - - **(a) encode-at-rest** (split/base64) is the **fallback** only where a bundled zero-config binary is - genuinely required: it is **scanner-evasion for an already-public value, *not* a security control** — - label it as such inline so reviewers aren't misled. This is a well-adopted pattern: **rclone** declares a - `rcloneEncryptedClientSecret` constant and calls `obscure.MustReveal()` at runtime for exactly this - reason (bypass static scanners / partner auto-revoke, explicitly not encryption) — see §10. + - **(a) encode-at-rest is the default** (split/base64, decoded at runtime): keeps the bundled zero-config + UX while dodging push-protection and partner auto-revoke. It is **scanner-evasion for an already-public + value, *not* a security control** — label it as such inline so reviewers aren't misled into treating it + as a real secret. Well-adopted pattern: **rclone** declares a `rcloneEncryptedClientSecret` constant and + calls `obscure.MustReveal()` at runtime for exactly this reason (bypass static scanners / partner + auto-revoke, explicitly not encryption) — see §10. + - **(b) inject at runtime via env is the alternative** (no secret in the repo at all): cleanest provenance, + consistent with the §5.3 env-route preference; preferred where a deployment already sets env (fleet/pod) + and the bundled zero-config UX isn't needed. + Empirically the ecosystem overwhelmingly commits the literal plaintext (survey: 99/107 ≈ 93%; obscure 4, + env 1) and the shared secret is **not** being aggressively auto-revoked despite 100+ public copies — so the + risk we mitigate is mainly **contributor push-protection friction on the `openabdev/openab` org repo**, not + credential loss. Encode-at-rest buys that with zero UX cost, hence the default. agy secret requirement is now **CONFIRMED** (2026-06-24): agy *does* require a `GOCSPX-…` client_secret, a - public ecosystem constant (≥20 antigravity-auth repos hardcode it — §6). Under the (b) env default it is - injected, never committed; (a) applies only if agy is ever shipped as a bundled zero-config binary. + public ecosystem constant (≥20 antigravity-auth repos hardcode it — §6); it ships encode-at-rest by default + per this decision (or via env where (b) applies). 3. **Vendor go/no-go — DECIDED (Brett 2026-06-24).** - **GO:** `gemini`, `grok` (high value, clean APIs, low ToS risk) — first wave. - **GO (experimental, opt-in):** `agy`. Marginal eng cost is low — it shares gemini's Google Code-Assist From f3701103c9f6c79f9a600e561bb1827b798e931f Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 01:18:44 +0800 Subject: [PATCH 10/18] =?UTF-8?q?feat(auth):=20cross-process=20locking=20f?= =?UTF-8?q?or=20auth.json=20=E2=80=94=20codex=20+=20MCP=20(ADR=20=C2=A75.4?= =?UTF-8?q?)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements the ADR §5.4 invariant. auth.json had NO lock: provider save_tokens and McpCredentialStore::save/clear each did an independent unlocked read-modify-write, so a concurrent codex refresh and MCP save last-writer-wins the whole map, and N processes (one openab-agent per Discord thread) could each refresh the codex token with the same RT_old → OAuth 2.1 §10.4 token-family revocation (fleet-wide logout). Two locks, flock(2) on sidecar files (kernel auto-releases on death), cfg(unix) with a non-unix no-op: - with_auth_locked: global exclusive lock across the re-read -> mutate -> atomic-write. ALL writers funnel through it — save_tokens (codex) and McpCredentialStore::save/clear (MCP) — so writers merge onto the latest on-disk state instead of lost-updating. Held only for the fast file RMW, never across network I/O. - lock_tenant_refresh: per-tenant refresh serialisation. get_valid_token / force_refresh take the codex tenant lock (non-blocking try + async backoff + 10s timeout, held across the network refresh), with a double-checked re-read so a process that waited adopts the token another already refreshed → exactly one real refresh per tenant per expiry, no RT_old reuse. Uses libc::flock (already a cfg(unix) dep); rustix is NOT in-tree (ADR text was optimistic). New test asserts the locked RMW merges codex + MCP tenants without lost-update. fmt + clippy -D warnings + test (191 passed) green. Known gap (for review): the MCP *network* refresh is owned by rmcp's AuthorizationManager (it refreshes then calls CredentialStore::save), so MCP writes get the file-integrity lock but MCP refreshes are not yet tenant- serialised. Closing that needs an rmcp-level hook — proposed as follow-up. Co-Authored-By: Claude Opus 4.8 --- openab-agent/src/auth.rs | 255 ++++++++++++++++++++++++++++++++++----- 1 file changed, 222 insertions(+), 33 deletions(-) diff --git a/openab-agent/src/auth.rs b/openab-agent/src/auth.rs index d9a701237..24040fa0a 100644 --- a/openab-agent/src/auth.rs +++ b/openab-agent/src/auth.rs @@ -214,6 +214,145 @@ fn write_auth_file(path: &Path, map: &HashMap) -> Result<()> Ok(()) } +// ── auth.json cross-process locking (ADR §5.4) ────────────────────────────── +// +// `auth.json` is written by multiple processes (one openab-agent per Discord +// thread) and by two code paths within each (`save_tokens` for the codex tenant +// + `McpCredentialStore` for MCP servers). Two hazards, two locks: +// +// (a) File integrity — every read-modify-write funnels through `with_auth_locked`, +// which holds an exclusive `flock` on an `auth.json.global.lock` sidecar +// across the re-read → mutate → atomic-write. The re-read *inside* the lock +// is what makes concurrent writers merge instead of lost-update. +// (b) Refresh-token rotation — `lock_tenant_refresh` serialises the network +// refresh per tenant so concurrent processes present a rotated `RT_old` +// only once, never tripping OAuth 2.1 §10.4 token-family revocation. +// +// `flock(2)` (not a sentinel lockfile) so the kernel auto-releases on fd close / +// process death — no stale lock, no orphan cleanup. The lock lives on a sidecar, +// never on `auth.json` itself, because the atomic tmp+rename swaps that inode out +// from under any lock held on it. `#[cfg(unix)]`; a non-unix build is a no-op +// (openab-agent is de-facto unix-only — see `write_auth_file`). + +/// Sidecar lock path `auth.json..lock`, next to the auth file so a +/// test-injected tempdir locks its own sidecar rather than the real `$HOME` one. +#[cfg(unix)] +fn lock_path_for(auth: &Path, suffix: &str) -> PathBuf { + let dir = auth.parent().unwrap_or_else(|| Path::new(".")); + dir.join(format!("auth.json.{suffix}.lock")) +} + +/// RAII guard releasing the advisory lock on drop. The kernel also drops it on +/// fd close / process death, so a crashed holder never wedges the file. +#[cfg(unix)] +struct AuthFileLock { + file: std::fs::File, +} + +#[cfg(unix)] +impl Drop for AuthFileLock { + fn drop(&mut self) { + use std::os::unix::io::AsRawFd; + // SAFETY: `self.file` owns a valid fd; flock has no memory effects. + unsafe { libc::flock(self.file.as_raw_fd(), libc::LOCK_UN) }; + } +} + +#[cfg(unix)] +fn open_lock_file(lock: &Path) -> Result { + use std::os::unix::fs::OpenOptionsExt; + if let Some(dir) = lock.parent() { + std::fs::create_dir_all(dir)?; + } + Ok(std::fs::OpenOptions::new() + .write(true) + .create(true) + .truncate(false) + .mode(0o600) + .open(lock)?) +} + +/// Blocking exclusive lock. Used ONLY for the global file RMW, which performs no +/// network I/O while held, so acquisition blocks at most for another process's +/// fast tmp+rename — never for a slow refresh (those take the per-tenant lock). +#[cfg(unix)] +fn flock_exclusive(lock: &Path) -> Result { + use std::os::unix::io::AsRawFd; + let file = open_lock_file(lock)?; + // SAFETY: valid fd held by `file`; flock has no memory effects. + let rc = unsafe { libc::flock(file.as_raw_fd(), libc::LOCK_EX) }; + if rc != 0 { + return Err(std::io::Error::last_os_error().into()); + } + Ok(AuthFileLock { file }) +} + +/// Non-blocking exclusive lock; `Ok(None)` means another holder owns it. +#[cfg(unix)] +fn flock_try_exclusive(lock: &Path) -> Result> { + use std::os::unix::io::AsRawFd; + let file = open_lock_file(lock)?; + // SAFETY: valid fd held by `file`; flock has no memory effects. + let rc = unsafe { libc::flock(file.as_raw_fd(), libc::LOCK_EX | libc::LOCK_NB) }; + if rc == 0 { + return Ok(Some(AuthFileLock { file })); + } + let err = std::io::Error::last_os_error(); + if err.raw_os_error() == Some(libc::EWOULDBLOCK) { + Ok(None) + } else { + Err(err.into()) + } +} + +/// (a) File-integrity funnel (ADR §5.4). Holds the global sidecar lock across a +/// re-read → mutate → atomic-write so the codex `save_tokens` path AND the MCP +/// `McpCredentialStore` never lost-update the shared map: each writer merges onto +/// the latest on-disk state. A corrupt file is quarantined by `read_auth_file` +/// and treated as empty (`unwrap_or_default`), matching the prior save behaviour. +fn with_auth_locked( + path: &Path, + f: impl FnOnce(&mut HashMap) -> R, +) -> Result { + #[cfg(unix)] + let _guard = flock_exclusive(&lock_path_for(path, "global"))?; + let mut map = read_auth_file(path).unwrap_or_default(); + let r = f(&mut map); + write_auth_file(path, &map)?; + Ok(r) +} + +/// (b) Per-tenant refresh serialisation (ADR §5.4). The returned guard is held by +/// the caller across the network refresh so concurrent processes do exactly one +/// real refresh per tenant — never presenting a rotated `RT_old` twice (OAuth 2.1 +/// §10.4 family revocation). Non-blocking acquire + async backoff so a refresh in +/// flight elsewhere never blocks this executor thread; on timeout we return `None` +/// and let the caller proceed (degrade to a possible double-refresh, never wedge). +#[cfg(unix)] +async fn lock_tenant_refresh(auth: &Path, tenant: &str) -> Option { + let lock = lock_path_for(auth, &format!("refresh.{tenant}")); + let deadline = std::time::Instant::now() + std::time::Duration::from_secs(10); + loop { + match flock_try_exclusive(&lock) { + Ok(Some(guard)) => return Some(guard), + Ok(None) => { + if std::time::Instant::now() >= deadline { + tracing::warn!( + tenant, + "timed out waiting for refresh lock; proceeding unserialised" + ); + return None; + } + tokio::time::sleep(std::time::Duration::from_millis(100)).await; + } + Err(e) => { + tracing::warn!(tenant, error = %e, "refresh lock unavailable; proceeding unserialised"); + return None; + } + } + } +} + pub fn load_tokens() -> Result { let path = auth_path(); let map = read_auth_file(&path).map_err(|_| { @@ -232,10 +371,9 @@ pub fn load_tokens() -> Result { } fn save_tokens(store: &TokenStore) -> Result<()> { - let path = auth_path(); - let mut map = read_auth_file(&path).unwrap_or_default(); - map.insert(CODEX_NAMESPACE.to_string(), AuthEntry::Token(store.clone())); - write_auth_file(&path, &map) + with_auth_locked(&auth_path(), |map| { + map.insert(CODEX_NAMESPACE.to_string(), AuthEntry::Token(store.clone())); + }) } /// rmcp [`CredentialStore`] backed by the shared `auth.json` file (ADR §6.1 @@ -283,61 +421,89 @@ impl CredentialStore for McpCredentialStore { async fn save(&self, mut credentials: StoredCredentials) -> Result<(), AuthError> { use oauth2::{RefreshToken, TokenResponse}; - let mut map = read_auth_file(&self.path).unwrap_or_default(); // OAuth 2.1 §10.4: when a refresh response omits `refresh_token`, the // prior one stays valid. rmcp's `refresh_token()` rebuilds the stored // credentials from the refresh response alone, so a rotating-but-omitting // AS would lose our fallback — splice the prior refresh_token back in. + // The prior-read happens inside the lock (re-read) so two writers can't + // race the splice. The whole RMW funnels through `with_auth_locked` so an + // interleaving codex `save_tokens` never lost-updates this MCP entry. let incoming_has_refresh = credentials .token_response .as_ref() .and_then(|tr| tr.refresh_token()) .is_some_and(|rt| !rt.secret().is_empty()); - if !incoming_has_refresh { - if let Some(AuthEntry::Mcp(old)) = map.get(&self.key) { - let prior = old - .token_response - .as_ref() - .and_then(|tr| tr.refresh_token()) - .map(|rt| rt.secret().to_string()) - .filter(|s| !s.is_empty()); + let key = self.key.clone(); + with_auth_locked(&self.path, move |map| { + if !incoming_has_refresh { + let prior = match map.get(&key) { + Some(AuthEntry::Mcp(old)) => old + .token_response + .as_ref() + .and_then(|tr| tr.refresh_token()) + .map(|rt| rt.secret().to_string()) + .filter(|s| !s.is_empty()), + _ => None, + }; if let (Some(prior), Some(tr)) = (prior, credentials.token_response.as_mut()) { tr.set_refresh_token(Some(RefreshToken::new(prior))); } } - } - - map.insert(self.key.clone(), AuthEntry::Mcp(credentials)); - write_auth_file(&self.path, &map).map_err(|e| AuthError::InternalError(e.to_string())) + map.insert(key.clone(), AuthEntry::Mcp(credentials)); + }) + .map_err(|e| AuthError::InternalError(e.to_string())) } async fn clear(&self) -> Result<(), AuthError> { - let mut map = match read_auth_file(&self.path) { - Ok(m) => m, - Err(_) => return Ok(()), + // Same global lock as every other writer; preserves the "last entry + // removed → delete the file" behaviour rather than leaving an empty map. + let run = || -> Result<()> { + #[cfg(unix)] + let _guard = flock_exclusive(&lock_path_for(&self.path, "global"))?; + let mut map = match read_auth_file(&self.path) { + Ok(m) => m, + Err(_) => return Ok(()), + }; + if map.remove(&self.key).is_none() { + return Ok(()); + } + if map.is_empty() { + let _ = std::fs::remove_file(&self.path); + return Ok(()); + } + write_auth_file(&self.path, &map) }; - if map.remove(&self.key).is_none() { - return Ok(()); - } - if map.is_empty() { - let _ = std::fs::remove_file(&self.path); - return Ok(()); - } - write_auth_file(&self.path, &map).map_err(|e| AuthError::InternalError(e.to_string())) + run().map_err(|e| AuthError::InternalError(e.to_string())) } } pub async fn get_valid_token() -> Result { - let mut store = load_tokens()?; - if store.is_expired() { - store = refresh_token(&store).await?; - save_tokens(&store)?; + // 1. Fast path: a fresh token needs no lock. + let store = load_tokens()?; + if !store.is_expired() { + return Ok(store.access_token); } - Ok(store.access_token) + // 2. Serialise the refresh for the codex tenant — held across the network + // call so a second process does not present the same RT_old (§5.4 (b)). + #[cfg(unix)] + let _refresh_guard = lock_tenant_refresh(&auth_path(), CODEX_NAMESPACE).await; + // 3. Double-check: another process may have refreshed while we waited. + let store = load_tokens()?; + if !store.is_expired() { + return Ok(store.access_token); + } + // 4. Exactly one network refresh per tenant per expiry (tenant lock held). + let fresh = refresh_token(&store).await?; + // 5. Commit under the global file lock. + save_tokens(&fresh)?; + Ok(fresh.access_token) } pub async fn force_refresh() -> Result { + // Serialise even a forced refresh so two of them can't both rotate RT_old. + #[cfg(unix)] + let _refresh_guard = lock_tenant_refresh(&auth_path(), CODEX_NAMESPACE).await; let store = load_tokens()?; let new_store = refresh_token(&store).await?; save_tokens(&new_store)?; @@ -1090,4 +1256,27 @@ mod tests { let pending = map.get("mcp-pending:srv"); assert!(matches!(pending, Some(AuthEntry::Pending(_)))); } + + #[test] + fn with_auth_locked_merges_concurrent_tenants_no_lost_update() { + // Two locked RMWs against the same file — the codex tenant and an MCP + // tenant — must both survive: the second writer re-reads inside the lock + // and merges, instead of clobbering the first (the §5.4 lost-update fix). + let dir = tempfile::tempdir().unwrap(); + let path = dir.path().join("auth.json"); + + with_auth_locked(&path, |m| { + m.insert("codex".to_string(), AuthEntry::Token(make_store(1))); + }) + .unwrap(); + with_auth_locked(&path, |m| { + m.insert("github".to_string(), AuthEntry::Mcp(make_mcp_creds())); + }) + .unwrap(); + + let map = read_auth_file(&path).unwrap(); + assert_eq!(map.len(), 2, "second write merged, did not lost-update"); + assert_eq!(token_of(map.get("codex")).expires_at, 1); + assert!(matches!(map.get("github"), Some(AuthEntry::Mcp(_)))); + } } From 38007e28eb8b46e1b566b46e1d48b638be7b2ee0 Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 01:24:24 +0800 Subject: [PATCH 11/18] =?UTF-8?q?feat(auth):=20address=20Mira=20code=20rev?= =?UTF-8?q?iew=20=E2=80=94=20pending=20GC=20+=20portable=20WouldBlock?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round-2 code review (Mira): - §7 pending-entry GC: add created_at to PendingPasteLogin and sweep AuthEntry::Pending older than 15 min inside with_auth_locked on every write, so abandoned /auth two-step attempts don't accumulate. PendingPasteLogin is currently a legacy tombstone with no live writer (PKCE state lives in rmcp's in-memory StateStore); the field + GC land now per ADR §7 and are forward-compatible with the forthcoming /auth two-step flow, and meanwhile sweep legacy stray entries (created_at default 0 reads as ancient). New test covers stale-swept / fresh-kept / real-tenant-untouched. - flock_try_exclusive: match std::io::ErrorKind::WouldBlock instead of a single raw EWOULDBLOCK errno, covering EAGAIN/EWOULDBLOCK across libc/BSD. - MCP network-refresh serialisation stays a follow-up (rmcp owns the refresh) — Mira concurs. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 --- openab-agent/src/auth.rs | 70 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 69 insertions(+), 1 deletion(-) diff --git a/openab-agent/src/auth.rs b/openab-agent/src/auth.rs index 24040fa0a..25c3827db 100644 --- a/openab-agent/src/auth.rs +++ b/openab-agent/src/auth.rs @@ -74,6 +74,14 @@ pub struct PendingPasteLogin { /// (written before this field existed) deserializable. #[serde(default)] pub resource: Option, + /// Unix-seconds stamp set when the pending entry is written; `with_auth_locked` + /// expires entries older than 15 min (ADR §7) so an abandoned `/auth` two-step + /// (verifier written, code never pasted) doesn't accumulate. The two-step flow + /// that *writes* pending state is forthcoming; today this struct is a legacy + /// read-tolerant tombstone, so the field exists mainly for the GC. `#[serde(default)]` + /// (= 0) reads pre-existing/legacy entries as ancient → swept on the next write. + #[serde(default)] + pub created_at: u64, } /// `auth.json` value type. Untagged Serde enum: `TokenStore` has required @@ -298,7 +306,9 @@ fn flock_try_exclusive(lock: &Path) -> Result> { return Ok(Some(AuthFileLock { file })); } let err = std::io::Error::last_os_error(); - if err.raw_os_error() == Some(libc::EWOULDBLOCK) { + // flock can report EWOULDBLOCK or EAGAIN depending on libc/BSD; both map to + // `ErrorKind::WouldBlock`, so use the std abstraction (Mira review). + if err.kind() == std::io::ErrorKind::WouldBlock { Ok(None) } else { Err(err.into()) @@ -318,10 +328,27 @@ fn with_auth_locked( let _guard = flock_exclusive(&lock_path_for(path, "global"))?; let mut map = read_auth_file(path).unwrap_or_default(); let r = f(&mut map); + gc_stale_pending(&mut map); write_auth_file(path, &map)?; Ok(r) } +/// Opportunistic GC (ADR §7): drop `AuthEntry::Pending` entries older than 15 min +/// on every locked write, so abandoned `/auth` two-step attempts don't accumulate. +/// `created_at == 0` (legacy/unstamped entries) reads as ancient and is swept. +const PENDING_TTL_SECONDS: u64 = 15 * 60; + +fn gc_stale_pending(map: &mut HashMap) { + let now = SystemTime::now() + .duration_since(UNIX_EPOCH) + .unwrap_or_default() + .as_secs(); + map.retain(|_, entry| match entry { + AuthEntry::Pending(p) => now.saturating_sub(p.created_at) <= PENDING_TTL_SECONDS, + _ => true, + }); +} + /// (b) Per-tenant refresh serialisation (ADR §5.4). The returned guard is held by /// the caller across the network refresh so concurrent processes do exactly one /// real refresh per tenant — never presenting a rotated `RT_old` twice (OAuth 2.1 @@ -1013,6 +1040,7 @@ mod tests { token_url: "https://example.com/token".to_string(), provider_name: "anthropic-mcp".to_string(), resource: None, + created_at: 0, } } @@ -1279,4 +1307,44 @@ mod tests { assert_eq!(token_of(map.get("codex")).expires_at, 1); assert!(matches!(map.get("github"), Some(AuthEntry::Mcp(_)))); } + + #[test] + fn with_auth_locked_gcs_stale_pending_but_keeps_fresh_and_tokens() { + // ADR §7: a locked write opportunistically sweeps `Pending` entries older + // than 15 min, while fresh pending state and real tenants are untouched. + let dir = tempfile::tempdir().unwrap(); + let path = dir.path().join("auth.json"); + let now = SystemTime::now() + .duration_since(UNIX_EPOCH) + .unwrap() + .as_secs(); + // Seed through the un-GC'd writer so the stale entry exists pre-sweep. + let mut seed = HashMap::new(); + seed.insert("codex".to_string(), AuthEntry::Token(make_store(1))); + seed.insert( + "mcp-pending:stale".to_string(), + AuthEntry::Pending(make_pending()), // created_at = 0 → ancient + ); + seed.insert( + "mcp-pending:fresh".to_string(), + AuthEntry::Pending(PendingPasteLogin { + created_at: now, + ..make_pending() + }), + ); + write_auth_file(&path, &seed).unwrap(); + + with_auth_locked(&path, |_m| {}).unwrap(); + + let map = read_auth_file(&path).unwrap(); + assert!( + map.get("mcp-pending:stale").is_none(), + "stale pending swept" + ); + assert!( + matches!(map.get("mcp-pending:fresh"), Some(AuthEntry::Pending(_))), + "fresh pending kept" + ); + assert!(map.get("codex").is_some(), "real tenant untouched"); + } } From f53f4631d56b700dd3395924d7b80bd1e817000b Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 03:04:40 +0800 Subject: [PATCH 12/18] =?UTF-8?q?feat(auth):=20serialise=20MCP=20token=20r?= =?UTF-8?q?efresh=20cross-process=20(close=20=C2=A75.4=20(b)=20gap)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the known gap from f370110: MCP refreshes now get the same per-tenant serialisation as codex. rmcp's CredentialStore exposes no pre-refresh hook, but openab drives the MCP refresh explicitly in McpRuntimeManager::resolve_oauth_dial via client.get_access_token(). Wrap that call (per-server) with the existing auth::lock_tenant_refresh so only one process refreshes a given server at a time. No explicit double-check needed: rmcp's get_access_token re-load()s auth.json each call and returns the cached token without a network refresh when remaining >= REFRESH_BUFFER (rmcp auth.rs:1238), so a process that loses the race adopts the token the winner wrote to the shared file — no second RT_old presentation, no OAuth 2.1 §10.4 family revocation. rmcp already single-flights within one process via its AuthorizationManager Mutex; this closes the cross-process gap. - auth.rs: lock_tenant_refresh + AuthFileLock made pub(crate) for the mcp module. - ADR §5.4: document the resolve_oauth_dial serialisation point. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 --- docs/adr/openab-agent-oauth.md | 6 +++++- openab-agent/src/auth.rs | 4 ++-- openab-agent/src/mcp/runtime.rs | 19 +++++++++++++++---- 3 files changed, 22 insertions(+), 7 deletions(-) diff --git a/docs/adr/openab-agent-oauth.md b/docs/adr/openab-agent-oauth.md index 2e688af1e..753e86fb6 100644 --- a/docs/adr/openab-agent-oauth.md +++ b/docs/adr/openab-agent-oauth.md @@ -207,7 +207,11 @@ fn get_or_refresh(tenant: &str) -> Result { `ci-openab-agent.yml` is linux, deploy is always container; Windows binaries are the broker only.) - Each MCP `mcp:` tenant takes its own tenant lock by the same rule, so the MCP `CredentialStore` refreshes are serialized per server too — the invariant serves it directly (see `openab-agent-mcp.md` - §6.1). + §6.1). rmcp owns the MCP refresh internally (no pre-refresh `CredentialStore` hook), but openab drives it + explicitly at `resolve_oauth_dial` (`mcp/runtime.rs`) via `client.get_access_token()`; the per-server + tenant lock wraps that call, and rmcp's `get_access_token` re-`load()`s `auth.json` and skips the network + refresh when the token is already fresh, so the lock-loser adopts the winner's token (cross-process + single-flight) without a second `RT_old` presentation. - **Until this lands**, prefer the `CLAUDE_CODE_OAUTH_TOKEN` env route (§5.3 — no refresh write, no race); treat interactive Anthropic OAuth as not-yet-hardened for high concurrency. diff --git a/openab-agent/src/auth.rs b/openab-agent/src/auth.rs index 25c3827db..2a3d78305 100644 --- a/openab-agent/src/auth.rs +++ b/openab-agent/src/auth.rs @@ -253,7 +253,7 @@ fn lock_path_for(auth: &Path, suffix: &str) -> PathBuf { /// RAII guard releasing the advisory lock on drop. The kernel also drops it on /// fd close / process death, so a crashed holder never wedges the file. #[cfg(unix)] -struct AuthFileLock { +pub(crate) struct AuthFileLock { file: std::fs::File, } @@ -356,7 +356,7 @@ fn gc_stale_pending(map: &mut HashMap) { /// flight elsewhere never blocks this executor thread; on timeout we return `None` /// and let the caller proceed (degrade to a possible double-refresh, never wedge). #[cfg(unix)] -async fn lock_tenant_refresh(auth: &Path, tenant: &str) -> Option { +pub(crate) async fn lock_tenant_refresh(auth: &Path, tenant: &str) -> Option { let lock = lock_path_for(auth, &format!("refresh.{tenant}")); let deadline = std::time::Instant::now() + std::time::Duration::from_secs(10); loop { diff --git a/openab-agent/src/mcp/runtime.rs b/openab-agent/src/mcp/runtime.rs index 0c8ac7936..46c038254 100644 --- a/openab-agent/src/mcp/runtime.rs +++ b/openab-agent/src/mcp/runtime.rs @@ -631,10 +631,21 @@ impl McpRuntimeManager { if !has_refresh { return Err(needs_login()); } - // Expired but refreshable: rmcp needs the authorization server's - // metadata configured before it can exchange the refresh token, so - // discover it now (network) and let `get_access_token` perform the - // rotation. Any failure surfaces as user-actionable `NeedsAuth`. + // Expired but refreshable. Serialise the refresh per-server across + // processes (ADR §5.4 (b)): openab runs one process per Discord thread, + // all sharing this server's stored creds, so two of them would otherwise + // present the same `RT_old` and trip OAuth 2.1 §10.4 token-family + // revocation. rmcp already single-flights within one process (shared + // `AuthorizationManager` `Mutex`); this closes the cross-process gap. + // The lock is held across the network refresh below. The double-check is + // implicit: rmcp's `get_access_token` re-`load()`s `auth.json` and skips + // the network refresh when the token is already fresh, so a process that + // loses the race adopts the token the winner wrote. Non-unix = no-op. + #[cfg(unix)] + let _refresh_guard = crate::auth::lock_tenant_refresh(&self.auth_path, name).await; + // rmcp needs the authorization server's metadata configured before it can + // exchange the refresh token, so discover it now (network) and let + // `get_access_token` perform the rotation. Failures surface as `NeedsAuth`. { let mut mgr = client.auth_manager.lock().await; mgr.initialize_from_store().await.map_err(|e| { From 842c70feccd911b341f5b5dbd0cb818fbc2acdbc Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 04:00:14 +0800 Subject: [PATCH 13/18] =?UTF-8?q?refactor(auth):=20/simplify=20cleanups=20?= =?UTF-8?q?=E2=80=94=20dedup=20lock=20acquire,=20reuse=20refresh=20fd?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Quality-only cleanups from a 4-angle /simplify pass (no behaviour change): - Extract lock_global(path) -> Result> and route both with_auth_locked and McpCredentialStore::clear through it, so the "global" sidecar name + the cfg(unix) acquire live in one place instead of two copy-pasted blocks. clear flattens (drop the inner run closure). - lock_tenant_refresh opens the lock fd once and re-issues flock on it each retry, instead of re-opening (and re-create_dir_all-ing) the file every 100ms under contention. Removes the now-unused flock_try_exclusive helper. - Document lock_tenant_refresh's double-check contract (the lock only serialises; callers must re-check freshness after acquiring). fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 --- openab-agent/src/auth.rs | 121 ++++++++++++++++++++++----------------- 1 file changed, 67 insertions(+), 54 deletions(-) diff --git a/openab-agent/src/auth.rs b/openab-agent/src/auth.rs index 2a3d78305..34bb24f44 100644 --- a/openab-agent/src/auth.rs +++ b/openab-agent/src/auth.rs @@ -295,23 +295,19 @@ fn flock_exclusive(lock: &Path) -> Result { Ok(AuthFileLock { file }) } -/// Non-blocking exclusive lock; `Ok(None)` means another holder owns it. -#[cfg(unix)] -fn flock_try_exclusive(lock: &Path) -> Result> { - use std::os::unix::io::AsRawFd; - let file = open_lock_file(lock)?; - // SAFETY: valid fd held by `file`; flock has no memory effects. - let rc = unsafe { libc::flock(file.as_raw_fd(), libc::LOCK_EX | libc::LOCK_NB) }; - if rc == 0 { - return Ok(Some(AuthFileLock { file })); +/// Acquire the global `auth.json` write lock (a no-op `None` guard off-unix). +/// Both `with_auth_locked` and `McpCredentialStore::clear` — which needs a +/// delete-on-empty tail the funnel can't express — acquire here, so the +/// `"global"` sidecar name and the acquire policy live in exactly one place. +fn lock_global(path: &Path) -> Result> { + #[cfg(unix)] + { + Ok(Some(flock_exclusive(&lock_path_for(path, "global"))?)) } - let err = std::io::Error::last_os_error(); - // flock can report EWOULDBLOCK or EAGAIN depending on libc/BSD; both map to - // `ErrorKind::WouldBlock`, so use the std abstraction (Mira review). - if err.kind() == std::io::ErrorKind::WouldBlock { + #[cfg(not(unix))] + { + let _ = path; Ok(None) - } else { - Err(err.into()) } } @@ -324,8 +320,7 @@ fn with_auth_locked( path: &Path, f: impl FnOnce(&mut HashMap) -> R, ) -> Result { - #[cfg(unix)] - let _guard = flock_exclusive(&lock_path_for(path, "global"))?; + let _guard = lock_global(path)?; let mut map = read_auth_file(path).unwrap_or_default(); let r = f(&mut map); gc_stale_pending(&mut map); @@ -352,31 +347,51 @@ fn gc_stale_pending(map: &mut HashMap) { /// (b) Per-tenant refresh serialisation (ADR §5.4). The returned guard is held by /// the caller across the network refresh so concurrent processes do exactly one /// real refresh per tenant — never presenting a rotated `RT_old` twice (OAuth 2.1 -/// §10.4 family revocation). Non-blocking acquire + async backoff so a refresh in -/// flight elsewhere never blocks this executor thread; on timeout we return `None` -/// and let the caller proceed (degrade to a possible double-refresh, never wedge). +/// §10.4 family revocation). Non-blocking acquire on a single fd + async backoff +/// so a refresh in flight elsewhere never blocks this executor thread; on timeout +/// we return `None` and let the caller proceed (degrade to a possible +/// double-refresh, never wedge). +/// +/// The lock only *serialises* — it does not decide whether to refresh. Callers +/// MUST re-check token freshness after acquiring (the double-check) so a process +/// that lost the race adopts the token the winner wrote instead of refreshing +/// again: `get_valid_token` does this explicitly; the MCP path relies on rmcp's +/// `get_access_token` re-`load()`ing the store. #[cfg(unix)] pub(crate) async fn lock_tenant_refresh(auth: &Path, tenant: &str) -> Option { + use std::os::unix::io::AsRawFd; let lock = lock_path_for(auth, &format!("refresh.{tenant}")); + // Open the lock fd once; re-issue `flock` on it each retry instead of + // re-opening (and re-`create_dir_all`-ing) the same file every 100 ms. + let file = match open_lock_file(&lock) { + Ok(f) => f, + Err(e) => { + tracing::warn!(tenant, error = %e, "refresh lock unavailable; proceeding unserialised"); + return None; + } + }; let deadline = std::time::Instant::now() + std::time::Duration::from_secs(10); loop { - match flock_try_exclusive(&lock) { - Ok(Some(guard)) => return Some(guard), - Ok(None) => { - if std::time::Instant::now() >= deadline { - tracing::warn!( - tenant, - "timed out waiting for refresh lock; proceeding unserialised" - ); - return None; - } - tokio::time::sleep(std::time::Duration::from_millis(100)).await; - } - Err(e) => { - tracing::warn!(tenant, error = %e, "refresh lock unavailable; proceeding unserialised"); - return None; - } + // SAFETY: valid fd held by `file`; flock has no memory effects. + let rc = unsafe { libc::flock(file.as_raw_fd(), libc::LOCK_EX | libc::LOCK_NB) }; + if rc == 0 { + return Some(AuthFileLock { file }); + } + let err = std::io::Error::last_os_error(); + // EWOULDBLOCK/EAGAIN (both `ErrorKind::WouldBlock`) = another holder is + // refreshing; any other errno is a real failure we degrade on. + if err.kind() != std::io::ErrorKind::WouldBlock { + tracing::warn!(tenant, error = %err, "refresh lock unavailable; proceeding unserialised"); + return None; + } + if std::time::Instant::now() >= deadline { + tracing::warn!( + tenant, + "timed out waiting for refresh lock; proceeding unserialised" + ); + return None; } + tokio::time::sleep(std::time::Duration::from_millis(100)).await; } } @@ -483,25 +498,23 @@ impl CredentialStore for McpCredentialStore { } async fn clear(&self) -> Result<(), AuthError> { - // Same global lock as every other writer; preserves the "last entry - // removed → delete the file" behaviour rather than leaving an empty map. - let run = || -> Result<()> { - #[cfg(unix)] - let _guard = flock_exclusive(&lock_path_for(&self.path, "global"))?; - let mut map = match read_auth_file(&self.path) { - Ok(m) => m, - Err(_) => return Ok(()), - }; - if map.remove(&self.key).is_none() { - return Ok(()); - } - if map.is_empty() { - let _ = std::fs::remove_file(&self.path); - return Ok(()); - } - write_auth_file(&self.path, &map) + // Same global lock as every other writer, but with a delete-on-empty tail + // `with_auth_locked` can't express (it always writes), so `clear` acquires + // the shared `lock_global` directly rather than funnelling through it. + let _guard = + lock_global(&self.path).map_err(|e| AuthError::InternalError(e.to_string()))?; + let mut map = match read_auth_file(&self.path) { + Ok(m) => m, + Err(_) => return Ok(()), }; - run().map_err(|e| AuthError::InternalError(e.to_string())) + if map.remove(&self.key).is_none() { + return Ok(()); + } + if map.is_empty() { + let _ = std::fs::remove_file(&self.path); + return Ok(()); + } + write_auth_file(&self.path, &map).map_err(|e| AuthError::InternalError(e.to_string())) } } From 4e6f2467ca23081f7581f4cf8ba77a26ea72cdd6 Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 04:07:13 +0800 Subject: [PATCH 14/18] =?UTF-8?q?docs(auth):=20/code-review=20=E2=80=94=20?= =?UTF-8?q?harden=20GC=20+=20refresh-lock=20contract=20comments?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Doc-only clarifications from a /code-review pass (no behaviour change): - PendingPasteLogin.created_at: warn loudly that any writer of a fresh Pending entry MUST stamp created_at, else gc_stale_pending sweeps it on the next locked write (latent footgun for the forthcoming /auth two-step writer). - lock_tenant_refresh: correct the contract — reuse-safety comes from loading the refresh token INSIDE the lock (so force_refresh, which always refreshes on a 401, is reuse-safe too); the post-lock expiry re-check is only an optimisation to skip a redundant refresh. Review also surfaced a pre-existing, out-of-scope namespacing gap (an MCP server literally named "codex" collides with the codex tenant's auth.json key and refresh lock — committed MCP creds use the bare server name, not mcp: as ADR §5.4 describes) — flagged for a separate change. Co-Authored-By: Claude Opus 4.8 --- openab-agent/src/auth.rs | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/openab-agent/src/auth.rs b/openab-agent/src/auth.rs index 34bb24f44..a11ec5658 100644 --- a/openab-agent/src/auth.rs +++ b/openab-agent/src/auth.rs @@ -80,6 +80,11 @@ pub struct PendingPasteLogin { /// that *writes* pending state is forthcoming; today this struct is a legacy /// read-tolerant tombstone, so the field exists mainly for the GC. `#[serde(default)]` /// (= 0) reads pre-existing/legacy entries as ancient → swept on the next write. + /// + /// **Any code that writes a fresh `Pending` entry MUST set `created_at` to the + /// current Unix time** — an unstamped (0) entry is treated as ancient and is + /// swept by `gc_stale_pending` on the very next locked write, so a verifier + /// written without stamping would vanish before the user pastes the code. #[serde(default)] pub created_at: u64, } @@ -352,11 +357,15 @@ fn gc_stale_pending(map: &mut HashMap) { /// we return `None` and let the caller proceed (degrade to a possible /// double-refresh, never wedge). /// -/// The lock only *serialises* — it does not decide whether to refresh. Callers -/// MUST re-check token freshness after acquiring (the double-check) so a process -/// that lost the race adopts the token the winner wrote instead of refreshing -/// again: `get_valid_token` does this explicitly; the MCP path relies on rmcp's -/// `get_access_token` re-`load()`ing the store. +/// Reuse-safety comes from loading the refresh token *inside* the lock: a +/// process that waited then loads the token the winner just wrote, so it never +/// re-presents a rotated `RT_old`. Re-checking expiry after acquiring is an +/// additional optimisation that skips a redundant network refresh — `get_valid_token` +/// does this explicitly, and the MCP path gets it free from rmcp's `get_access_token` +/// (which re-`load()`s and returns early when the token is already fresh). +/// `force_refresh` intentionally skips that optimisation and always refreshes (it +/// runs on a 401, where the clock-fresh token is already known-bad); it stays +/// reuse-safe because it, too, loads inside the lock. #[cfg(unix)] pub(crate) async fn lock_tenant_refresh(auth: &Path, tenant: &str) -> Option { use std::os::unix::io::AsRawFd; From b40693c31c87f1fa8133f08a30f71b1e860f83c5 Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 04:10:43 +0800 Subject: [PATCH 15/18] build(openab-agent): declare standalone [workspace] to fix CI ci-openab-agent.yml runs cargo fmt/clippy/test/build with working-directory: openab-agent, but the crate is not a member of the parent openab workspace (members = crates/openab-core, openab-gateway) and was not excluded, so cargo errors 'current package believes it's in a workspace when it's not' and every step fails before doing any work. openab-agent is intentionally standalone (own version + dual reqwest 0.12/0.13 for rmcp), so the correct fix is an empty [workspace] table making it its own root. Co-Authored-By: Claude Opus 4.8 --- openab-agent/Cargo.toml | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/openab-agent/Cargo.toml b/openab-agent/Cargo.toml index 5f146ca21..57f3d55ae 100644 --- a/openab-agent/Cargo.toml +++ b/openab-agent/Cargo.toml @@ -50,3 +50,11 @@ libc = "0.2" [dev-dependencies] tempfile = "3" temp-env = "0.3.6" + +# openab-agent is a standalone crate (its own version + a dual reqwest 0.12/0.13 +# dep set for rmcp), deliberately NOT a member of the parent `openab` workspace +# (members = crates/openab-core, openab-gateway). Declare an empty workspace so +# cargo, run from this directory, treats it as its own root instead of erroring +# "current package believes it's in a workspace when it's not" — which otherwise +# breaks every cargo step in ci-openab-agent.yml (working-directory: openab-agent). +[workspace] From 752be4c1106d7dd6774cb1549b36ffa8feb89f95 Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 04:24:10 +0800 Subject: [PATCH 16/18] Revert "build(openab-agent): declare standalone [workspace] to fix CI" This reverts commit b40693c31c87f1fa8133f08a30f71b1e860f83c5. --- openab-agent/Cargo.toml | 8 -------- 1 file changed, 8 deletions(-) diff --git a/openab-agent/Cargo.toml b/openab-agent/Cargo.toml index 57f3d55ae..5f146ca21 100644 --- a/openab-agent/Cargo.toml +++ b/openab-agent/Cargo.toml @@ -50,11 +50,3 @@ libc = "0.2" [dev-dependencies] tempfile = "3" temp-env = "0.3.6" - -# openab-agent is a standalone crate (its own version + a dual reqwest 0.12/0.13 -# dep set for rmcp), deliberately NOT a member of the parent `openab` workspace -# (members = crates/openab-core, openab-gateway). Declare an empty workspace so -# cargo, run from this directory, treats it as its own root instead of erroring -# "current package believes it's in a workspace when it's not" — which otherwise -# breaks every cargo step in ci-openab-agent.yml (working-directory: openab-agent). -[workspace] From 212d4bbf43366cfc00ea125487421510924bd7d8 Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 04:25:18 +0800 Subject: [PATCH 17/18] ci(openab-agent): append [workspace] in CI + harden ACP smoke test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ci-openab-agent.yml ran cargo from working-directory: openab-agent without the [workspace] table the crate needs (it's standalone, not a parent-workspace member), so every step failed with 'believes it's in a workspace when it's not' — the workflow had been red independently of any PR. Dockerfile.unified already works around this by appending the table at build time; replicate that in the workflow rather than committing it to Cargo.toml (which would double-append in the Dockerfile and break the image build). Also harden the ACP smoke test: build the release binary in its own step and bump the response timeout 5s -> 30s so a loaded runner doesn't flake the agent's first-response window (the binary itself responds in <1s locally, verified incl. a clean-HOME release build). Co-Authored-By: Claude Opus 4.8 --- .github/workflows/ci-openab-agent.yml | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/.github/workflows/ci-openab-agent.yml b/.github/workflows/ci-openab-agent.yml index 11d9290ca..550edb4ac 100644 --- a/.github/workflows/ci-openab-agent.yml +++ b/.github/workflows/ci-openab-agent.yml @@ -24,17 +24,27 @@ jobs: - uses: Swatinem/rust-cache@v2 with: workspaces: openab-agent + # openab-agent is a standalone crate, not a member of the parent openab + # workspace (members = crates/openab-core, openab-gateway) and not excluded, + # so cargo run from this directory errors "believes it's in a workspace when + # it's not". Declare an empty workspace, mirroring Dockerfile.unified, so + # every cargo step below resolves it as its own root. + - run: printf '\n[workspace]\n' >> Cargo.toml - run: cargo fmt --check - run: cargo clippy -- -D warnings - run: cargo test - run: cargo test -- --ignored env: ANTHROPIC_API_KEY: "fake-key-for-ci" + # Build in its own step so a build failure is distinct from a smoke failure + # and the timed run below races only the binary, not a concurrent compile. + - name: Build release binary + run: cargo build --release - name: ACP smoke test run: | - cargo build --release - # Test: initialize returns valid ACP response - RESP=$(echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}' | timeout 5 ./target/release/openab-agent 2>/dev/null | head -1) + # Generous timeout so a loaded runner doesn't flake the agent's + # first-response window (the prior 5s was tight right after a release build). + RESP=$(echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}' | timeout 30 ./target/release/openab-agent 2>/dev/null | head -1) echo "Response: $RESP" echo "$RESP" | grep -q '"agentInfo"' || (echo "FAIL: no agentInfo in response" && exit 1) echo "$RESP" | grep -q '"openab-agent"' || (echo "FAIL: wrong agent name" && exit 1) From 235945094f8840ce8c62094e6d90f070d8ae41ac Mon Sep 17 00:00:00 2001 From: Brett Chien Date: Thu, 25 Jun 2026 04:37:10 +0800 Subject: [PATCH 18/18] ci(openab-agent): surface ACP smoke-test stderr + exit code for diagnosis The smoke test produces empty stdout only in CI (the binary responds with agentInfo in <1s locally across debug/release/clean-HOME/CI-env-var runs). Capture stderr + exit code so the next CI run reveals why stdout is empty. --- .github/workflows/ci-openab-agent.yml | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/.github/workflows/ci-openab-agent.yml b/.github/workflows/ci-openab-agent.yml index 550edb4ac..5ba974853 100644 --- a/.github/workflows/ci-openab-agent.yml +++ b/.github/workflows/ci-openab-agent.yml @@ -42,9 +42,14 @@ jobs: run: cargo build --release - name: ACP smoke test run: | - # Generous timeout so a loaded runner doesn't flake the agent's - # first-response window (the prior 5s was tight right after a release build). - RESP=$(echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}' | timeout 30 ./target/release/openab-agent 2>/dev/null | head -1) + set +e + echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}' \ + | timeout 30 ./target/release/openab-agent > /tmp/acp_out.txt 2> /tmp/acp_err.txt + code=$? + echo "exit code: $code" + echo "--- stdout ---"; cat /tmp/acp_out.txt + echo "--- stderr ---"; cat /tmp/acp_err.txt + RESP=$(head -1 /tmp/acp_out.txt) echo "Response: $RESP" echo "$RESP" | grep -q '"agentInfo"' || (echo "FAIL: no agentInfo in response" && exit 1) echo "$RESP" | grep -q '"openab-agent"' || (echo "FAIL: wrong agent name" && exit 1)