nnemirovsky · nnemirovsky · May 18, 2026 · May 18, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -118,6 +118,14 @@ Two credential types: `static` (default) for API keys and `oauth` for OAuth acce
 
 Runtime flags: `--mcp-base-url` sets the external URL the agent uses to reach sluice's MCP gateway (e.g. `http://sluice:3000`). This is added to `SelfBypass` so sluice does not policy-check its own MCP traffic. Defaults to deriving from `--health-addr`. `--agent <profile>` selects an agent profile (`openclaw`, `hermes`); the profile controls the env file path inside the container, the secrets-reload mechanism, and the MCP wiring command. The default is `openclaw`. May also be set via `SLUICE_AGENT_PROFILE`.
 
+## Channel Feature Parity
+
+Sluice exposes management surfaces over multiple channels: the **CLI**, the **HTTP/REST API** (`api/openapi.yaml`), the **Telegram bot**, and any future channel. **Every store-backed management feature must be reachable from all channels.** When you add or change a feature that reads or writes the SQLite store (policy rules, credentials, bindings, MCP upstreams, credential pools, config/default-verdict, …), implement it on **CLI and REST and Telegram**, not just one.
+
+The mechanism for keeping them honest: put the operation logic in a **channel-agnostic package** (e.g. the `store` methods, or a small `*ops` package like `internal/poolops`) and have each channel be a thin adapter over it. Logic written inline in one channel (the historical cause of the pools-are-CLI-only gap — rotate/status lived in `cmd/sluice/pool.go`) is the anti-pattern; lift it so the channels cannot drift.
+
+The **only** acceptable single-channel features are those with a documented rationale that makes them meaningless elsewhere — e.g. `sluice cert generate` / `sluice audit verify` are local-filesystem/operator tools with no remote-management semantics; OAuth token entry is stdin-only on the CLI specifically to keep secrets out of shell history and out of the REST/Telegram surfaces. State the rationale in the code and docs when you deliberately scope a feature to one channel; absent that rationale, parity is required and a reviewer should block the PR.
+
 ## Agent Profiles
 
 Profiles abstract per-agent runtime conventions so sluice's container managers stay agent-agnostic. Each profile carries `EnvFileRelPath` (where to write phantom-token env vars), `ReloadCmd` (argv to exec for in-place secret reload, or nil), and `WireMCPCmd` (argv to register sluice as an MCP server in the agent's config).

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -97,6 +97,14 @@ make lint-api
 
 Do not edit `internal/api/api.gen.go` manually. It is regenerated from the spec.
 
+## Channel Feature Parity
+
+Sluice manages its state from several channels: the **CLI**, the **REST API**, the **Telegram bot**, and future channels. **A store-backed management feature must land on all channels in the same PR**, not just the one that motivated it. This applies to anything that reads or writes the SQLite store: policy rules, credentials, bindings, MCP upstreams, credential pools, config/default-verdict, etc.
+
+To keep the channels from drifting, put the operation in a **channel-agnostic** place (a `store` method or a small `*ops` package) and make each channel a thin adapter. Do **not** write the logic inline in `cmd/sluice/*` (or only in a REST handler) — that is exactly how credential pools ended up CLI-only.
+
+A feature may be scoped to a single channel **only** with an explicit, documented rationale that makes it meaningless elsewhere (e.g. local operator tools like `sluice cert generate` / `sluice audit verify`; CLI-stdin-only OAuth token entry, which deliberately keeps secrets out of the REST/Telegram surfaces and shell history). State the rationale in the code comment and the relevant doc. Absent that rationale, **reviewers should treat a single-channel feature as incomplete and request parity before merge**, and the PR description should note which channels the change touches.
+
 ## Architecture
 
 - `internal/store/` -- SQLite-backed policy store for all runtime state

diff --git a/docs/plans/20260518-channel-feature-parity.md b/docs/plans/20260518-channel-feature-parity.md
@@ -0,0 +1,111 @@
+# Channel Feature Parity (pools on HTTP API + Telegram) and Token-Host Grant Scoping
+
+## Overview
+
+Two follow-ups surfaced while operating the credential-pool feature live:
+
+1. **Pools are CLI-only.** `sluice pool create|list|status|rotate|remove` exists, but there are **no HTTP API endpoints** and **no Telegram `/pool` commands**. Every other store-backed management surface (policy, credentials, bindings, MCP upstreams) is reachable from CLI **and** REST **and** Telegram. Pools must reach parity.
+2. **Token-host phantom expansion is grant-blind.** sluice rewrites *every* request to a pool's shared OAuth token host (e.g. `auth.openai.com`), including non-refresh grants. A fresh in-container `codex login --device-auth` (a `device_code` grant) is therefore corrupted into `400 token_exchange_user_error`. The expansion must act only on `grant_type=refresh_token`.
+
+This plan also codifies a general **channel feature-parity principle** (CLI / HTTP API / Telegram / future channels expose the same management features unless there is a documented single-channel rationale) in `CLAUDE.md` and `CONTRIBUTING.md` so future store-backed features do not regress into a single channel again.
+
+## Context
+
+- Store (already complete, channel-agnostic): `internal/store/pools.go` — `CreatePoolWithMembers`, `GetPool`, `ListPools`, `RemovePoolIfUnreferenced`, `PoolsForMember`, `SetCredentialHealthIfPoolMemberEpoch`, `ListCredentialHealth`, `ErrCredentialInUseByPool`, `PoolReferencedError`.
+- Pool resolution / status / rotate logic currently lives in `cmd/sluice/pool.go` (`handlePoolStatus`, `handlePoolRotate`) — the epoch-guarded `SetCredentialHealthIfPoolMemberEpoch(active, name, epoch, "cooldown", until, vault.ManualRotateReason)` write and the `vault.NewPoolResolver(...).ResolveActive` derivation. This is **not** reusable by other channels yet — it must be lifted into a shared package.
+- REST: spec-first. `api/openapi.yaml` is the source of truth; `make generate` regenerates `internal/api/api.gen.go`; handlers implement the generated `ServerInterface` in `internal/api/server.go`. Existing resource patterns to mirror: `/api/credentials`, `/api/bindings`, `/api/mcp/upstreams` (collection + `/{id}` item).
+- Telegram: command dispatch switch in `internal/telegram/commands.go` (~line 245: `case "policy" | "cred" | "mcp"`). Mirror an existing multi-subcommand handler (`/mcp add|list|remove`). Pool create carries **no secrets** (member credential names only) so the `/mcp add` auto-delete-message behavior is **not** needed.
+- Token-host expansion: `internal/proxy/addon.go` — the `[ADDON-INJECT] token-host phantom expansion for pool ... (auth.openai.com)` path (`buildPhantomPairs` / `buildOAuthPhantomPairs` token-host branch). The request body / `grant_type` is available on the flow at injection time.
+- Failover attribution already keys on `grant_type=refresh_token` bodies in `internal/proxy/pool_failover.go` (`classifyFailover` token-endpoint path) — reuse the same form-parse helper rather than writing a new one.
+
+## Development Approach
+
+- **Testing approach**: Regular (code first, then tests)
+- Complete each task fully before moving to the next
+- CRITICAL: every task MUST include new/updated tests
+- CRITICAL: all tests must pass before starting the next task
+- gofumpt for Go formatting; scoped Conventional Commits; PR to `main` (never direct push)
+- REST changes follow the spec-first workflow: edit `api/openapi.yaml` → `make generate` → implement handler → `go test ./internal/api/` → `make lint-api`. Never hand-edit `api.gen.go`.
+
+## Testing Strategy
+
+- **Unit tests**: shared pool-ops package (create/list/status/rotate/remove) success + error paths (namespace collision, static member rejection, in-use-by-pool, unknown pool, epoch-raced rotate).
+- **Unit tests**: REST handlers — happy path + 400 (validation) + 404 (unknown pool) + 409 (`ErrCredentialInUseByPool` / `PoolReferencedError`), mirroring the credentials/bindings handler tests.
+- **Unit tests**: Telegram `/pool` handler — each subcommand success + error rendering.
+- **Unit tests**: token-host expansion — a `refresh_token` grant is still expanded (regression guard); a `device_code` / `authorization_code` grant is **passed through untouched** (fail-before/pass-after).
+- **E2E** (optional, if a pool e2e harness slot is cheap): a pooled upstream managed entirely via REST then via Telegram resolves identically to the CLI path.
+
+## Solution Overview
+
+1. **Lift pool ops into a shared, channel-agnostic package** (`internal/poolops` or a method set on `*store.Store` + a small `vault` helper) exposing `Create`, `List`, `Status`, `Rotate`, `Remove` returning typed results/errors. CLI, REST, and Telegram all call it. The epoch-guarded rotate write and the `ResolveActive`-based status derivation live here exactly once, so the three channels cannot drift (this is the root cause of the parity gap — the logic was written inline in `cmd/sluice`).
+2. **REST**: add `/api/pools` (GET list, POST create), `/api/pools/{name}` (GET status, DELETE remove), `/api/pools/{name}/rotate` (POST). Spec-first.
+3. **Telegram**: add `/pool create|list|status|rotate|remove` dispatching to the shared package, rendered like `/mcp`.
+4. **Token-host grant scope**: gate the pool token-host phantom expansion on `grant_type=refresh_token`; pass other grants through unmodified.
+5. **Docs/process**: add the channel-parity principle to `CLAUDE.md` and `CONTRIBUTING.md` (done in this PR for the docs; the code tasks below are the follow-up implementation).
+
+## Implementation Steps
+
+### Task 1: Extract channel-agnostic pool operations
+
+**Files:**
+- Add: `internal/poolops/poolops.go`
+- Add: `internal/poolops/poolops_test.go`
+- Modify: `cmd/sluice/pool.go` (call the shared package instead of inline logic)
+
+- [ ] Define `Create(store, name, strategy, members)`, `List(store)`, `Status(store, name)`, `Rotate(store, name)`, `Remove(store, name)` with typed results and sentinel errors (reuse `store.ErrCredentialInUseByPool`, `store.PoolReferencedError`)
+- [ ] Move the epoch-guarded rotate write (`SetCredentialHealthIfPoolMemberEpoch` + `vault.ManualRotateReason`) and the `vault.NewPoolResolver(...).ResolveActive` status derivation into `Status`/`Rotate`
+- [ ] Rewrite `cmd/sluice/pool.go` handlers to thin wrappers over `internal/poolops` (no behavior change; existing CLI tests must still pass)
+- [ ] Tests: every op, success + each sentinel error; rotate epoch-race no-op
+- [ ] Run tests
+
+### Task 2: REST endpoints for pools (spec-first)
+
+**Files:**
+- Modify: `api/openapi.yaml`
+- Regenerate: `internal/api/api.gen.go` (via `make generate` — do not hand-edit)
+- Modify: `internal/api/server.go`
+- Modify: `internal/api/server_test.go`
+
+- [ ] Add to `api/openapi.yaml`: `/api/pools` (GET, POST), `/api/pools/{name}` (GET, DELETE), `/api/pools/{name}/rotate` (POST), with schemas mirroring the credentials/bindings style
+- [ ] `make generate`; implement the new `ServerInterface` methods in `server.go` calling `internal/poolops`
+- [ ] Map errors: validation → 400, unknown pool → 404, `ErrCredentialInUseByPool` / `PoolReferencedError` → 409, else 500 (mirror the existing cred-removal mapping at `server.go:~1287`)
+- [ ] Tests: list/create/status/rotate/remove happy paths + 400/404/409
+- [ ] `make lint-api`; run `go test ./internal/api/`
+
+### Task 3: Telegram `/pool` commands
+
+**Files:**
+- Modify: `internal/telegram/commands.go`
+- Modify: `internal/telegram/commands_test.go`
+- Modify: command help/menu registration (wherever `/policy`,`/cred`,`/mcp` are listed)
+
+- [ ] Add `case "pool":` to the dispatch switch; subcommands `create|list|status|rotate|remove` calling `internal/poolops`
+- [ ] Render `status` like the CLI (`* [i] member  healthy|cooldown … reason`); render errors as plain text
+- [ ] No message auto-delete (pool args carry no secrets — unlike `/mcp add`); add `/pool` to the grouped help/command menu
+- [ ] Tests: each subcommand success + error
+- [ ] Run tests
+
+### Task 4: Scope pool token-host phantom expansion to refresh grants
+
+**Files:**
+- Modify: `internal/proxy/addon.go`
+- Modify: `internal/proxy/addon_test.go` (or `pool_phantom_test.go`)
+
+- [ ] At the pool token-host expansion site, parse the request body form and only expand when `grant_type == "refresh_token"`; pass `device_code`/`authorization_code`/absent-grant requests through unmodified (reuse the form-parse already used by `classifyFailover`)
+- [ ] Tests (fail-before/pass-after): `refresh_token` grant still expanded (regression guard); `device_code` grant body + headers reach upstream byte-unchanged so a fresh in-container `codex login --device-auth` is not corrupted
+- [ ] Run tests; `go vet ./...` and `go vet -tags=e2e ./e2e/`
+
+### Task 5: Final validation
+
+**Files:** none (validation only)
+
+- [ ] `gofumpt -l` clean; `golangci-lint run ./...` 0 issues
+- [ ] Full `go test ./...` + `-race` on `internal/proxy`,`internal/api`,`internal/telegram`,`internal/poolops`,`internal/store`
+- [ ] `go build ./...`; `go vet ./...`; `go vet -tags=e2e ./e2e/`
+- [ ] Independently verify the committed HEAD (no conflict markers; `git diff --stat HEAD` empty) before pushing
+
+## Out of Scope
+
+- In-flight retry on pooled 429/401 (the agent's own retry + synchronous member switch already covers it; documented non-goal).
+- Streaming-response failover (documented known limitation, separate work).
+- A pool-management web dashboard surface (tracked by `docs/plans/20260406-web-dashboard.md`; once `internal/poolops` exists the dashboard reuses it for free — note the synergy there rather than duplicating here).