diff --git a/CLAUDE.md b/CLAUDE.md index e61b4b4..f66ec25 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -118,6 +118,14 @@ Two credential types: `static` (default) for API keys and `oauth` for OAuth acce Runtime flags: `--mcp-base-url` sets the external URL the agent uses to reach sluice's MCP gateway (e.g. `http://sluice:3000`). This is added to `SelfBypass` so sluice does not policy-check its own MCP traffic. Defaults to deriving from `--health-addr`. `--agent ` selects an agent profile (`openclaw`, `hermes`); the profile controls the env file path inside the container, the secrets-reload mechanism, and the MCP wiring command. The default is `openclaw`. May also be set via `SLUICE_AGENT_PROFILE`. +## Channel Feature Parity + +Sluice exposes management surfaces over multiple channels: the **CLI**, the **HTTP/REST API** (`api/openapi.yaml`), the **Telegram bot**, and any future channel. **Every store-backed management feature must be reachable from all channels.** When you add or change a feature that reads or writes the SQLite store (policy rules, credentials, bindings, MCP upstreams, credential pools, config/default-verdict, …), implement it on **CLI and REST and Telegram**, not just one. + +The mechanism for keeping them honest: put the operation logic in a **channel-agnostic package** (e.g. the `store` methods, or a small `*ops` package like `internal/poolops`) and have each channel be a thin adapter over it. Logic written inline in one channel (the historical cause of the pools-are-CLI-only gap — rotate/status lived in `cmd/sluice/pool.go`) is the anti-pattern; lift it so the channels cannot drift. + +The **only** acceptable single-channel features are those with a documented rationale that makes them meaningless elsewhere — e.g. `sluice cert generate` / `sluice audit verify` are local-filesystem/operator tools with no remote-management semantics; OAuth token entry is stdin-only on the CLI specifically to keep secrets out of shell history and out of the REST/Telegram surfaces. State the rationale in the code and docs when you deliberately scope a feature to one channel; absent that rationale, parity is required and a reviewer should block the PR. + ## Agent Profiles Profiles abstract per-agent runtime conventions so sluice's container managers stay agent-agnostic. Each profile carries `EnvFileRelPath` (where to write phantom-token env vars), `ReloadCmd` (argv to exec for in-place secret reload, or nil), and `WireMCPCmd` (argv to register sluice as an MCP server in the agent's config). diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index cb22863..1caa1dc 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -97,6 +97,14 @@ make lint-api Do not edit `internal/api/api.gen.go` manually. It is regenerated from the spec. +## Channel Feature Parity + +Sluice manages its state from several channels: the **CLI**, the **REST API**, the **Telegram bot**, and future channels. **A store-backed management feature must land on all channels in the same PR**, not just the one that motivated it. This applies to anything that reads or writes the SQLite store: policy rules, credentials, bindings, MCP upstreams, credential pools, config/default-verdict, etc. + +To keep the channels from drifting, put the operation in a **channel-agnostic** place (a `store` method or a small `*ops` package) and make each channel a thin adapter. Do **not** write the logic inline in `cmd/sluice/*` (or only in a REST handler) — that is exactly how credential pools ended up CLI-only. + +A feature may be scoped to a single channel **only** with an explicit, documented rationale that makes it meaningless elsewhere (e.g. local operator tools like `sluice cert generate` / `sluice audit verify`; CLI-stdin-only OAuth token entry, which deliberately keeps secrets out of the REST/Telegram surfaces and shell history). State the rationale in the code comment and the relevant doc. Absent that rationale, **reviewers should treat a single-channel feature as incomplete and request parity before merge**, and the PR description should note which channels the change touches. + ## Architecture - `internal/store/` -- SQLite-backed policy store for all runtime state diff --git a/docs/plans/20260518-channel-feature-parity.md b/docs/plans/20260518-channel-feature-parity.md new file mode 100644 index 0000000..9427a63 --- /dev/null +++ b/docs/plans/20260518-channel-feature-parity.md @@ -0,0 +1,111 @@ +# Channel Feature Parity (pools on HTTP API + Telegram) and Token-Host Grant Scoping + +## Overview + +Two follow-ups surfaced while operating the credential-pool feature live: + +1. **Pools are CLI-only.** `sluice pool create|list|status|rotate|remove` exists, but there are **no HTTP API endpoints** and **no Telegram `/pool` commands**. Every other store-backed management surface (policy, credentials, bindings, MCP upstreams) is reachable from CLI **and** REST **and** Telegram. Pools must reach parity. +2. **Token-host phantom expansion is grant-blind.** sluice rewrites *every* request to a pool's shared OAuth token host (e.g. `auth.openai.com`), including non-refresh grants. A fresh in-container `codex login --device-auth` (a `device_code` grant) is therefore corrupted into `400 token_exchange_user_error`. The expansion must act only on `grant_type=refresh_token`. + +This plan also codifies a general **channel feature-parity principle** (CLI / HTTP API / Telegram / future channels expose the same management features unless there is a documented single-channel rationale) in `CLAUDE.md` and `CONTRIBUTING.md` so future store-backed features do not regress into a single channel again. + +## Context + +- Store (already complete, channel-agnostic): `internal/store/pools.go` — `CreatePoolWithMembers`, `GetPool`, `ListPools`, `RemovePoolIfUnreferenced`, `PoolsForMember`, `SetCredentialHealthIfPoolMemberEpoch`, `ListCredentialHealth`, `ErrCredentialInUseByPool`, `PoolReferencedError`. +- Pool resolution / status / rotate logic currently lives in `cmd/sluice/pool.go` (`handlePoolStatus`, `handlePoolRotate`) — the epoch-guarded `SetCredentialHealthIfPoolMemberEpoch(active, name, epoch, "cooldown", until, vault.ManualRotateReason)` write and the `vault.NewPoolResolver(...).ResolveActive` derivation. This is **not** reusable by other channels yet — it must be lifted into a shared package. +- REST: spec-first. `api/openapi.yaml` is the source of truth; `make generate` regenerates `internal/api/api.gen.go`; handlers implement the generated `ServerInterface` in `internal/api/server.go`. Existing resource patterns to mirror: `/api/credentials`, `/api/bindings`, `/api/mcp/upstreams` (collection + `/{id}` item). +- Telegram: command dispatch switch in `internal/telegram/commands.go` (~line 245: `case "policy" | "cred" | "mcp"`). Mirror an existing multi-subcommand handler (`/mcp add|list|remove`). Pool create carries **no secrets** (member credential names only) so the `/mcp add` auto-delete-message behavior is **not** needed. +- Token-host expansion: `internal/proxy/addon.go` — the `[ADDON-INJECT] token-host phantom expansion for pool ... (auth.openai.com)` path (`buildPhantomPairs` / `buildOAuthPhantomPairs` token-host branch). The request body / `grant_type` is available on the flow at injection time. +- Failover attribution already keys on `grant_type=refresh_token` bodies in `internal/proxy/pool_failover.go` (`classifyFailover` token-endpoint path) — reuse the same form-parse helper rather than writing a new one. + +## Development Approach + +- **Testing approach**: Regular (code first, then tests) +- Complete each task fully before moving to the next +- CRITICAL: every task MUST include new/updated tests +- CRITICAL: all tests must pass before starting the next task +- gofumpt for Go formatting; scoped Conventional Commits; PR to `main` (never direct push) +- REST changes follow the spec-first workflow: edit `api/openapi.yaml` → `make generate` → implement handler → `go test ./internal/api/` → `make lint-api`. Never hand-edit `api.gen.go`. + +## Testing Strategy + +- **Unit tests**: shared pool-ops package (create/list/status/rotate/remove) success + error paths (namespace collision, static member rejection, in-use-by-pool, unknown pool, epoch-raced rotate). +- **Unit tests**: REST handlers — happy path + 400 (validation) + 404 (unknown pool) + 409 (`ErrCredentialInUseByPool` / `PoolReferencedError`), mirroring the credentials/bindings handler tests. +- **Unit tests**: Telegram `/pool` handler — each subcommand success + error rendering. +- **Unit tests**: token-host expansion — a `refresh_token` grant is still expanded (regression guard); a `device_code` / `authorization_code` grant is **passed through untouched** (fail-before/pass-after). +- **E2E** (optional, if a pool e2e harness slot is cheap): a pooled upstream managed entirely via REST then via Telegram resolves identically to the CLI path. + +## Solution Overview + +1. **Lift pool ops into a shared, channel-agnostic package** (`internal/poolops` or a method set on `*store.Store` + a small `vault` helper) exposing `Create`, `List`, `Status`, `Rotate`, `Remove` returning typed results/errors. CLI, REST, and Telegram all call it. The epoch-guarded rotate write and the `ResolveActive`-based status derivation live here exactly once, so the three channels cannot drift (this is the root cause of the parity gap — the logic was written inline in `cmd/sluice`). +2. **REST**: add `/api/pools` (GET list, POST create), `/api/pools/{name}` (GET status, DELETE remove), `/api/pools/{name}/rotate` (POST). Spec-first. +3. **Telegram**: add `/pool create|list|status|rotate|remove` dispatching to the shared package, rendered like `/mcp`. +4. **Token-host grant scope**: gate the pool token-host phantom expansion on `grant_type=refresh_token`; pass other grants through unmodified. +5. **Docs/process**: add the channel-parity principle to `CLAUDE.md` and `CONTRIBUTING.md` (done in this PR for the docs; the code tasks below are the follow-up implementation). + +## Implementation Steps + +### Task 1: Extract channel-agnostic pool operations + +**Files:** +- Add: `internal/poolops/poolops.go` +- Add: `internal/poolops/poolops_test.go` +- Modify: `cmd/sluice/pool.go` (call the shared package instead of inline logic) + +- [ ] Define `Create(store, name, strategy, members)`, `List(store)`, `Status(store, name)`, `Rotate(store, name)`, `Remove(store, name)` with typed results and sentinel errors (reuse `store.ErrCredentialInUseByPool`, `store.PoolReferencedError`) +- [ ] Move the epoch-guarded rotate write (`SetCredentialHealthIfPoolMemberEpoch` + `vault.ManualRotateReason`) and the `vault.NewPoolResolver(...).ResolveActive` status derivation into `Status`/`Rotate` +- [ ] Rewrite `cmd/sluice/pool.go` handlers to thin wrappers over `internal/poolops` (no behavior change; existing CLI tests must still pass) +- [ ] Tests: every op, success + each sentinel error; rotate epoch-race no-op +- [ ] Run tests + +### Task 2: REST endpoints for pools (spec-first) + +**Files:** +- Modify: `api/openapi.yaml` +- Regenerate: `internal/api/api.gen.go` (via `make generate` — do not hand-edit) +- Modify: `internal/api/server.go` +- Modify: `internal/api/server_test.go` + +- [ ] Add to `api/openapi.yaml`: `/api/pools` (GET, POST), `/api/pools/{name}` (GET, DELETE), `/api/pools/{name}/rotate` (POST), with schemas mirroring the credentials/bindings style +- [ ] `make generate`; implement the new `ServerInterface` methods in `server.go` calling `internal/poolops` +- [ ] Map errors: validation → 400, unknown pool → 404, `ErrCredentialInUseByPool` / `PoolReferencedError` → 409, else 500 (mirror the existing cred-removal mapping at `server.go:~1287`) +- [ ] Tests: list/create/status/rotate/remove happy paths + 400/404/409 +- [ ] `make lint-api`; run `go test ./internal/api/` + +### Task 3: Telegram `/pool` commands + +**Files:** +- Modify: `internal/telegram/commands.go` +- Modify: `internal/telegram/commands_test.go` +- Modify: command help/menu registration (wherever `/policy`,`/cred`,`/mcp` are listed) + +- [ ] Add `case "pool":` to the dispatch switch; subcommands `create|list|status|rotate|remove` calling `internal/poolops` +- [ ] Render `status` like the CLI (`* [i] member healthy|cooldown … reason`); render errors as plain text +- [ ] No message auto-delete (pool args carry no secrets — unlike `/mcp add`); add `/pool` to the grouped help/command menu +- [ ] Tests: each subcommand success + error +- [ ] Run tests + +### Task 4: Scope pool token-host phantom expansion to refresh grants + +**Files:** +- Modify: `internal/proxy/addon.go` +- Modify: `internal/proxy/addon_test.go` (or `pool_phantom_test.go`) + +- [ ] At the pool token-host expansion site, parse the request body form and only expand when `grant_type == "refresh_token"`; pass `device_code`/`authorization_code`/absent-grant requests through unmodified (reuse the form-parse already used by `classifyFailover`) +- [ ] Tests (fail-before/pass-after): `refresh_token` grant still expanded (regression guard); `device_code` grant body + headers reach upstream byte-unchanged so a fresh in-container `codex login --device-auth` is not corrupted +- [ ] Run tests; `go vet ./...` and `go vet -tags=e2e ./e2e/` + +### Task 5: Final validation + +**Files:** none (validation only) + +- [ ] `gofumpt -l` clean; `golangci-lint run ./...` 0 issues +- [ ] Full `go test ./...` + `-race` on `internal/proxy`,`internal/api`,`internal/telegram`,`internal/poolops`,`internal/store` +- [ ] `go build ./...`; `go vet ./...`; `go vet -tags=e2e ./e2e/` +- [ ] Independently verify the committed HEAD (no conflict markers; `git diff --stat HEAD` empty) before pushing + +## Out of Scope + +- In-flight retry on pooled 429/401 (the agent's own retry + synchronous member switch already covers it; documented non-goal). +- Streaming-response failover (documented known limitation, separate work). +- A pool-management web dashboard surface (tracked by `docs/plans/20260406-web-dashboard.md`; once `internal/poolops` exists the dashboard reuses it for free — note the synergy there rather than duplicating here).