Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,14 @@ Two credential types: `static` (default) for API keys and `oauth` for OAuth acce

Runtime flags: `--mcp-base-url` sets the external URL the agent uses to reach sluice's MCP gateway (e.g. `http://sluice:3000`). This is added to `SelfBypass` so sluice does not policy-check its own MCP traffic. Defaults to deriving from `--health-addr`. `--agent <profile>` selects an agent profile (`openclaw`, `hermes`); the profile controls the env file path inside the container, the secrets-reload mechanism, and the MCP wiring command. The default is `openclaw`. May also be set via `SLUICE_AGENT_PROFILE`.

## Channel Feature Parity

Sluice exposes management surfaces over multiple channels: the **CLI**, the **HTTP/REST API** (`api/openapi.yaml`), the **Telegram bot**, and any future channel. **Every store-backed management feature must be reachable from all channels.** When you add or change a feature that reads or writes the SQLite store (policy rules, credentials, bindings, MCP upstreams, credential pools, config/default-verdict, …), implement it on **CLI and REST and Telegram**, not just one.

The mechanism for keeping them honest: put the operation logic in a **channel-agnostic package** (e.g. the `store` methods, or a small `*ops` package like `internal/poolops`) and have each channel be a thin adapter over it. Logic written inline in one channel (the historical cause of the pools-are-CLI-only gap — rotate/status lived in `cmd/sluice/pool.go`) is the anti-pattern; lift it so the channels cannot drift.

The **only** acceptable single-channel features are those with a documented rationale that makes them meaningless elsewhere — e.g. `sluice cert generate` / `sluice audit verify` are local-filesystem/operator tools with no remote-management semantics; OAuth token entry is stdin-only on the CLI specifically to keep secrets out of shell history and out of the REST/Telegram surfaces. State the rationale in the code and docs when you deliberately scope a feature to one channel; absent that rationale, parity is required and a reviewer should block the PR.

## Agent Profiles

Profiles abstract per-agent runtime conventions so sluice's container managers stay agent-agnostic. Each profile carries `EnvFileRelPath` (where to write phantom-token env vars), `ReloadCmd` (argv to exec for in-place secret reload, or nil), and `WireMCPCmd` (argv to register sluice as an MCP server in the agent's config).
Expand Down
8 changes: 8 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,14 @@ make lint-api

Do not edit `internal/api/api.gen.go` manually. It is regenerated from the spec.

## Channel Feature Parity

Sluice manages its state from several channels: the **CLI**, the **REST API**, the **Telegram bot**, and future channels. **A store-backed management feature must land on all channels in the same PR**, not just the one that motivated it. This applies to anything that reads or writes the SQLite store: policy rules, credentials, bindings, MCP upstreams, credential pools, config/default-verdict, etc.

To keep the channels from drifting, put the operation in a **channel-agnostic** place (a `store` method or a small `*ops` package) and make each channel a thin adapter. Do **not** write the logic inline in `cmd/sluice/*` (or only in a REST handler) — that is exactly how credential pools ended up CLI-only.

A feature may be scoped to a single channel **only** with an explicit, documented rationale that makes it meaningless elsewhere (e.g. local operator tools like `sluice cert generate` / `sluice audit verify`; CLI-stdin-only OAuth token entry, which deliberately keeps secrets out of the REST/Telegram surfaces and shell history). State the rationale in the code comment and the relevant doc. Absent that rationale, **reviewers should treat a single-channel feature as incomplete and request parity before merge**, and the PR description should note which channels the change touches.

## Architecture

- `internal/store/` -- SQLite-backed policy store for all runtime state
Expand Down
111 changes: 111 additions & 0 deletions docs/plans/20260518-channel-feature-parity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Channel Feature Parity (pools on HTTP API + Telegram) and Token-Host Grant Scoping

## Overview

Two follow-ups surfaced while operating the credential-pool feature live:

1. **Pools are CLI-only.** `sluice pool create|list|status|rotate|remove` exists, but there are **no HTTP API endpoints** and **no Telegram `/pool` commands**. Every other store-backed management surface (policy, credentials, bindings, MCP upstreams) is reachable from CLI **and** REST **and** Telegram. Pools must reach parity.
2. **Token-host phantom expansion is grant-blind.** sluice rewrites *every* request to a pool's shared OAuth token host (e.g. `auth.openai.com`), including non-refresh grants. A fresh in-container `codex login --device-auth` (a `device_code` grant) is therefore corrupted into `400 token_exchange_user_error`. The expansion must act only on `grant_type=refresh_token`.

This plan also codifies a general **channel feature-parity principle** (CLI / HTTP API / Telegram / future channels expose the same management features unless there is a documented single-channel rationale) in `CLAUDE.md` and `CONTRIBUTING.md` so future store-backed features do not regress into a single channel again.

## Context

- Store (already complete, channel-agnostic): `internal/store/pools.go` — `CreatePoolWithMembers`, `GetPool`, `ListPools`, `RemovePoolIfUnreferenced`, `PoolsForMember`, `SetCredentialHealthIfPoolMemberEpoch`, `ListCredentialHealth`, `ErrCredentialInUseByPool`, `PoolReferencedError`.
- Pool resolution / status / rotate logic currently lives in `cmd/sluice/pool.go` (`handlePoolStatus`, `handlePoolRotate`) — the epoch-guarded `SetCredentialHealthIfPoolMemberEpoch(active, name, epoch, "cooldown", until, vault.ManualRotateReason)` write and the `vault.NewPoolResolver(...).ResolveActive` derivation. This is **not** reusable by other channels yet — it must be lifted into a shared package.
- REST: spec-first. `api/openapi.yaml` is the source of truth; `make generate` regenerates `internal/api/api.gen.go`; handlers implement the generated `ServerInterface` in `internal/api/server.go`. Existing resource patterns to mirror: `/api/credentials`, `/api/bindings`, `/api/mcp/upstreams` (collection + `/{id}` item).
- Telegram: command dispatch switch in `internal/telegram/commands.go` (~line 245: `case "policy" | "cred" | "mcp"`). Mirror an existing multi-subcommand handler (`/mcp add|list|remove`). Pool create carries **no secrets** (member credential names only) so the `/mcp add` auto-delete-message behavior is **not** needed.
- Token-host expansion: `internal/proxy/addon.go` — the `[ADDON-INJECT] token-host phantom expansion for pool ... (auth.openai.com)` path (`buildPhantomPairs` / `buildOAuthPhantomPairs` token-host branch). The request body / `grant_type` is available on the flow at injection time.
- Failover attribution already keys on `grant_type=refresh_token` bodies in `internal/proxy/pool_failover.go` (`classifyFailover` token-endpoint path) — reuse the same form-parse helper rather than writing a new one.

## Development Approach

- **Testing approach**: Regular (code first, then tests)
- Complete each task fully before moving to the next
- CRITICAL: every task MUST include new/updated tests
- CRITICAL: all tests must pass before starting the next task
- gofumpt for Go formatting; scoped Conventional Commits; PR to `main` (never direct push)
- REST changes follow the spec-first workflow: edit `api/openapi.yaml` → `make generate` → implement handler → `go test ./internal/api/` → `make lint-api`. Never hand-edit `api.gen.go`.

## Testing Strategy

- **Unit tests**: shared pool-ops package (create/list/status/rotate/remove) success + error paths (namespace collision, static member rejection, in-use-by-pool, unknown pool, epoch-raced rotate).
- **Unit tests**: REST handlers — happy path + 400 (validation) + 404 (unknown pool) + 409 (`ErrCredentialInUseByPool` / `PoolReferencedError`), mirroring the credentials/bindings handler tests.
- **Unit tests**: Telegram `/pool` handler — each subcommand success + error rendering.
- **Unit tests**: token-host expansion — a `refresh_token` grant is still expanded (regression guard); a `device_code` / `authorization_code` grant is **passed through untouched** (fail-before/pass-after).
- **E2E** (optional, if a pool e2e harness slot is cheap): a pooled upstream managed entirely via REST then via Telegram resolves identically to the CLI path.

## Solution Overview

1. **Lift pool ops into a shared, channel-agnostic package** (`internal/poolops` or a method set on `*store.Store` + a small `vault` helper) exposing `Create`, `List`, `Status`, `Rotate`, `Remove` returning typed results/errors. CLI, REST, and Telegram all call it. The epoch-guarded rotate write and the `ResolveActive`-based status derivation live here exactly once, so the three channels cannot drift (this is the root cause of the parity gap — the logic was written inline in `cmd/sluice`).
2. **REST**: add `/api/pools` (GET list, POST create), `/api/pools/{name}` (GET status, DELETE remove), `/api/pools/{name}/rotate` (POST). Spec-first.
3. **Telegram**: add `/pool create|list|status|rotate|remove` dispatching to the shared package, rendered like `/mcp`.
4. **Token-host grant scope**: gate the pool token-host phantom expansion on `grant_type=refresh_token`; pass other grants through unmodified.
5. **Docs/process**: add the channel-parity principle to `CLAUDE.md` and `CONTRIBUTING.md` (done in this PR for the docs; the code tasks below are the follow-up implementation).

## Implementation Steps

### Task 1: Extract channel-agnostic pool operations

**Files:**
- Add: `internal/poolops/poolops.go`
- Add: `internal/poolops/poolops_test.go`
- Modify: `cmd/sluice/pool.go` (call the shared package instead of inline logic)

- [ ] Define `Create(store, name, strategy, members)`, `List(store)`, `Status(store, name)`, `Rotate(store, name)`, `Remove(store, name)` with typed results and sentinel errors (reuse `store.ErrCredentialInUseByPool`, `store.PoolReferencedError`)
- [ ] Move the epoch-guarded rotate write (`SetCredentialHealthIfPoolMemberEpoch` + `vault.ManualRotateReason`) and the `vault.NewPoolResolver(...).ResolveActive` status derivation into `Status`/`Rotate`
- [ ] Rewrite `cmd/sluice/pool.go` handlers to thin wrappers over `internal/poolops` (no behavior change; existing CLI tests must still pass)
- [ ] Tests: every op, success + each sentinel error; rotate epoch-race no-op
- [ ] Run tests

### Task 2: REST endpoints for pools (spec-first)

**Files:**
- Modify: `api/openapi.yaml`
- Regenerate: `internal/api/api.gen.go` (via `make generate` — do not hand-edit)
- Modify: `internal/api/server.go`
- Modify: `internal/api/server_test.go`

- [ ] Add to `api/openapi.yaml`: `/api/pools` (GET, POST), `/api/pools/{name}` (GET, DELETE), `/api/pools/{name}/rotate` (POST), with schemas mirroring the credentials/bindings style
- [ ] `make generate`; implement the new `ServerInterface` methods in `server.go` calling `internal/poolops`
- [ ] Map errors: validation → 400, unknown pool → 404, `ErrCredentialInUseByPool` / `PoolReferencedError` → 409, else 500 (mirror the existing cred-removal mapping at `server.go:~1287`)
- [ ] Tests: list/create/status/rotate/remove happy paths + 400/404/409
- [ ] `make lint-api`; run `go test ./internal/api/`

### Task 3: Telegram `/pool` commands

**Files:**
- Modify: `internal/telegram/commands.go`
- Modify: `internal/telegram/commands_test.go`
- Modify: command help/menu registration (wherever `/policy`,`/cred`,`/mcp` are listed)

- [ ] Add `case "pool":` to the dispatch switch; subcommands `create|list|status|rotate|remove` calling `internal/poolops`
- [ ] Render `status` like the CLI (`* [i] member healthy|cooldown … reason`); render errors as plain text
- [ ] No message auto-delete (pool args carry no secrets — unlike `/mcp add`); add `/pool` to the grouped help/command menu
- [ ] Tests: each subcommand success + error
- [ ] Run tests

### Task 4: Scope pool token-host phantom expansion to refresh grants

**Files:**
- Modify: `internal/proxy/addon.go`
- Modify: `internal/proxy/addon_test.go` (or `pool_phantom_test.go`)

- [ ] At the pool token-host expansion site, parse the request body form and only expand when `grant_type == "refresh_token"`; pass `device_code`/`authorization_code`/absent-grant requests through unmodified (reuse the form-parse already used by `classifyFailover`)
- [ ] Tests (fail-before/pass-after): `refresh_token` grant still expanded (regression guard); `device_code` grant body + headers reach upstream byte-unchanged so a fresh in-container `codex login --device-auth` is not corrupted
- [ ] Run tests; `go vet ./...` and `go vet -tags=e2e ./e2e/`

### Task 5: Final validation

**Files:** none (validation only)

- [ ] `gofumpt -l` clean; `golangci-lint run ./...` 0 issues
- [ ] Full `go test ./...` + `-race` on `internal/proxy`,`internal/api`,`internal/telegram`,`internal/poolops`,`internal/store`
- [ ] `go build ./...`; `go vet ./...`; `go vet -tags=e2e ./e2e/`
- [ ] Independently verify the committed HEAD (no conflict markers; `git diff --stat HEAD` empty) before pushing

## Out of Scope

- In-flight retry on pooled 429/401 (the agent's own retry + synchronous member switch already covers it; documented non-goal).
- Streaming-response failover (documented known limitation, separate work).
- A pool-management web dashboard surface (tracked by `docs/plans/20260406-web-dashboard.md`; once `internal/poolops` exists the dashboard reuses it for free — note the synergy there rather than duplicating here).
Loading