From 25fcf54b014a910b42aca14c9c62382540ed5d08 Mon Sep 17 00:00:00 2001 From: amcheste-ai-agent <278991699+amcheste-ai-agent@users.noreply.github.com> Date: Mon, 11 May 2026 17:00:32 -0400 Subject: [PATCH 1/2] chore(brand): em-dash sweep across prose + remove submitted CFP draft MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Brand alignment pass against the alanchester-brand voice rules. 196 em-dashes swept across 22 prose files. The mechanical sweep replaces ` — ` with `. ` (period + space) and capitalizes the following letter when it was lowercase. Code blocks and markdown table rows are protected from substitution. Files swept: README.md, CONTRIBUTING.md, AGENTS.md, ARCHITECTURE.md, SECURITY.md, docs/README.md, docs/index.md, docs/helm-values.md, docs/explanation/{coordination,index,operations,resources}.md, docs/how-to/index.md, docs/how-to/install/{aks,eks,gke}.md, docs/how-to/operate/{budget-alerts,expose-dashboard, shared-storage}.md, docs/tutorials/{getting-started,index}.md, docs/reference/index.md Post-sweep audit (`grep -nE '\. [a-z]'`) found 5 awkward continuations after mechanical replacement. 4 are abbreviation false positives (`e.g.`, `Approx.`, `vs.`) and left as-is. 1 was a real awkward continuation in docs/how-to/operate/expose-dashboard.md where the original em-dash separated a comma-clause; restored to comma form. Out of scope (intentional): - internal/dashboard/templates/layout.html status colors. The dashboard is a tool surface, not a brand surface; the semantic UI palette (gray/amber/blue/green/red phase colors) stays. - docs/cfp/cfp-draft.md is deleted in this PR. The CFP was submitted, the draft no longer needs to live in the repo. Removing it eliminates 30 em-dashes that otherwise would have been flagged in scope. Co-Authored-By: Claude Opus 4.7 (1M context) Co-Authored-By: amcheste <13696614+amcheste@users.noreply.github.com> --- AGENTS.md | 30 ++--- ARCHITECTURE.md | 40 +++--- CONTRIBUTING.md | 52 ++++---- README.md | 28 ++--- SECURITY.md | 18 +-- docs/README.md | 2 +- docs/cfp/cfp-draft.md | 156 ------------------------ docs/explanation/coordination.md | 18 +-- docs/explanation/index.md | 10 +- docs/explanation/operations.md | 14 +-- docs/explanation/resources.md | 22 ++-- docs/helm-values.md | 6 +- docs/how-to/index.md | 14 +-- docs/how-to/install/aks.md | 12 +- docs/how-to/install/eks.md | 12 +- docs/how-to/install/gke.md | 16 +-- docs/how-to/operate/budget-alerts.md | 16 +-- docs/how-to/operate/expose-dashboard.md | 16 +-- docs/how-to/operate/shared-storage.md | 14 +-- docs/index.md | 10 +- docs/reference/index.md | 6 +- docs/tutorials/getting-started.md | 16 +-- docs/tutorials/index.md | 6 +- 23 files changed, 189 insertions(+), 345 deletions(-) delete mode 100644 docs/cfp/cfp-draft.md diff --git a/AGENTS.md b/AGENTS.md index f21b66f..a64c775 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,24 +1,24 @@ -# AGENTS.md — Agent Team Guidelines for claude-teams-operator +# AGENTS.md. Agent Team Guidelines for claude-teams-operator ## When working as a teammate on this project -1. **Check the task list first** — before starting work, check what's assigned to you -2. **Respect module boundaries** — each internal package has a clear scope: - - `internal/controller/` — only reconciliation logic - - `internal/claude/` — only Claude Code file I/O and session management - - `internal/budget/` — only cost estimation - - `internal/webhook/` — only external notifications - - `internal/metrics/` — only Prometheus metrics -3. **Use kubebuilder markers** — all CRD types in `api/v1alpha1/` must have proper `+kubebuilder:` annotations -4. **Test with envtest** — controller tests should use controller-runtime's envtest framework -5. **Follow Kubernetes conventions** — conditions use `metav1.Condition`, status updates are separate from spec changes +1. **Check the task list first**. Before starting work, check what's assigned to you +2. **Respect module boundaries**. Each internal package has a clear scope: + - `internal/controller/`. Only reconciliation logic + - `internal/claude/`. Only Claude Code file I/O and session management + - `internal/budget/`. Only cost estimation + - `internal/webhook/`. Only external notifications + - `internal/metrics/`. Only Prometheus metrics +3. **Use kubebuilder markers**. All CRD types in `api/v1alpha1/` must have proper `+kubebuilder:` annotations +4. **Test with envtest**. Controller tests should use controller-runtime's envtest framework +5. **Follow Kubernetes conventions**. Conditions use `metav1.Condition`, status updates are separate from spec changes ## Architecture rules -- The operator NEVER makes Anthropic API calls directly — it only manages pods that run Claude Code -- All inter-agent communication goes through the shared PVC filesystem — the operator just creates and monitors the volumes -- Budget tracking is estimation-based — we can't read real-time token counts from Claude Code -- Pods use `RestartPolicy: Never` — crashed agents get re-spawned fresh, not restarted +- The operator NEVER makes Anthropic API calls directly. It only manages pods that run Claude Code +- All inter-agent communication goes through the shared PVC filesystem. The operator just creates and monitors the volumes +- Budget tracking is estimation-based. We can't read real-time token counts from Claude Code +- Pods use `RestartPolicy: Never`. Crashed agents get re-spawned fresh, not restarted ## Build verification diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index dfe9eb0..db18a59 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -86,20 +86,20 @@ This approach preserves the native Agent Teams protocol without modification whi ## Storage Requirements -All operator-managed PVCs — `team-state`, `repo`, and (in Cowork mode) `output` — default to `ReadWriteMany` access on a StorageClass named `nfs`. The requirement is not incidental: the lead and every teammate pod must open the same mailbox and task files concurrently, and on a multi-node cluster they will generally land on different nodes. `ReadWriteOnce` can only bind to one node at a time, so it is not a viable default. +All operator-managed PVCs. `team-state`, `repo`, and (in Cowork mode) `output`. Default to `ReadWriteMany` access on a StorageClass named `nfs`. The requirement is not incidental: the lead and every teammate pod must open the same mailbox and task files concurrently, and on a multi-node cluster they will generally land on different nodes. `ReadWriteOnce` can only bind to one node at a time, so it is not a viable default. ### Why ReadWriteMany Each agent pod does two concurrent things against shared state: -- **Writing into peers' inboxes** — the lead writes `teams/{team}/inboxes/{teammate}.json`; each teammate writes to the lead's inbox and occasionally to other teammates'. -- **Claiming tasks** — multiple teammates race to claim items from `tasks/{team}/tasks.json`. +- **Writing into peers' inboxes**. The lead writes `teams/{team}/inboxes/{teammate}.json`; each teammate writes to the lead's inbox and occasionally to other teammates'. +- **Claiming tasks**. Multiple teammates race to claim items from `tasks/{team}/tasks.json`. If the backing PVC cannot be mounted on more than one node, the second pod will fail to schedule (`volume already attached to a different node`) and the team deadlocks before the first mailbox round-trip. ### Supported storage backends -The operator itself has no opinion about the CSI driver — it asks for a PVC with `accessModes: [ReadWriteMany]` and a `storageClassName` that you supply. The table below lists drivers known to satisfy the RWX contract: +The operator itself has no opinion about the CSI driver. It asks for a PVC with `accessModes: [ReadWriteMany]` and a `storageClassName` that you supply. The table below lists drivers known to satisfy the RWX contract: | Platform | Driver | Notes | |----------|--------|-------| @@ -114,11 +114,11 @@ The StorageClass name the operator requests defaults to `nfs` and is overridable ### Single-node fallback -For laptops and CI — Kind, k3d, minikube — a full RWX provisioner is overkill. The operator accepts a `--pvc-access-mode=ReadWriteOnce` flag that switches every managed PVC from `ReadWriteMany` to `ReadWriteOnce`. This works **only** on single-node clusters, because every pod lands on the same node and a hostPath-backed RWO PVC is effectively visible to all of them. +For laptops and CI. Kind, k3d, minikube. A full RWX provisioner is overkill. The operator accepts a `--pvc-access-mode=ReadWriteOnce` flag that switches every managed PVC from `ReadWriteMany` to `ReadWriteOnce`. This works **only** on single-node clusters, because every pod lands on the same node and a hostPath-backed RWO PVC is effectively visible to all of them. `hack/acceptance-setup.sh` uses exactly this trick: it creates an alias StorageClass named `nfs` over `rancher.io/local-path` so the operator's PVC specs still validate, then sets `--pvc-access-mode=ReadWriteOnce` on the controller deployment. -The architectural claim — that a shared mount is sufficient to ferry mailbox JSON between pods — can be verified on any single-node cluster with: +The architectural claim. That a shared mount is sufficient to ferry mailbox JSON between pods. Can be verified on any single-node cluster with: ```bash make acceptance-up @@ -133,10 +133,10 @@ The smoke test reports the effective StorageClass and AccessMode on its PASS lin The native Agent Teams protocol is file-based: -- **Mailboxes** — each agent has a JSON inbox at `~/.claude/teams/{team}/inboxes/{agent}.json`. Agents read their own inbox for messages from teammates. -- **Task list** — a shared JSON file at `~/.claude/tasks/{team}/tasks.json`. The lead writes tasks; teammates claim and update them. +- **Mailboxes**. Each agent has a JSON inbox at `~/.claude/teams/{team}/inboxes/{agent}.json`. Agents read their own inbox for messages from teammates. +- **Task list**. A shared JSON file at `~/.claude/tasks/{team}/tasks.json`. The lead writes tasks; teammates claim and update them. -The operator does not implement or speak this protocol — it only creates the shared PVC that makes the filesystem visible to all pods. Claude Code manages the protocol itself. +The operator does not implement or speak this protocol. It only creates the shared PVC that makes the filesystem visible to all pods. Claude Code manages the protocol itself. ## Coding Mode @@ -148,7 +148,7 @@ When `spec.repository` is set, the operator runs an init Job before deploying po Each teammate pod receives `WORKTREE_PATH=worktrees/{name}`, and the entrypoint `cd`s to that path before launching Claude Code. The lead has no worktree path and works directly from `/workspace/repo`. -Per-worktree isolation prevents git conflicts between concurrent agents — each agent commits to its own branch, and the lead (or an `onComplete` action) handles merging. +Per-worktree isolation prevents git conflicts between concurrent agents. Each agent commits to its own branch, and the lead (or an `onComplete` action) handles merging. ## Cowork Mode @@ -156,7 +156,7 @@ When `spec.workspace` is set (and `spec.repository` is absent or minimal), the o - Creates an output PVC for writable agent output - Mounts workspace inputs (ConfigMaps or existing PVCs) read-only into each pod -- Does not set `WORKTREE_PATH` — agents work in `/workspace/output` or `/workspace/data` +- Does not set `WORKTREE_PATH`. Agents work in `/workspace/output` or `/workspace/data` The entrypoint detects the absence of a git repo gracefully and skips the `git log` startup output. @@ -164,7 +164,7 @@ The entrypoint detects the absence of a git repo gracefully and skips the `git l Claude Code skills live under `~/.claude/skills/{name}/`. The operator mounts ConfigMap-backed skills at `/var/claude-skills/{name}/` and the entrypoint copies them to `~/.claude/skills/{name}/` before launching Claude Code. -Skills are per-agent — the same skill ConfigMap can be mounted into multiple pods independently, so different teammates can have different skill sets. +Skills are per-agent. The same skill ConfigMap can be mounted into multiple pods independently, so different teammates can have different skill sets. ## MCP Servers @@ -193,7 +193,7 @@ The next reconcile loop (within 30 seconds) sees the annotation and spawns the t ## DependsOn Ordering -Teammates can declare `dependsOn` — a list of other teammate names that must reach `Succeeded` phase before this teammate is spawned. The check runs every reconcile loop: +Teammates can declare `dependsOn`. A list of other teammate names that must reach `Succeeded` phase before this teammate is spawned. The check runs every reconcile loop: - In `reconcileInitializing`: initial pod deployment respects dependency order - In `reconcileRunning`: newly unblocked teammates are spawned automatically as their dependencies complete @@ -217,7 +217,7 @@ When the estimate exceeds `budgetLimit`, the operator terminates all pods and se ### Why shared PVC over a message bus? -Agent Teams uses a file-based protocol. Rather than translating it to Redis or NATS, we preserve it exactly by mounting a shared filesystem. This means no changes to Claude Code itself, no protocol versioning concerns, and no additional infrastructure dependencies for simple deployments. The tradeoff is the requirement for ReadWriteMany PVC support — NFS or a cloud-native equivalent like EFS or GCP Filestore. +Agent Teams uses a file-based protocol. Rather than translating it to Redis or NATS, we preserve it exactly by mounting a shared filesystem. This means no changes to Claude Code itself, no protocol versioning concerns, and no additional infrastructure dependencies for simple deployments. The tradeoff is the requirement for ReadWriteMany PVC support. NFS or a cloud-native equivalent like EFS or GCP Filestore. ### Why RestartPolicy: Never? @@ -266,9 +266,9 @@ hack/ ## Roadmap -- **OCI skill artifacts** — pull skills from OCI registries instead of ConfigMaps -- **Real token tracking** — instrument or sidecar Claude Code to capture actual usage -- **envtest integration tests** — full reconcile loop tests against a real API server -- **Horizontal scaling** — multiple operator replicas with leader election -- **Beads/Dolt integration** — persistent task tracking across team runs -- **`AgentTeamRun` controller** — reconciler for the template-instantiation CRD +- **OCI skill artifacts**. Pull skills from OCI registries instead of ConfigMaps +- **Real token tracking**. Instrument or sidecar Claude Code to capture actual usage +- **envtest integration tests**. Full reconcile loop tests against a real API server +- **Horizontal scaling**. Multiple operator replicas with leader election +- **Beads/Dolt integration**. Persistent task tracking across team runs +- **`AgentTeamRun` controller**. Reconciler for the template-instantiation CRD diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 3c3b189..999ced7 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -20,31 +20,31 @@ This project uses **Linear** (team `AMC`, project `claude-teams-operator`) as th For external contributors who don't have Linear access: - File issues directly on GitHub using the [issue templates](https://github.com/amcheste/claude-teams-operator/issues/new/choose). The maintainer will mirror them into Linear. -- Reference the GitHub issue number in your PR (`Fixes #123`) — that works fine. The Linear sync handles the cross-reference. +- Reference the GitHub issue number in your PR (`Fixes #123`). That works fine. The Linear sync handles the cross-reference. For maintainers and regular contributors: - Open or claim issues in Linear directly via [save_issue](https://linear.app/amcheste/project/claude-teams-operator-32aab082f36b) (or the Linear UI). -- PRs to `develop` are required to reference an `AMC-N` ID or carry a `No-Linear-Issue: ` trailer — the `Linear Issue Reference` CI check enforces this. +- PRs to `develop` are required to reference an `AMC-N` ID or carry a `No-Linear-Issue: ` trailer. The `Linear Issue Reference` CI check enforces this. ## Good first issues If you're looking for a way in, browse: -- [`good first issue`](https://github.com/amcheste/claude-teams-operator/labels/good%20first%20issue) — small, well-scoped tasks with clear acceptance criteria -- [`help wanted`](https://github.com/amcheste/claude-teams-operator/labels/help%20wanted) — areas where the maintainer would specifically welcome a hand -- [`documentation`](https://github.com/amcheste/claude-teams-operator/labels/documentation) — content fixes, new tutorials, or how-to guides for the docs site at [kagents.dev](https://kagents.dev) +- [`good first issue`](https://github.com/amcheste/claude-teams-operator/labels/good%20first%20issue). Small, well-scoped tasks with clear acceptance criteria +- [`help wanted`](https://github.com/amcheste/claude-teams-operator/labels/help%20wanted). Areas where the maintainer would specifically welcome a hand +- [`documentation`](https://github.com/amcheste/claude-teams-operator/labels/documentation). Content fixes, new tutorials, or how-to guides for the docs site at [kagents.dev](https://kagents.dev) If nothing on those lists fits, [open a Discussion](https://github.com/amcheste/claude-teams-operator/discussions) describing what you'd like to work on. Better to align before writing code than after. ## Prerequisites -- **Go 1.23+** — `brew install go` or [go.dev/dl](https://go.dev/dl) -- **Docker** — for building container images -- **Kind** — `brew install kind` (local cluster) -- **kubectl** — `brew install kubectl` -- **Helm** — `brew install helm` -- **golangci-lint** — `brew install golangci-lint` +- **Go 1.23+**. `brew install go` or [go.dev/dl](https://go.dev/dl) +- **Docker**. For building container images +- **Kind**. `brew install kind` (local cluster) +- **kubectl**. `brew install kubectl` +- **Helm**. `brew install helm` +- **golangci-lint**. `brew install golangci-lint` Verify your Go installation: @@ -119,7 +119,7 @@ The CRD types live in `api/v1alpha1/`. After modifying them: 3. Run `make install` to apply the updated CRDs to your cluster 4. Commit both the Go source changes **and** the generated files -Do not edit `zz_generated.deepcopy.go` or `config/crd/bases/*.yaml` by hand — they are always regenerated. +Do not edit `zz_generated.deepcopy.go` or `config/crd/bases/*.yaml` by hand. They are always regenerated. ## Testing @@ -144,8 +144,8 @@ In short: branch from `develop`, one logical change per PR, [Conventional Commit This repo extends the canonical commit types with: -- `test:` — adding or updating tests -- `ci:` — CI/CD configuration changes +- `test:`. Adding or updating tests +- `ci:`. CI/CD configuration changes Scopes are encouraged (optional but helpful): `feat(controller):`, `fix(crd):`, `docs(readme):`, `feat(crd)!: rename budgetLimit field`. @@ -182,11 +182,11 @@ The site auto-deploys to `gh-pages` on every push to `main` that touches `docs/` ## How to add a new reconciler feature -The most common contribution path is "add a new field to an `AgentTeam` and have the operator do something with it." Use this worked example as a template — it's the path #13–#16 followed for crash respawn, RBAC, create-pr, and push-branch. +The most common contribution path is "add a new field to an `AgentTeam` and have the operator do something with it." Use this worked example as a template. It's the path #13–#16 followed for crash respawn, RBAC, create-pr, and push-branch. ### 1. Decide where the field belongs -Most lifecycle-related fields live on `LifecycleSpec`; pod-level configuration lives on `LeadSpec`/`TeammateSpec`; cluster-wide defaults live on the Helm chart's `values.yaml`. When in doubt, look at how `MaxRestarts` or `GitCredentialsSecret` are wired — they're representative. +Most lifecycle-related fields live on `LifecycleSpec`; pod-level configuration lives on `LeadSpec`/`TeammateSpec`; cluster-wide defaults live on the Helm chart's `values.yaml`. When in doubt, look at how `MaxRestarts` or `GitCredentialsSecret` are wired. They're representative. ### 2. Extend the CRD type @@ -202,7 +202,7 @@ Edit `api/v1alpha1/agentteam_types.go` (or `template_types.go`). Add the field w MaxRestarts *int32 `json:"maxRestarts,omitempty"` ``` -The doc comment becomes the CRD's OpenAPI description — write it for someone reading `kubectl explain agentteam.spec.lifecycle.maxRestarts`. +The doc comment becomes the CRD's OpenAPI description. Write it for someone reading `kubectl explain agentteam.spec.lifecycle.maxRestarts`. ### 3. Regenerate manifests + deepcopy @@ -214,7 +214,7 @@ This rewrites `config/crd/bases/*.yaml`, `charts/claude-teams-operator/crds/*.ya ### 4. Implement the reconciler change -Find the right phase function — `reconcilePending`, `reconcileInitializing`, `reconcileRunning`, or `reconcileTerminal` — in `internal/controller/agentteam_controller.go`. The phases are documented in [ARCHITECTURE.md § State Machine](ARCHITECTURE.md). +Find the right phase function. `reconcilePending`, `reconcileInitializing`, `reconcileRunning`, or `reconcileTerminal`. In `internal/controller/agentteam_controller.go`. The phases are documented in [ARCHITECTURE.md § State Machine](ARCHITECTURE.md). Add a small helper rather than inlining new logic. The convention is `func (r *AgentTeamReconciler) handleX(ctx, team) (...)` for stateful behavior, and free functions for pure logic. See `handleTeammateFailures` and `newTeamTracker` for examples. @@ -236,9 +236,9 @@ If the existing webhook event types don't fit, add a new one to `internal/webhoo Each PR should add tests at the layers it changes: -- **Unit tests** — fast, fake-client based. Cover validation, branch coverage in your helper, error paths. Add to `internal/controller/agentteam__test.go`. See [TESTING.md](TESTING.md) for the suite breakdown. -- **Integration tests** — envtest-backed Ginkgo specs in `internal/controller/agentteam_integration_test.go` (or a new `agentteam__integration_test.go`). Use these when the behavior depends on the real API server's optimistic concurrency, status subresource handling, or owner references. -- **Acceptance tests** — Kind-cluster Ginkgo specs under `test/acceptance/`. Use when the behavior involves pod lifecycle, PVC mounting, or anything that fake-client can't simulate. Real-API E2E (`test/e2e/`) is reserved for end-to-end verification against Anthropic's API. +- **Unit tests**. Fast, fake-client based. Cover validation, branch coverage in your helper, error paths. Add to `internal/controller/agentteam__test.go`. See [TESTING.md](TESTING.md) for the suite breakdown. +- **Integration tests**. Envtest-backed Ginkgo specs in `internal/controller/agentteam_integration_test.go` (or a new `agentteam__integration_test.go`). Use these when the behavior depends on the real API server's optimistic concurrency, status subresource handling, or owner references. +- **Acceptance tests**. Kind-cluster Ginkgo specs under `test/acceptance/`. Use when the behavior involves pod lifecycle, PVC mounting, or anything that fake-client can't simulate. Real-API E2E (`test/e2e/`) is reserved for end-to-end verification against Anthropic's API. A good rule: if your feature has a state machine, your test count should be ≥ the number of branches in the state machine. @@ -255,9 +255,9 @@ Cluster-wide defaults belong on the operator's CLI flags (read from a ConfigMap ### Reference PRs -These are good examples to skim before opening your first reconciler PR — each one followed this exact recipe: +These are good examples to skim before opening your first reconciler PR. Each one followed this exact recipe: -- [#13 Crash respawn](https://github.com/amcheste/claude-teams-operator/pull/133) — controller state machine + metrics + webhook + tests across all three layers -- [#14 Per-agent RBAC](https://github.com/amcheste/claude-teams-operator/pull/134) — CRD-less feature: just controller logic + scoped Roles + RBAC markers -- [#15 create-pr](https://github.com/amcheste/claude-teams-operator/pull/135) — new internal package (`internal/github`) + controller wiring + httptest-backed tests -- [#16 push-branch](https://github.com/amcheste/claude-teams-operator/pull/148) — async terminal Job + status mirror + envtest integration spec +- [#13 Crash respawn](https://github.com/amcheste/claude-teams-operator/pull/133). Controller state machine + metrics + webhook + tests across all three layers +- [#14 Per-agent RBAC](https://github.com/amcheste/claude-teams-operator/pull/134). CRD-less feature: just controller logic + scoped Roles + RBAC markers +- [#15 create-pr](https://github.com/amcheste/claude-teams-operator/pull/135). New internal package (`internal/github`) + controller wiring + httptest-backed tests +- [#16 push-branch](https://github.com/amcheste/claude-teams-operator/pull/148). Async terminal Job + status mirror + envtest integration spec diff --git a/README.md b/README.md index 9feeb08..d6db3f3 100644 --- a/README.md +++ b/README.md @@ -15,9 +15,9 @@ --- -> **kagents** is the project brand. The implementation lives in the [`claude-teams-operator`](https://github.com/amcheste/claude-teams-operator) repository and ships under the `claude.amcheste.io/v1alpha1` API group. Documentation site: [kagents.dev](https://kagents.dev) (under construction — see [v0.7.0 milestone](https://github.com/amcheste/claude-teams-operator/milestone/8)). +> **kagents** is the project brand. The implementation lives in the [`claude-teams-operator`](https://github.com/amcheste/claude-teams-operator) repository and ships under the `claude.amcheste.io/v1alpha1` API group. Documentation site: [kagents.dev](https://kagents.dev) (under construction. See [v0.7.0 milestone](https://github.com/amcheste/claude-teams-operator/milestone/8)). -Claude Code [Agent Teams](https://docs.anthropic.com/en/docs/claude-code/agent-teams) let multiple Claude Code instances collaborate — a lead coordinates work via a shared task list while teammates communicate through peer-to-peer mailboxes. Natively this runs on a single machine using tmux. This operator lifts that pattern into Kubernetes so you can run large-scale agent teams on your cluster. +Claude Code [Agent Teams](https://docs.anthropic.com/en/docs/claude-code/agent-teams) let multiple Claude Code instances collaborate. A lead coordinates work via a shared task list while teammates communicate through peer-to-peer mailboxes. Natively this runs on a single machine using tmux. This operator lifts that pattern into Kubernetes so you can run large-scale agent teams on your cluster. ## Modes @@ -32,23 +32,23 @@ Both modes share the same coordination protocol (shared PVCs, mailboxes, task li ## Features -- **Native Agent Teams protocol** — preserves Anthropic's file-based mailbox and task list format over ReadWriteMany PVCs; no protocol translation -- **Per-teammate git worktrees** — each coding agent works on an isolated branch to prevent merge conflicts -- **Cowork mode** — mount ConfigMap/PVC inputs and collect outputs without requiring a git repo -- **Skills as CRD fields** — mount Claude Code skills from ConfigMaps into each agent's `.claude/skills/` -- **MCP servers per agent** — configure Model Context Protocol connections per teammate -- **Approval gates** — pause spawning specific teammates until a human applies an annotation -- **Budget enforcement** — terminate the team if estimated API cost exceeds a configured limit -- **Timeout enforcement** — terminate the team after a configurable wall-clock duration -- **`dependsOn` ordering** — spawn teammates only after their declared dependencies complete -- **Reusable templates** — define team patterns with `AgentTeamTemplate`, instantiate with `AgentTeamRun` +- **Native Agent Teams protocol**. Preserves Anthropic's file-based mailbox and task list format over ReadWriteMany PVCs; no protocol translation +- **Per-teammate git worktrees**. Each coding agent works on an isolated branch to prevent merge conflicts +- **Cowork mode**. Mount ConfigMap/PVC inputs and collect outputs without requiring a git repo +- **Skills as CRD fields**. Mount Claude Code skills from ConfigMaps into each agent's `.claude/skills/` +- **MCP servers per agent**. Configure Model Context Protocol connections per teammate +- **Approval gates**. Pause spawning specific teammates until a human applies an annotation +- **Budget enforcement**. Terminate the team if estimated API cost exceeds a configured limit +- **Timeout enforcement**. Terminate the team after a configurable wall-clock duration +- **`dependsOn` ordering**. Spawn teammates only after their declared dependencies complete +- **Reusable templates**. Define team patterns with `AgentTeamTemplate`, instantiate with `AgentTeamRun` ## Quick Start ### Prerequisites - Kubernetes 1.28+ -- ReadWriteMany PVC support (NFS, EFS, or a compatible CSI driver — see [ARCHITECTURE.md § Storage Requirements](ARCHITECTURE.md#storage-requirements) for options) +- ReadWriteMany PVC support (NFS, EFS, or a compatible CSI driver. See [ARCHITECTURE.md § Storage Requirements](ARCHITECTURE.md#storage-requirements) for options) - Claude Code CLI access (Max subscription or API key) - Opus 4.6 model access (required for Agent Teams) @@ -222,7 +222,7 @@ The primary resource. Defines the full team, its workspace, lifecycle, and obser ### AgentTeamTemplate -A reusable team pattern. Does not run on its own — instantiate with `AgentTeamRun`. +A reusable team pattern. Does not run on its own. Instantiate with `AgentTeamRun`. ### AgentTeamRun diff --git a/SECURITY.md b/SECURITY.md index 6778b5b..96690f6 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -13,28 +13,28 @@ The latest release is the most recent `v*` tag on https://github.com/amcheste/cl ## Reporting a vulnerability -**Please do not open a public issue, Discussion, or pull request for security vulnerabilities.** Use GitHub's [private vulnerability reporting](https://github.com/amcheste/claude-teams-operator/security/advisories/new) instead — that surface lets you submit confidentially, and the maintainer can collaborate with you on a fix without the report being visible to anyone else until it's resolved. +**Please do not open a public issue, Discussion, or pull request for security vulnerabilities.** Use GitHub's [private vulnerability reporting](https://github.com/amcheste/claude-teams-operator/security/advisories/new) instead. That surface lets you submit confidentially, and the maintainer can collaborate with you on a fix without the report being visible to anyone else until it's resolved. Please include in your report: - A clear description of the vulnerability - Steps to reproduce (or a proof-of-concept manifest / kubectl invocation) - The kagents version you observed it on -- Potential impact — what an attacker could achieve, and against what cluster topology +- Potential impact. What an attacker could achieve, and against what cluster topology ## Coordinated disclosure expectations We follow a **coordinated disclosure** process: -1. **Acknowledgement** — within **7 days** of your report, the maintainer will confirm receipt and start triage. -2. **Triage + fix** — within **30 days**, you will receive either a fix candidate, a status update with a clear timeline, or a written explanation of why the report doesn't qualify as a vulnerability. -3. **Embargo** — fix development happens in private. We ask you to keep the issue confidential until the fix ships and is publicly announced. We will not embargo for longer than 90 days from the original report without your agreement. -4. **Public disclosure** — once the fix is released, we publish a [GitHub Security Advisory](https://github.com/amcheste/claude-teams-operator/security/advisories) with the details, affected versions, mitigation steps, and credit to you (unless you ask to remain anonymous). -5. **CVE assignment** — if the issue qualifies, we request a CVE through GitHub's CNA before public disclosure. +1. **Acknowledgement**. Within **7 days** of your report, the maintainer will confirm receipt and start triage. +2. **Triage + fix**. Within **30 days**, you will receive either a fix candidate, a status update with a clear timeline, or a written explanation of why the report doesn't qualify as a vulnerability. +3. **Embargo**. Fix development happens in private. We ask you to keep the issue confidential until the fix ships and is publicly announced. We will not embargo for longer than 90 days from the original report without your agreement. +4. **Public disclosure**. Once the fix is released, we publish a [GitHub Security Advisory](https://github.com/amcheste/claude-teams-operator/security/advisories) with the details, affected versions, mitigation steps, and credit to you (unless you ask to remain anonymous). +5. **CVE assignment**. If the issue qualifies, we request a CVE through GitHub's CNA before public disclosure. ## What counts as a security issue -If you're not sure whether something is a vulnerability or a bug, err on the side of reporting it through the private channel — it's easy to move a non-security report to a public issue, but a public report of a real vulnerability is unfixable damage. +If you're not sure whether something is a vulnerability or a bug, err on the side of reporting it through the private channel. It's easy to move a non-security report to a public issue, but a public report of a real vulnerability is unfixable damage. In-scope examples: @@ -54,4 +54,4 @@ Out of scope (please file as regular GitHub issues): ## Hardening checklist for operators -For users deploying kagents in production, the [Operations explanation](https://kagents.dev/explanation/operations/) covers the defense-in-depth model — per-agent ServiceAccounts, the file-based-protocol threat model, and what RBAC does and doesn't enforce. Reading that page before going live is recommended. +For users deploying kagents in production, the [Operations explanation](https://kagents.dev/explanation/operations/) covers the defense-in-depth model. Per-agent ServiceAccounts, the file-based-protocol threat model, and what RBAC does and doesn't enforce. Reading that page before going live is recommended. diff --git a/docs/README.md b/docs/README.md index 84fc85f..6980717 100644 --- a/docs/README.md +++ b/docs/README.md @@ -17,4 +17,4 @@ A push to `main` that touches `docs/`, `mkdocs.yml`, or `.github/workflows/docs. ## Structure -The site uses the [Diátaxis framework](https://diataxis.fr) — four sections: Tutorials, How-to guides, Reference, Explanation. Section pages will be filled in by the v0.7.0 content issues. For now only the homepage exists. +The site uses the [Diátaxis framework](https://diataxis.fr). Four sections: Tutorials, How-to guides, Reference, Explanation. Section pages will be filled in by the v0.7.0 content issues. For now only the homepage exists. diff --git a/docs/cfp/cfp-draft.md b/docs/cfp/cfp-draft.md deleted file mode 100644 index d502b8f..0000000 --- a/docs/cfp/cfp-draft.md +++ /dev/null @@ -1,156 +0,0 @@ -# KubeCon NA 2026 — kagents CFP Draft - -> **Project:** **kagents** ([kagents.dev](https://kagents.dev)) — implementation in [`claude-teams-operator`](https://github.com/amcheste/claude-teams-operator). -> -> Draft submission for issue [#23](https://github.com/amcheste/claude-teams-operator/issues/23). Conference: KubeCon + CloudNativeCon North America 2026, Salt Lake City, Nov 9–12. CFP deadline: **May 31, 2026 at 11:59pm MT**. Submit at https://sessionize.com/kubecon-cloudnativecon-north-america-2026/. -> -> This is a starting draft. Every field below is meant to be edited. Open questions for the maintainer are listed at the bottom. - ---- - -## Submission metadata - -| Field | Recommendation | Rationale | -|-------|----------------|-----------| -| **Submission type** | Session Presentation (30 min) | The form's options are 5 / 30 / 75 minutes. 30 fits the demo-heavy structure without padding. Tutorial (75 min) is the alternative if the maintainer wants hands-on. | -| **Track (primary)** | AI Inference + Agentic | New track for 2026. Direct fit: this is a system for running agent workloads on K8s. | -| **Track (alternate)** | Platform Engineering | Reasonable alternate angle: the operator *is* platform infra for agent teams. Pick AI Inference + Agentic if both feel viable, since the program committee may load-balance between them. | -| **Audience level** | Intermediate | Assumes operator-pattern literacy (CRDs, reconcile loops, RBAC, PVC access modes). Does not assume Claude Code or LLM background. | -| **Case study?** | No | This is a project talk, not a deployment retrospective. | - ---- - -## Abstract title (75 char max) - -**Primary:** - -``` -Reconciling Agent Teams: A Kubernetes Operator for Claude Code -``` - -(62 characters) - -**Alternates worth considering:** - -``` -Stateless Agents, Stateful Cluster: K8s for Claude Code Agent Teams -``` - -(67 characters — leans harder into the architectural narrative from KUBECON.md: the agent forgets, the cluster remembers.) - -``` -The Operator Pattern for Multi-Agent Coding Teams -``` - -(49 characters — most general, drops the Claude Code brand. Use if the program committee tends to read brand-named titles as vendor pitches.) - ---- - -## Abstract (1,300 char max) - -> Most multi-agent orchestration frameworks treat Kubernetes as deployment infrastructure: pods that happen to run an LLM. This talk shows what changes when the cluster becomes the coordination fabric. **kagents** ([kagents.dev](https://kagents.dev)) runs Anthropic's Claude Code Agent Teams as a CRD-driven workload, preserving the native file-based mailbox protocol over a ReadWriteMany PVC instead of inventing a new one. An AgentTeam resource declares a lead, teammates, budget, quality gates, and lifecycle policy in a single spec. The reconciler provisions per-teammate git worktrees, scopes each pod with its own ServiceAccount, and re-spawns crashed agents using the durable task list as recovery state. The agent does not remember the conversation, but the task list on the PVC tells the fresh pod what work remains. The talk walks through the architectural choices that made this work in K8s: why agent state lives on a PVC instead of a CRD status field, why RestartPolicy is Never, what RWX storage you actually need in production versus on a laptop, and how Prometheus metrics, webhooks, and human approval gates plug into the reconcile loop. A live demo deploys a coding team to a Kind cluster, shows mailbox traffic between pods, kills a teammate, and watches the operator respawn it from the task list. - -(~1,290 characters, against a 1,300 limit. The buffer is small. Trim if any field is added during iteration.) - ---- - -## Audience - -> Platform engineers, operator authors, and SREs who run Kubernetes and are evaluating how to host multi-agent LLM workloads without building a custom protocol. Attendees should be comfortable with the operator pattern (CRDs, controllers, reconcile loops), Kubernetes RBAC, and PVC access modes. Familiarity with Claude Code or Agent Teams is helpful but not required; the talk explains the native protocol and the K8s primitives it maps to. Attendees will leave with a clear picture of which Kubernetes building blocks translate cleanly to agent workloads (git worktrees as a concurrency primitive, ServiceAccounts as per-agent capability boundaries, owner references for cascade deletion of team state) and which assumptions break down at scale (CRD status as long-running state, single-node RWO fallbacks, real-time cost tracking). - ---- - -## Benefits to the ecosystem (1,000 char max) - -> Cloud-native multi-agent systems are a 2026 priority for both CNCF and individual platform teams, but most current solutions invent new orchestration protocols and layer them on top of Kubernetes. This talk demonstrates the alternative: model the agent team as a first-class Kubernetes resource and let existing primitives do the coordination work. The architectural patterns generalize beyond Claude Code; any multi-agent system with file-based or shared-state coordination can adopt the same approach. The talk surfaces the honest tradeoffs (ReadWriteMany storage cost, estimation-based budget tracking, the limits of CRD status for long-running state) so attendees can evaluate whether the pattern fits their workloads. The operator is open source under Apache 2.0, ships with a published Helm chart and Prometheus dashboard, and gates every release on a real-Claude end-to-end test in CI. - -(~970 characters) - ---- - -## Open source projects discussed - -- **kagents** ([kagents.dev](https://kagents.dev)) — the operator itself, Apache 2.0; implementation at [claude-teams-operator](https://github.com/amcheste/claude-teams-operator) -- [Kubernetes](https://github.com/kubernetes/kubernetes) — the platform; specifically `controller-runtime`, `kubebuilder`, RBAC, PVC subsystem -- [Prometheus](https://github.com/prometheus/prometheus) and [Grafana](https://github.com/grafana/grafana) — metrics scraping and the published dashboard ConfigMap -- [Helm](https://github.com/helm/helm) — chart packaging and release distribution -- Anthropic's Claude Code Agent Teams protocol — the native file-based coordination format the operator preserves (Claude Code itself is not open source; the protocol behavior is documented and stable enough to wrap as-is) - ---- - -## Reviewer-facing talk outline (~30 min) - -This expands on the abstract — provided in case the Sessionize form exposes a longer description field, and to anchor the demo plan. - -| Time | Beat | -|------|------| -| 0:00 | The problem framing. Most agent frameworks bolt onto Kubernetes; this talk argues for the inverse — Kubernetes primitives doing the coordination work. | -| 2:00 | Native Agent Teams in 60 seconds: file-based JSON mailboxes, shared task list, no session resumption. Why this protocol is unusually well-suited to a shared filesystem. | -| 5:00 | The `AgentTeam` CRD: one spec for a whole team (lead + teammates + lifecycle + budget). Contrast with agent-as-a-resource designs. | -| 8:00 | Phase state machine: `Pending → Initializing → Running → Completed/Failed/TimedOut/BudgetExceeded`. How state transitions map to actual K8s objects (PVCs, init Job, pods). | -| 11:00 | The ReadWriteMany requirement, in detail. Why coordination over a PVC actually works. What fails on RWO. The single-node RWO fallback used in CI and what it can and cannot prove. | -| 14:00 | Per-agent RBAC. Each pod gets its own ServiceAccount with `resourceNames`-restricted Roles on the secrets and PVCs it owns. A free security win that non-native orchestrators have to reinvent. | -| 16:00 | **Demo 1 — Crash recovery.** Deploy a coding team. Show mailbox files appearing on the PVC. Kill a teammate pod. Watch the reconciler respawn it. The fresh agent has no conversation memory, but the task list tells it what is left. | -| 21:00 | `onComplete` actions: `create-pr` opens a real GitHub PR via the REST API; `push-branch` consolidates per-teammate worktree branches into one head via a Job. The worktree-as-concurrency-primitive story. | -| 24:00 | **Demo 2 — Observability.** Prometheus metrics, the Grafana dashboard ConfigMap, an approval gate firing a webhook before a sensitive teammate spawns. | -| 27:00 | Honest tradeoffs we are still working through: estimation-based budget tracking, real multi-node test coverage, the limits of CRD status as a substitute for a workflow engine. | -| 29:00 | Wrap and pointers (repo, Helm chart, contributor docs). | -| 30:00 | Q&A. | - ---- - -## Demo plan - -Two demos, both runnable on a laptop with Kind: - -1. **Crash recovery (5 min, on stage).** Deploy a 3-agent `AgentTeam` from a sample manifest, watch pods come up, observe mailbox JSON appearing on the shared PVC, `kubectl delete pod` one teammate, watch the reconciler respawn it. The point is to show the agent's lost context window does not lose the team's progress, because the task list is durable. - -2. **Observability and gates (3 min, on stage).** Bring up the Grafana dashboard against the operator's Prometheus metrics. Trigger an approval gate so a webhook fires; grant approval via `kubectl annotate`; watch the gated teammate spawn. - -Both demos run today against the shipped v0.5.0 release. Backup recordings will be prepared in case live demo bandwidth fails on the venue Wi-Fi. - ---- - -## Speaker bio - -> _TBD — see open questions._ - ---- - -## Prior speaking history - -> _TBD — see open questions._ - ---- - -## Open questions for the maintainer - -These are the items that need maintainer input before submission: - -1. **Speaker bio** — short paragraph (≤ ~500 chars) covering current role, relevant background, and any past public talks or projects. Include a recent headshot upload-ready. -2. **Prior speaking history** — has the maintainer presented at a CNCF event in the past 12 months? The form asks for video links if so. -3. **Track preference** — primary recommendation here is **AI Inference + Agentic**; the alternate is **Platform Engineering**. Which one does the maintainer want as the primary track? (Submitting to one does not preclude the program committee from re-routing.) -4. **Title preference** — three candidates above. Maintainer's call. -5. **Co-speaker?** — solo or two-speaker? The form allows up to two on a Session Presentation. -6. **Tutorial alternate?** — if the talk lands strongly, a 75-minute Tutorial slot is also viable (deploy a team in real time, walk through the CRD field by field). Worth submitting both? The CFP allows up to three submissions per speaker. -7. **Demo cluster** — confirm the on-stage cluster is Kind on a laptop, vs. an actual cloud cluster. Bandwidth and predictability favor Kind; "real cluster" favors the multi-node RWX story. -8. **Release alignment** — the v0.6.0 (Operator Dashboard) and v1.0.0 (Demo Polish) milestones land before the conference. Should the dashboard be part of the demo, or kept as a parallel track? Including it strengthens the story but adds a moving piece to rehearse. - ---- - -## Notes on substance - -Everything in the abstract and the outline maps to shipped, tested code in v0.1.0–v0.5.0: - -- AgentTeam CRD with single-spec team declaration → [api/v1alpha1/agentteam_types.go](../../api/v1alpha1/agentteam_types.go) -- Reconciler phase state machine → [internal/controller/agentteam_controller.go](../../internal/controller/agentteam_controller.go), see also [ARCHITECTURE.md § Phase State Machine](../../ARCHITECTURE.md#phase-state-machine) -- ReadWriteMany PVC coordination + single-node fallback → [ARCHITECTURE.md § Storage Requirements](../../ARCHITECTURE.md#storage-requirements), [hack/mailbox-smoke-test.sh](../../hack/mailbox-smoke-test.sh) -- Per-agent ServiceAccounts with `resourceNames`-restricted Roles — shipped in v0.4.0 (#14) -- Crash respawn with restart counters — v0.4.0 (#13) -- `onComplete: create-pr` — v0.4.0 (#15); `onComplete: push-branch` — v0.4.0 (#16) -- Prometheus metrics + Grafana dashboard ConfigMap — v0.3.0 -- Webhook engine + approval gates — v0.3.0 -- AgentTeamTemplate + AgentTeamRun controllers — v0.5.0 (#17, #18) -- Real-Claude E2E gate before release publishes — v0.4.0 (#150) - -No claim in this draft refers to unshipped work. diff --git a/docs/explanation/coordination.md b/docs/explanation/coordination.md index d5ce19a..bb17b98 100644 --- a/docs/explanation/coordination.md +++ b/docs/explanation/coordination.md @@ -4,15 +4,15 @@ This is the load-bearing design choice in kagents: agent-to-agent communication ## Why a shared filesystem instead of a message bus? -Anthropic's Claude Code Agent Teams runs natively on a single machine using tmux. Multiple Claude Code instances coordinate via files in `~/.claude/teams/` — JSON inboxes for peer-to-peer messages, a JSON task list for shared work tracking. The protocol is unspecified beyond "look at the files." +Anthropic's Claude Code Agent Teams runs natively on a single machine using tmux. Multiple Claude Code instances coordinate via files in `~/.claude/teams/`. JSON inboxes for peer-to-peer messages, a JSON task list for shared work tracking. The protocol is unspecified beyond "look at the files." We could have translated this to Redis, NATS, or a custom gRPC service. We chose not to: -- **No protocol versioning to track.** Claude Code owns the format. When it ships a v2 mailbox schema, kagents inherits it for free — we never read or write the contents. +- **No protocol versioning to track.** Claude Code owns the format. When it ships a v2 mailbox schema, kagents inherits it for free. We never read or write the contents. - **No translation layer to debug.** When something goes wrong, you can `kubectl exec` into a pod and inspect the actual files Claude Code is reading and writing. There's no opaque protocol bridge in the middle. - **No additional infrastructure.** A bare RWX PVC is enough. No Redis to operate, no message-bus HA story. -The cost is real — ReadWriteMany storage isn't free on every cluster, and we have to be honest about that. +The cost is real. ReadWriteMany storage isn't free on every cluster, and we have to be honest about that. ## Mailbox layout @@ -70,7 +70,7 @@ graph TB style O fill:#f3e5f5,stroke:#7b1fa2 ``` -The `team-state` PVC is the coordination fabric — it carries the mailboxes and the task list. The `repo` PVC (coding mode) carries the git clone and per-teammate worktrees. The `output` PVC (Cowork mode) is where agents write artifacts. +The `team-state` PVC is the coordination fabric. It carries the mailboxes and the task list. The `repo` PVC (coding mode) carries the git clone and per-teammate worktrees. The `output` PVC (Cowork mode) is where agents write artifacts. In practice the operator mounts the team-state PVC into each pod, and the entrypoint symlinks the `teams/` and `tasks/` subdirectories into `~/.claude/`: @@ -92,7 +92,7 @@ If the backing PVC supports only `ReadWriteOnce`, the second pod fails to mount ### Supported backends -The operator has no opinion about the CSI driver — it asks for an RWX PVC and a `storageClassName` you supply. Backends that satisfy the contract: +The operator has no opinion about the CSI driver. It asks for an RWX PVC and a `storageClassName` you supply. Backends that satisfy the contract: | Platform | Driver | Notes | |----------|--------|-------| @@ -104,7 +104,7 @@ The operator has no opinion about the CSI driver — it asks for an RWX PVC and ### Single-node fallback -For laptops, Kind, k3d, minikube — a real RWX provisioner is overkill. The operator accepts a `--pvc-access-mode=ReadWriteOnce` flag. This works **only** because every pod lands on the same node, and a hostPath-backed RWO PVC is then visible to all of them. +For laptops, Kind, k3d, minikube. A real RWX provisioner is overkill. The operator accepts a `--pvc-access-mode=ReadWriteOnce` flag. This works **only** because every pod lands on the same node, and a hostPath-backed RWO PVC is then visible to all of them. !!! danger "Don't use RWO on a multi-node cluster" A second pod scheduled on a different node will fail to mount the PVC and the team will deadlock. The single-node fallback is a development convenience, not a production option. @@ -154,7 +154,7 @@ When `spec.workspace` is set instead of `spec.repository`, the operator skips th - Mounts `workspace.inputs` (ConfigMaps or existing PVCs) read-only into each pod - Doesn't set `WORKTREE_PATH`; agents work in `/workspace/output` or `/workspace/data` -The mailbox protocol is identical — Cowork agents still coordinate via `~/.claude/teams/.../inboxes/`. The only difference is what filesystem they're writing artifacts into. +The mailbox protocol is identical. Cowork agents still coordinate via `~/.claude/teams/.../inboxes/`. The only difference is what filesystem they're writing artifacts into. ## What this means for debugging @@ -168,5 +168,5 @@ There's no opaque coordinator process to dump. Everything Claude Code knows abou ## Where to look next -- [Resource model](resources.md) — the CRDs that compose into a running team -- [Operations](operations.md) — budget, RBAC, and observability for the running team +- [Resource model](resources.md). The CRDs that compose into a running team +- [Operations](operations.md). Budget, RBAC, and observability for the running team diff --git a/docs/explanation/index.md b/docs/explanation/index.md index d67bfcd..852f6f8 100644 --- a/docs/explanation/index.md +++ b/docs/explanation/index.md @@ -1,15 +1,15 @@ # Explanation -The "why" behind kagents — architecture, design tradeoffs, the choices that shaped the project. Read these when you want to understand what's actually happening, not just how to use it. +The "why" behind kagents. Architecture, design tradeoffs, the choices that shaped the project. Read these when you want to understand what's actually happening, not just how to use it. ## Pages -- **[Resource model](resources.md)** — the three CRDs (`AgentTeam`, `AgentTeamTemplate`, `AgentTeamRun`), how they relate, and when to reach for which. -- **[Coordination protocol](coordination.md)** — the file-based mailbox model, why ReadWriteMany is required, per-teammate git worktrees as a concurrency primitive. -- **[Operations](operations.md)** — budget estimation, per-agent RBAC, observability via Prometheus + Grafana + webhooks. +- **[Resource model](resources.md)**. The three CRDs (`AgentTeam`, `AgentTeamTemplate`, `AgentTeamRun`), how they relate, and when to reach for which. +- **[Coordination protocol](coordination.md)**. The file-based mailbox model, why ReadWriteMany is required, per-teammate git worktrees as a concurrency primitive. +- **[Operations](operations.md)**. Budget estimation, per-agent RBAC, observability via Prometheus + Grafana + webhooks. ## Going deeper -The repo's [`ARCHITECTURE.md`](https://github.com/amcheste/claude-teams-operator/blob/main/ARCHITECTURE.md) is the design doc — denser, more focused on rationale than on usage. It overlaps with these pages but goes further into the file-by-file structure of the codebase. +The repo's [`ARCHITECTURE.md`](https://github.com/amcheste/claude-teams-operator/blob/main/ARCHITECTURE.md) is the design doc. Denser, more focused on rationale than on usage. It overlaps with these pages but goes further into the file-by-file structure of the codebase. The [KubeCon NA 2026 talk](https://github.com/amcheste/claude-teams-operator/blob/main/KUBECON.md) frames the same architecture from the conference angle (interesting problems encountered, competitive landscape, design decisions worth surfacing on stage). diff --git a/docs/explanation/operations.md b/docs/explanation/operations.md index f7d9513..415dd90 100644 --- a/docs/explanation/operations.md +++ b/docs/explanation/operations.md @@ -26,7 +26,7 @@ The reconciler compares `status.estimatedCostUsd` against `spec.lifecycle.budget 3. `status.completedAt` is stamped 4. A `webhook.budgetExceeded` event fires (if configured) -There's no grace period — the team stops the moment the estimate crosses. Set the limit with headroom. +There's no grace period. The team stops the moment the estimate crosses. Set the limit with headroom. ### Honest tradeoffs @@ -34,7 +34,7 @@ This is the lightest-touch approach available without instrumenting Claude Code. - **Estimate, not measurement.** Real token usage depends on prompt length, context window growth, and how often the agent reaches for tools. The estimate can be off by 2-3x in either direction. - **Heuristic is per-active-minute.** An agent waiting on `dependsOn` doesn't accrue cost; one running flat out at the same rate as one mostly idle does. The heuristic averages the difference away. -- **Rate table is hardcoded.** The token-per-minute heuristic and the per-million prices live in `internal/budget/tracker.go`. Adjusting them requires a code change and rebuild — config-via-Helm-values is on the roadmap. +- **Rate table is hardcoded.** The token-per-minute heuristic and the per-million prices live in `internal/budget/tracker.go`. Adjusting them requires a code change and rebuild. Config-via-Helm-values is on the roadmap. For production, set `budgetLimit` ~2x what you actually want to spend, and treat the budget as a circuit breaker rather than a precise meter. Real cost tracking via instrumented Claude Code or sidecar log parsing is on the roadmap; until then, the [Anthropic console](https://console.anthropic.com/) is the source of truth for accounting. @@ -77,10 +77,10 @@ The threat model is "a teammate's prompt is malicious or compromised." The blast - ✅ Cannot read another teammate's secrets (different SA) - ✅ Cannot exec into the lead pod (no `pods/exec`) - ✅ Cannot enumerate cluster state (no list verbs on namespace-wide resources) -- ⚠️ Can write to the shared `team-state` PVC — a malicious teammate could poison the task list or write to a peer's inbox. This is inherent to the file-based protocol; mitigations would require Claude Code to authenticate writes. +- ⚠️ Can write to the shared `team-state` PVC. A malicious teammate could poison the task list or write to a peer's inbox. This is inherent to the file-based protocol; mitigations would require Claude Code to authenticate writes. - ⚠️ Can write to the shared `repo` PVC. Worktrees are isolated by branch, but the agent could `cd` to a peer's worktree. -The RBAC model handles the K8s side cleanly; the filesystem-level threats need protocol-level signing to fully address. For most use cases — internal CI, trusted prompts — the filesystem trust model is acceptable. +The RBAC model handles the K8s side cleanly; the filesystem-level threats need protocol-level signing to fully address. For most use cases. Internal CI, trusted prompts. The filesystem trust model is acceptable. ## Observability @@ -163,6 +163,6 @@ Within 30 seconds (the default reconcile interval), the gated teammate spawns an ## Where to look next -- [Resource model](resources.md) — what an `AgentTeam` looks like under the hood -- [Coordination protocol](coordination.md) — how the agents actually talk to each other -- [How-to guides](../how-to/index.md) — concrete operational recipes (coming in v0.7.0) +- [Resource model](resources.md). What an `AgentTeam` looks like under the hood +- [Coordination protocol](coordination.md). How the agents actually talk to each other +- [How-to guides](../how-to/index.md). Concrete operational recipes (coming in v0.7.0) diff --git a/docs/explanation/resources.md b/docs/explanation/resources.md index 9265083..97d5c40 100644 --- a/docs/explanation/resources.md +++ b/docs/explanation/resources.md @@ -76,18 +76,18 @@ Pending ─────► Initializing ─────► Running ──── Failed ``` -Terminal phases (`Completed`, `Failed`, `TimedOut`, `BudgetExceeded`) trigger cleanup — pods get deleted, `status.completedAt` gets stamped, the reconciler stops requeuing. +Terminal phases (`Completed`, `Failed`, `TimedOut`, `BudgetExceeded`) trigger cleanup. Pods get deleted, `status.completedAt` gets stamped, the reconciler stops requeuing. Other status fields worth knowing: -- `status.lead.phase` and `status.teammates[].phase` — per-pod state -- `status.estimatedCostUsd` — budget tracker output (see [Operations](operations.md)) -- `status.consolidatedBranch` — populated when `onComplete: push-branch` runs -- `status.conditions` — Kubernetes-style conditions array +- `status.lead.phase` and `status.teammates[].phase`. Per-pod state +- `status.estimatedCostUsd`. Budget tracker output (see [Operations](operations.md)) +- `status.consolidatedBranch`. Populated when `onComplete: push-branch` runs +- `status.conditions`. Kubernetes-style conditions array ## AgentTeamTemplate -A reusable team blueprint. Does not run on its own — it sits inert until an `AgentTeamRun` references it. +A reusable team blueprint. Does not run on its own. It sits inert until an `AgentTeamRun` references it. ```yaml apiVersion: claude.amcheste.io/v1alpha1 @@ -166,7 +166,7 @@ graph TD style D fill:#e1f5ff,stroke:#0288d1 ``` -The Template+Run pattern shines when you want the same team shape (same lead prompt, same teammate roles) parameterised by repo, branch, or per-run prompt overrides. For a one-off job, the indirection is overhead — just write an `AgentTeam` directly. +The Template+Run pattern shines when you want the same team shape (same lead prompt, same teammate roles) parameterised by repo, branch, or per-run prompt overrides. For a one-off job, the indirection is overhead. Just write an `AgentTeam` directly. ## Worked example: security review across three repos @@ -240,12 +240,12 @@ Three concurrent reviews. One template definition. Updating the template (e.g. t ## Owner references and cascade delete -Every child resource — pods, PVCs, ConfigMaps, the init Job, per-agent ServiceAccounts and Roles — has an owner reference to the `AgentTeam`. Deleting the `AgentTeam` cascades to all of them via Kubernetes garbage collection. +Every child resource. Pods, PVCs, ConfigMaps, the init Job, per-agent ServiceAccounts and Roles. Has an owner reference to the `AgentTeam`. Deleting the `AgentTeam` cascades to all of them via Kubernetes garbage collection. If the team was created by an `AgentTeamRun`, that adds another layer: deleting the `AgentTeamRun` cascades to the `AgentTeam` (which then cascades to everything else). One `kubectl delete agentteamrun` is sufficient teardown. ## Where to look next -- [Coordination protocol](coordination.md) — how the agents actually talk to each other -- [Operations](operations.md) — budget, RBAC, and observability -- [API reference (coming in v0.7.0)](../reference/index.md) — every field, every type, every default +- [Coordination protocol](coordination.md). How the agents actually talk to each other +- [Operations](operations.md). Budget, RBAC, and observability +- [API reference (coming in v0.7.0)](../reference/index.md). Every field, every type, every default diff --git a/docs/helm-values.md b/docs/helm-values.md index 59bb39a..793e1cf 100644 --- a/docs/helm-values.md +++ b/docs/helm-values.md @@ -69,7 +69,7 @@ The operator pod is single-replica and lightweight by default. Bump limits if yo ## Storage -Defaults applied to PVCs the operator creates per AgentTeam. **Required:** the storage class must support `ReadWriteMany` for multi-pod teams (NFS, EFS, CephFS) — see [ARCHITECTURE.md § Storage Requirements](../ARCHITECTURE.md#storage-requirements). +Defaults applied to PVCs the operator creates per AgentTeam. **Required:** the storage class must support `ReadWriteMany` for multi-pod teams (NFS, EFS, CephFS). See [ARCHITECTURE.md § Storage Requirements](../ARCHITECTURE.md#storage-requirements). | Key | Default | Description | |---|---|---| @@ -77,7 +77,7 @@ Defaults applied to PVCs the operator creates per AgentTeam. **Required:** the s | `storage.teamStateSize` | `5Gi` | Size of the team-state PVC (mailboxes + task list). | | `storage.repoSize` | `20Gi` | Size of the per-team repo PVC (clones + worktrees). | -## Metrics — Service + ServiceMonitor +## Metrics. Service + ServiceMonitor | Key | Default | Description | |---|---|---| @@ -88,7 +88,7 @@ Defaults applied to PVCs the operator creates per AgentTeam. **Required:** the s | `metrics.serviceMonitor.interval` | `30s` | Prometheus scrape interval. | | `metrics.serviceMonitor.additionalLabels` | `{}` | Extra labels on the ServiceMonitor. Match your Prometheus CR's selector — e.g. `{release: kube-prometheus-stack}`. | -## Metrics — Grafana dashboard +## Metrics. Grafana dashboard Renders a ConfigMap holding a 10-panel Grafana dashboard for Claude team observability. With kube-prometheus-stack, the Grafana sidecar auto-imports any ConfigMap carrying the configured label. diff --git a/docs/how-to/index.md b/docs/how-to/index.md index 007e678..1c2cde6 100644 --- a/docs/how-to/index.md +++ b/docs/how-to/index.md @@ -1,14 +1,14 @@ # How-to guides -Recipes for solving specific operational tasks. These assume you already have kagents installed and at least a basic working AgentTeam — if not, start with the [Getting Started tutorial](../tutorials/getting-started.md). +Recipes for solving specific operational tasks. These assume you already have kagents installed and at least a basic working AgentTeam. If not, start with the [Getting Started tutorial](../tutorials/getting-started.md). ## Install Cloud-specific install paths covering the ReadWriteMany storage configuration that's the actual deployment friction point on each cloud: -- **[Install on Amazon EKS](install/eks.md)** — EFS CSI driver + EFS file system + Access Points -- **[Install on Google GKE](install/gke.md)** — Filestore CSI driver + Filestore instance -- **[Install on Azure AKS](install/aks.md)** — Azure Files CSI driver + Premium NFS share +- **[Install on Amazon EKS](install/eks.md)**. EFS CSI driver + EFS file system + Access Points +- **[Install on Google GKE](install/gke.md)**. Filestore CSI driver + Filestore instance +- **[Install on Azure AKS](install/aks.md)**. Azure Files CSI driver + Premium NFS share Each guide ends with the same `make mailbox-smoke-test` verification step. @@ -16,9 +16,9 @@ Each guide ends with the same `make mailbox-smoke-test` verification step. Day-to-day operational tasks once kagents is running: -- **[Expose the dashboard](operate/expose-dashboard.md)** — port-forward for dev, Ingress with basic auth for prod, oauth2-proxy for corporate SSO, namespace-scoping -- **[Configure shared storage](operate/shared-storage.md)** — sizing the team-state / repo / output PVCs, backup strategies per cloud backend, performance tuning recipes -- **[Set budget alerts](operate/budget-alerts.md)** — per-team `budgetLimit`, chart-wide default, webhook events to Slack/PagerDuty, Prometheus alert rules +- **[Expose the dashboard](operate/expose-dashboard.md)**. Port-forward for dev, Ingress with basic auth for prod, oauth2-proxy for corporate SSO, namespace-scoping +- **[Configure shared storage](operate/shared-storage.md)**. Sizing the team-state / repo / output PVCs, backup strategies per cloud backend, performance tuning recipes +- **[Set budget alerts](operate/budget-alerts.md)**. Per-team `budgetLimit`, chart-wide default, webhook events to Slack/PagerDuty, Prometheus alert rules ## Looking for something else? diff --git a/docs/how-to/install/aks.md b/docs/how-to/install/aks.md index 6b1ebf1..10a0c7e 100644 --- a/docs/how-to/install/aks.md +++ b/docs/how-to/install/aks.md @@ -8,7 +8,7 @@ This guide walks you from a working AKS cluster to a running kagents operator ba - `kubectl` configured against the cluster - `helm` 3.14+ - `az` CLI authenticated with the subscription that owns the cluster -- The cluster's resource group and node resource group — `az aks show -g -n ` shows them +- The cluster's resource group and node resource group. `az aks show -g -n ` shows them ## 1. Verify the Azure Files CSI driver is enabled @@ -105,7 +105,7 @@ Azure Files Premium (FileStorage SKU) is billed by **provisioned capacity** per - **Price**: ~$0.16/GiB-month for Premium NFS in most regions, plus tiny per-operation fees. - **Network**: free within the same Azure region. -A 100 GiB Premium share is **~$16/month**. That's enough for tens of concurrent teams' worth of mailbox state. For larger teams or longer retention, scale capacity up — Azure Files Premium auto-scales IOPS proportional to provisioned size. +A 100 GiB Premium share is **~$16/month**. That's enough for tens of concurrent teams' worth of mailbox state. For larger teams or longer retention, scale capacity up. Azure Files Premium auto-scales IOPS proportional to provisioned size. The honest range for a small production install is **$15–$50/month** depending on how aggressively you scale capacity for performance. @@ -127,10 +127,10 @@ The honest range for a small production install is **$15–$50/month** depending Azure Files NFS without `nconnect=4` can be 2-3x slower than expected. Add the mount option in the StorageClass and recreate any pods using existing PVCs to pick it up. ??? warning "Cannot use Standard or Premium_ZRS SKU" - Only `Premium_LRS` supports NFS. Standard SMB shares technically support RWX but the file-locking semantics don't work for the mailbox protocol — use Premium NFS. + Only `Premium_LRS` supports NFS. Standard SMB shares technically support RWX but the file-locking semantics don't work for the mailbox protocol. Use Premium NFS. ## Where to look next -- [Resource model](../../explanation/resources.md) — the CRDs you'll be writing -- [Coordination protocol](../../explanation/coordination.md) — why RWX matters in detail -- [Operations](../../explanation/operations.md) — budget, RBAC, observability for the running operator +- [Resource model](../../explanation/resources.md). The CRDs you'll be writing +- [Coordination protocol](../../explanation/coordination.md). Why RWX matters in detail +- [Operations](../../explanation/operations.md). Budget, RBAC, observability for the running operator diff --git a/docs/how-to/install/eks.md b/docs/how-to/install/eks.md index 83db26c..71a0e62 100644 --- a/docs/how-to/install/eks.md +++ b/docs/how-to/install/eks.md @@ -8,7 +8,7 @@ This guide walks you from a working EKS cluster to a running kagents operator ba - `kubectl` configured against the cluster - `helm` 3.14+ - `aws` CLI authenticated with permissions to create EFS file systems and IAM policies -- The cluster's VPC ID and the security group used by your worker nodes — `aws eks describe-cluster --name ` shows them +- The cluster's VPC ID and the security group used by your worker nodes. `aws eks describe-cluster --name ` shows them ## 1. Install the EFS CSI driver @@ -119,7 +119,7 @@ A passing run looks like: PASS StorageClass=nfs AccessMode=ReadWriteMany RoundTripMs=842 ``` -If `AccessMode` reports `ReadWriteOnce` or the test fails to schedule the second pod, your StorageClass isn't actually advertising RWX — re-check step 3. +If `AccessMode` reports `ReadWriteOnce` or the test fails to schedule the second pod, your StorageClass isn't actually advertising RWX. Re-check step 3. ## Cost notes @@ -127,7 +127,7 @@ EFS is billed by storage GB-month + provisioned throughput. For a typical kagent - **Storage**: 1-5 GiB per team. At ~$0.30/GiB-month (Standard storage class), expect $0.50–$2/month for storage. - **Throughput**: in `elastic` mode you pay per byte read/written (~$0.01/GiB). Idle teams cost nothing; active teams during a busy period might generate a few GiB of traffic per day. -- **Per-mount cost**: nothing — EFS mount targets are free. +- **Per-mount cost**: nothing. EFS mount targets are free. The honest range for a small production install is **$5–$30/month**. For larger scale see the [EFS pricing page](https://aws.amazon.com/efs/pricing/). @@ -147,6 +147,6 @@ The honest range for a small production install is **$5–$30/month**. For large ## Where to look next -- [Resource model](../../explanation/resources.md) — the CRDs you'll be writing -- [Coordination protocol](../../explanation/coordination.md) — why RWX matters in detail -- [Operations](../../explanation/operations.md) — budget, RBAC, observability for the running operator +- [Resource model](../../explanation/resources.md). The CRDs you'll be writing +- [Coordination protocol](../../explanation/coordination.md). Why RWX matters in detail +- [Operations](../../explanation/operations.md). Budget, RBAC, observability for the running operator diff --git a/docs/how-to/install/gke.md b/docs/how-to/install/gke.md index ea0584b..25509d7 100644 --- a/docs/how-to/install/gke.md +++ b/docs/how-to/install/gke.md @@ -8,7 +8,7 @@ This guide walks you from a working GKE cluster to a running kagents operator ba - `kubectl` configured against the cluster - `helm` 3.14+ - `gcloud` CLI authenticated with the project that owns the cluster -- The cluster's VPC network and region — `gcloud container clusters describe ` shows them +- The cluster's VPC network and region. `gcloud container clusters describe ` shows them ## 1. Enable the Filestore CSI driver @@ -30,7 +30,7 @@ kubectl get pods -n kube-system -l k8s-app=gcp-filestore-csi-driver ## 2. Create the StorageClass -The driver supports dynamic provisioning, so you don't need to create a Filestore instance manually — the CSI driver creates one when the first PVC binds. +The driver supports dynamic provisioning, so you don't need to create a Filestore instance manually. The CSI driver creates one when the first PVC binds. ```yaml title="storageclass-filestore.yaml" apiVersion: storage.k8s.io/v1 @@ -84,7 +84,7 @@ A passing run reports the effective StorageClass and AccessMode: PASS StorageClass=nfs AccessMode=ReadWriteMany RoundTripMs=623 ``` -The first `make mailbox-smoke-test` run on Filestore takes a few minutes — Filestore instance provisioning is the slow step (~3-5 min). Subsequent test runs reuse the instance and complete in under 30s. +The first `make mailbox-smoke-test` run on Filestore takes a few minutes. Filestore instance provisioning is the slow step (~3-5 min). Subsequent test runs reuse the instance and complete in under 30s. ## Cost notes @@ -94,7 +94,7 @@ Filestore is billed by provisioned capacity per hour, not actual usage: - **Premium tier (SSD)**: ~$0.30/GiB-month. Same 1 TiB minimum. - **Enterprise tier (HA, regional)**: ~$0.60/GiB-month. 2.5 TiB minimum. -Note that **each PVC creates a new Filestore instance by default** with this StorageClass config. If you're running many teams, this gets expensive fast — at least one instance per PVC times the 1 TiB minimum. +Note that **each PVC creates a new Filestore instance by default** with this StorageClass config. If you're running many teams, this gets expensive fast. At least one instance per PVC times the 1 TiB minimum. For multi-team production use, set `volumeHandle` on a manually-provisioned shared Filestore instance and use sub-directory provisioning instead. See [GKE's Filestore docs](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/filestore-csi-driver) for the multi-PVC pattern. @@ -103,7 +103,7 @@ The honest range for a small production install with one shared Filestore instan ## Common gotchas ??? warning "PVC stuck in `Pending` with `does not satisfy capacity`" - Filestore instances have a 1 TiB minimum size. The kagents chart's default `storage.teamStateSize` is `5Gi`, but Filestore will round it up to the tier minimum. The PVC binds successfully — the warning resolves once provisioning completes (3-5 min). + Filestore instances have a 1 TiB minimum size. The kagents chart's default `storage.teamStateSize` is `5Gi`, but Filestore will round it up to the tier minimum. The PVC binds successfully. The warning resolves once provisioning completes (3-5 min). ??? warning "`failed to create filestore instance: insufficient quota`" Filestore instances count against a project-wide quota. `gcloud compute regions describe ` shows current usage. Request a quota increase via the GCP console. @@ -116,6 +116,6 @@ The honest range for a small production install with one shared Filestore instan ## Where to look next -- [Resource model](../../explanation/resources.md) — the CRDs you'll be writing -- [Coordination protocol](../../explanation/coordination.md) — why RWX matters in detail -- [Operations](../../explanation/operations.md) — budget, RBAC, observability for the running operator +- [Resource model](../../explanation/resources.md). The CRDs you'll be writing +- [Coordination protocol](../../explanation/coordination.md). Why RWX matters in detail +- [Operations](../../explanation/operations.md). Budget, RBAC, observability for the running operator diff --git a/docs/how-to/operate/budget-alerts.md b/docs/how-to/operate/budget-alerts.md index 7354438..fb9c612 100644 --- a/docs/how-to/operate/budget-alerts.md +++ b/docs/how-to/operate/budget-alerts.md @@ -20,7 +20,7 @@ spec: budgetLimit: "10.00" # USD ``` -There's no grace period — the team stops the moment the estimate crosses. The estimate is conservative-to-the-low-side (~50K input + 5K output tokens per agent per minute is a rough ballpark), so set the limit with **2x headroom** over what you actually want to spend. +There's no grace period. The team stops the moment the estimate crosses. The estimate is conservative-to-the-low-side (~50K input + 5K output tokens per agent per minute is a rough ballpark), so set the limit with **2x headroom** over what you actually want to spend. ## Chart-wide default @@ -33,11 +33,11 @@ helm upgrade kagents \ --set defaultBudgetLimit=15.00 ``` -This is a safety net, not a recommendation — every team should set its own `budgetLimit` based on the work it's doing. The default exists to prevent a misconfigured team from running unbounded. +This is a safety net, not a recommendation. Every team should set its own `budgetLimit` based on the work it's doing. The default exists to prevent a misconfigured team from running unbounded. ## Webhook events on threshold crossings -The operator fires a `budget.warning` webhook event when a team's estimated cost crosses **80% of its `budgetLimit`** — useful as an early warning before the hard stop fires. +The operator fires a `budget.warning` webhook event when a team's estimated cost crosses **80% of its `budgetLimit`**. Useful as an early warning before the hard stop fires. ### Configure the webhook URL @@ -109,8 +109,8 @@ PagerDuty's [Events API v2](https://developer.pagerduty.com/docs/events-api-v2/o For teams that already have a Prometheus + Alertmanager stack, alert directly on the metrics the chart exposes. The relevant series: -- `claude_team_cost_usd{team_name=...}` — current estimated cost -- `claude_team_budget_remaining_usd{team_name=...}` — `limit - cost` +- `claude_team_cost_usd{team_name=...}`. Current estimated cost +- `claude_team_budget_remaining_usd{team_name=...}`. `limit - cost` ### Alert: budget about to be exceeded @@ -186,6 +186,6 @@ If the estimate is consistently 50% low, double your `budgetLimit` headroom. If ## Where to look next -- [Operations explanation](../../explanation/operations.md) — how the budget is computed in detail -- [Expose the dashboard](expose-dashboard.md) — visual cost view per team -- [Configure shared storage](shared-storage.md) — the other recurring cost on a kagents install +- [Operations explanation](../../explanation/operations.md). How the budget is computed in detail +- [Expose the dashboard](expose-dashboard.md). Visual cost view per team +- [Configure shared storage](shared-storage.md). The other recurring cost on a kagents install diff --git a/docs/how-to/operate/expose-dashboard.md b/docs/how-to/operate/expose-dashboard.md index 00a4c59..35a0ecb 100644 --- a/docs/how-to/operate/expose-dashboard.md +++ b/docs/how-to/operate/expose-dashboard.md @@ -1,6 +1,6 @@ # Expose the dashboard -The dashboard ships with kagents but is **off by default** — installing the chart alone gives you the controller and CRDs only. This guide walks through enabling it and exposing it for the three most common scenarios. +The dashboard ships with kagents but is **off by default**. Installing the chart alone gives you the controller and CRDs only. This guide walks through enabling it and exposing it for the three most common scenarios. For why the dashboard is off by default and what it can show, see the [Operations explanation](../../explanation/operations.md). @@ -39,7 +39,7 @@ kubectl port-forward -n claude-teams-system svc/kagents-dashboard 8080:8080 Open http://localhost:8080. You'll see the team list view; click any team for the detail page with live SSE updates. -`port-forward` is fine for dev but is a single-user tunnel through your local kubeconfig — don't rely on it for shared access. +`port-forward` is fine for dev but is a single-user tunnel through your local kubeconfig. Don't rely on it for shared access. ## Scenario 2: production (Ingress with basic auth) @@ -100,11 +100,11 @@ The pattern: 2. Point your Ingress at oauth2-proxy instead of the dashboard 3. Configure oauth2-proxy's `--upstream` flag to forward authenticated requests to `http://kagents-dashboard:8080` -This is a standard pattern with extensive documentation in the oauth2-proxy project. The dashboard itself doesn't need to change — it stays on the internal Service, and oauth2-proxy handles all authentication and group/role checks before requests reach it. +This is a standard pattern with extensive documentation in the oauth2-proxy project. The dashboard itself doesn't need to change. It stays on the internal Service, and oauth2-proxy handles all authentication and group/role checks before requests reach it. ## Scoping the dashboard to one namespace -By default the dashboard sees AgentTeams in **every** namespace (a `ClusterRoleBinding` grants read across the cluster). To restrict it to a single namespace — e.g. when teams in different namespaces belong to different tenants: +By default the dashboard sees AgentTeams in **every** namespace (a `ClusterRoleBinding` grants read across the cluster). To restrict it to a single namespace, e.g. when teams in different namespaces belong to different tenants: ```bash helm upgrade kagents \ @@ -127,10 +127,10 @@ Once the dashboard is reachable, deploy a quick test team and open the detail vi kubectl apply -n dev-agents -f config/samples/auth-refactor-team.yaml ``` -The list view should show the team. Click in — the detail page streams live status updates via SSE; killing a teammate pod with `kubectl delete pod ...` should cause the page to redraw within a second or two. +The list view should show the team. Click in. The detail page streams live status updates via SSE; killing a teammate pod with `kubectl delete pod ...` should cause the page to redraw within a second or two. ## Where to look next -- [Operations explanation](../../explanation/operations.md) — what the dashboard's metrics and alerts look like -- [Configure shared storage](shared-storage.md) — sizing and tuning the PVC backends -- [Set budget alerts](budget-alerts.md) — wiring webhook alerts on cost overruns +- [Operations explanation](../../explanation/operations.md). What the dashboard's metrics and alerts look like +- [Configure shared storage](shared-storage.md). Sizing and tuning the PVC backends +- [Set budget alerts](budget-alerts.md). Wiring webhook alerts on cost overruns diff --git a/docs/how-to/operate/shared-storage.md b/docs/how-to/operate/shared-storage.md index 72bb39d..038b403 100644 --- a/docs/how-to/operate/shared-storage.md +++ b/docs/how-to/operate/shared-storage.md @@ -36,7 +36,7 @@ spec: ## Backup -For most use cases the team-state PVC can be discarded — the mailbox is intermediate state, and finished teams' artifacts live elsewhere (in the git remote or in the Cowork output PVC). For the cases where you do want backups: +For most use cases the team-state PVC can be discarded. The mailbox is intermediate state, and finished teams' artifacts live elsewhere (in the git remote or in the Cowork output PVC). For the cases where you do want backups: ### EFS (EKS) @@ -90,7 +90,7 @@ The dominant workload is small synchronous writes (mailbox JSON updates) and sma ### EFS -- **Throughput mode**: `elastic` is the right default — pay per byte, scale automatically. Switch to `provisioned` only if you measure consistent saturation in CloudWatch's `BurstCreditBalance` metric. +- **Throughput mode**: `elastic` is the right default. Pay per byte, scale automatically. Switch to `provisioned` only if you measure consistent saturation in CloudWatch's `BurstCreditBalance` metric. - **Performance mode**: `generalPurpose` for <7,000 file ops/sec total across all teams (the typical case). `maxIO` only if you exceed that; it adds 1-3ms latency per op which hurts mailbox round-trips. - **Mount options**: defaults are fine. The CSI driver applies `nfsvers=4.1, rsize=1048576, wsize=1048576` by default. @@ -101,7 +101,7 @@ The dominant workload is small synchronous writes (mailbox JSON updates) and sma ### Azure Files (Premium NFS) -- **Mount option `nconnect=4`** is the single biggest performance win. Without it, expect 2-3x slower mailbox round-trips. Set it in the StorageClass — see the [AKS install guide](../install/aks.md#3-create-the-storageclass). +- **Mount option `nconnect=4`** is the single biggest performance win. Without it, expect 2-3x slower mailbox round-trips. Set it in the StorageClass. See the [AKS install guide](../install/aks.md#3-create-the-storageclass). - **Provisioned IOPS**: Azure Files Premium gives baseline IOPS proportional to provisioned size (1 IOPS per GiB). For a 100 GiB share, you get ~100 IOPS baseline + bursting. Raise capacity for more IOPS, not for more storage you don't need. ## Monitoring storage health @@ -112,10 +112,10 @@ Use the Prometheus metrics the chart exposes (see the [Operations explanation](. - **Filestore**: `nfs/server/operation_count`, `nfs/server/free_bytes_percent` in Cloud Monitoring - **Azure Files**: `Transactions`, `SuccessE2ELatency` in Azure Monitor -A sudden spike in operation count without a corresponding rise in active teams usually indicates a stuck-poll loop in one team — `kubectl describe agentteam ` to investigate. +A sudden spike in operation count without a corresponding rise in active teams usually indicates a stuck-poll loop in one team. `kubectl describe agentteam ` to investigate. ## Where to look next -- [Coordination protocol](../../explanation/coordination.md) — what the storage is actually carrying -- [Set budget alerts](budget-alerts.md) — wiring cost overruns into your alert pipeline -- [Expose the dashboard](expose-dashboard.md) — visual storage-load view +- [Coordination protocol](../../explanation/coordination.md). What the storage is actually carrying +- [Set budget alerts](budget-alerts.md). Wiring cost overruns into your alert pipeline +- [Expose the dashboard](expose-dashboard.md). Visual storage-load view diff --git a/docs/index.md b/docs/index.md index 44dd9f0..cc46df3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -35,13 +35,13 @@ helm install kagents \ --- - One `AgentTeam` CRD declares roles, budget, quality gates, and coordination topology. `AgentTeamTemplate` lets you reuse common team patterns — "3-agent security review," "fullstack feature team" — with one-line instantiation. + One `AgentTeam` CRD declares roles, budget, quality gates, and coordination topology. `AgentTeamTemplate` lets you reuse common team patterns. "3-agent security review," "fullstack feature team". With one-line instantiation. - :material-kubernetes:{ .lg .middle } **K8s as coordination fabric** --- - ServiceAccounts scope what each agent pod can touch. RWX PVCs hold the shared mailboxes. RBAC enforces per-agent capability boundaries. The cluster does the coordination work — kagents just wires it up. + ServiceAccounts scope what each agent pod can touch. RWX PVCs hold the shared mailboxes. RBAC enforces per-agent capability boundaries. The cluster does the coordination work. Kagents just wires it up. - :material-recycle-variant:{ .lg .middle } **Dogfooded** @@ -61,7 +61,7 @@ helm install kagents \ - :material-cog: **[How-to guides](how-to/index.md)** - Recipes for specific operational tasks — install on a cloud, expose the dashboard, tune budgets. + Recipes for specific operational tasks. Install on a cloud, expose the dashboard, tune budgets. - :material-book-open-variant: **[Reference](reference/index.md)** @@ -69,7 +69,7 @@ helm install kagents \ - :material-lightbulb: **[Explanation](explanation/index.md)** - How and why kagents works the way it does — the architecture, the design tradeoffs. + How and why kagents works the way it does. The architecture, the design tradeoffs. @@ -85,6 +85,6 @@ helm install kagents \ - :material-presentation:{ .lg .middle } **Talk** - *Reconciling Agent Teams: A Kubernetes Operator for Claude Code* — KubeCon NA 2026 (submitted). + *Reconciling Agent Teams: A Kubernetes Operator for Claude Code*. KubeCon NA 2026 (submitted). diff --git a/docs/reference/index.md b/docs/reference/index.md index 26b7e3f..887d193 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -1,14 +1,14 @@ # Reference -The lookup tables — every CRD field, every Helm value, every CLI flag, with no narrative wrapping. +The lookup tables. Every CRD field, every Helm value, every CLI flag, with no narrative wrapping. ## Pages -- **[API reference](api/index.md)** — auto-generated field-by-field detail for `AgentTeam`, `AgentTeamTemplate`, and `AgentTeamRun`. Regenerated from the kubebuilder markers in `api/v1alpha1/` on every site build via `make docs-api`. +- **[API reference](api/index.md)**. Auto-generated field-by-field detail for `AgentTeam`, `AgentTeamTemplate`, and `AgentTeamRun`. Regenerated from the kubebuilder markers in `api/v1alpha1/` on every site build via `make docs-api`. ## Coming next -- **Helm chart values** — every chart value documented with defaults and production override recipes (will migrate from the existing in-repo [`docs/helm-values.md`](https://github.com/amcheste/claude-teams-operator/blob/main/docs/helm-values.md)) +- **Helm chart values**. Every chart value documented with defaults and production override recipes (will migrate from the existing in-repo [`docs/helm-values.md`](https://github.com/amcheste/claude-teams-operator/blob/main/docs/helm-values.md)) ## Looking for a tutorial or recipe? diff --git a/docs/tutorials/getting-started.md b/docs/tutorials/getting-started.md index 8b98cd3..6d89051 100644 --- a/docs/tutorials/getting-started.md +++ b/docs/tutorials/getting-started.md @@ -6,7 +6,7 @@ This tutorial walks you from a fresh laptop to a running AgentTeam in about 15 m - A small Cowork-mode AgentTeam that researches a topic and writes a summary file - The know-how to inspect what's happening with `kubectl` and the dashboard -You don't need any cloud accounts or external services — everything runs on your laptop. +You don't need any cloud accounts or external services. Everything runs on your laptop. ## Prerequisites @@ -31,7 +31,7 @@ cd claude-teams-operator make kind-create ``` -This creates a Kind cluster named `claude-teams` with a local-path storage class aliased as `nfs`. On a single-node cluster every pod runs on the same node, so a hostPath volume is visible to all pods — that's our RWX-equivalent for laptop testing. +This creates a Kind cluster named `claude-teams` with a local-path storage class aliased as `nfs`. On a single-node cluster every pod runs on the same node, so a hostPath volume is visible to all pods. That's our RWX-equivalent for laptop testing. !!! note "Production deployments need a real RWX backend" For real multi-node clusters you'll need NFS, EFS, Filestore, or Azure Files. The Kind setup is a single-node convenience, not the production story. See the *Concept: file-based mailbox protocol* page (coming in v0.7.0) for why. @@ -77,7 +77,7 @@ Replace `sk-ant-...` with your actual key from [console.anthropic.com](https://c ## 4. Apply your first AgentTeam -This is a small Cowork-mode team — no git repo, just an output volume. The lead coordinates a single writer agent that produces a Markdown file. +This is a small Cowork-mode team. No git repo, just an output volume. The lead coordinates a single writer agent that produces a Markdown file. ```yaml title="hello-team.yaml" apiVersion: claude.amcheste.io/v1alpha1 @@ -185,18 +185,18 @@ make kind-delete ## What you just did -A real Kubernetes operator just orchestrated two Claude Code instances communicating via a shared filesystem to produce real output, with K8s primitives doing the coordination work — RWX PVC for the mailbox, ServiceAccounts for per-agent identity, owner references for cleanup. No custom protocol, no orchestrator service, no daemon outside the cluster. +A real Kubernetes operator just orchestrated two Claude Code instances communicating via a shared filesystem to produce real output, with K8s primitives doing the coordination work. RWX PVC for the mailbox, ServiceAccounts for per-agent identity, owner references for cleanup. No custom protocol, no orchestrator service, no daemon outside the cluster. ## Where to go next -- **[How-to guides](../how-to/index.md)** — install on a real cloud, expose the dashboard, set budget alerts -- **[Reference](../reference/index.md)** — every CRD field and Helm value documented -- **[Explanation](../explanation/index.md)** — how the file-based mailbox protocol actually works under the hood +- **[How-to guides](../how-to/index.md)**. Install on a real cloud, expose the dashboard, set budget alerts +- **[Reference](../reference/index.md)**. Every CRD field and Helm value documented +- **[Explanation](../explanation/index.md)**. How the file-based mailbox protocol actually works under the hood ## Common errors ??? warning "`PVCs stuck in Pending`" - The operator requires a ReadWriteMany-capable StorageClass. On a Kind cluster, `make kind-create` sets one up under the alias `nfs`. If you're using your own cluster, check `kubectl get sc` — there must be one named `nfs` (or you need to pass `--set storage.storageClassName=` when installing the chart). + The operator requires a ReadWriteMany-capable StorageClass. On a Kind cluster, `make kind-create` sets one up under the alias `nfs`. If you're using your own cluster, check `kubectl get sc`. There must be one named `nfs` (or you need to pass `--set storage.storageClassName=` when installing the chart). ??? warning "`Pod stuck in CrashLoopBackOff`" Check the agent pod logs: `kubectl logs -n dev-agents hello-team-writer`. The most common cause is a missing or invalid Anthropic API key. Re-create the Secret with `kubectl create secret generic anthropic-api-key --namespace dev-agents --from-literal=ANTHROPIC_API_KEY=... --dry-run=client -o yaml | kubectl apply -f -`. diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index ad15ef6..b076289 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -1,12 +1,12 @@ # Tutorials -Step-by-step lessons that take you from zero to a working AgentTeam. Read these top-to-bottom — they assume you're new to the project. +Step-by-step lessons that take you from zero to a working AgentTeam. Read these top-to-bottom. They assume you're new to the project. ## Available tutorials -- **[Getting started](getting-started.md)** — install kagents on a Kind cluster and run your first AgentTeam end-to-end. ~15 minutes. +- **[Getting started](getting-started.md)**. Install kagents on a Kind cluster and run your first AgentTeam end-to-end. ~15 minutes. -More tutorials will be added as the project matures. Have a use case you'd like a tutorial for — a security review team, a doc-generation team, multi-cluster fan-out? [Open a Discussion](https://github.com/amcheste/claude-teams-operator/discussions/categories/ideas) and tell us about it. +More tutorials will be added as the project matures. Have a use case you'd like a tutorial for. A security review team, a doc-generation team, multi-cluster fan-out? [Open a Discussion](https://github.com/amcheste/claude-teams-operator/discussions/categories/ideas) and tell us about it. ## Looking for something else? From f06bf14d075ade617c0684fae2c1bd8bd85b4339 Mon Sep 17 00:00:00 2001 From: amcheste-ai-agent <278991699+amcheste-ai-agent@users.noreply.github.com> Date: Mon, 11 May 2026 17:07:09 -0400 Subject: [PATCH 2/2] chore: declare No-Linear-Issue trailer for CI The brand-alignment sweep in 25fcf54 has no associated Linear ticket. The validate.yml Linear-ref check requires either an AMC-N reference or a No-Linear-Issue trailer. PR body was updated with the trailer; this empty commit re-triggers the synchronize event on the workflow. No-Linear-Issue: brand-alignment doc cleanup, no associated Linear ticket Co-Authored-By: Claude Opus 4.7 (1M context) Co-Authored-By: amcheste <13696614+amcheste@users.noreply.github.com>