diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 0000000..a66ccdf --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1,15 @@ +# Disables blank issues so contributors choose one of the structured +# templates above. Routes "I have a question" / "I have an idea" to +# Discussions instead of issues. +blank_issues_enabled: false + +contact_links: + - name: Question or discussion + url: https://github.com/amcheste/claude-teams-operator/discussions + about: For general questions, design discussions, or help requests, please open a Discussion instead of an issue. + - name: Documentation site + url: https://kagents.dev + about: Tutorials, how-to guides, concept pages, and CRD reference are at kagents.dev. + - name: Security vulnerability + url: https://github.com/amcheste/claude-teams-operator/security/advisories/new + about: Report security vulnerabilities privately via GitHub Security Advisories — see SECURITY.md. diff --git a/.github/ISSUE_TEMPLATE/docs_issue.yml b/.github/ISSUE_TEMPLATE/docs_issue.yml new file mode 100644 index 0000000..01db5e5 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/docs_issue.yml @@ -0,0 +1,42 @@ +name: Documentation Issue +description: Something on https://kagents.dev (or in-repo docs) is wrong, missing, confusing, or out of date. +labels: [documentation] +body: + - type: dropdown + id: location + attributes: + label: Where did you encounter this? + options: + - kagents.dev (docs site) + - README.md + - ARCHITECTURE.md + - CONTRIBUTING.md + - In-repo docs (docs/ tree) + - kubectl explain output (CRD docstring) + - Other + validations: + required: true + + - type: input + id: page + attributes: + label: Page or section + placeholder: e.g. https://kagents.dev/tutorials/getting-started/ or "Resource model > AgentTeamRun" + validations: + required: true + + - type: textarea + id: problem + attributes: + label: What's wrong? + description: Describe the issue. Wrong instruction? Stale info? Confusing wording? Missing example? + validations: + required: true + + - type: textarea + id: suggestion + attributes: + label: Proposed fix (optional) + description: If you have a specific suggestion — wording, an example, a diagram — share it here. PRs welcome too. + validations: + required: false diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index d3b7998..c182ec2 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -2,6 +2,15 @@ +## Linear + + + +Fixes AMC-XXX + ## Changes diff --git a/.github/release-announcements/v0.7.0-rc.1.md b/.github/release-announcements/v0.7.0-rc.1.md new file mode 100644 index 0000000..0bccbda --- /dev/null +++ b/.github/release-announcements/v0.7.0-rc.1.md @@ -0,0 +1,83 @@ +# v0.7.0-rc.1 — Documentation Site preview + +> Draft announcement for the v0.7.0 release candidate. Tone is transparency-first: this is a preview, we want feedback, here's what's in it. Edit this file freely before posting — it's a working draft, not a contract. + +--- + +## Short version (Discussions / Slack / social card) + +**kagents v0.7.0-rc.1 is live — first preview of the documentation site at https://kagents.dev.** + +Six weeks of work to replace the README-only experience with a proper docs site: tutorials, how-to guides, an auto-generated CRD reference, and concept pages on the architecture. Built with mkdocs-material, deployed via GitHub Pages. + +This is a **preview release** — we're shaking out polish items before the stable v0.7.0 cut. Please poke at it and tell us what's broken, missing, or confusing: + +- Found a typo or broken link? [Docs issue](https://github.com/amcheste/claude-teams-operator/issues/new?template=docs_issue.yml) +- Confused by a concept? [Discussion](https://github.com/amcheste/claude-teams-operator/discussions/categories/q-a) +- Have an idea for a missing tutorial? [Idea](https://github.com/amcheste/claude-teams-operator/discussions/categories/ideas) + +We'll hold the stable cut until the rough edges are smoothed. + +— Alan + +--- + +## Long version (GitHub release body) + +### What's in this preview + +The v0.7.0 milestone ships a [polished documentation site at kagents.dev](https://kagents.dev), structured around the [Diátaxis framework](https://diataxis.fr) — four sections, each with a clear purpose: + +**📘 [Tutorials](https://kagents.dev/tutorials/)** +- [Getting Started](https://kagents.dev/tutorials/getting-started/) — install kagents on a Kind cluster and run your first AgentTeam end-to-end. ~15 minutes, no cloud accounts needed. + +**🔧 [How-to guides](https://kagents.dev/how-to/)** +- [Install on Amazon EKS](https://kagents.dev/how-to/install/eks/) (EFS CSI driver + EFS file system + Access Points) +- [Install on Google GKE](https://kagents.dev/how-to/install/gke/) (Filestore CSI driver + Filestore instance) +- [Install on Azure AKS](https://kagents.dev/how-to/install/aks/) (Azure Files CSI driver + Premium NFS share) +- [Expose the dashboard](https://kagents.dev/how-to/operate/expose-dashboard/) (port-forward, Ingress + basic auth, oauth2-proxy) +- [Configure shared storage](https://kagents.dev/how-to/operate/shared-storage/) (sizing, backup, perf tuning per backend) +- [Set budget alerts](https://kagents.dev/how-to/operate/budget-alerts/) (per-team limits, webhook events, Prometheus rules) + +**📚 [Reference](https://kagents.dev/reference/)** +- [API reference](https://kagents.dev/reference/api/) — auto-generated from the kubebuilder markers in `api/v1alpha1/`. CI's `Check API reference docs are up to date` step keeps it in lockstep with the code. + +**💡 [Explanation](https://kagents.dev/explanation/)** +- [Resource model](https://kagents.dev/explanation/resources/) — the `AgentTeam` / `AgentTeamTemplate` / `AgentTeamRun` CRDs and how they relate, with a worked "3-agent security review across multiple repos" example. +- [Coordination protocol](https://kagents.dev/explanation/coordination/) — the load-bearing design choice. File-based mailboxes over RWX PVCs, per-teammate git worktrees as a concurrency primitive, the single-node fallback story. +- [Operations](https://kagents.dev/explanation/operations/) — honest breakdown of estimation-based budget tracking, per-agent RBAC's threat model, and the eight Prometheus metrics the operator exposes. + +### Other v0.7.0 changes + +- **`make docs-api`** — new make target regenerates the API reference from kubebuilder markers via [`crd-ref-docs`](https://github.com/elastic/crd-ref-docs). Wired into the lint job's drift check, mirrors the existing `make manifests` pattern. +- **mkdocs-material site infrastructure** — `docs/` tree, `mkdocs.yml`, `docs/requirements.txt`, and `.github/workflows/docs.yml` (deploys to `gh-pages` on every push to `main`). +- **Community baseline** — adopted [Contributor Covenant v2.1](https://github.com/amcheste/claude-teams-operator/blob/main/CODE_OF_CONDUCT.md), polished CONTRIBUTING.md with Linear ↔ GitHub guidance for contributors, hardened SECURITY.md with explicit coordinated-disclosure expectations, added a docs-issue template, enabled GitHub Discussions. +- **kagents brand introduced** — README, KUBECON.md, and CFP draft all aligned around the new public name. Repo, Helm chart, image names, and CRD group (`claude.amcheste.io/v1alpha1`) intentionally unchanged — same pattern as Argo CD, Knative, Kueue. + +### Upgrade notes + +This release is **docs-only** — zero functional changes to the operator, dashboard, or CRDs. Existing v0.6.0 installs continue to work without any migration. Helm chart version is bumped purely for tagging cohesion. + +### What's not yet in v0.7.0 (and won't block the stable cut) + +- The "Helm chart values" page under [Reference](https://kagents.dev/reference/) — the existing in-repo [`docs/helm-values.md`](https://github.com/amcheste/claude-teams-operator/blob/main/docs/helm-values.md) is still the source of truth. Migration to the docs site is tracked for a future minor release. + +### What we want feedback on + +Please tell us if you find: + +- **Broken anything** — links, code blocks that don't run, commands that fail. [Docs issue template](https://github.com/amcheste/claude-teams-operator/issues/new?template=docs_issue.yml). +- **Confusing concepts** — a passage you read three times. [Q&A discussion](https://github.com/amcheste/claude-teams-operator/discussions/categories/q-a). +- **Missing tutorials** — a use case you'd want a step-by-step guide for. [Ideas discussion](https://github.com/amcheste/claude-teams-operator/discussions/categories/ideas). +- **Wrong or stale technical details** — especially in the cloud-install guides. The author wrote them based on documented patterns; real platform engineers on each cloud will catch nuances that haven't been hit yet. + +### What's next + +We'll hold the v0.7.0 stable cut until the obvious rough edges are smoothed — likely 1–2 weeks. After that: + +- **v0.4.0/v0.5.0/v0.6.0 retrospective** — these milestones are all 100% complete in code; we'll close out any straggling Linear hygiene. +- **v1.0.0 — KubeCon Demo Polish** (target: Oct 2026). The on-stage demo script, real-API E2E gating in CI, OCI skill distribution, and a presentation-mode dashboard view. KubeCon NA 2026 is Nov 9–12 in Salt Lake City. + +Thanks for reading. Try the [Getting Started tutorial](https://kagents.dev/tutorials/getting-started/) and let us know how it goes. + +— Alan, with help from the Claude Code agent team that built it diff --git a/.github/workflows/auto-assign.yml b/.github/workflows/auto-assign.yml deleted file mode 100644 index 08c25d5..0000000 --- a/.github/workflows/auto-assign.yml +++ /dev/null @@ -1,21 +0,0 @@ -name: Auto-assign PR creator -on: - pull_request: - types: [opened] - -jobs: - assign: - runs-on: ubuntu-latest - permissions: - pull-requests: write - issues: write - steps: - - uses: actions/github-script@v9 - with: - script: | - await github.rest.issues.addAssignees({ - owner: context.repo.owner, - repo: context.repo.repo, - issue_number: context.issue.number, - assignees: [context.payload.pull_request.user.login] - }); diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml new file mode 100644 index 0000000..3b434e0 --- /dev/null +++ b/.github/workflows/docs.yml @@ -0,0 +1,46 @@ +name: Deploy Docs + +on: + push: + branches: [main] + paths: + - 'docs/**' + - 'mkdocs.yml' + - '.github/workflows/docs.yml' + workflow_dispatch: + +permissions: + contents: write + +concurrency: + group: docs-deploy + cancel-in-progress: false + +jobs: + deploy: + name: Build and Deploy + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6 + with: + # Required for git-revision-date-localized to read commit history. + fetch-depth: 0 + + - uses: actions/setup-python@v6 + with: + python-version: '3.12' + cache: pip + cache-dependency-path: docs/requirements.txt + + - name: Install mkdocs-material + run: pip install -r docs/requirements.txt + + - name: Configure git for gh-deploy + run: | + git config --global user.name "github-actions[bot]" + git config --global user.email "41898282+github-actions[bot]@users.noreply.github.com" + + # mkdocs gh-deploy builds the site and force-pushes to the gh-pages + # branch in a single command. The branch is created on first run. + - name: Build + deploy to gh-pages + run: mkdocs gh-deploy --force --no-history diff --git a/.github/workflows/release-drafter.yml b/.github/workflows/release-drafter.yml index 1d7df64..44ef789 100644 --- a/.github/workflows/release-drafter.yml +++ b/.github/workflows/release-drafter.yml @@ -16,6 +16,6 @@ jobs: name: Update Release Draft runs-on: ubuntu-latest steps: - - uses: release-drafter/release-drafter@6a93d829887aa2e0748befe2e808c66c0ec6e4c7 # v6 + - uses: release-drafter/release-drafter@c2e2804cc59f45f57076a99af580d0fedb697927 # v7.3.0 env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/scorecard.yml b/.github/workflows/scorecard.yml index 3716ce2..54bcfb7 100644 --- a/.github/workflows/scorecard.yml +++ b/.github/workflows/scorecard.yml @@ -5,7 +5,7 @@ on: schedule: - cron: '30 1 * * 1' # Every Monday at 01:30 UTC push: - branches: [main] + branches: [main, develop] workflow_dispatch: permissions: read-all @@ -28,7 +28,7 @@ jobs: with: results_file: results.sarif results_format: sarif - publish_results: ${{ github.ref == 'refs/heads/main' }} + publish_results: ${{ github.ref_name == github.event.repository.default_branch }} - uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7 with: @@ -36,6 +36,6 @@ jobs: path: results.sarif retention-days: 5 - - uses: github/codeql-action/upload-sarif@d4b3ca9fa7f69d38bfcd667bdc45bc373d16277e # v4 + - uses: github/codeql-action/upload-sarif@68bde559dea0fdcac2102bfdf6230c5f70eb485e # v4 with: sarif_file: results.sarif diff --git a/.github/workflows/validate.yml b/.github/workflows/validate.yml index 1e3bcd5..3df25c7 100644 --- a/.github/workflows/validate.yml +++ b/.github/workflows/validate.yml @@ -65,6 +65,11 @@ jobs: make manifests generate git diff --exit-code || (echo "CRD manifests or generated code is out of date. Run 'make manifests generate' and commit the result." && exit 1) + - name: Check API reference docs are up to date + run: | + make docs-api + git diff --exit-code docs/reference/api/ || (echo "API reference docs are out of date. Run 'make docs-api' and commit the result." && exit 1) + - name: Helm lint run: | curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash diff --git a/.gitignore b/.gitignore index 55e9377..224c97b 100644 --- a/.gitignore +++ b/.gitignore @@ -30,5 +30,8 @@ Thumbs.db # Build dist/ +# mkdocs build output +site/ + # Kind kubeconfig diff --git a/AGENTS.md b/AGENTS.md index f21b66f..a64c775 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,24 +1,24 @@ -# AGENTS.md — Agent Team Guidelines for claude-teams-operator +# AGENTS.md. Agent Team Guidelines for claude-teams-operator ## When working as a teammate on this project -1. **Check the task list first** — before starting work, check what's assigned to you -2. **Respect module boundaries** — each internal package has a clear scope: - - `internal/controller/` — only reconciliation logic - - `internal/claude/` — only Claude Code file I/O and session management - - `internal/budget/` — only cost estimation - - `internal/webhook/` — only external notifications - - `internal/metrics/` — only Prometheus metrics -3. **Use kubebuilder markers** — all CRD types in `api/v1alpha1/` must have proper `+kubebuilder:` annotations -4. **Test with envtest** — controller tests should use controller-runtime's envtest framework -5. **Follow Kubernetes conventions** — conditions use `metav1.Condition`, status updates are separate from spec changes +1. **Check the task list first**. Before starting work, check what's assigned to you +2. **Respect module boundaries**. Each internal package has a clear scope: + - `internal/controller/`. Only reconciliation logic + - `internal/claude/`. Only Claude Code file I/O and session management + - `internal/budget/`. Only cost estimation + - `internal/webhook/`. Only external notifications + - `internal/metrics/`. Only Prometheus metrics +3. **Use kubebuilder markers**. All CRD types in `api/v1alpha1/` must have proper `+kubebuilder:` annotations +4. **Test with envtest**. Controller tests should use controller-runtime's envtest framework +5. **Follow Kubernetes conventions**. Conditions use `metav1.Condition`, status updates are separate from spec changes ## Architecture rules -- The operator NEVER makes Anthropic API calls directly — it only manages pods that run Claude Code -- All inter-agent communication goes through the shared PVC filesystem — the operator just creates and monitors the volumes -- Budget tracking is estimation-based — we can't read real-time token counts from Claude Code -- Pods use `RestartPolicy: Never` — crashed agents get re-spawned fresh, not restarted +- The operator NEVER makes Anthropic API calls directly. It only manages pods that run Claude Code +- All inter-agent communication goes through the shared PVC filesystem. The operator just creates and monitors the volumes +- Budget tracking is estimation-based. We can't read real-time token counts from Claude Code +- Pods use `RestartPolicy: Never`. Crashed agents get re-spawned fresh, not restarted ## Build verification diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index dfe9eb0..5dce9d5 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -86,25 +86,25 @@ This approach preserves the native Agent Teams protocol without modification whi ## Storage Requirements -All operator-managed PVCs — `team-state`, `repo`, and (in Cowork mode) `output` — default to `ReadWriteMany` access on a StorageClass named `nfs`. The requirement is not incidental: the lead and every teammate pod must open the same mailbox and task files concurrently, and on a multi-node cluster they will generally land on different nodes. `ReadWriteOnce` can only bind to one node at a time, so it is not a viable default. +All operator-managed PVCs. `team-state`, `repo`, and (in Cowork mode) `output`. Default to `ReadWriteMany` access on a StorageClass named `nfs`. The requirement is not incidental: the lead and every teammate pod must open the same mailbox and task files concurrently, and on a multi-node cluster they will generally land on different nodes. `ReadWriteOnce` can only bind to one node at a time, so it is not a viable default. ### Why ReadWriteMany Each agent pod does two concurrent things against shared state: -- **Writing into peers' inboxes** — the lead writes `teams/{team}/inboxes/{teammate}.json`; each teammate writes to the lead's inbox and occasionally to other teammates'. -- **Claiming tasks** — multiple teammates race to claim items from `tasks/{team}/tasks.json`. +- **Writing into peers' inboxes**. The lead writes `teams/{team}/inboxes/{teammate}.json`; each teammate writes to the lead's inbox and occasionally to other teammates'. +- **Claiming tasks**. Multiple teammates race to claim items from `tasks/{team}/tasks.json`. If the backing PVC cannot be mounted on more than one node, the second pod will fail to schedule (`volume already attached to a different node`) and the team deadlocks before the first mailbox round-trip. ### Supported storage backends -The operator itself has no opinion about the CSI driver — it asks for a PVC with `accessModes: [ReadWriteMany]` and a `storageClassName` that you supply. The table below lists drivers known to satisfy the RWX contract: +The operator itself has no opinion about the CSI driver. It asks for a PVC with `accessModes: [ReadWriteMany]` and a `storageClassName` that you supply. The table below lists drivers known to satisfy the RWX contract: | Platform | Driver | Notes | |----------|--------|-------| | Kind (multi-node dev) | `nfs-ganesha/nfs-server-provisioner` | Installed by `hack/kind-setup.sh` as StorageClass `nfs`. Real RWX over an in-cluster NFS server. | -| Kind (single-node acceptance) | `rancher.io/local-path` under the `nfs` StorageClass alias | Installed by `hack/acceptance-setup.sh`. See "Single-node fallback" — not true RWX. | +| Kind (single-node acceptance) | `rancher.io/local-path` under the `nfs` StorageClass alias | Installed by `hack/acceptance-setup.sh`. See "Single-node fallback"; not true RWX. | | Amazon EKS | [EFS CSI driver](https://github.com/kubernetes-sigs/aws-efs-csi-driver) | StorageClass pointing at an EFS file system. RWX natively. | | Google GKE | [Filestore CSI driver](https://cloud.google.com/filestore/docs/csi-driver) | Enable the Filestore CSI add-on; Filestore instances advertise RWX. | | Azure AKS | [Azure Files CSI driver](https://learn.microsoft.com/azure/aks/azure-files-csi) | SMB or NFS-protocol file shares; both support RWX. | @@ -114,11 +114,11 @@ The StorageClass name the operator requests defaults to `nfs` and is overridable ### Single-node fallback -For laptops and CI — Kind, k3d, minikube — a full RWX provisioner is overkill. The operator accepts a `--pvc-access-mode=ReadWriteOnce` flag that switches every managed PVC from `ReadWriteMany` to `ReadWriteOnce`. This works **only** on single-node clusters, because every pod lands on the same node and a hostPath-backed RWO PVC is effectively visible to all of them. +For laptops and CI. Kind, k3d, minikube. A full RWX provisioner is overkill. The operator accepts a `--pvc-access-mode=ReadWriteOnce` flag that switches every managed PVC from `ReadWriteMany` to `ReadWriteOnce`. This works **only** on single-node clusters, because every pod lands on the same node and a hostPath-backed RWO PVC is effectively visible to all of them. `hack/acceptance-setup.sh` uses exactly this trick: it creates an alias StorageClass named `nfs` over `rancher.io/local-path` so the operator's PVC specs still validate, then sets `--pvc-access-mode=ReadWriteOnce` on the controller deployment. -The architectural claim — that a shared mount is sufficient to ferry mailbox JSON between pods — can be verified on any single-node cluster with: +The architectural claim. That a shared mount is sufficient to ferry mailbox JSON between pods. Can be verified on any single-node cluster with: ```bash make acceptance-up @@ -133,10 +133,10 @@ The smoke test reports the effective StorageClass and AccessMode on its PASS lin The native Agent Teams protocol is file-based: -- **Mailboxes** — each agent has a JSON inbox at `~/.claude/teams/{team}/inboxes/{agent}.json`. Agents read their own inbox for messages from teammates. -- **Task list** — a shared JSON file at `~/.claude/tasks/{team}/tasks.json`. The lead writes tasks; teammates claim and update them. +- **Mailboxes**. Each agent has a JSON inbox at `~/.claude/teams/{team}/inboxes/{agent}.json`. Agents read their own inbox for messages from teammates. +- **Task list**. A shared JSON file at `~/.claude/tasks/{team}/tasks.json`. The lead writes tasks; teammates claim and update them. -The operator does not implement or speak this protocol — it only creates the shared PVC that makes the filesystem visible to all pods. Claude Code manages the protocol itself. +The operator does not implement or speak this protocol. It only creates the shared PVC that makes the filesystem visible to all pods. Claude Code manages the protocol itself. ## Coding Mode @@ -148,7 +148,7 @@ When `spec.repository` is set, the operator runs an init Job before deploying po Each teammate pod receives `WORKTREE_PATH=worktrees/{name}`, and the entrypoint `cd`s to that path before launching Claude Code. The lead has no worktree path and works directly from `/workspace/repo`. -Per-worktree isolation prevents git conflicts between concurrent agents — each agent commits to its own branch, and the lead (or an `onComplete` action) handles merging. +Per-worktree isolation prevents git conflicts between concurrent agents. Each agent commits to its own branch, and the lead (or an `onComplete` action) handles merging. ## Cowork Mode @@ -156,7 +156,7 @@ When `spec.workspace` is set (and `spec.repository` is absent or minimal), the o - Creates an output PVC for writable agent output - Mounts workspace inputs (ConfigMaps or existing PVCs) read-only into each pod -- Does not set `WORKTREE_PATH` — agents work in `/workspace/output` or `/workspace/data` +- Does not set `WORKTREE_PATH`. Agents work in `/workspace/output` or `/workspace/data` The entrypoint detects the absence of a git repo gracefully and skips the `git log` startup output. @@ -164,7 +164,7 @@ The entrypoint detects the absence of a git repo gracefully and skips the `git l Claude Code skills live under `~/.claude/skills/{name}/`. The operator mounts ConfigMap-backed skills at `/var/claude-skills/{name}/` and the entrypoint copies them to `~/.claude/skills/{name}/` before launching Claude Code. -Skills are per-agent — the same skill ConfigMap can be mounted into multiple pods independently, so different teammates can have different skill sets. +Skills are per-agent. The same skill ConfigMap can be mounted into multiple pods independently, so different teammates can have different skill sets. ## MCP Servers @@ -193,7 +193,7 @@ The next reconcile loop (within 30 seconds) sees the annotation and spawns the t ## DependsOn Ordering -Teammates can declare `dependsOn` — a list of other teammate names that must reach `Succeeded` phase before this teammate is spawned. The check runs every reconcile loop: +Teammates can declare `dependsOn`. A list of other teammate names that must reach `Succeeded` phase before this teammate is spawned. The check runs every reconcile loop: - In `reconcileInitializing`: initial pod deployment respects dependency order - In `reconcileRunning`: newly unblocked teammates are spawned automatically as their dependencies complete @@ -217,7 +217,7 @@ When the estimate exceeds `budgetLimit`, the operator terminates all pods and se ### Why shared PVC over a message bus? -Agent Teams uses a file-based protocol. Rather than translating it to Redis or NATS, we preserve it exactly by mounting a shared filesystem. This means no changes to Claude Code itself, no protocol versioning concerns, and no additional infrastructure dependencies for simple deployments. The tradeoff is the requirement for ReadWriteMany PVC support — NFS or a cloud-native equivalent like EFS or GCP Filestore. +Agent Teams uses a file-based protocol. Rather than translating it to Redis or NATS, we preserve it exactly by mounting a shared filesystem. This means no changes to Claude Code itself, no protocol versioning concerns, and no additional infrastructure dependencies for simple deployments. The tradeoff is the requirement for ReadWriteMany PVC support. NFS or a cloud-native equivalent like EFS or GCP Filestore. ### Why RestartPolicy: Never? @@ -266,9 +266,9 @@ hack/ ## Roadmap -- **OCI skill artifacts** — pull skills from OCI registries instead of ConfigMaps -- **Real token tracking** — instrument or sidecar Claude Code to capture actual usage -- **envtest integration tests** — full reconcile loop tests against a real API server -- **Horizontal scaling** — multiple operator replicas with leader election -- **Beads/Dolt integration** — persistent task tracking across team runs -- **`AgentTeamRun` controller** — reconciler for the template-instantiation CRD +- **OCI skill artifacts**. Pull skills from OCI registries instead of ConfigMaps +- **Real token tracking**. Instrument or sidecar Claude Code to capture actual usage +- **envtest integration tests**. Full reconcile loop tests against a real API server +- **Horizontal scaling**. Multiple operator replicas with leader election +- **Beads/Dolt integration**. Persistent task tracking across team runs +- **`AgentTeamRun` controller**. Reconciler for the template-instantiation CRD diff --git a/CLAUDE.md b/CLAUDE.md index e49f309..57c626c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -118,7 +118,7 @@ This project is being developed with the goal of presenting at KubeCon NA 2026 ( ### Release Timeline -All milestones and issues are tracked on GitHub. The CFP is **OPEN** with submissions due **May 31 2026 at 11:59pm MT** — see KUBECON.md. +All milestones and issues are tracked on GitHub. The KubeCon CFP has been **submitted** (May 2026) — see KUBECON.md. | Version | GitHub Milestone | Due | What it unlocks | |---------|-----------------|-----|-----------------| @@ -128,9 +128,9 @@ All milestones and issues are tracked on GitHub. The CFP is **OPEN** with submis | **v0.4.0** | Resilience & RBAC | Aug 31 2026 | Crash re-spawn ✅, per-agent ServiceAccounts ✅, `onComplete: create-pr` ✅, `onComplete: push-branch` | | **v0.5.0** | Template Engine & Helm | Sep 30 2026 | `AgentTeamTemplate`/`AgentTeamRun` controllers, production Helm chart, CONTRIBUTING.md | | **v0.6.0** | Operator Dashboard | Oct 5 2026 | Web UI for running AgentTeams: backend API, list + detail views (HTMX + Go templates), live SSE updates, Helm packaging | -| **v1.0.0** | KubeCon Demo Polish | Oct 26 2026 | Demo script, CFP submitted, OCI skill distribution, dashboard presentation mode for stage | +| **v1.0.0** | KubeCon Demo Polish | Oct 26 2026 | Demo script, OCI skill distribution, dashboard presentation mode for stage | -**KubeCon talk:** November 9–12 2026, Salt Lake City. CFP deadline: May 31 2026. +**KubeCon talk:** November 9–12 2026, Salt Lake City. CFP submitted May 2026. ### Current Priority (post-v0.3.0) @@ -138,7 +138,6 @@ The next highest-value issues: 1. **#16** — `onComplete: push-branch` — closes out v0.4.0 alongside the already-merged #13/#14/#15 2. **#17 / #18** — AgentTeamTemplate + AgentTeamRun controllers (v0.5.0) 3. **#137–#140** — the operator dashboard (v0.6.0) -4. **#23** — draft and submit the KubeCon CFP by May 31 — this is the hard deadline ### Ask of Claude Code diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..d442294 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,43 @@ +# Code of Conduct + +## Our pledge + +This project follows the [**Contributor Covenant Code of Conduct, version 2.1**](https://www.contributor-covenant.org/version/2/1/code_of_conduct/) — an industry-standard agreement adopted by thousands of open-source projects including Kubernetes, the Cloud Native Computing Foundation, Rust, and Node.js. + +In short: contributors and maintainers commit to making participation in this project a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. + +The full text of the standards expected — both the positive examples (welcoming language, accepting constructive feedback, etc.) and the unacceptable behaviors (sexualized language, trolling, harassment, etc.) — is available at the link above. **By contributing to this project, you agree to abide by it.** + +## Scope + +This Code of Conduct applies within all project spaces, including: + +- The GitHub repository (issues, pull requests, Discussions, code review) +- Any chat, mailing list, or social media account associated with the project +- Public events where someone is representing the project (talks, demos, meetups) + +## Reporting + +If you experience or witness a violation, report it confidentially: + +- **Email:** amcheste@gmail.com +- **GitHub:** open a [private security advisory](https://github.com/amcheste/claude-teams-operator/security/advisories/new) and tag it as a Code of Conduct concern in the title + +You will receive an acknowledgement within **7 days** and a resolution or status update within **30 days**. All reports are handled confidentially. Reporters will not be retaliated against. + +## Enforcement + +Project maintainers will follow the Contributor Covenant's [Enforcement Guidelines](https://www.contributor-covenant.org/version/2/1/code_of_conduct/#enforcement-guidelines), which describe a four-stage response: + +1. **Correction** — private warning for a first minor incident +2. **Warning** — formal warning with conditions +3. **Temporary ban** — time-limited removal from project spaces +4. **Permanent ban** — for sustained or severe violations + +The maintainer is the final authority on enforcement decisions for this project. + +## Attribution + +This Code of Conduct adopts the [Contributor Covenant](https://www.contributor-covenant.org/), version 2.1, available at https://www.contributor-covenant.org/version/2/1/code_of_conduct/. Translations are available at https://www.contributor-covenant.org/translations. + +The Contributor Covenant is licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/). diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 3e427ad..999ced7 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,13 +1,50 @@ # Contributing +Thank you for considering a contribution to **kagents** (the [`claude-teams-operator`](https://github.com/amcheste/claude-teams-operator) repo). This guide covers everything from setting up your dev environment to opening your first PR. + +If you're new here, the fastest orientation: + +1. Read the [Getting Started tutorial](https://kagents.dev/tutorials/getting-started/) to install kagents on a Kind cluster and run a team end-to-end (~15 minutes) +2. Read the [Resource model](https://kagents.dev/explanation/resources/) and [Coordination protocol](https://kagents.dev/explanation/coordination/) explanation pages to understand the architecture +3. Skim this guide +4. Check the [good first issues](https://github.com/amcheste/claude-teams-operator/labels/good%20first%20issue) below for something to pick up + +## Code of Conduct + +This project adopts the [Contributor Covenant Code of Conduct, version 2.1](CODE_OF_CONDUCT.md). By participating you agree to abide by it. Report violations confidentially per the instructions in that file. + +## Issue tracking + +This project uses **Linear** (team `AMC`, project `claude-teams-operator`) as the source of truth for issue tracking. Linear ↔ GitHub sync mirrors issues both ways: a Linear issue gets a GitHub mirror, and a PR that references the Linear ID (`Fixes AMC-N` in the PR body or any commit message) auto-closes the Linear ticket on merge. + +For external contributors who don't have Linear access: + +- File issues directly on GitHub using the [issue templates](https://github.com/amcheste/claude-teams-operator/issues/new/choose). The maintainer will mirror them into Linear. +- Reference the GitHub issue number in your PR (`Fixes #123`). That works fine. The Linear sync handles the cross-reference. + +For maintainers and regular contributors: + +- Open or claim issues in Linear directly via [save_issue](https://linear.app/amcheste/project/claude-teams-operator-32aab082f36b) (or the Linear UI). +- PRs to `develop` are required to reference an `AMC-N` ID or carry a `No-Linear-Issue: ` trailer. The `Linear Issue Reference` CI check enforces this. + +## Good first issues + +If you're looking for a way in, browse: + +- [`good first issue`](https://github.com/amcheste/claude-teams-operator/labels/good%20first%20issue). Small, well-scoped tasks with clear acceptance criteria +- [`help wanted`](https://github.com/amcheste/claude-teams-operator/labels/help%20wanted). Areas where the maintainer would specifically welcome a hand +- [`documentation`](https://github.com/amcheste/claude-teams-operator/labels/documentation). Content fixes, new tutorials, or how-to guides for the docs site at [kagents.dev](https://kagents.dev) + +If nothing on those lists fits, [open a Discussion](https://github.com/amcheste/claude-teams-operator/discussions) describing what you'd like to work on. Better to align before writing code than after. + ## Prerequisites -- **Go 1.23+** — `brew install go` or [go.dev/dl](https://go.dev/dl) -- **Docker** — for building container images -- **Kind** — `brew install kind` (local cluster) -- **kubectl** — `brew install kubectl` -- **Helm** — `brew install helm` -- **golangci-lint** — `brew install golangci-lint` +- **Go 1.23+**. `brew install go` or [go.dev/dl](https://go.dev/dl) +- **Docker**. For building container images +- **Kind**. `brew install kind` (local cluster) +- **kubectl**. `brew install kubectl` +- **Helm**. `brew install helm` +- **golangci-lint**. `brew install golangci-lint` Verify your Go installation: @@ -82,7 +119,7 @@ The CRD types live in `api/v1alpha1/`. After modifying them: 3. Run `make install` to apply the updated CRDs to your cluster 4. Commit both the Go source changes **and** the generated files -Do not edit `zz_generated.deepcopy.go` or `config/crd/bases/*.yaml` by hand — they are always regenerated. +Do not edit `zz_generated.deepcopy.go` or `config/crd/bases/*.yaml` by hand. They are always regenerated. ## Testing @@ -107,8 +144,8 @@ In short: branch from `develop`, one logical change per PR, [Conventional Commit This repo extends the canonical commit types with: -- `test:` — adding or updating tests -- `ci:` — CI/CD configuration changes +- `test:`. Adding or updating tests +- `ci:`. CI/CD configuration changes Scopes are encouraged (optional but helpful): `feat(controller):`, `fix(crd):`, `docs(readme):`, `feat(crd)!: rename budgetLimit field`. @@ -122,15 +159,34 @@ make manifests generate fmt vet test All must pass. CI will re-run them. +If your PR touches `api/v1alpha1/*.go`, also run: + +```bash +make docs-api +``` + +This regenerates the auto-generated API reference at `docs/reference/api/index.md`. CI's `Check API reference docs are up to date` step fails if you skip this. + +### Documentation site changes + +The docs site at [kagents.dev](https://kagents.dev) lives under `docs/` and is built with [mkdocs-material](https://squidfunk.github.io/mkdocs-material/). To preview your changes locally: + +```bash +pip install -r docs/requirements.txt +mkdocs serve # http://localhost:8000 +``` + +The site auto-deploys to `gh-pages` on every push to `main` that touches `docs/`, `mkdocs.yml`, or `.github/workflows/docs.yml`. See [`docs/README.md`](docs/README.md) for the dev loop. + --- ## How to add a new reconciler feature -The most common contribution path is "add a new field to an `AgentTeam` and have the operator do something with it." Use this worked example as a template — it's the path #13–#16 followed for crash respawn, RBAC, create-pr, and push-branch. +The most common contribution path is "add a new field to an `AgentTeam` and have the operator do something with it." Use this worked example as a template. It's the path #13–#16 followed for crash respawn, RBAC, create-pr, and push-branch. ### 1. Decide where the field belongs -Most lifecycle-related fields live on `LifecycleSpec`; pod-level configuration lives on `LeadSpec`/`TeammateSpec`; cluster-wide defaults live on the Helm chart's `values.yaml`. When in doubt, look at how `MaxRestarts` or `GitCredentialsSecret` are wired — they're representative. +Most lifecycle-related fields live on `LifecycleSpec`; pod-level configuration lives on `LeadSpec`/`TeammateSpec`; cluster-wide defaults live on the Helm chart's `values.yaml`. When in doubt, look at how `MaxRestarts` or `GitCredentialsSecret` are wired. They're representative. ### 2. Extend the CRD type @@ -146,7 +202,7 @@ Edit `api/v1alpha1/agentteam_types.go` (or `template_types.go`). Add the field w MaxRestarts *int32 `json:"maxRestarts,omitempty"` ``` -The doc comment becomes the CRD's OpenAPI description — write it for someone reading `kubectl explain agentteam.spec.lifecycle.maxRestarts`. +The doc comment becomes the CRD's OpenAPI description. Write it for someone reading `kubectl explain agentteam.spec.lifecycle.maxRestarts`. ### 3. Regenerate manifests + deepcopy @@ -158,7 +214,7 @@ This rewrites `config/crd/bases/*.yaml`, `charts/claude-teams-operator/crds/*.ya ### 4. Implement the reconciler change -Find the right phase function — `reconcilePending`, `reconcileInitializing`, `reconcileRunning`, or `reconcileTerminal` — in `internal/controller/agentteam_controller.go`. The phases are documented in [ARCHITECTURE.md § State Machine](ARCHITECTURE.md). +Find the right phase function. `reconcilePending`, `reconcileInitializing`, `reconcileRunning`, or `reconcileTerminal`. In `internal/controller/agentteam_controller.go`. The phases are documented in [ARCHITECTURE.md § State Machine](ARCHITECTURE.md). Add a small helper rather than inlining new logic. The convention is `func (r *AgentTeamReconciler) handleX(ctx, team) (...)` for stateful behavior, and free functions for pure logic. See `handleTeammateFailures` and `newTeamTracker` for examples. @@ -180,9 +236,9 @@ If the existing webhook event types don't fit, add a new one to `internal/webhoo Each PR should add tests at the layers it changes: -- **Unit tests** — fast, fake-client based. Cover validation, branch coverage in your helper, error paths. Add to `internal/controller/agentteam__test.go`. See [TESTING.md](TESTING.md) for the suite breakdown. -- **Integration tests** — envtest-backed Ginkgo specs in `internal/controller/agentteam_integration_test.go` (or a new `agentteam__integration_test.go`). Use these when the behavior depends on the real API server's optimistic concurrency, status subresource handling, or owner references. -- **Acceptance tests** — Kind-cluster Ginkgo specs under `test/acceptance/`. Use when the behavior involves pod lifecycle, PVC mounting, or anything that fake-client can't simulate. Real-API E2E (`test/e2e/`) is reserved for end-to-end verification against Anthropic's API. +- **Unit tests**. Fast, fake-client based. Cover validation, branch coverage in your helper, error paths. Add to `internal/controller/agentteam__test.go`. See [TESTING.md](TESTING.md) for the suite breakdown. +- **Integration tests**. Envtest-backed Ginkgo specs in `internal/controller/agentteam_integration_test.go` (or a new `agentteam__integration_test.go`). Use these when the behavior depends on the real API server's optimistic concurrency, status subresource handling, or owner references. +- **Acceptance tests**. Kind-cluster Ginkgo specs under `test/acceptance/`. Use when the behavior involves pod lifecycle, PVC mounting, or anything that fake-client can't simulate. Real-API E2E (`test/e2e/`) is reserved for end-to-end verification against Anthropic's API. A good rule: if your feature has a state machine, your test count should be ≥ the number of branches in the state machine. @@ -199,9 +255,9 @@ Cluster-wide defaults belong on the operator's CLI flags (read from a ConfigMap ### Reference PRs -These are good examples to skim before opening your first reconciler PR — each one followed this exact recipe: +These are good examples to skim before opening your first reconciler PR. Each one followed this exact recipe: -- [#13 Crash respawn](https://github.com/amcheste/claude-teams-operator/pull/133) — controller state machine + metrics + webhook + tests across all three layers -- [#14 Per-agent RBAC](https://github.com/amcheste/claude-teams-operator/pull/134) — CRD-less feature: just controller logic + scoped Roles + RBAC markers -- [#15 create-pr](https://github.com/amcheste/claude-teams-operator/pull/135) — new internal package (`internal/github`) + controller wiring + httptest-backed tests -- [#16 push-branch](https://github.com/amcheste/claude-teams-operator/pull/148) — async terminal Job + status mirror + envtest integration spec +- [#13 Crash respawn](https://github.com/amcheste/claude-teams-operator/pull/133). Controller state machine + metrics + webhook + tests across all three layers +- [#14 Per-agent RBAC](https://github.com/amcheste/claude-teams-operator/pull/134). CRD-less feature: just controller logic + scoped Roles + RBAC markers +- [#15 create-pr](https://github.com/amcheste/claude-teams-operator/pull/135). New internal package (`internal/github`) + controller wiring + httptest-backed tests +- [#16 push-branch](https://github.com/amcheste/claude-teams-operator/pull/148). Async terminal Job + status mirror + envtest integration spec diff --git a/KUBECON.md b/KUBECON.md index 3a36c31..eaab977 100644 --- a/KUBECON.md +++ b/KUBECON.md @@ -3,7 +3,7 @@ **Project:** **kagents** — run Claude Code Agent Teams as a Kubernetes operator. Site: [kagents.dev](https://kagents.dev) (in progress). Repo: [`claude-teams-operator`](https://github.com/amcheste/claude-teams-operator). **Conference:** KubeCon + CloudNativeCon North America 2026 **Dates:** November 9–12, 2026 — Salt Lake City, Utah -**CFP:** Open now — deadline **May 31, 2026 at 11:59pm MT**. Submit at https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/program/cfp/ +**CFP:** Submitted (May 2026) — deadline was **May 31, 2026 at 11:59pm MT**. --- diff --git a/Makefile b/Makefile index dbe396f..bbcdd36 100644 --- a/Makefile +++ b/Makefile @@ -176,3 +176,23 @@ CONTROLLER_GEN = $(shell go env GOPATH)/bin/controller-gen .PHONY: controller-gen controller-gen: ## Install controller-gen @test -f $(CONTROLLER_GEN) || go install sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) + +CRD_REF_DOCS_VERSION ?= v0.3.0 +CRD_REF_DOCS = $(shell go env GOPATH)/bin/crd-ref-docs +.PHONY: crd-ref-docs +crd-ref-docs: ## Install crd-ref-docs (used by docs-api) + @go install github.com/elastic/crd-ref-docs@$(CRD_REF_DOCS_VERSION) + + +##@ Documentation + +.PHONY: docs-api +docs-api: crd-ref-docs ## Regenerate the API reference under docs/reference/api/ from kubebuilder markers + @mkdir -p docs/reference/api + $(CRD_REF_DOCS) \ + --config=hack/crd-ref-docs-config.yaml \ + --source-path=api/v1alpha1 \ + --renderer=markdown \ + --output-path=docs/reference/api/index.md \ + --output-mode=single + @echo "API reference regenerated at docs/reference/api/index.md" diff --git a/README.md b/README.md index b50ca7f..982b5e0 100644 --- a/README.md +++ b/README.md @@ -1,23 +1,23 @@
-kagents mascot +kagents banner # kagents **Run Claude Code Agent Teams as a Kubernetes operator.** [![Validate](https://github.com/amcheste/claude-teams-operator/actions/workflows/validate.yml/badge.svg)](https://github.com/amcheste/claude-teams-operator/actions/workflows/validate.yml) -[![Version](https://img.shields.io/github/v/tag/amcheste/claude-teams-operator?label=version&sort=semver)](https://github.com/amcheste/claude-teams-operator/releases) -[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE) +[![Version](https://img.shields.io/github/v/tag/amcheste/claude-teams-operator?label=version&sort=semver&color=0B0B0C)](https://github.com/amcheste/claude-teams-operator/releases) +[![License](https://img.shields.io/badge/License-Apache_2.0-1F4D3A.svg)](LICENSE) [![Go](https://img.shields.io/badge/Go-1.23-00ADD8)](go.mod)
--- -> **kagents** is the project brand. The implementation lives in the [`claude-teams-operator`](https://github.com/amcheste/claude-teams-operator) repository and ships under the `claude.amcheste.io/v1alpha1` API group. Documentation site: [kagents.dev](https://kagents.dev) (under construction — see [v0.7.0 milestone](https://github.com/amcheste/claude-teams-operator/milestone/8)). +> **kagents** is the project brand. The implementation lives in the [`claude-teams-operator`](https://github.com/amcheste/claude-teams-operator) repository and ships under the `claude.amcheste.io/v1alpha1` API group. Documentation site: [kagents.dev](https://kagents.dev) (under construction. See [v0.7.0 milestone](https://github.com/amcheste/claude-teams-operator/milestone/8)). -Claude Code [Agent Teams](https://docs.anthropic.com/en/docs/claude-code/agent-teams) let multiple Claude Code instances collaborate — a lead coordinates work via a shared task list while teammates communicate through peer-to-peer mailboxes. Natively this runs on a single machine using tmux. This operator lifts that pattern into Kubernetes so you can run large-scale agent teams on your cluster. +Claude Code [Agent Teams](https://docs.anthropic.com/en/docs/claude-code/agent-teams) let multiple Claude Code instances collaborate. A lead coordinates work via a shared task list while teammates communicate through peer-to-peer mailboxes. Natively this runs on a single machine using tmux. This operator lifts that pattern into Kubernetes so you can run large-scale agent teams on your cluster. ## Modes @@ -32,23 +32,23 @@ Both modes share the same coordination protocol (shared PVCs, mailboxes, task li ## Features -- **Native Agent Teams protocol** — preserves Anthropic's file-based mailbox and task list format over ReadWriteMany PVCs; no protocol translation -- **Per-teammate git worktrees** — each coding agent works on an isolated branch to prevent merge conflicts -- **Cowork mode** — mount ConfigMap/PVC inputs and collect outputs without requiring a git repo -- **Skills as CRD fields** — mount Claude Code skills from ConfigMaps into each agent's `.claude/skills/` -- **MCP servers per agent** — configure Model Context Protocol connections per teammate -- **Approval gates** — pause spawning specific teammates until a human applies an annotation -- **Budget enforcement** — terminate the team if estimated API cost exceeds a configured limit -- **Timeout enforcement** — terminate the team after a configurable wall-clock duration -- **`dependsOn` ordering** — spawn teammates only after their declared dependencies complete -- **Reusable templates** — define team patterns with `AgentTeamTemplate`, instantiate with `AgentTeamRun` +- **Native Agent Teams protocol**. Preserves Anthropic's file-based mailbox and task list format over ReadWriteMany PVCs; no protocol translation +- **Per-teammate git worktrees**. Each coding agent works on an isolated branch to prevent merge conflicts +- **Cowork mode**. Mount ConfigMap/PVC inputs and collect outputs without requiring a git repo +- **Skills as CRD fields**. Mount Claude Code skills from ConfigMaps into each agent's `.claude/skills/` +- **MCP servers per agent**. Configure Model Context Protocol connections per teammate +- **Approval gates**. Pause spawning specific teammates until a human applies an annotation +- **Budget enforcement**. Terminate the team if estimated API cost exceeds a configured limit +- **Timeout enforcement**. Terminate the team after a configurable wall-clock duration +- **`dependsOn` ordering**. Spawn teammates only after their declared dependencies complete +- **Reusable templates**. Define team patterns with `AgentTeamTemplate`, instantiate with `AgentTeamRun` ## Quick Start ### Prerequisites - Kubernetes 1.28+ -- ReadWriteMany PVC support (NFS, EFS, or a compatible CSI driver — see [ARCHITECTURE.md § Storage Requirements](ARCHITECTURE.md#storage-requirements) for options) +- ReadWriteMany PVC support (NFS, EFS, or a compatible CSI driver. See [ARCHITECTURE.md § Storage Requirements](ARCHITECTURE.md#storage-requirements) for options) - Claude Code CLI access (Max subscription or API key) - Opus 4.6 model access (required for Agent Teams) @@ -222,7 +222,7 @@ The primary resource. Defines the full team, its workspace, lifecycle, and obser ### AgentTeamTemplate -A reusable team pattern. Does not run on its own — instantiate with `AgentTeamRun`. +A reusable team pattern. Does not run on its own. Instantiate with `AgentTeamRun`. ### AgentTeamRun @@ -298,12 +298,12 @@ This README is the entry point. For deeper dives, every topic lives in a dedicat | Document | Read when you want to… | |----------|-----------------------| -| [ARCHITECTURE.md](ARCHITECTURE.md) | Understand how the operator models Agent Teams — phase state machine, PVC layout, RWX storage backends, coordination protocol, key design tradeoffs. | +| [ARCHITECTURE.md](ARCHITECTURE.md) | Understand how the operator models Agent Teams. Phase state machine, PVC layout, RWX storage backends, coordination protocol, key design tradeoffs. | | [TESTING.md](TESTING.md) | See the test strategy (unit / integration / acceptance / E2E), how to run each suite, and what each one actually verifies. | | [CONTRIBUTING.md](CONTRIBUTING.md) | Set up a dev environment, run the full build/test loop, follow the branch + PR workflow, and walk through "How to add a new reconciler feature." | -| [docs/helm-values.md](docs/helm-values.md) | Tune the Helm chart — every value documented with defaults and production override recipes. | +| [docs/helm-values.md](docs/helm-values.md) | Tune the Helm chart. Every value documented with defaults and production override recipes. | | [SECURITY.md](SECURITY.md) | Report a vulnerability or review the project's security policy. | -| [KUBECON.md](KUBECON.md) | See the talk framing and "interesting problems" log — useful context for why specific architectural choices were made. | +| [KUBECON.md](KUBECON.md) | See the talk framing and "interesting problems" log. Useful context for why specific architectural choices were made. | ## Development diff --git a/SECURITY.md b/SECURITY.md index adcd08b..49cb413 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -1,18 +1,57 @@ # Security Policy -## Supported Versions +## Supported versions -Only the latest release is actively maintained. +Only the latest released version is actively maintained. Security fixes are issued against `main` and tagged with the next patch release. -## Reporting a Vulnerability +| Version | Supported | +|---------|:---------:| +| Latest release | ✅ | +| Older releases | ❌ please upgrade | -**Please do not open a public issue for security vulnerabilities.** +The latest release is the most recent `v*` tag on https://github.com/amcheste/claude-teams-operator/releases. -Use GitHub's [private vulnerability reporting](../../security/advisories/new) to report issues confidentially. +## Reporting a vulnerability + +**Please do not open a public issue, Discussion, or pull request for security vulnerabilities.** Use GitHub's [private vulnerability reporting](https://github.com/amcheste/claude-teams-operator/security/advisories/new) instead. That surface lets you submit confidentially, and the maintainer can collaborate with you on a fix without the report being visible to anyone else until it's resolved. + +Please include in your report: -Please include: - A clear description of the vulnerability -- Steps to reproduce -- Potential impact +- Steps to reproduce (or a proof-of-concept manifest / kubectl invocation) +- The kagents version you observed it on +- Potential impact. What an attacker could achieve, and against what cluster topology + +## Coordinated disclosure expectations + +We follow a **coordinated disclosure** process: + +1. **Acknowledgement**. Within **7 days** of your report, the maintainer will confirm receipt and start triage. +2. **Triage + fix**. Within **30 days**, you will receive either a fix candidate, a status update with a clear timeline, or a written explanation of why the report doesn't qualify as a vulnerability. +3. **Embargo**. Fix development happens in private. We ask you to keep the issue confidential until the fix ships and is publicly announced. We will not embargo for longer than 90 days from the original report without your agreement. +4. **Public disclosure**. Once the fix is released, we publish a [GitHub Security Advisory](https://github.com/amcheste/claude-teams-operator/security/advisories) with the details, affected versions, mitigation steps, and credit to you (unless you ask to remain anonymous). +5. **CVE assignment**. If the issue qualifies, we request a CVE through GitHub's CNA before public disclosure. + +## What counts as a security issue + +If you're not sure whether something is a vulnerability or a bug, err on the side of reporting it through the private channel. It's easy to move a non-security report to a public issue, but a public report of a real vulnerability is unfixable damage. + +In-scope examples: + +- Privilege escalation between agent pods within a team or across teams +- Container escape from an agent pod to the node +- Reading secrets that an agent's RBAC scope shouldn't allow +- Operator manipulating arbitrary cluster resources beyond its declared RBAC +- Information disclosure through dashboard endpoints or webhook payloads +- Supply-chain weaknesses in the operator or runner container images + +Out of scope (please file as regular GitHub issues): + +- Bugs without a security impact +- Denial-of-service requiring cluster-admin access to set up +- Issues that require physical access to a node +- Best-practice deviations that don't enable an attack + +## Hardening checklist for operators -You can expect an acknowledgement within **7 days** and a resolution or status update within **30 days**. +For users deploying kagents in production, the [Operations explanation](https://kagents.dev/explanation/operations/) covers the defense-in-depth model. Per-agent ServiceAccounts, the file-based-protocol threat model, and what RBAC does and doesn't enforce. Reading that page before going live is recommended. diff --git a/VERSION b/VERSION index a918a2a..faef31a 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.6.0 +0.7.0 diff --git a/api/v1alpha1/agentteam_types.go b/api/v1alpha1/agentteam_types.go index f3b4804..ec86c6c 100644 --- a/api/v1alpha1/agentteam_types.go +++ b/api/v1alpha1/agentteam_types.go @@ -306,7 +306,7 @@ type LifecycleSpec struct { // MaxRestarts bounds how many times each teammate pod may be re-spawned // after a Failed phase before the team itself is marked Failed. The lead - // pod is not subject to this limit — a lead crash always fails the team. + // pod is not subject to this limit; a lead crash always fails the team. // +kubebuilder:default=3 // +kubebuilder:validation:Minimum=0 // +optional diff --git a/assets/banner.png b/assets/banner.png new file mode 100644 index 0000000..c5324d5 Binary files /dev/null and b/assets/banner.png differ diff --git a/charts/claude-teams-operator/Chart.yaml b/charts/claude-teams-operator/Chart.yaml index 1ee0f84..1ab984e 100644 --- a/charts/claude-teams-operator/Chart.yaml +++ b/charts/claude-teams-operator/Chart.yaml @@ -2,8 +2,8 @@ apiVersion: v2 name: claude-teams-operator description: A Kubernetes operator for running Claude Code Agent Teams as distributed pods type: application -version: 0.6.0 -appVersion: "0.6.0" +version: 0.7.0 +appVersion: "0.7.0" keywords: - claude - ai diff --git a/charts/claude-teams-operator/crds/claude.amcheste.io_agentteamruns.yaml b/charts/claude-teams-operator/crds/claude.amcheste.io_agentteamruns.yaml index 0d574a7..62c1c6e 100644 --- a/charts/claude-teams-operator/crds/claude.amcheste.io_agentteamruns.yaml +++ b/charts/claude-teams-operator/crds/claude.amcheste.io_agentteamruns.yaml @@ -262,7 +262,7 @@ spec: description: |- MaxRestarts bounds how many times each teammate pod may be re-spawned after a Failed phase before the team itself is marked Failed. The lead - pod is not subject to this limit — a lead crash always fails the team. + pod is not subject to this limit; a lead crash always fails the team. format: int32 minimum: 0 type: integer diff --git a/charts/claude-teams-operator/crds/claude.amcheste.io_agentteams.yaml b/charts/claude-teams-operator/crds/claude.amcheste.io_agentteams.yaml index 72acd2a..13f07a8 100644 --- a/charts/claude-teams-operator/crds/claude.amcheste.io_agentteams.yaml +++ b/charts/claude-teams-operator/crds/claude.amcheste.io_agentteams.yaml @@ -307,7 +307,7 @@ spec: description: |- MaxRestarts bounds how many times each teammate pod may be re-spawned after a Failed phase before the team itself is marked Failed. The lead - pod is not subject to this limit — a lead crash always fails the team. + pod is not subject to this limit; a lead crash always fails the team. format: int32 minimum: 0 type: integer diff --git a/charts/claude-teams-operator/crds/claude.amcheste.io_agentteamtemplates.yaml b/charts/claude-teams-operator/crds/claude.amcheste.io_agentteamtemplates.yaml index c0135fb..a9ff55d 100644 --- a/charts/claude-teams-operator/crds/claude.amcheste.io_agentteamtemplates.yaml +++ b/charts/claude-teams-operator/crds/claude.amcheste.io_agentteamtemplates.yaml @@ -155,7 +155,7 @@ spec: description: |- MaxRestarts bounds how many times each teammate pod may be re-spawned after a Failed phase before the team itself is marked Failed. The lead - pod is not subject to this limit — a lead crash always fails the team. + pod is not subject to this limit; a lead crash always fails the team. format: int32 minimum: 0 type: integer diff --git a/config/crd/bases/claude.amcheste.io_agentteamruns.yaml b/config/crd/bases/claude.amcheste.io_agentteamruns.yaml index 0d574a7..62c1c6e 100644 --- a/config/crd/bases/claude.amcheste.io_agentteamruns.yaml +++ b/config/crd/bases/claude.amcheste.io_agentteamruns.yaml @@ -262,7 +262,7 @@ spec: description: |- MaxRestarts bounds how many times each teammate pod may be re-spawned after a Failed phase before the team itself is marked Failed. The lead - pod is not subject to this limit — a lead crash always fails the team. + pod is not subject to this limit; a lead crash always fails the team. format: int32 minimum: 0 type: integer diff --git a/config/crd/bases/claude.amcheste.io_agentteams.yaml b/config/crd/bases/claude.amcheste.io_agentteams.yaml index 72acd2a..13f07a8 100644 --- a/config/crd/bases/claude.amcheste.io_agentteams.yaml +++ b/config/crd/bases/claude.amcheste.io_agentteams.yaml @@ -307,7 +307,7 @@ spec: description: |- MaxRestarts bounds how many times each teammate pod may be re-spawned after a Failed phase before the team itself is marked Failed. The lead - pod is not subject to this limit — a lead crash always fails the team. + pod is not subject to this limit; a lead crash always fails the team. format: int32 minimum: 0 type: integer diff --git a/config/crd/bases/claude.amcheste.io_agentteamtemplates.yaml b/config/crd/bases/claude.amcheste.io_agentteamtemplates.yaml index c0135fb..a9ff55d 100644 --- a/config/crd/bases/claude.amcheste.io_agentteamtemplates.yaml +++ b/config/crd/bases/claude.amcheste.io_agentteamtemplates.yaml @@ -155,7 +155,7 @@ spec: description: |- MaxRestarts bounds how many times each teammate pod may be re-spawned after a Failed phase before the team itself is marked Failed. The lead - pod is not subject to this limit — a lead crash always fails the team. + pod is not subject to this limit; a lead crash always fails the team. format: int32 minimum: 0 type: integer diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..6980717 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,20 @@ +# Docs site + +This directory holds the [mkdocs-material](https://squidfunk.github.io/mkdocs-material/) source for [kagents.dev](https://kagents.dev). + +## Local development + +```bash +pip install -r docs/requirements.txt +mkdocs serve # http://localhost:8000 +``` + +Edits to any file under `docs/` or `mkdocs.yml` hot-reload in the browser. + +## Deploying + +A push to `main` that touches `docs/`, `mkdocs.yml`, or `.github/workflows/docs.yml` triggers `Deploy Docs`, which builds the site with `mkdocs gh-deploy` and force-pushes the rendered HTML to the `gh-pages` branch. GitHub Pages serves it at https://kagents.dev (and at `amcheste.github.io/claude-teams-operator` until the custom domain DNS resolves). + +## Structure + +The site uses the [Diátaxis framework](https://diataxis.fr). Four sections: Tutorials, How-to guides, Reference, Explanation. Section pages will be filled in by the v0.7.0 content issues. For now only the homepage exists. diff --git a/docs/cfp/cfp-draft.md b/docs/cfp/cfp-draft.md deleted file mode 100644 index d502b8f..0000000 --- a/docs/cfp/cfp-draft.md +++ /dev/null @@ -1,156 +0,0 @@ -# KubeCon NA 2026 — kagents CFP Draft - -> **Project:** **kagents** ([kagents.dev](https://kagents.dev)) — implementation in [`claude-teams-operator`](https://github.com/amcheste/claude-teams-operator). -> -> Draft submission for issue [#23](https://github.com/amcheste/claude-teams-operator/issues/23). Conference: KubeCon + CloudNativeCon North America 2026, Salt Lake City, Nov 9–12. CFP deadline: **May 31, 2026 at 11:59pm MT**. Submit at https://sessionize.com/kubecon-cloudnativecon-north-america-2026/. -> -> This is a starting draft. Every field below is meant to be edited. Open questions for the maintainer are listed at the bottom. - ---- - -## Submission metadata - -| Field | Recommendation | Rationale | -|-------|----------------|-----------| -| **Submission type** | Session Presentation (30 min) | The form's options are 5 / 30 / 75 minutes. 30 fits the demo-heavy structure without padding. Tutorial (75 min) is the alternative if the maintainer wants hands-on. | -| **Track (primary)** | AI Inference + Agentic | New track for 2026. Direct fit: this is a system for running agent workloads on K8s. | -| **Track (alternate)** | Platform Engineering | Reasonable alternate angle: the operator *is* platform infra for agent teams. Pick AI Inference + Agentic if both feel viable, since the program committee may load-balance between them. | -| **Audience level** | Intermediate | Assumes operator-pattern literacy (CRDs, reconcile loops, RBAC, PVC access modes). Does not assume Claude Code or LLM background. | -| **Case study?** | No | This is a project talk, not a deployment retrospective. | - ---- - -## Abstract title (75 char max) - -**Primary:** - -``` -Reconciling Agent Teams: A Kubernetes Operator for Claude Code -``` - -(62 characters) - -**Alternates worth considering:** - -``` -Stateless Agents, Stateful Cluster: K8s for Claude Code Agent Teams -``` - -(67 characters — leans harder into the architectural narrative from KUBECON.md: the agent forgets, the cluster remembers.) - -``` -The Operator Pattern for Multi-Agent Coding Teams -``` - -(49 characters — most general, drops the Claude Code brand. Use if the program committee tends to read brand-named titles as vendor pitches.) - ---- - -## Abstract (1,300 char max) - -> Most multi-agent orchestration frameworks treat Kubernetes as deployment infrastructure: pods that happen to run an LLM. This talk shows what changes when the cluster becomes the coordination fabric. **kagents** ([kagents.dev](https://kagents.dev)) runs Anthropic's Claude Code Agent Teams as a CRD-driven workload, preserving the native file-based mailbox protocol over a ReadWriteMany PVC instead of inventing a new one. An AgentTeam resource declares a lead, teammates, budget, quality gates, and lifecycle policy in a single spec. The reconciler provisions per-teammate git worktrees, scopes each pod with its own ServiceAccount, and re-spawns crashed agents using the durable task list as recovery state. The agent does not remember the conversation, but the task list on the PVC tells the fresh pod what work remains. The talk walks through the architectural choices that made this work in K8s: why agent state lives on a PVC instead of a CRD status field, why RestartPolicy is Never, what RWX storage you actually need in production versus on a laptop, and how Prometheus metrics, webhooks, and human approval gates plug into the reconcile loop. A live demo deploys a coding team to a Kind cluster, shows mailbox traffic between pods, kills a teammate, and watches the operator respawn it from the task list. - -(~1,290 characters, against a 1,300 limit. The buffer is small. Trim if any field is added during iteration.) - ---- - -## Audience - -> Platform engineers, operator authors, and SREs who run Kubernetes and are evaluating how to host multi-agent LLM workloads without building a custom protocol. Attendees should be comfortable with the operator pattern (CRDs, controllers, reconcile loops), Kubernetes RBAC, and PVC access modes. Familiarity with Claude Code or Agent Teams is helpful but not required; the talk explains the native protocol and the K8s primitives it maps to. Attendees will leave with a clear picture of which Kubernetes building blocks translate cleanly to agent workloads (git worktrees as a concurrency primitive, ServiceAccounts as per-agent capability boundaries, owner references for cascade deletion of team state) and which assumptions break down at scale (CRD status as long-running state, single-node RWO fallbacks, real-time cost tracking). - ---- - -## Benefits to the ecosystem (1,000 char max) - -> Cloud-native multi-agent systems are a 2026 priority for both CNCF and individual platform teams, but most current solutions invent new orchestration protocols and layer them on top of Kubernetes. This talk demonstrates the alternative: model the agent team as a first-class Kubernetes resource and let existing primitives do the coordination work. The architectural patterns generalize beyond Claude Code; any multi-agent system with file-based or shared-state coordination can adopt the same approach. The talk surfaces the honest tradeoffs (ReadWriteMany storage cost, estimation-based budget tracking, the limits of CRD status for long-running state) so attendees can evaluate whether the pattern fits their workloads. The operator is open source under Apache 2.0, ships with a published Helm chart and Prometheus dashboard, and gates every release on a real-Claude end-to-end test in CI. - -(~970 characters) - ---- - -## Open source projects discussed - -- **kagents** ([kagents.dev](https://kagents.dev)) — the operator itself, Apache 2.0; implementation at [claude-teams-operator](https://github.com/amcheste/claude-teams-operator) -- [Kubernetes](https://github.com/kubernetes/kubernetes) — the platform; specifically `controller-runtime`, `kubebuilder`, RBAC, PVC subsystem -- [Prometheus](https://github.com/prometheus/prometheus) and [Grafana](https://github.com/grafana/grafana) — metrics scraping and the published dashboard ConfigMap -- [Helm](https://github.com/helm/helm) — chart packaging and release distribution -- Anthropic's Claude Code Agent Teams protocol — the native file-based coordination format the operator preserves (Claude Code itself is not open source; the protocol behavior is documented and stable enough to wrap as-is) - ---- - -## Reviewer-facing talk outline (~30 min) - -This expands on the abstract — provided in case the Sessionize form exposes a longer description field, and to anchor the demo plan. - -| Time | Beat | -|------|------| -| 0:00 | The problem framing. Most agent frameworks bolt onto Kubernetes; this talk argues for the inverse — Kubernetes primitives doing the coordination work. | -| 2:00 | Native Agent Teams in 60 seconds: file-based JSON mailboxes, shared task list, no session resumption. Why this protocol is unusually well-suited to a shared filesystem. | -| 5:00 | The `AgentTeam` CRD: one spec for a whole team (lead + teammates + lifecycle + budget). Contrast with agent-as-a-resource designs. | -| 8:00 | Phase state machine: `Pending → Initializing → Running → Completed/Failed/TimedOut/BudgetExceeded`. How state transitions map to actual K8s objects (PVCs, init Job, pods). | -| 11:00 | The ReadWriteMany requirement, in detail. Why coordination over a PVC actually works. What fails on RWO. The single-node RWO fallback used in CI and what it can and cannot prove. | -| 14:00 | Per-agent RBAC. Each pod gets its own ServiceAccount with `resourceNames`-restricted Roles on the secrets and PVCs it owns. A free security win that non-native orchestrators have to reinvent. | -| 16:00 | **Demo 1 — Crash recovery.** Deploy a coding team. Show mailbox files appearing on the PVC. Kill a teammate pod. Watch the reconciler respawn it. The fresh agent has no conversation memory, but the task list tells it what is left. | -| 21:00 | `onComplete` actions: `create-pr` opens a real GitHub PR via the REST API; `push-branch` consolidates per-teammate worktree branches into one head via a Job. The worktree-as-concurrency-primitive story. | -| 24:00 | **Demo 2 — Observability.** Prometheus metrics, the Grafana dashboard ConfigMap, an approval gate firing a webhook before a sensitive teammate spawns. | -| 27:00 | Honest tradeoffs we are still working through: estimation-based budget tracking, real multi-node test coverage, the limits of CRD status as a substitute for a workflow engine. | -| 29:00 | Wrap and pointers (repo, Helm chart, contributor docs). | -| 30:00 | Q&A. | - ---- - -## Demo plan - -Two demos, both runnable on a laptop with Kind: - -1. **Crash recovery (5 min, on stage).** Deploy a 3-agent `AgentTeam` from a sample manifest, watch pods come up, observe mailbox JSON appearing on the shared PVC, `kubectl delete pod` one teammate, watch the reconciler respawn it. The point is to show the agent's lost context window does not lose the team's progress, because the task list is durable. - -2. **Observability and gates (3 min, on stage).** Bring up the Grafana dashboard against the operator's Prometheus metrics. Trigger an approval gate so a webhook fires; grant approval via `kubectl annotate`; watch the gated teammate spawn. - -Both demos run today against the shipped v0.5.0 release. Backup recordings will be prepared in case live demo bandwidth fails on the venue Wi-Fi. - ---- - -## Speaker bio - -> _TBD — see open questions._ - ---- - -## Prior speaking history - -> _TBD — see open questions._ - ---- - -## Open questions for the maintainer - -These are the items that need maintainer input before submission: - -1. **Speaker bio** — short paragraph (≤ ~500 chars) covering current role, relevant background, and any past public talks or projects. Include a recent headshot upload-ready. -2. **Prior speaking history** — has the maintainer presented at a CNCF event in the past 12 months? The form asks for video links if so. -3. **Track preference** — primary recommendation here is **AI Inference + Agentic**; the alternate is **Platform Engineering**. Which one does the maintainer want as the primary track? (Submitting to one does not preclude the program committee from re-routing.) -4. **Title preference** — three candidates above. Maintainer's call. -5. **Co-speaker?** — solo or two-speaker? The form allows up to two on a Session Presentation. -6. **Tutorial alternate?** — if the talk lands strongly, a 75-minute Tutorial slot is also viable (deploy a team in real time, walk through the CRD field by field). Worth submitting both? The CFP allows up to three submissions per speaker. -7. **Demo cluster** — confirm the on-stage cluster is Kind on a laptop, vs. an actual cloud cluster. Bandwidth and predictability favor Kind; "real cluster" favors the multi-node RWX story. -8. **Release alignment** — the v0.6.0 (Operator Dashboard) and v1.0.0 (Demo Polish) milestones land before the conference. Should the dashboard be part of the demo, or kept as a parallel track? Including it strengthens the story but adds a moving piece to rehearse. - ---- - -## Notes on substance - -Everything in the abstract and the outline maps to shipped, tested code in v0.1.0–v0.5.0: - -- AgentTeam CRD with single-spec team declaration → [api/v1alpha1/agentteam_types.go](../../api/v1alpha1/agentteam_types.go) -- Reconciler phase state machine → [internal/controller/agentteam_controller.go](../../internal/controller/agentteam_controller.go), see also [ARCHITECTURE.md § Phase State Machine](../../ARCHITECTURE.md#phase-state-machine) -- ReadWriteMany PVC coordination + single-node fallback → [ARCHITECTURE.md § Storage Requirements](../../ARCHITECTURE.md#storage-requirements), [hack/mailbox-smoke-test.sh](../../hack/mailbox-smoke-test.sh) -- Per-agent ServiceAccounts with `resourceNames`-restricted Roles — shipped in v0.4.0 (#14) -- Crash respawn with restart counters — v0.4.0 (#13) -- `onComplete: create-pr` — v0.4.0 (#15); `onComplete: push-branch` — v0.4.0 (#16) -- Prometheus metrics + Grafana dashboard ConfigMap — v0.3.0 -- Webhook engine + approval gates — v0.3.0 -- AgentTeamTemplate + AgentTeamRun controllers — v0.5.0 (#17, #18) -- Real-Claude E2E gate before release publishes — v0.4.0 (#150) - -No claim in this draft refers to unshipped work. diff --git a/docs/explanation/coordination.md b/docs/explanation/coordination.md new file mode 100644 index 0000000..bb17b98 --- /dev/null +++ b/docs/explanation/coordination.md @@ -0,0 +1,172 @@ +# Coordination protocol + +This is the load-bearing design choice in kagents: agent-to-agent communication happens through files on a shared PVC, not through a custom RPC protocol. Understanding why explains most of the rest of the architecture. + +## Why a shared filesystem instead of a message bus? + +Anthropic's Claude Code Agent Teams runs natively on a single machine using tmux. Multiple Claude Code instances coordinate via files in `~/.claude/teams/`. JSON inboxes for peer-to-peer messages, a JSON task list for shared work tracking. The protocol is unspecified beyond "look at the files." + +We could have translated this to Redis, NATS, or a custom gRPC service. We chose not to: + +- **No protocol versioning to track.** Claude Code owns the format. When it ships a v2 mailbox schema, kagents inherits it for free. We never read or write the contents. +- **No translation layer to debug.** When something goes wrong, you can `kubectl exec` into a pod and inspect the actual files Claude Code is reading and writing. There's no opaque protocol bridge in the middle. +- **No additional infrastructure.** A bare RWX PVC is enough. No Redis to operate, no message-bus HA story. + +The cost is real. ReadWriteMany storage isn't free on every cluster, and we have to be honest about that. + +## Mailbox layout + +Each agent has an inbox at a stable path under `~/.claude/teams/`: + +``` +~/.claude/ + teams/ + {team-name}/ + inboxes/ + lead.json + teammate-a.json + teammate-b.json + ... + tasks/ + {team-name}/ + tasks.json +``` + +- **Inboxes** are peer-to-peer. The lead writes to `inboxes/teammate-a.json` to address teammate A; teammate A reads its own inbox to receive messages. +- **The task list** is broadcast: the lead writes tasks; teammates claim them via writes to the same file (with file-locking to handle concurrent claims). + +These paths come from Claude Code itself, not from kagents. The operator just makes the files visible to all pods that need them. + +## Volume topology + +Each team uses up to three PVCs, all `ReadWriteMany`: + +```mermaid +graph TB + subgraph N1[Node 1] + L[Lead Pod
opus] + T1[Teammate Pod
backend-api] + end + subgraph N2[Node 2] + T2[Teammate Pod
frontend-auth] + T3[Teammate Pod
test-coverage] + end + + SS[(team-state PVC
RWX
~/.claude/teams
~/.claude/tasks)] + R[(repo PVC
RWX, coding mode
/workspace)] + O[(output PVC
RWX, Cowork mode
/workspace/output)] + + L -.mount.-> SS + L -.mount.-> R + T1 -.mount.-> SS + T1 -.mount.-> R + T2 -.mount.-> SS + T2 -.mount.-> R + T3 -.mount.-> SS + T3 -.mount.-> R + + style SS fill:#fff3e0,stroke:#f57c00 + style R fill:#e8f5e9,stroke:#388e3c + style O fill:#f3e5f5,stroke:#7b1fa2 +``` + +The `team-state` PVC is the coordination fabric. It carries the mailboxes and the task list. The `repo` PVC (coding mode) carries the git clone and per-teammate worktrees. The `output` PVC (Cowork mode) is where agents write artifacts. + +In practice the operator mounts the team-state PVC into each pod, and the entrypoint symlinks the `teams/` and `tasks/` subdirectories into `~/.claude/`: + +```bash +ln -sfn /var/claude-state/teams ~/.claude/teams +ln -sfn /var/claude-state/tasks ~/.claude/tasks +``` + +This preserves the native paths Claude Code expects without polluting the agent's per-pod `~/.claude/` config. + +## Why ReadWriteMany? + +Two pods need to write to the same file at the same time: + +1. **Mailbox writes.** The lead writes into a teammate's inbox. The teammate reads from its own inbox. Both sides happen continuously. +2. **Task claims.** Multiple teammates race to claim items from the shared task list. + +If the backing PVC supports only `ReadWriteOnce`, the second pod fails to mount with `volume already attached to a different node` and the team deadlocks before the first message round-trip. + +### Supported backends + +The operator has no opinion about the CSI driver. It asks for an RWX PVC and a `storageClassName` you supply. Backends that satisfy the contract: + +| Platform | Driver | Notes | +|----------|--------|-------| +| Amazon EKS | [EFS CSI driver](https://github.com/kubernetes-sigs/aws-efs-csi-driver) | Native RWX over NFS protocol | +| Google GKE | [Filestore CSI driver](https://cloud.google.com/filestore/docs/csi-driver) | Filestore instances advertise RWX | +| Azure AKS | [Azure Files CSI driver](https://learn.microsoft.com/azure/aks/azure-files-csi) | SMB or NFS protocol | +| Bare-metal / on-prem | NFS subdir provisioner, Longhorn, Rook/Ceph | Anything with `accessModes: [ReadWriteMany]` | +| Kind (multi-node dev) | NFS server provisioner | Installed by `make kind-create` | + +### Single-node fallback + +For laptops, Kind, k3d, minikube. A real RWX provisioner is overkill. The operator accepts a `--pvc-access-mode=ReadWriteOnce` flag. This works **only** because every pod lands on the same node, and a hostPath-backed RWO PVC is then visible to all of them. + +!!! danger "Don't use RWO on a multi-node cluster" + A second pod scheduled on a different node will fail to mount the PVC and the team will deadlock. The single-node fallback is a development convenience, not a production option. + +## Per-teammate git worktrees (coding mode) + +When `spec.repository` is set, the init Job: + +1. Clones the repository into `/workspace/repo` +2. Creates one git worktree per teammate at `/workspace/worktrees/{teammate-name}` on a dedicated branch named `teammate-{teammate-name}` +3. Initialises the team-state directories and an empty task list + +Each teammate pod receives `WORKTREE_PATH=/workspace/worktrees/{teammate-name}` and the entrypoint `cd`s there before launching Claude Code. The lead has no worktree path and works directly from `/workspace/repo`. + +The branch naming is a deliberate choice. Each teammate's commits go to `teammate-{name}`, completely isolated from peers' work-in-progress. There's no possibility of a merge conflict between concurrent agents because they never share a branch. The lead (or an `onComplete` action) handles consolidation at the end. + +```mermaid +graph LR + M[main branch] -.cloned to.-> R[/workspace/repo] + R -.worktree.-> WA[/workspace/worktrees/backend-api
branch: teammate-backend-api] + R -.worktree.-> WB[/workspace/worktrees/frontend-auth
branch: teammate-frontend-auth] + R -.worktree.-> WC[/workspace/worktrees/test-coverage
branch: teammate-test-coverage] + + style R fill:#e8f5e9,stroke:#388e3c + style WA fill:#fff3e0,stroke:#f57c00 + style WB fill:#fff3e0,stroke:#f57c00 + style WC fill:#fff3e0,stroke:#f57c00 +``` + +## Push-branch consolidation (`onComplete`) + +When the team finishes successfully and `lifecycle.onComplete: push-branch` is set, the operator runs a terminal Job that: + +1. Iterates each teammate worktree +2. `git merge --no-ff` each `teammate-{name}` branch into a fresh consolidation branch +3. `git push` the consolidated branch to the remote + +The default consolidated branch name is `teams/{team-name}` (Go template; overridable via `lifecycle.consolidatedBranchTemplate`). The operator sets `status.consolidatedBranch` once the push succeeds. + +If `onComplete: create-pr` is also set (or used alone), the operator opens a GitHub PR with the consolidated branch as the head. PR title and body are configurable via `lifecycle.pullRequest.titleTemplate` and `bodyTemplate`. + +## Cowork mode + +When `spec.workspace` is set instead of `spec.repository`, the operator skips the init Job and the worktree machinery entirely: + +- Creates an `output` PVC for writable agent output +- Mounts `workspace.inputs` (ConfigMaps or existing PVCs) read-only into each pod +- Doesn't set `WORKTREE_PATH`; agents work in `/workspace/output` or `/workspace/data` + +The mailbox protocol is identical. Cowork agents still coordinate via `~/.claude/teams/.../inboxes/`. The only difference is what filesystem they're writing artifacts into. + +## What this means for debugging + +A surprising amount of the system is just files on disk: + +- See what's in a teammate's inbox right now: `kubectl exec -n dev-agents - -- cat ~/.claude/teams//inboxes/.json` +- See the live task list: `kubectl exec -n dev-agents -lead -- cat ~/.claude/tasks//tasks.json` +- See worktree state: `kubectl exec ... -- git -C /workspace/worktrees/ log --oneline` + +There's no opaque coordinator process to dump. Everything Claude Code knows about its teammates is on the shared filesystem. + +## Where to look next + +- [Resource model](resources.md). The CRDs that compose into a running team +- [Operations](operations.md). Budget, RBAC, and observability for the running team diff --git a/docs/explanation/index.md b/docs/explanation/index.md new file mode 100644 index 0000000..852f6f8 --- /dev/null +++ b/docs/explanation/index.md @@ -0,0 +1,15 @@ +# Explanation + +The "why" behind kagents. Architecture, design tradeoffs, the choices that shaped the project. Read these when you want to understand what's actually happening, not just how to use it. + +## Pages + +- **[Resource model](resources.md)**. The three CRDs (`AgentTeam`, `AgentTeamTemplate`, `AgentTeamRun`), how they relate, and when to reach for which. +- **[Coordination protocol](coordination.md)**. The file-based mailbox model, why ReadWriteMany is required, per-teammate git worktrees as a concurrency primitive. +- **[Operations](operations.md)**. Budget estimation, per-agent RBAC, observability via Prometheus + Grafana + webhooks. + +## Going deeper + +The repo's [`ARCHITECTURE.md`](https://github.com/amcheste/claude-teams-operator/blob/main/ARCHITECTURE.md) is the design doc. Denser, more focused on rationale than on usage. It overlaps with these pages but goes further into the file-by-file structure of the codebase. + +The [KubeCon NA 2026 talk](https://github.com/amcheste/claude-teams-operator/blob/main/KUBECON.md) frames the same architecture from the conference angle (interesting problems encountered, competitive landscape, design decisions worth surfacing on stage). diff --git a/docs/explanation/operations.md b/docs/explanation/operations.md new file mode 100644 index 0000000..415dd90 --- /dev/null +++ b/docs/explanation/operations.md @@ -0,0 +1,168 @@ +# Operations + +Three concerns once a team is running: how much it costs, what each agent can touch, and how you see what it's doing. + +## Budget tracking + +Claude Code does not expose real-time token usage to the outside world. The operator estimates cost from elapsed time and the model assigned to each agent. + +### How the estimate works + +The estimator (in [`internal/budget`](https://github.com/amcheste/claude-teams-operator/tree/main/internal/budget)) treats every active agent session as if it consumes a fixed token rate per minute. The rate per million tokens uses Anthropic's published list price, applied to a **heuristic of 50,000 input + 5,000 output tokens per active minute** per agent. + +| Model | Input ($/M tokens) | Output ($/M tokens) | Approx. cost / minute / agent | +|-------|-------------------:|--------------------:|------------------------------:| +| `opus` | $5.00 | $25.00 | $0.375 | +| `sonnet` | $3.00 | $15.00 | $0.225 | + +The cost ticks up monotonically while pods are in `Running`. The reconciler aggregates per-teammate cost into `status.estimatedCostUsd`. + +### What triggers BudgetExceeded + +The reconciler compares `status.estimatedCostUsd` against `spec.lifecycle.budgetLimit` on every reconcile loop. When the estimate crosses the limit: + +1. Phase transitions to `BudgetExceeded` +2. All agent pods are deleted via owner-reference cascade +3. `status.completedAt` is stamped +4. A `webhook.budgetExceeded` event fires (if configured) + +There's no grace period. The team stops the moment the estimate crosses. Set the limit with headroom. + +### Honest tradeoffs + +This is the lightest-touch approach available without instrumenting Claude Code. The honest limitations: + +- **Estimate, not measurement.** Real token usage depends on prompt length, context window growth, and how often the agent reaches for tools. The estimate can be off by 2-3x in either direction. +- **Heuristic is per-active-minute.** An agent waiting on `dependsOn` doesn't accrue cost; one running flat out at the same rate as one mostly idle does. The heuristic averages the difference away. +- **Rate table is hardcoded.** The token-per-minute heuristic and the per-million prices live in `internal/budget/tracker.go`. Adjusting them requires a code change and rebuild. Config-via-Helm-values is on the roadmap. + +For production, set `budgetLimit` ~2x what you actually want to spend, and treat the budget as a circuit breaker rather than a precise meter. Real cost tracking via instrumented Claude Code or sidecar log parsing is on the roadmap; until then, the [Anthropic console](https://console.anthropic.com/) is the source of truth for accounting. + +## Per-agent RBAC + +Every agent pod in a team gets its own `ServiceAccount`, `Role`, and `RoleBinding`. The lead and each teammate are isolated: a compromised teammate cannot read a peer's secrets or PVCs. + +### What gets created + +For an `AgentTeam` with a lead and three teammates, the operator creates: + +- 1 `ServiceAccount` for the lead +- 1 `Role` granting access to the lead's secrets and the team-state PVC +- 1 `RoleBinding` binding the SA to the Role +- 3 `ServiceAccount`s, one per teammate +- 3 `Role`s, scoped to that teammate's secrets and PVCs only +- 3 `RoleBindings` + +All eight resources are owned by the `AgentTeam`. Deleting the team garbage-collects everything. + +### What each agent can do + +The Roles use `resourceNames` to scope by name, not just by type. A teammate's Role grants: + +| Resource | Verbs | Scope | +|----------|------:|-------| +| `secrets` | `get` | Only the API key Secret + that teammate's git credentials Secret | +| `persistentvolumeclaims` | `get`, `list`, `watch` | Only the PVCs this team uses | + +Notably absent: + +- No `pods`. Agents cannot list or exec into peer pods. +- No `pods/exec`. The teammate cannot escape the pod by `kubectl exec`. +- No `configmaps`. Skill ConfigMaps are mounted by the operator; the agent cannot enumerate or read other ConfigMaps. + +### What this defends against + +The threat model is "a teammate's prompt is malicious or compromised." The blast radius from that scenario is: + +- ✅ Cannot read another teammate's secrets (different SA) +- ✅ Cannot exec into the lead pod (no `pods/exec`) +- ✅ Cannot enumerate cluster state (no list verbs on namespace-wide resources) +- ⚠️ Can write to the shared `team-state` PVC. A malicious teammate could poison the task list or write to a peer's inbox. This is inherent to the file-based protocol; mitigations would require Claude Code to authenticate writes. +- ⚠️ Can write to the shared `repo` PVC. Worktrees are isolated by branch, but the agent could `cd` to a peer's worktree. + +The RBAC model handles the K8s side cleanly; the filesystem-level threats need protocol-level signing to fully address. For most use cases. Internal CI, trusted prompts. The filesystem trust model is acceptable. + +## Observability + +The operator exposes Prometheus metrics, ships a Grafana dashboard, and fires webhook events on key state transitions. + +### Prometheus metrics + +The operator binary exposes `/metrics` on port 8080 by default. Eight series, all labeled by team name and (where applicable) teammate name + model: + +| Metric | Type | Description | +|--------|------|-------------| +| `claude_team_active_total` | gauge | Count of teams in non-terminal phases | +| `claude_team_duration_seconds` | histogram | Wall-clock time from `Pending` to a terminal phase | +| `claude_teammate_tokens_total` | counter | Estimated tokens consumed per teammate / model | +| `claude_team_cost_usd` | gauge | Current `status.estimatedCostUsd` | +| `claude_team_tasks_completed_total` | counter | Tasks marked complete in the shared task list | +| `claude_teammate_restarts_total` | counter | Pod restarts per teammate | +| `claude_team_budget_remaining_usd` | gauge | `budgetLimit - estimatedCostUsd` | +| `claude_teammate_idle_seconds` | histogram | Time between task completions per teammate | + +Wire them to Prometheus by enabling the chart's ServiceMonitor: + +```bash +helm upgrade kagents ./charts/claude-teams-operator \ + --set metrics.serviceMonitor.enabled=true +``` + +### Grafana dashboard + +The chart ships a curated Grafana dashboard as a ConfigMap with the `grafana_dashboard: "1"` label. With the standard `kube-prometheus-stack`, the Grafana sidecar auto-imports it within ~30 seconds. + +```bash +helm upgrade kagents ./charts/claude-teams-operator \ + --set metrics.serviceMonitor.enabled=true \ + --set metrics.grafanaDashboard.enabled=true +``` + +The dashboard's panels cover active team count, cost rate, per-teammate task throughput, restart count, and idle-time distribution. + +### Webhook events + +The operator's webhook engine POSTs JSON payloads to a configured URL on key transitions. Events that fire: + +| Event type | When | +|------------|------| +| `team.started` | The team transitions to `Running` | +| `teammate.error` | A teammate pod enters `CrashLoopBackOff` or `Error` | +| `budget.warning` | Estimated cost crosses 80% of `budgetLimit` | +| `completed` | An approval gate is hit; reconciler is waiting on `kubectl annotate` | + +Configure via the chart's `webhook` values. Each event includes the team name, namespace, phase, and a payload-type-specific extras object. + +### Approval gates + +Approval gates pause spawning a specific teammate until a human applies an annotation. They're useful when one agent's output should be reviewed before subsequent agents see it. + +```yaml +spec: + lifecycle: + approvalGates: + - event: "spawn-email-drafter" + channel: "webhook" + webhookUrl: "https://hooks.example.com/approvals" +``` + +When the reconciler would otherwise spawn the gated teammate, it instead: + +1. Marks the teammate's `status.pendingApproval` field +2. Fires a `completed` webhook event with the gate name +3. Waits for the annotation `approved.claude.amcheste.io/spawn-email-drafter=true` + +Grant approval: + +```bash +kubectl annotate agentteam my-team \ + approved.claude.amcheste.io/spawn-email-drafter=true +``` + +Within 30 seconds (the default reconcile interval), the gated teammate spawns and joins the team. + +## Where to look next + +- [Resource model](resources.md). What an `AgentTeam` looks like under the hood +- [Coordination protocol](coordination.md). How the agents actually talk to each other +- [How-to guides](../how-to/index.md). Concrete operational recipes (coming in v0.7.0) diff --git a/docs/explanation/resources.md b/docs/explanation/resources.md new file mode 100644 index 0000000..873be1b --- /dev/null +++ b/docs/explanation/resources.md @@ -0,0 +1,251 @@ +# Resource model + +kagents manages three custom resources. Most users only ever touch the first one. + +| CRD | What it represents | When to use | +|-----|-------------------|-------------| +| `AgentTeam` | A specific team running a specific job | One-off work, e.g. refactor, code review, report draft | +| `AgentTeamTemplate` | A reusable team blueprint | You'll instantiate the same team shape against many inputs | +| `AgentTeamRun` | One instantiation of a template | Used together with `AgentTeamTemplate` | + +## How they relate + +```mermaid +graph LR + T[AgentTeamTemplate
'3-agent-security-review'] -->|referenced by| R[AgentTeamRun
'q4-platform-review'] + R -->|owns + reconciles| A[AgentTeam
'q4-platform-review-team'] + A -->|owns| P1[lead Pod] + A -->|owns| P2[teammate Pods] + A -->|owns| V[PVCs] + + style T fill:#e1f5ff,stroke:#0288d1 + style R fill:#fff4e1,stroke:#f57c00 + style A fill:#e8f5e9,stroke:#388e3c +``` + +The `AgentTeamRun` controller merges the run's overrides on top of the template defaults and creates a child `AgentTeam`. Status flows back into the `AgentTeamRun` via an `Owns` watch, so `kubectl get agentteamrun` shows progress without users needing to know the child team exists. + +## AgentTeam + +The primary resource. Defines a single team and its lifecycle. + +The three load-bearing fields are `spec.lead`, `spec.teammates`, and either `spec.repository` (coding mode) or `spec.workspace` (Cowork mode): + +```yaml +apiVersion: claude.amcheste.io/v1alpha1 +kind: AgentTeam +metadata: + name: auth-refactor +spec: + repository: # coding mode: git repo + worktrees + url: "git@github.com:acme/backend.git" + branch: "main" + credentialsSecret: "git-credentials" + auth: + apiKeySecret: "anthropic-api-key" + lead: + model: "opus" + prompt: "..." + teammates: + - name: "backend-api" + model: "sonnet" + prompt: "..." + dependsOn: [] + lifecycle: + timeout: "2h" + budgetLimit: "30.00" + onComplete: "create-pr" +``` + +### Status + +The reconciler routes on `status.phase`: + +``` +(new CR) + │ + ▼ +Pending ─────► Initializing ─────► Running ─────► Completed + │ │ │ │ + │ │ init Job failed │ pod failed │ pods deleted, + │ ▼ ▼ │ completedAt stamped + │ Failed Failed/ ▼ + │ TimedOut/ (terminal) + │ BudgetExceeded + ▼ + Failed +``` + +Terminal phases (`Completed`, `Failed`, `TimedOut`, `BudgetExceeded`) trigger cleanup. Pods get deleted, `status.completedAt` gets stamped, the reconciler stops requeuing. + +Other status fields worth knowing: + +- `status.lead.phase` and `status.teammates[].phase`. Per-pod state +- `status.estimatedCostUsd`. Budget tracker output (see [Operations](operations.md)) +- `status.consolidatedBranch`. Populated when `onComplete: push-branch` runs +- `status.conditions`. Kubernetes-style conditions array + +## AgentTeamTemplate + +A reusable team blueprint. Does not run on its own. It sits inert until an `AgentTeamRun` references it. + +```yaml +apiVersion: claude.amcheste.io/v1alpha1 +kind: AgentTeamTemplate +metadata: + name: 3-agent-security-review +spec: + lead: + model: "opus" + prompt: | + Run a security audit. Coordinate three reviewers: + - dependency-review for known CVEs + - secrets-scanner for committed credentials + - auth-audit for IAM/permission changes + teammates: + - name: "dependency-review" + model: "sonnet" + prompt: "..." + - name: "secrets-scanner" + model: "sonnet" + prompt: "..." + - name: "auth-audit" + model: "sonnet" + prompt: "..." + lifecycle: + timeout: "4h" + budgetLimit: "20.00" +``` + +The template controller validates the spec on create/update: + +- `dependsOn` references match real teammate names +- Model values are valid (`opus`, `sonnet`, `haiku`) +- No duplicate teammate names + +It writes a `Ready` condition on `status`. The `AgentTeamRun` controller refuses to instantiate templates where `Ready=false`. + +## AgentTeamRun + +One concrete run of a template. The controller merges run-level fields on top of the template's defaults and creates a child `AgentTeam`. + +```yaml +apiVersion: claude.amcheste.io/v1alpha1 +kind: AgentTeamRun +metadata: + name: q4-security-review +spec: + templateRef: + name: 3-agent-security-review + repository: + url: "git@github.com:acme/platform.git" + branch: "release/4.0" + credentialsSecret: "git-credentials" + auth: + apiKeySecret: "anthropic-api-key" + # Optional: override any teammate's prompt at run time + lead: + prompt: "Focus on the new OAuth flow this quarter." +``` + +The `AgentTeam` it creates is owned by the `AgentTeamRun` (set via `ctrl.SetControllerReference` in the controller). Deleting the `AgentTeamRun` cascades to the team and all its child resources. + +`status.phase` mirrors the child `AgentTeam`'s phase, so a single `kubectl get agentteamrun` shows the full picture. + +## Which one do I use? + +```mermaid +graph TD + Q[Need to run
an agent team] --> A{Will you run
this same team
shape again?} + A -->|No, one-off| B[Use AgentTeam
directly] + A -->|Yes, regularly| C{Multiple inputs?
e.g. different repos,
different branches} + C -->|Yes| D[Define AgentTeamTemplate
once, instantiate with
AgentTeamRun per input] + C -->|No, same inputs| B + + style B fill:#e8f5e9,stroke:#388e3c + style D fill:#e1f5ff,stroke:#0288d1 +``` + +The Template+Run pattern shines when you want the same team shape (same lead prompt, same teammate roles) parameterised by repo, branch, or per-run prompt overrides. For a one-off job, the indirection is overhead. Just write an `AgentTeam` directly. + +## Worked example: security review across three repos + +Define the template once: + +```yaml +apiVersion: claude.amcheste.io/v1alpha1 +kind: AgentTeamTemplate +metadata: + name: 3-agent-security-review + namespace: security-team +spec: + lead: + model: opus + prompt: "Coordinate dependency-review, secrets-scanner, and auth-audit." + teammates: + - name: dependency-review + model: sonnet + prompt: "Audit go.mod for CVEs via osv-scanner." + - name: secrets-scanner + model: sonnet + prompt: "Run trufflehog over the repo, report any matches." + - name: auth-audit + model: sonnet + prompt: "Diff RBAC manifests vs main, flag privilege escalation." + lifecycle: + timeout: 2h + budgetLimit: "15.00" + onComplete: create-pr +``` + +Then trigger it on whatever repo needs it: + +```yaml +--- +apiVersion: claude.amcheste.io/v1alpha1 +kind: AgentTeamRun +metadata: { name: payments-review, namespace: security-team } +spec: + templateRef: { name: 3-agent-security-review } + repository: + url: git@github.com:acme/payments.git + branch: main + credentialsSecret: git-credentials + auth: { apiKeySecret: anthropic-api-key } +--- +apiVersion: claude.amcheste.io/v1alpha1 +kind: AgentTeamRun +metadata: { name: identity-review, namespace: security-team } +spec: + templateRef: { name: 3-agent-security-review } + repository: + url: git@github.com:acme/identity.git + branch: main + credentialsSecret: git-credentials + auth: { apiKeySecret: anthropic-api-key } +--- +apiVersion: claude.amcheste.io/v1alpha1 +kind: AgentTeamRun +metadata: { name: notifications-review, namespace: security-team } +spec: + templateRef: { name: 3-agent-security-review } + repository: + url: git@github.com:acme/notifications.git + branch: main + credentialsSecret: git-credentials + auth: { apiKeySecret: anthropic-api-key } +``` + +Three concurrent reviews. One template definition. Updating the template (e.g. tightening the lead prompt) automatically applies to future runs. + +## Owner references and cascade delete + +Every child resource. Pods, PVCs, ConfigMaps, the init Job, per-agent ServiceAccounts and Roles. Has an owner reference to the `AgentTeam`. Deleting the `AgentTeam` cascades to all of them via Kubernetes garbage collection. + +If the team was created by an `AgentTeamRun`, that adds another layer: deleting the `AgentTeamRun` cascades to the `AgentTeam` (which then cascades to everything else). One `kubectl delete agentteamrun` is sufficient teardown. + +## Where to look next + +- [Coordination protocol](coordination.md). How the agents actually talk to each other +- [Operations](operations.md). Budget, RBAC, and observability +- [API reference (coming in v0.7.0)](../reference/index.md). Every field, every type, every default diff --git a/docs/helm-values.md b/docs/helm-values.md index 59bb39a..b160b1a 100644 --- a/docs/helm-values.md +++ b/docs/helm-values.md @@ -69,7 +69,7 @@ The operator pod is single-replica and lightweight by default. Bump limits if yo ## Storage -Defaults applied to PVCs the operator creates per AgentTeam. **Required:** the storage class must support `ReadWriteMany` for multi-pod teams (NFS, EFS, CephFS) — see [ARCHITECTURE.md § Storage Requirements](../ARCHITECTURE.md#storage-requirements). +Defaults applied to PVCs the operator creates per AgentTeam. **Required:** the storage class must support `ReadWriteMany` for multi-pod teams (NFS, EFS, CephFS). See [ARCHITECTURE.md § Storage Requirements](../ARCHITECTURE.md#storage-requirements). | Key | Default | Description | |---|---|---| @@ -77,7 +77,7 @@ Defaults applied to PVCs the operator creates per AgentTeam. **Required:** the s | `storage.teamStateSize` | `5Gi` | Size of the team-state PVC (mailboxes + task list). | | `storage.repoSize` | `20Gi` | Size of the per-team repo PVC (clones + worktrees). | -## Metrics — Service + ServiceMonitor +## Metrics. Service + ServiceMonitor | Key | Default | Description | |---|---|---| @@ -86,9 +86,9 @@ Defaults applied to PVCs the operator creates per AgentTeam. **Required:** the s | `metrics.serviceMonitor.enabled` | `false` | **Production:** set to `true` when running with kube-prometheus-stack. Requires the `monitoring.coreos.com` CRDs. | | `metrics.serviceMonitor.namespace` | `""` | Namespace for the ServiceMonitor. Defaults to the release namespace. Set this to the Prometheus namespace when using a namespace-scoped selector. | | `metrics.serviceMonitor.interval` | `30s` | Prometheus scrape interval. | -| `metrics.serviceMonitor.additionalLabels` | `{}` | Extra labels on the ServiceMonitor. Match your Prometheus CR's selector — e.g. `{release: kube-prometheus-stack}`. | +| `metrics.serviceMonitor.additionalLabels` | `{}` | Extra labels on the ServiceMonitor. Match your Prometheus CR's selector, e.g. `{release: kube-prometheus-stack}`. | -## Metrics — Grafana dashboard +## Metrics. Grafana dashboard Renders a ConfigMap holding a 10-panel Grafana dashboard for Claude team observability. With kube-prometheus-stack, the Grafana sidecar auto-imports any ConfigMap carrying the configured label. diff --git a/docs/how-to/index.md b/docs/how-to/index.md new file mode 100644 index 0000000..1c2cde6 --- /dev/null +++ b/docs/how-to/index.md @@ -0,0 +1,27 @@ +# How-to guides + +Recipes for solving specific operational tasks. These assume you already have kagents installed and at least a basic working AgentTeam. If not, start with the [Getting Started tutorial](../tutorials/getting-started.md). + +## Install + +Cloud-specific install paths covering the ReadWriteMany storage configuration that's the actual deployment friction point on each cloud: + +- **[Install on Amazon EKS](install/eks.md)**. EFS CSI driver + EFS file system + Access Points +- **[Install on Google GKE](install/gke.md)**. Filestore CSI driver + Filestore instance +- **[Install on Azure AKS](install/aks.md)**. Azure Files CSI driver + Premium NFS share + +Each guide ends with the same `make mailbox-smoke-test` verification step. + +## Operate + +Day-to-day operational tasks once kagents is running: + +- **[Expose the dashboard](operate/expose-dashboard.md)**. Port-forward for dev, Ingress with basic auth for prod, oauth2-proxy for corporate SSO, namespace-scoping +- **[Configure shared storage](operate/shared-storage.md)**. Sizing the team-state / repo / output PVCs, backup strategies per cloud backend, performance tuning recipes +- **[Set budget alerts](operate/budget-alerts.md)**. Per-team `budgetLimit`, chart-wide default, webhook events to Slack/PagerDuty, Prometheus alert rules + +## Looking for something else? + +- **New to the project?** Start with the [Getting Started tutorial](../tutorials/getting-started.md). +- **Want to understand how it works?** See the [Explanation](../explanation/index.md) section. +- **Need a specific CRD field or Helm value?** See the [Reference](../reference/index.md) section. diff --git a/docs/how-to/install/aks.md b/docs/how-to/install/aks.md new file mode 100644 index 0000000..f6509ec --- /dev/null +++ b/docs/how-to/install/aks.md @@ -0,0 +1,136 @@ +# Install on Azure AKS + +This guide walks you from a working AKS cluster to a running kagents operator backed by Azure Files for the ReadWriteMany storage requirement. + +## Prerequisites + +- An AKS cluster on Kubernetes 1.28+ +- `kubectl` configured against the cluster +- `helm` 3.14+ +- `az` CLI authenticated with the subscription that owns the cluster +- The cluster's resource group and node resource group. `az aks show -g -n ` shows them + +## 1. Verify the Azure Files CSI driver is enabled + +AKS includes the Azure Files CSI driver as a managed add-on, enabled by default on new clusters since 1.21. Verify: + +```bash +kubectl get csidriver file.csi.azure.com +``` + +If the resource doesn't exist, enable it: + +```bash +az aks update -g -n --enable-file-driver +``` + +## 2. Choose the file share protocol + +Azure Files supports two protocols, and only one is suitable for kagents: + +| Protocol | RWX? | Use? | +|----------|------|------| +| **NFS v4.1** | ✅ Yes | **Yes, use this.** | +| **SMB** | ⚠️ Partial | No. POSIX semantics on the agent's mailbox writes don't work reliably. | + +NFS shares require a Premium storage account (FileStorage SKU). The good news is Premium pricing is reasonable for the small share sizes kagents needs. + +## 3. Create the StorageClass + +```yaml title="storageclass-azurefile.yaml" +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + # Match the operator's default StorageClass name. + name: nfs +provisioner: file.csi.azure.com +parameters: + protocol: nfs + skuName: Premium_LRS # FileStorage SKU; required for NFS shares + storageAccount: "" # leave empty for dynamic; populate for an existing account + resourceGroup: "" # leave empty to use the AKS node RG +mountOptions: + - nconnect=4 + - actimeo=30 + - hard +volumeBindingMode: Immediate +allowVolumeExpansion: true +reclaimPolicy: Delete +``` + +Apply it: + +```bash +kubectl apply -f storageclass-azurefile.yaml +``` + +The `nconnect=4` mount option opens four parallel TCP connections per mount, which significantly improves throughput on Azure Files. `actimeo=30` reduces metadata round-trips for the mailbox-poll workload. + +## 4. Install kagents + +```bash +helm install kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --create-namespace +``` + +Wait for the operator: + +```bash +kubectl rollout status deployment/kagents-controller-manager \ + --namespace claude-teams-system --timeout=120s +``` + +## 5. Verify with the mailbox smoke test + +```bash +git clone https://github.com/amcheste/claude-teams-operator.git +cd claude-teams-operator +make mailbox-smoke-test +``` + +A passing run reports: + +``` +PASS StorageClass=nfs AccessMode=ReadWriteMany RoundTripMs=918 +``` + +The first PVC takes longer to provision because the CSI driver creates a storage account if `storageAccount` is empty. Subsequent PVCs in the same RG reuse it. + +## Cost notes + +Azure Files Premium (FileStorage SKU) is billed by **provisioned capacity** per GiB-month, not actual usage: + +- **Provisioned capacity minimum**: 100 GiB per share. +- **Price**: ~$0.16/GiB-month for Premium NFS in most regions, plus tiny per-operation fees. +- **Network**: free within the same Azure region. + +A 100 GiB Premium share is **~$16/month**. That's enough for tens of concurrent teams' worth of mailbox state. For larger teams or longer retention, scale capacity up. Azure Files Premium auto-scales IOPS proportional to provisioned size. + +The honest range for a small production install is **$15–$50/month** depending on how aggressively you scale capacity for performance. + +## Common gotchas + +??? warning "`mount.nfs4: Permission denied` or `Stale file handle`" + The most common cause is the AKS subnet missing the `Microsoft.Storage` service endpoint. Add it: `az network vnet subnet update -g --vnet-name -n --service-endpoints Microsoft.Storage`. + +??? warning "PVCs stuck in `Pending` with `failed to provision volume: ... PrincipalNotFound`" + The AKS managed identity (or service principal) lacks `Storage Account Contributor` on the resource group. Grant it: + ```bash + az role assignment create \ + --assignee \ + --role "Storage Account Contributor" \ + --scope /subscriptions//resourceGroups/ + ``` + +??? warning "Slow mailbox round-trips (>5s)" + Azure Files NFS without `nconnect=4` can be 2-3x slower than expected. Add the mount option in the StorageClass and recreate any pods using existing PVCs to pick it up. + +??? warning "Cannot use Standard or Premium_ZRS SKU" + Only `Premium_LRS` supports NFS. Standard SMB shares technically support RWX but the file-locking semantics don't work for the mailbox protocol. Use Premium NFS. + +## Where to look next + +- [Resource model](../../explanation/resources.md). The CRDs you'll be writing +- [Coordination protocol](../../explanation/coordination.md). Why RWX matters in detail +- [Operations](../../explanation/operations.md). Budget, RBAC, observability for the running operator diff --git a/docs/how-to/install/eks.md b/docs/how-to/install/eks.md new file mode 100644 index 0000000..71a0e62 --- /dev/null +++ b/docs/how-to/install/eks.md @@ -0,0 +1,152 @@ +# Install on Amazon EKS + +This guide walks you from a working EKS cluster to a running kagents operator backed by Amazon EFS for the ReadWriteMany storage requirement. + +## Prerequisites + +- An EKS cluster on Kubernetes 1.28+ +- `kubectl` configured against the cluster +- `helm` 3.14+ +- `aws` CLI authenticated with permissions to create EFS file systems and IAM policies +- The cluster's VPC ID and the security group used by your worker nodes. `aws eks describe-cluster --name ` shows them + +## 1. Install the EFS CSI driver + +The official AWS EFS CSI driver supports the `ReadWriteMany` access mode kagents requires. Install via the EKS add-on: + +```bash +aws eks create-addon \ + --cluster-name \ + --addon-name aws-efs-csi-driver \ + --resolve-conflicts OVERWRITE +``` + +Or via Helm if you prefer the upstream chart: + +```bash +helm repo add aws-efs-csi-driver https://kubernetes-sigs.github.io/aws-efs-csi-driver/ +helm install aws-efs-csi-driver aws-efs-csi-driver/aws-efs-csi-driver \ + --namespace kube-system +``` + +Verify pods are ready: + +```bash +kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-efs-csi-driver +``` + +## 2. Create the EFS file system + +```bash +aws efs create-file-system \ + --creation-token kagents-state \ + --performance-mode generalPurpose \ + --throughput-mode elastic \ + --encrypted \ + --tags Key=Name,Value=kagents-state +``` + +Note the returned `FileSystemId` (looks like `fs-0abc123def456`). + +Add a mount target in each worker subnet so pods on any node can mount it: + +```bash +# For each subnet your nodes live in: +aws efs create-mount-target \ + --file-system-id fs-0abc123def456 \ + --subnet-id subnet-... \ + --security-groups sg-... # the worker node security group +``` + +The security group must allow inbound NFS (TCP 2049) from itself. + +## 3. Create the StorageClass + +```yaml title="storageclass-efs.yaml" +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + # The operator defaults to a class named "nfs"; using that name avoids + # needing to override storage.storageClassName in the chart values. + name: nfs +provisioner: efs.csi.aws.com +parameters: + provisioningMode: efs-ap + fileSystemId: fs-0abc123def456 + directoryPerms: "700" + uid: "65532" + gid: "65532" +reclaimPolicy: Retain +volumeBindingMode: Immediate +``` + +Apply it: + +```bash +kubectl apply -f storageclass-efs.yaml +``` + +The `efs-ap` provisioning mode creates an EFS Access Point per PVC, which gives each team its own permissioned root directory inside the shared file system. + +## 4. Install kagents + +```bash +helm install kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --create-namespace +``` + +Wait for the operator to be ready: + +```bash +kubectl rollout status deployment/kagents-controller-manager \ + --namespace claude-teams-system --timeout=120s +``` + +## 5. Verify with the mailbox smoke test + +The repo includes a smoke test that provisions an `AgentTeam`, lets the lead and a teammate exchange a single mailbox round-trip, and reports the effective StorageClass and AccessMode. + +```bash +git clone https://github.com/amcheste/claude-teams-operator.git +cd claude-teams-operator +make mailbox-smoke-test +``` + +A passing run looks like: + +``` +PASS StorageClass=nfs AccessMode=ReadWriteMany RoundTripMs=842 +``` + +If `AccessMode` reports `ReadWriteOnce` or the test fails to schedule the second pod, your StorageClass isn't actually advertising RWX. Re-check step 3. + +## Cost notes + +EFS is billed by storage GB-month + provisioned throughput. For a typical kagents deployment running 5-10 teams concurrently: + +- **Storage**: 1-5 GiB per team. At ~$0.30/GiB-month (Standard storage class), expect $0.50–$2/month for storage. +- **Throughput**: in `elastic` mode you pay per byte read/written (~$0.01/GiB). Idle teams cost nothing; active teams during a busy period might generate a few GiB of traffic per day. +- **Per-mount cost**: nothing. EFS mount targets are free. + +The honest range for a small production install is **$5–$30/month**. For larger scale see the [EFS pricing page](https://aws.amazon.com/efs/pricing/). + +## Common gotchas + +??? warning "PVCs stuck in `Pending` with `failed to provision volume`" + Almost always one of: + - The EFS CSI driver pod isn't running. `kubectl get pods -n kube-system | grep efs` + - The IAM role attached to the node group lacks `elasticfilesystem:CreateAccessPoint` and `DescribeAccessPoints`. The EKS add-on form attaches the right policy automatically; the upstream Helm install requires manual IAM setup. See [AWS docs](https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-iam-resources). + - The StorageClass references a `fileSystemId` that doesn't exist or has no mount targets in the right subnets. + +??? warning "Pods get stuck mounting with `mount.nfs4: Connection refused`" + The worker security group doesn't allow inbound NFS from itself. Add a rule: source ``, type `NFS`, port `2049`. + +??? warning "Slow first-mount on a fresh PVC" + EFS Access Point provisioning can take 30-60s on first use. After the first mount, subsequent mounts of the same PVC are fast. This is normal. + +## Where to look next + +- [Resource model](../../explanation/resources.md). The CRDs you'll be writing +- [Coordination protocol](../../explanation/coordination.md). Why RWX matters in detail +- [Operations](../../explanation/operations.md). Budget, RBAC, observability for the running operator diff --git a/docs/how-to/install/gke.md b/docs/how-to/install/gke.md new file mode 100644 index 0000000..25509d7 --- /dev/null +++ b/docs/how-to/install/gke.md @@ -0,0 +1,121 @@ +# Install on Google GKE + +This guide walks you from a working GKE cluster to a running kagents operator backed by Google Filestore for the ReadWriteMany storage requirement. + +## Prerequisites + +- A GKE cluster on Kubernetes 1.28+ +- `kubectl` configured against the cluster +- `helm` 3.14+ +- `gcloud` CLI authenticated with the project that owns the cluster +- The cluster's VPC network and region. `gcloud container clusters describe ` shows them + +## 1. Enable the Filestore CSI driver + +GKE provides the Filestore CSI driver as a managed add-on. Enable it on the cluster: + +```bash +gcloud container clusters update \ + --update-addons=GcpFilestoreCsiDriver=ENABLED \ + --location +``` + +For new clusters you can enable it at create time with `--addons=GcpFilestoreCsiDriver`. + +Verify the driver pods are running: + +```bash +kubectl get pods -n kube-system -l k8s-app=gcp-filestore-csi-driver +``` + +## 2. Create the StorageClass + +The driver supports dynamic provisioning, so you don't need to create a Filestore instance manually. The CSI driver creates one when the first PVC binds. + +```yaml title="storageclass-filestore.yaml" +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + # Match the operator's default StorageClass name. + name: nfs +provisioner: filestore.csi.storage.gke.io +parameters: + tier: standard # or "premium" for SSD-backed; "enterprise" for regional HA + network: default # match your cluster's VPC +volumeBindingMode: WaitForFirstConsumer +allowVolumeExpansion: true +reclaimPolicy: Delete +``` + +Apply it: + +```bash +kubectl apply -f storageclass-filestore.yaml +``` + +`WaitForFirstConsumer` defers provisioning until a pod is scheduled, which lets the Filestore instance land in the right zone for the consuming pod. + +## 3. Install kagents + +```bash +helm install kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --create-namespace +``` + +Wait for the operator: + +```bash +kubectl rollout status deployment/kagents-controller-manager \ + --namespace claude-teams-system --timeout=120s +``` + +## 4. Verify with the mailbox smoke test + +```bash +git clone https://github.com/amcheste/claude-teams-operator.git +cd claude-teams-operator +make mailbox-smoke-test +``` + +A passing run reports the effective StorageClass and AccessMode: + +``` +PASS StorageClass=nfs AccessMode=ReadWriteMany RoundTripMs=623 +``` + +The first `make mailbox-smoke-test` run on Filestore takes a few minutes. Filestore instance provisioning is the slow step (~3-5 min). Subsequent test runs reuse the instance and complete in under 30s. + +## Cost notes + +Filestore is billed by provisioned capacity per hour, not actual usage: + +- **Standard tier**: ~$0.20/GiB-month. Minimum instance size is **1 TiB**, so the floor is ~$200/month per Filestore instance. +- **Premium tier (SSD)**: ~$0.30/GiB-month. Same 1 TiB minimum. +- **Enterprise tier (HA, regional)**: ~$0.60/GiB-month. 2.5 TiB minimum. + +Note that **each PVC creates a new Filestore instance by default** with this StorageClass config. If you're running many teams, this gets expensive fast. At least one instance per PVC times the 1 TiB minimum. + +For multi-team production use, set `volumeHandle` on a manually-provisioned shared Filestore instance and use sub-directory provisioning instead. See [GKE's Filestore docs](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/filestore-csi-driver) for the multi-PVC pattern. + +The honest range for a small production install with one shared Filestore instance is **$200–$300/month**. + +## Common gotchas + +??? warning "PVC stuck in `Pending` with `does not satisfy capacity`" + Filestore instances have a 1 TiB minimum size. The kagents chart's default `storage.teamStateSize` is `5Gi`, but Filestore will round it up to the tier minimum. The PVC binds successfully. The warning resolves once provisioning completes (3-5 min). + +??? warning "`failed to create filestore instance: insufficient quota`" + Filestore instances count against a project-wide quota. `gcloud compute regions describe ` shows current usage. Request a quota increase via the GCP console. + +??? warning "Pods can't reach the Filestore IP" + The Filestore instance must be in the same VPC as the cluster. The StorageClass `network: default` parameter must match your cluster's VPC name. If you use a custom VPC, set it explicitly. + +??? warning "`Failed to create access mode RWO from RWX SC`" + Don't use `--pvc-access-mode=ReadWriteOnce` on GKE. Filestore is RWX-native; the operator just needs the default `ReadWriteMany` to work. + +## Where to look next + +- [Resource model](../../explanation/resources.md). The CRDs you'll be writing +- [Coordination protocol](../../explanation/coordination.md). Why RWX matters in detail +- [Operations](../../explanation/operations.md). Budget, RBAC, observability for the running operator diff --git a/docs/how-to/operate/budget-alerts.md b/docs/how-to/operate/budget-alerts.md new file mode 100644 index 0000000..fb9c612 --- /dev/null +++ b/docs/how-to/operate/budget-alerts.md @@ -0,0 +1,191 @@ +# Set budget alerts + +This guide covers the four ways to limit and observe spend on a kagents installation: per-team budget limits, the chart-wide default, webhook events on threshold crossings, and Prometheus alert rules. + +For how the budget estimate is computed and its honest limitations, see the [Operations explanation](../../explanation/operations.md). + +## Per-team `budgetLimit` + +The hard stop. When a team's `status.estimatedCostUsd` crosses `spec.lifecycle.budgetLimit`, the operator deletes all the team's pods and transitions the phase to `BudgetExceeded`. + +```yaml +apiVersion: claude.amcheste.io/v1alpha1 +kind: AgentTeam +metadata: + name: nightly-security-review +spec: + # ... + lifecycle: + timeout: 4h + budgetLimit: "10.00" # USD +``` + +There's no grace period. The team stops the moment the estimate crosses. The estimate is conservative-to-the-low-side (~50K input + 5K output tokens per agent per minute is a rough ballpark), so set the limit with **2x headroom** over what you actually want to spend. + +## Chart-wide default + +For teams that don't set their own `budgetLimit`, the operator falls back to a chart-level default. The default value is **$50.00**. Override at install time: + +```bash +helm upgrade kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --reuse-values \ + --set defaultBudgetLimit=15.00 +``` + +This is a safety net, not a recommendation. Every team should set its own `budgetLimit` based on the work it's doing. The default exists to prevent a misconfigured team from running unbounded. + +## Webhook events on threshold crossings + +The operator fires a `budget.warning` webhook event when a team's estimated cost crosses **80% of its `budgetLimit`**. Useful as an early warning before the hard stop fires. + +### Configure the webhook URL + +Set the chart-level webhook URL (applies to all teams unless overridden per-team): + +```bash +helm upgrade kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --reuse-values \ + --set webhook.defaultUrl=https://hooks.example.com/kagents +``` + +### Payload shape + +Each event POSTs a JSON body: + +```json +{ + "type": "budget.warning", + "team": { + "namespace": "dev-agents", + "name": "auth-refactor", + "phase": "Running" + }, + "budget": { + "limitUsd": 10.00, + "estimatedCostUsd": 8.42, + "percentOfLimit": 84.2 + }, + "timestamp": "2026-05-02T14:33:21Z" +} +``` + +### Wire to Slack + +For a Slack notification, point the webhook at an [Incoming Webhook URL](https://api.slack.com/messaging/webhooks) and translate the payload via a small Cloud Function or a dedicated webhook-relay service (Slack expects its own message format, not the kagents one). + +A minimal relay in Cloud Run / AWS Lambda / Azure Functions: + +```python title="kagents-to-slack.py" +import json, os, urllib.request + +def handler(event): + body = json.loads(event["body"]) + if body["type"] != "budget.warning": + return {"statusCode": 204} + + msg = { + "text": f":warning: kagents budget warning: " + f"team `{body['team']['namespace']}/{body['team']['name']}` " + f"at {body['budget']['percentOfLimit']:.1f}% of " + f"${body['budget']['limitUsd']:.2f} limit " + f"(${body['budget']['estimatedCostUsd']:.2f} estimated)" + } + req = urllib.request.Request( + os.environ["SLACK_WEBHOOK_URL"], + data=json.dumps(msg).encode(), + headers={"Content-Type": "application/json"} + ) + urllib.request.urlopen(req) + return {"statusCode": 200} +``` + +### Wire to PagerDuty + +PagerDuty's [Events API v2](https://developer.pagerduty.com/docs/events-api-v2/overview/) accepts a similar relay pattern. The dedup key should combine team namespace + name so repeated `budget.warning` events for the same team collapse to a single incident. + +## Prometheus alert rules + +For teams that already have a Prometheus + Alertmanager stack, alert directly on the metrics the chart exposes. The relevant series: + +- `claude_team_cost_usd{team_name=...}`. Current estimated cost +- `claude_team_budget_remaining_usd{team_name=...}`. `limit - cost` + +### Alert: budget about to be exceeded + +```yaml title="kagents-alerts.yaml" +groups: + - name: kagents-budget + rules: + - alert: KagentsBudgetWarning + expr: | + (claude_team_cost_usd / on(team_name) group_left + (claude_team_cost_usd + claude_team_budget_remaining_usd)) + > 0.80 + for: 1m + labels: + severity: warning + annotations: + summary: "kagents team {{ $labels.team_name }} at {{ $value | humanizePercentage }} of budget" + description: | + Team {{ $labels.namespace }}/{{ $labels.team_name }} has + consumed {{ $value | humanizePercentage }} of its budget. + Hard stop fires at 100%. + + - alert: KagentsBudgetExceeded + expr: claude_team_budget_remaining_usd <= 0 + for: 30s + labels: + severity: critical + annotations: + summary: "kagents team {{ $labels.team_name }} hit budget limit and was terminated" +``` + +Apply via your Prometheus operator's `PrometheusRule` CRD if you're using kube-prometheus-stack: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: kagents-budget + namespace: monitoring + labels: + release: kube-prometheus-stack +spec: + groups: + # ...same as above +``` + +### Alert: aggregate cost across all running teams + +For total spend visibility: + +```yaml +- alert: KagentsAggregateCostHigh + expr: sum(claude_team_cost_usd) > 100 + for: 5m + labels: + severity: warning + annotations: + summary: "Total in-flight kagents cost exceeds $100" + description: | + Aggregate estimated cost across all running teams: ${{ $value }}. + Investigate which teams are running with: kubectl get agentteams -A +``` + +## Cross-checking against actual spend + +The operator's estimate is a heuristic, not a meter. Reconcile against ground truth at least weekly: + +- Pull actual spend from the [Anthropic Console](https://console.anthropic.com/) usage API +- Compare against the `claude_team_cost_usd` Prometheus history for the same time window +- Adjust your `budgetLimit` headroom factor based on the observed estimate-vs-actual ratio + +If the estimate is consistently 50% low, double your `budgetLimit` headroom. If it's 200% high, you can tighten limits. + +## Where to look next + +- [Operations explanation](../../explanation/operations.md). How the budget is computed in detail +- [Expose the dashboard](expose-dashboard.md). Visual cost view per team +- [Configure shared storage](shared-storage.md). The other recurring cost on a kagents install diff --git a/docs/how-to/operate/expose-dashboard.md b/docs/how-to/operate/expose-dashboard.md new file mode 100644 index 0000000..35a0ecb --- /dev/null +++ b/docs/how-to/operate/expose-dashboard.md @@ -0,0 +1,136 @@ +# Expose the dashboard + +The dashboard ships with kagents but is **off by default**. Installing the chart alone gives you the controller and CRDs only. This guide walks through enabling it and exposing it for the three most common scenarios. + +For why the dashboard is off by default and what it can show, see the [Operations explanation](../../explanation/operations.md). + +## Enable the dashboard + +The dashboard is a chart sub-component gated on `dashboard.enabled`: + +```bash +helm upgrade kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --reuse-values \ + --set dashboard.enabled=true +``` + +This deploys: + +- A read-only `Deployment` running the dashboard binary +- A `ClusterIP` Service on port 8080 +- A dedicated `ServiceAccount` with read-only RBAC on AgentTeam CRs and Pods/log +- Templates for an optional Ingress (off by default) + +Verify the deployment: + +```bash +kubectl rollout status deployment/kagents-dashboard \ + --namespace claude-teams-system --timeout=60s +``` + +## Scenario 1: dev / first-look (port-forward) + +For local development or a quick "is it working" check, port-forward the Service: + +```bash +kubectl port-forward -n claude-teams-system svc/kagents-dashboard 8080:8080 +``` + +Open http://localhost:8080. You'll see the team list view; click any team for the detail page with live SSE updates. + +`port-forward` is fine for dev but is a single-user tunnel through your local kubeconfig. Don't rely on it for shared access. + +## Scenario 2: production (Ingress with basic auth) + +For a small-team production deployment, expose the dashboard via an Ingress with basic auth in front. Most ingress controllers can do this without a separate auth proxy. + +### a. Create the basic-auth secret + +```bash +htpasswd -bc auth admin "$(openssl rand -base64 24)" +kubectl create secret generic dashboard-basic-auth \ + --namespace claude-teams-system \ + --from-file=auth +``` + +The `auth` file contains an htpasswd-formatted line; the `nginx` ingress controller (and most others) read this format directly. + +### b. Configure the Ingress via Helm values + +```yaml title="dashboard-values.yaml" +dashboard: + enabled: true + ingress: + enabled: true + className: nginx # or "traefik", "alb", whatever your cluster uses + annotations: + nginx.ingress.kubernetes.io/auth-type: basic + nginx.ingress.kubernetes.io/auth-secret: dashboard-basic-auth + nginx.ingress.kubernetes.io/auth-realm: "kagents dashboard" + cert-manager.io/cluster-issuer: letsencrypt-prod # if using cert-manager + hosts: + - host: kagents.example.com + paths: + - path: / + pathType: Prefix + tls: + - hosts: [kagents.example.com] + secretName: kagents-dashboard-tls +``` + +Apply it: + +```bash +helm upgrade kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --reuse-values \ + -f dashboard-values.yaml +``` + +Set the DNS for `kagents.example.com` to the Ingress controller's external IP. Once cert-manager provisions the TLS cert (1-3 minutes), browse to https://kagents.example.com and authenticate with the password you generated. + +## Scenario 3: corporate (oauth2-proxy + identity provider) + +For larger teams that already have an OIDC identity provider (Okta, Auth0, Google Workspace, GitHub, etc.), put [`oauth2-proxy`](https://oauth2-proxy.github.io/oauth2-proxy/) in front of the dashboard. + +The pattern: + +1. Deploy oauth2-proxy as a separate Deployment + Service in the same namespace +2. Point your Ingress at oauth2-proxy instead of the dashboard +3. Configure oauth2-proxy's `--upstream` flag to forward authenticated requests to `http://kagents-dashboard:8080` + +This is a standard pattern with extensive documentation in the oauth2-proxy project. The dashboard itself doesn't need to change. It stays on the internal Service, and oauth2-proxy handles all authentication and group/role checks before requests reach it. + +## Scoping the dashboard to one namespace + +By default the dashboard sees AgentTeams in **every** namespace (a `ClusterRoleBinding` grants read across the cluster). To restrict it to a single namespace, e.g. when teams in different namespaces belong to different tenants: + +```bash +helm upgrade kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --reuse-values \ + --set dashboard.enabled=true \ + --set dashboard.namespace=dev-agents +``` + +This: + +- Passes `--namespace=dev-agents` to the dashboard binary, so it only lists teams from that namespace +- Generates a `RoleBinding` scoped to `dev-agents` instead of a `ClusterRoleBinding` + +## Verifying + +Once the dashboard is reachable, deploy a quick test team and open the detail view: + +```bash +kubectl apply -n dev-agents -f config/samples/auth-refactor-team.yaml +``` + +The list view should show the team. Click in. The detail page streams live status updates via SSE; killing a teammate pod with `kubectl delete pod ...` should cause the page to redraw within a second or two. + +## Where to look next + +- [Operations explanation](../../explanation/operations.md). What the dashboard's metrics and alerts look like +- [Configure shared storage](shared-storage.md). Sizing and tuning the PVC backends +- [Set budget alerts](budget-alerts.md). Wiring webhook alerts on cost overruns diff --git a/docs/how-to/operate/shared-storage.md b/docs/how-to/operate/shared-storage.md new file mode 100644 index 0000000..29f8d68 --- /dev/null +++ b/docs/how-to/operate/shared-storage.md @@ -0,0 +1,121 @@ +# Configure shared storage + +kagents needs ReadWriteMany PVCs for the team-state, repo (coding mode), and output (Cowork mode) volumes. This guide covers sizing, backup, and per-backend performance tuning once you've picked a backend. + +For the *why* of RWX, see the [Coordination protocol explanation](../../explanation/coordination.md). For initial backend setup, see the cloud-specific install guides ([EKS](../install/eks.md), [GKE](../install/gke.md), [AKS](../install/aks.md)). + +## Sizing + +The chart's default sizes are conservative; raise them if your teams handle large repos or produce big outputs. + +| Volume | Default Helm value | Default size | When to raise | +|--------|-------------------:|-------------:|---------------| +| Team state (mailboxes + tasks) | `storage.teamStateSize` | `5Gi` | Almost never. Mailbox JSON is tiny. 5 GiB holds thousands of messages. | +| Repo (coding mode) | `storage.repoSize` | `20Gi` | If your monorepo + per-teammate worktrees together exceed 20 GiB. Each worktree is roughly the size of your `git checkout`. For a 5-teammate team on a 4 GiB repo, 20 GiB might tip over. | +| Output (Cowork mode) | `spec.workspace.output.size` (per-team) | n/a | Set per AgentTeam based on expected artifact volume. 1 GiB is fine for documents; raise for image/video output. | + +Override at install time: + +```bash +helm upgrade kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --reuse-values \ + --set storage.teamStateSize=10Gi \ + --set storage.repoSize=50Gi +``` + +The Cowork output size is per-team, set in the manifest: + +```yaml +spec: + workspace: + output: + mountPath: /workspace/output + size: 5Gi # adjust per team +``` + +## Backup + +For most use cases the team-state PVC can be discarded. The mailbox is intermediate state, and finished teams' artifacts live elsewhere (in the git remote or in the Cowork output PVC). For the cases where you do want backups: + +### EFS (EKS) + +Use [AWS Backup](https://aws.amazon.com/backup/) with an EFS resource type. A daily backup with 7-day retention is the standard pattern: + +```bash +aws backup create-backup-plan --backup-plan '{ + "BackupPlanName": "kagents-efs-daily", + "Rules": [{ + "RuleName": "DailyBackup", + "TargetBackupVaultName": "Default", + "ScheduleExpression": "cron(0 2 ? * * *)", + "Lifecycle": {"DeleteAfterDays": 7} + }] +}' +``` + +EFS backups are incremental after the first; cost scales with change rate, not full size. + +### Filestore (GKE) + +Use [Filestore Backups](https://cloud.google.com/filestore/docs/backups). They're snapshot-based; the first is full, subsequent are incremental: + +```bash +gcloud filestore backups create kagents-daily-$(date +%Y%m%d) \ + --source-instance \ + --source-instance-region \ + --region +``` + +Schedule via Cloud Scheduler hitting a Cloud Function that runs the above command. + +### Azure Files (AKS) + +Premium NFS shares support [Azure Backup](https://learn.microsoft.com/en-us/azure/backup/azure-file-share-backup-overview): + +```bash +az backup vault create -g -n kagents-vault --location +az backup protection enable-for-azurefileshare \ + --vault-name kagents-vault -g \ + --storage-account \ + --azure-file-share \ + --policy-name DefaultPolicy +``` + +The default policy is daily with 30-day retention. Override per the [Azure docs](https://learn.microsoft.com/en-us/azure/backup/manage-afs-backup). + +## Performance tuning + +The dominant workload is small synchronous writes (mailbox JSON updates) and small synchronous reads (mailbox polls). Raw throughput matters less than IOPS and metadata-op latency. + +### EFS + +- **Throughput mode**: `elastic` is the right default. Pay per byte, scale automatically. Switch to `provisioned` only if you measure consistent saturation in CloudWatch's `BurstCreditBalance` metric. +- **Performance mode**: `generalPurpose` for <7,000 file ops/sec total across all teams (the typical case). `maxIO` only if you exceed that; it adds 1-3ms latency per op which hurts mailbox round-trips. +- **Mount options**: defaults are fine. The CSI driver applies `nfsvers=4.1, rsize=1048576, wsize=1048576` by default. + +### Filestore + +- **Tier**: `standard` is HDD-backed and fine for mailbox-polling workloads. Move to `premium` only if you measure IOPS-bound saturation under load (rare with kagents). +- **Capacity scaling**: Filestore IOPS scale linearly with provisioned capacity. If a single shared instance is saturated by many teams, double the capacity rather than splitting into multiple instances. + +### Azure Files (Premium NFS) + +- **Mount option `nconnect=4`** is the single biggest performance win. Without it, expect 2-3x slower mailbox round-trips. Set it in the StorageClass. See the [AKS install guide](../install/aks.md#3-create-the-storageclass). +- **Provisioned IOPS**: Azure Files Premium gives baseline IOPS proportional to provisioned size (1 IOPS per GiB). For a 100 GiB share, you get ~100 IOPS baseline + bursting. Raise capacity for more IOPS, not for more storage you don't need. + +## Monitoring storage health + +Use the Prometheus metrics the chart exposes (see the [Operations explanation](../../explanation/operations.md)) plus your cloud's native metrics: + +- **EFS**: `IOBytes`, `BurstCreditBalance`, `ClientConnections` in CloudWatch +- **Filestore**: `nfs/server/operation_count`, `nfs/server/free_bytes_percent` in Cloud Monitoring +- **Azure Files**: `Transactions`, `SuccessE2ELatency` in Azure Monitor + +A sudden spike in operation count without a corresponding rise in active teams usually indicates a stuck-poll loop in one team. `kubectl describe agentteam ` to investigate. + +## Where to look next + +- [Coordination protocol](../../explanation/coordination.md). What the storage is actually carrying +- [Set budget alerts](budget-alerts.md). Wiring cost overruns into your alert pipeline +- [Expose the dashboard](expose-dashboard.md). Visual storage-load view diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..cc46df3 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,90 @@ +--- +hide: + - navigation + - toc +--- + +# kagents + +**Run Claude Code Agent Teams as a Kubernetes operator.** + +[Get started in 5 minutes :material-rocket-launch:](tutorials/getting-started.md){ .md-button .md-button--primary } +[View on GitHub :material-github:](https://github.com/amcheste/claude-teams-operator){ .md-button } + +--- + +## Quick install + +```bash +helm install kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --create-namespace +``` + +## Why kagents + +
+ +- :material-protocol:{ .lg .middle } **Native protocol fidelity** + + --- + + Wraps Anthropic's file-based mailbox protocol exactly as designed. No custom RPC layer to maintain, no protocol translation, no behavior drift when Claude Code ships an update. + +- :material-account-group:{ .lg .middle } **Team as a first-class resource** + + --- + + One `AgentTeam` CRD declares roles, budget, quality gates, and coordination topology. `AgentTeamTemplate` lets you reuse common team patterns. "3-agent security review," "fullstack feature team". With one-line instantiation. + +- :material-kubernetes:{ .lg .middle } **K8s as coordination fabric** + + --- + + ServiceAccounts scope what each agent pod can touch. RWX PVCs hold the shared mailboxes. RBAC enforces per-agent capability boundaries. The cluster does the coordination work. Kagents just wires it up. + +- :material-recycle-variant:{ .lg .middle } **Dogfooded** + + --- + + Built with the same Claude Code agent teams it operates. Every release is shipped by an agent team running in production. The recursion is intentional. + +
+ +## What you'll find here + +
+ +- :material-school: **[Tutorials](tutorials/index.md)** + + Step-by-step walkthroughs from zero to a running AgentTeam. + +- :material-cog: **[How-to guides](how-to/index.md)** + + Recipes for specific operational tasks. Install on a cloud, expose the dashboard, tune budgets. + +- :material-book-open-variant: **[Reference](reference/index.md)** + + CRD field reference, Helm values, CLI flags. + +- :material-lightbulb: **[Explanation](explanation/index.md)** + + How and why kagents works the way it does. The architecture, the design tradeoffs. + +
+ +--- + +
+ +- :fontawesome-brands-github:{ .lg .middle } **Source code** + + Apache 2.0. Issues, PRs, and Discussions welcome. + + [github.com/amcheste/claude-teams-operator](https://github.com/amcheste/claude-teams-operator) + +- :material-presentation:{ .lg .middle } **Talk** + + *Reconciling Agent Teams: A Kubernetes Operator for Claude Code*. KubeCon NA 2026 (submitted). + +
diff --git a/docs/reference/api/index.md b/docs/reference/api/index.md new file mode 100644 index 0000000..f9af454 --- /dev/null +++ b/docs/reference/api/index.md @@ -0,0 +1,665 @@ +# API Reference + +## Packages +- [claude.amcheste.io/v1alpha1](#claudeamchesteiov1alpha1) + + +## claude.amcheste.io/v1alpha1 + +Package v1alpha1 contains API Schema definitions for the claude v1alpha1 API group. + +Package v1alpha1 contains API Schema definitions for the claude v1alpha1 API group. + +### Resource Types +- [AgentTeam](#agentteam) +- [AgentTeamRun](#agentteamrun) +- [AgentTeamTemplate](#agentteamtemplate) + + + +#### AgentStatus + + + +AgentStatus reports a single agent's state. + + + +_Appears in:_ +- [AgentTeamStatus](#agentteamstatus) +- [TeammateStatus](#teammatestatus) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `podName` _string_ | PodName is the name of the agent's pod. | | | +| `phase` _string_ | Phase of this agent. | | Enum: [Pending Running Idle Completed Failed Waiting]
| + + +#### AgentTeam + + + +AgentTeam is the Schema for the agentteams API. + + + + + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `apiVersion` _string_ | `claude.amcheste.io/v1alpha1` | | | +| `kind` _string_ | `AgentTeam` | | | +| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | | +| `spec` _[AgentTeamSpec](#agentteamspec)_ | | | | +| `status` _[AgentTeamStatus](#agentteamstatus)_ | | | | + + +#### AgentTeamRun + + + +AgentTeamRun is an instance of an AgentTeamTemplate applied to a specific repository. + + + + + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `apiVersion` _string_ | `claude.amcheste.io/v1alpha1` | | | +| `kind` _string_ | `AgentTeamRun` | | | +| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | | +| `spec` _[AgentTeamRunSpec](#agentteamrunspec)_ | | | | +| `status` _[AgentTeamStatus](#agentteamstatus)_ | | | | + + +#### AgentTeamRunSpec + + + +AgentTeamRunSpec defines an instance of a template applied to a specific repo. + + + +_Appears in:_ +- [AgentTeamRun](#agentteamrun) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `templateRef` _[TemplateReference](#templatereference)_ | TemplateRef references the AgentTeamTemplate to instantiate. | | | +| `repository` _[RepositorySpec](#repositoryspec)_ | Repository configuration for this run (coding mode). | | Optional: \{\}
| +| `workspace` _[WorkspaceSpec](#workspacespec)_ | Workspace configures inputs/outputs for this run (Cowork mode). | | Optional: \{\}
| +| `auth` _[AuthSpec](#authspec)_ | Auth configures API authentication for this run. | | | +| `lead` _[LeadSpec](#leadspec)_ | Lead configures the team lead for this run. | | | +| `lifecycle` _[LifecycleSpec](#lifecyclespec)_ | Lifecycle overrides for this run. | | Optional: \{\}
| + + +#### AgentTeamSpec + + + +AgentTeamSpec defines the desired state of an AgentTeam. + + + +_Appears in:_ +- [AgentTeam](#agentteam) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `repository` _[RepositorySpec](#repositoryspec)_ | Repository configuration for the codebase agents will work on.
Use this for coding tasks. Optional when Workspace is set. | | Optional: \{\}
| +| `workspace` _[WorkspaceSpec](#workspacespec)_ | Workspace configures non-git inputs and outputs for Cowork teams.
Use this for knowledge-work tasks (documents, reports, email, etc.). | | Optional: \{\}
| +| `auth` _[AuthSpec](#authspec)_ | Auth configures how agents authenticate with the Anthropic API. | | | +| `lead` _[LeadSpec](#leadspec)_ | Lead configures the team lead agent. | | | +| `teammates` _[TeammateSpec](#teammatespec) array_ | Teammates defines the worker agents in the team. | | MaxItems: 16
MinItems: 1
| +| `coordination` _[CoordinationSpec](#coordinationspec)_ | Coordination configures how agents communicate. | | Optional: \{\}
| +| `lifecycle` _[LifecycleSpec](#lifecyclespec)_ | Lifecycle configures team runtime behavior and budget. | | Optional: \{\}
| +| `qualityGates` _[QualityGateSpec](#qualitygatespec)_ | QualityGates configures validation before marking team complete. | | Optional: \{\}
| +| `observability` _[ObservabilitySpec](#observabilityspec)_ | Observability configures metrics and notifications. | | Optional: \{\}
| + + +#### AgentTeamStatus + + + +AgentTeamStatus defines the observed state of an AgentTeam. + + + +_Appears in:_ +- [AgentTeam](#agentteam) +- [AgentTeamRun](#agentteamrun) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `phase` _string_ | Phase is the current lifecycle phase of the team. | | Enum: [Pending Initializing Running Completed Failed TimedOut BudgetExceeded]
| +| `startedAt` _[Time](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#time-v1-meta)_ | StartedAt is when the team began execution. | | Optional: \{\}
| +| `completedAt` _[Time](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#time-v1-meta)_ | CompletedAt is when the team finished execution. | | Optional: \{\}
| +| `totalTokensUsed` _integer_ | TotalTokensUsed is the estimated total tokens consumed. | | | +| `estimatedCost` _string_ | EstimatedCost is the estimated API cost in USD (e.g. "4.50"). | | | +| `ready` _string_ | Ready reports how many teammate pods are ready vs. declared, in the form
"running+completed/total" (e.g. "3/5"). Shown in `kubectl get` output. | | Optional: \{\}
| +| `lead` _[AgentStatus](#agentstatus)_ | Lead reports the team lead's status. | | Optional: \{\}
| +| `teammates` _[TeammateStatus](#teammatestatus) array_ | Teammates reports each teammate's status. | | Optional: \{\}
| +| `tasks` _[TaskSummary](#tasksummary)_ | Tasks reports aggregate task progress. | | Optional: \{\}
| +| `pullRequest` _[PullRequestStatus](#pullrequeststatus)_ | PullRequest reports PR creation status. | | Optional: \{\}
| +| `consolidatedBranch` _string_ | ConsolidatedBranch is the branch name pushed by OnComplete=push-branch.
Populated once the push-branch Job succeeds; OnComplete=create-pr reads
this as the PR head branch when set, in place of Spec.Repository.Branch. | | Optional: \{\}
| +| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions represent the latest available observations. | | Optional: \{\}
| + + +#### AgentTeamTemplate + + + +AgentTeamTemplate is a reusable team definition. + + + + + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `apiVersion` _string_ | `claude.amcheste.io/v1alpha1` | | | +| `kind` _string_ | `AgentTeamTemplate` | | | +| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | | +| `spec` _[AgentTeamTemplateSpec](#agentteamtemplatespec)_ | | | | +| `status` _[AgentTeamTemplateStatus](#agentteamtemplatestatus)_ | | | | + + +#### AgentTeamTemplateSpec + + + +AgentTeamTemplateSpec defines a reusable team pattern. + + + +_Appears in:_ +- [AgentTeamTemplate](#agentteamtemplate) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `description` _string_ | Description explains the template's purpose. | | | +| `teammates` _[TeammateSpec](#teammatespec) array_ | Teammates defines the worker agents in the template. | | MaxItems: 16
MinItems: 1
| +| `coordination` _[CoordinationSpec](#coordinationspec)_ | Coordination configures how agents communicate. | | Optional: \{\}
| +| `lifecycle` _[LifecycleSpec](#lifecyclespec)_ | Lifecycle configures default runtime behavior. | | Optional: \{\}
| +| `qualityGates` _[QualityGateSpec](#qualitygatespec)_ | QualityGates configures default validation steps. | | Optional: \{\}
| + + +#### AgentTeamTemplateStatus + + + +AgentTeamTemplateStatus reports validation state for an AgentTeamTemplate. +The reconciler validates teammate references and writes a Ready condition; +AgentTeamRun controllers should refuse to instantiate templates where +Ready is false. + + + +_Appears in:_ +- [AgentTeamTemplate](#agentteamtemplate) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `ready` _boolean_ | Ready is true when the template has passed validation and is safe to
instantiate via an AgentTeamRun. | | Optional: \{\}
| +| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions track the latest validation state with structured reasons. | | Optional: \{\}
| + + +#### ApprovalGateSpec + + + +ApprovalGateSpec pauses execution before a named event until human approval is recorded. +Approval is granted by adding the annotation approved.claude.amcheste.io/{event}=true to the AgentTeam. + + + +_Appears in:_ +- [LifecycleSpec](#lifecyclespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `event` _string_ | Event is the gate identifier. Use "spawn-\{teammate-name\}" to gate spawning a specific teammate. | | | +| `channel` _string_ | Channel is how the approval request notification is sent. | none | Enum: [webhook none]
| +| `webhookUrl` _string_ | WebhookURL to POST when this gate is triggered (used when channel is "webhook"). | | Optional: \{\}
| + + +#### AuthSpec + + + +AuthSpec defines Anthropic API authentication. + + + +_Appears in:_ +- [AgentTeamRunSpec](#agentteamrunspec) +- [AgentTeamSpec](#agentteamspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `apiKeySecret` _string_ | APIKeySecret references a Secret containing ANTHROPIC_API_KEY. | | Optional: \{\}
| +| `oauthSecret` _string_ | OAuthSecret references a Secret containing OAuth tokens for subscription auth. | | Optional: \{\}
| + + +#### BeadsSpec + + + +BeadsSpec configures Beads integration. + + + +_Appears in:_ +- [CoordinationSpec](#coordinationspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `enabled` _boolean_ | Enabled turns on Beads tracking. | | | +| `doltServerService` _string_ | DoltServerService is the K8s service name for the Dolt SQL server. | | Optional: \{\}
| +| `doltServerPort` _integer_ | DoltServerPort is the port for the Dolt SQL server. | 3306 | | + + +#### CoordinationSpec + + + +CoordinationSpec configures inter-agent communication. + + + +_Appears in:_ +- [AgentTeamSpec](#agentteamspec) +- [AgentTeamTemplateSpec](#agentteamtemplatespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `mailboxBackend` _string_ | MailboxBackend determines how mailbox messages are transported. | shared-volume | Enum: [shared-volume redis nats]
| +| `taskBackend` _string_ | TaskBackend determines how the shared task list is stored. | shared-volume | Enum: [shared-volume beads]
| +| `beads` _[BeadsSpec](#beadsspec)_ | Beads configures optional Beads integration for persistent tracking. | | Optional: \{\}
| + + +#### LeadSpec + + + +LeadSpec defines the team lead configuration. + + + +_Appears in:_ +- [AgentTeamRunSpec](#agentteamrunspec) +- [AgentTeamSpec](#agentteamspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `model` _string_ | Model to use for the team lead. | opus | Enum: [opus sonnet haiku]
| +| `prompt` _string_ | Prompt is the initial instruction for the team lead. | | | +| `permissionMode` _string_ | PermissionMode controls how the lead handles permission requests. | auto-accept | Enum: [auto-accept plan default]
| +| `skills` _[SkillSpec](#skillspec) array_ | Skills to mount into .claude/skills/ for the lead agent. | | Optional: \{\}
| +| `mcpServers` _[MCPServerSpec](#mcpserverspec) array_ | MCPServers configures Model Context Protocol connections for the lead agent. | | Optional: \{\}
| +| `resources` _[ResourceRequirements](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#resourcerequirements-v1-core)_ | Resources defines compute resources for the lead pod. | | Optional: \{\}
| + + +#### LifecycleSpec + + + +LifecycleSpec controls team runtime behavior. + + + +_Appears in:_ +- [AgentTeamRunSpec](#agentteamrunspec) +- [AgentTeamSpec](#agentteamspec) +- [AgentTeamTemplateSpec](#agentteamtemplatespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `timeout` _string_ | Timeout is the maximum duration the team can run (e.g. "4h", "30m"). | 4h | | +| `budgetLimit` _string_ | BudgetLimit is the maximum API spend in USD before the team is terminated (e.g. "10.00"). | | Optional: \{\}
| +| `onComplete` _string_ | OnComplete determines what happens when the team finishes. | notify | Enum: [create-pr push-branch notify none]
| +| `pullRequest` _[PullRequestSpec](#pullrequestspec)_ | PullRequest configures PR creation when onComplete is "create-pr". | | Optional: \{\}
| +| `approvalGates` _[ApprovalGateSpec](#approvalgatespec) array_ | ApprovalGates pause execution before specified events until human approval is recorded.
Grant approval by annotating the AgentTeam: kubectl annotate agentteam approved.claude.amcheste.io/=true | | Optional: \{\}
| +| `maxRestarts` _integer_ | MaxRestarts bounds how many times each teammate pod may be re-spawned
after a Failed phase before the team itself is marked Failed. The lead
pod is not subject to this limit; a lead crash always fails the team. | 3 | Minimum: 0
Optional: \{\}
| +| `githubTokenSecret` _string_ | GitHubTokenSecret names a Secret in the team's namespace carrying a
GitHub token under the key GITHUB_TOKEN. Used by OnComplete=create-pr
(and OnComplete=push-branch, once implemented) to authenticate against
the GitHub REST API. | | Optional: \{\}
| +| `prTitleTemplate` _string_ | PRTitleTemplate overrides the title template used by OnComplete=create-pr.
Available variables: .TeamName, .Namespace. When empty, falls back to
Spec.Lifecycle.PullRequest.TitleTemplate, then to the default
"claude-teams: \{\{.TeamName\}\}". | | Optional: \{\}
| +| `gitCredentialsSecret` _string_ | GitCredentialsSecret names a Secret in the team's namespace carrying git
push credentials. The Secret must contain either 'ssh-privatekey' or
'token'. Used by OnComplete=push-branch (and OnComplete=create-pr when
push-branch runs ahead of it). Falls back to Spec.Repository.CredentialsSecret
when unset, so teams that already configured clone credentials with push
scope don't need to duplicate. | | Optional: \{\}
| +| `consolidatedBranchTemplate` _string_ | ConsolidatedBranchTemplate is a Go template rendered to produce the
branch name pushed by OnComplete=push-branch. Available variables:
.TeamName, .Namespace. When empty, defaults to "teams/\{\{.TeamName\}\}". | | Optional: \{\}
| + + +#### MCPServerSpec + + + +MCPServerSpec configures a Model Context Protocol server for an agent. + + + +_Appears in:_ +- [LeadSpec](#leadspec) +- [TeammateSpec](#teammatespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `name` _string_ | Name identifies this MCP server in the agent's config. | | | +| `url` _string_ | URL is the MCP server endpoint. | | | +| `credentialsSecret` _string_ | CredentialsSecret references a Secret containing an 'apiKey' key for bearer auth. | | Optional: \{\}
| + + +#### MetricsSpec + + + +MetricsSpec configures Prometheus metrics. + + + +_Appears in:_ +- [ObservabilitySpec](#observabilityspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `enabled` _boolean_ | Enabled turns on metrics exposition. | | | +| `port` _integer_ | Port for the metrics endpoint. | 9090 | | + + +#### ObservabilitySpec + + + +ObservabilitySpec configures monitoring and notifications. + + + +_Appears in:_ +- [AgentTeamSpec](#agentteamspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `metrics` _[MetricsSpec](#metricsspec)_ | Metrics configures Prometheus metrics exposition. | | Optional: \{\}
| +| `logLevel` _string_ | LogLevel controls operator log verbosity for this team. | info | Enum: [debug info warn error]
| +| `webhook` _[WebhookSpec](#webhookspec)_ | Webhook configures event notifications. | | Optional: \{\}
| + + +#### PullRequestSpec + + + +PullRequestSpec configures automatic PR creation. + + + +_Appears in:_ +- [LifecycleSpec](#lifecyclespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `targetBranch` _string_ | TargetBranch is the branch to open the PR against. | main | | +| `titleTemplate` _string_ | TitleTemplate is a Go template for the PR title.
Available variables: .TeamName, .Namespace | | | +| `reviewers` _string array_ | Reviewers to request on the PR. | | Optional: \{\}
| +| `labels` _string array_ | Labels to apply to the PR. | | Optional: \{\}
| + + +#### PullRequestStatus + + + +PullRequestStatus reports PR creation state. + + + +_Appears in:_ +- [AgentTeamStatus](#agentteamstatus) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `url` _string_ | | | | +| `state` _string_ | | | | + + +#### QualityGateSpec + + + +QualityGateSpec configures validation steps. + + + +_Appears in:_ +- [AgentTeamSpec](#agentteamspec) +- [AgentTeamTemplateSpec](#agentteamtemplatespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `requireTests` _boolean_ | RequireTests ensures tests pass before completion. | | | +| `requireLint` _boolean_ | RequireLint ensures linting passes before completion. | | | +| `validationScript` _string_ | ValidationScript is a custom script to run before marking complete. | | Optional: \{\}
| + + +#### RepositorySpec + + + +RepositorySpec defines the git repository configuration. + + + +_Appears in:_ +- [AgentTeamRunSpec](#agentteamrunspec) +- [AgentTeamSpec](#agentteamspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `url` _string_ | URL is the git clone URL. | | | +| `branch` _string_ | Branch to clone and work from. | main | | +| `worktreeStrategy` _string_ | WorktreeStrategy determines how git worktrees are managed. | per-teammate | Enum: [per-teammate shared]
| +| `credentialsSecret` _string_ | CredentialsSecret references a Secret containing git credentials.
The secret should contain either 'ssh-privatekey' or 'token'. | | Optional: \{\}
| + + +#### ScopeSpec + + + +ScopeSpec restricts file access for a teammate. + + + +_Appears in:_ +- [TeammateSpec](#teammatespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `includePaths` _string array_ | IncludePaths lists paths the teammate should focus on. | | Optional: \{\}
| +| `excludePaths` _string array_ | ExcludePaths lists paths the teammate should not modify. | | Optional: \{\}
| + + +#### SkillSource + + + +SkillSource identifies where to load a skill from. Exactly one field should be set. + + + +_Appears in:_ +- [SkillSpec](#skillspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `configMap` _string_ | ConfigMap references a ConfigMap in the same namespace.
Each key in the ConfigMap becomes a file in the skill directory. | | Optional: \{\}
| +| `oci` _string_ | OCI is an OCI artifact reference containing the skill files (e.g. "ghcr.io/org/skills/web-research:v1"). | | Optional: \{\}
| + + +#### SkillSpec + + + +SkillSpec defines a Claude Code skill to mount into an agent pod. + + + +_Appears in:_ +- [LeadSpec](#leadspec) +- [TeammateSpec](#teammatespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `name` _string_ | Name is the skill directory name under .claude/skills/. | | | +| `source` _[SkillSource](#skillsource)_ | Source identifies where to load the skill from. | | | + + +#### TaskSummary + + + +TaskSummary reports aggregate task progress. + + + +_Appears in:_ +- [AgentTeamStatus](#agentteamstatus) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `total` _integer_ | | | | +| `completed` _integer_ | | | | +| `inProgress` _integer_ | | | | +| `pending` _integer_ | | | | + + +#### TeammateSpec + + + +TeammateSpec defines a single teammate agent. + + + +_Appears in:_ +- [AgentTeamSpec](#agentteamspec) +- [AgentTeamTemplateSpec](#agentteamtemplatespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `name` _string_ | Name is the unique identifier for this teammate. | | Pattern: `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$`
| +| `model` _string_ | Model to use for this teammate. | sonnet | Enum: [opus sonnet haiku]
| +| `prompt` _string_ | Prompt is the spawn instruction for this teammate. | | | +| `scope` _[ScopeSpec](#scopespec)_ | Scope restricts which files this teammate can access. | | Optional: \{\}
| +| `dependsOn` _string array_ | DependsOn lists teammate names that must complete before this one starts. | | Optional: \{\}
| +| `skills` _[SkillSpec](#skillspec) array_ | Skills to mount into .claude/skills/ for this teammate. | | Optional: \{\}
| +| `mcpServers` _[MCPServerSpec](#mcpserverspec) array_ | MCPServers configures Model Context Protocol connections for this teammate. | | Optional: \{\}
| +| `resources` _[ResourceRequirements](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#resourcerequirements-v1-core)_ | Resources defines compute resources for this teammate's pod. | | Optional: \{\}
| + + +#### TeammateStatus + + + +TeammateStatus reports a teammate's state. + + + +_Appears in:_ +- [AgentTeamStatus](#agentteamstatus) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `podName` _string_ | PodName is the name of the agent's pod. | | | +| `phase` _string_ | Phase of this agent. | | Enum: [Pending Running Idle Completed Failed Waiting]
| +| `name` _string_ | Name matches the teammate's spec name. | | | +| `tasksCompleted` _integer_ | TasksCompleted is the number of tasks this teammate has finished. | | | +| `tasksClaimed` _integer_ | TasksClaimed is the number of tasks currently owned by this teammate. | | | +| `pendingApproval` _string_ | PendingApproval is the approval gate event this teammate is waiting on, if any. | | Optional: \{\}
| +| `restartCount` _integer_ | RestartCount is the number of times this teammate's pod has been
re-spawned after a Failed phase. The team is marked Failed when any
teammate's RestartCount reaches Spec.Lifecycle.MaxRestarts. | | Optional: \{\}
| + + +#### TemplateReference + + + +TemplateReference points to an AgentTeamTemplate. + + + +_Appears in:_ +- [AgentTeamRunSpec](#agentteamrunspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `name` _string_ | Name of the AgentTeamTemplate in the same namespace. | | | + + +#### WebhookSpec + + + +WebhookSpec configures event notifications. + + + +_Appears in:_ +- [ObservabilitySpec](#observabilityspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `url` _string_ | URL to POST events to. | | | +| `events` _string array_ | Events to send notifications for. | | MinItems: 1
| + + +#### WorkspaceInputSpec + + + +WorkspaceInputSpec defines a read-only input mounted into the agent pod. + + + +_Appears in:_ +- [WorkspaceSpec](#workspacespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `configMap` _string_ | ConfigMap references a ConfigMap to mount as a directory. | | Optional: \{\}
| +| `pvc` _string_ | PVC references an existing PersistentVolumeClaim to mount read-only. | | Optional: \{\}
| +| `mountPath` _string_ | MountPath is where to mount this input inside the container. | | | + + +#### WorkspaceOutputSpec + + + +WorkspaceOutputSpec defines the writable output volume for a Cowork team. + + + +_Appears in:_ +- [WorkspaceSpec](#workspacespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `pvc` _string_ | PVC is the name of an existing PVC to use. If empty, the operator creates one named "\{team\}-output". | | Optional: \{\}
| +| `storageClass` _string_ | StorageClass for the auto-created PVC. Defaults to "nfs". | | Optional: \{\}
| +| `size` _string_ | Size of the auto-created PVC. | 5Gi | | +| `mountPath` _string_ | MountPath inside the container where the output volume is mounted. | /workspace/output | | + + +#### WorkspaceSpec + + + +WorkspaceSpec configures non-git inputs and outputs for Cowork teams. +Use this instead of (or alongside) Repository for knowledge-work tasks. + + + +_Appears in:_ +- [AgentTeamRunSpec](#agentteamrunspec) +- [AgentTeamSpec](#agentteamspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `inputs` _[WorkspaceInputSpec](#workspaceinputspec) array_ | Inputs are read-only volumes mounted into all agent pods. | | Optional: \{\}
| +| `output` _[WorkspaceOutputSpec](#workspaceoutputspec)_ | Output configures the shared writable output volume. | | Optional: \{\}
| + + diff --git a/docs/reference/index.md b/docs/reference/index.md new file mode 100644 index 0000000..887d193 --- /dev/null +++ b/docs/reference/index.md @@ -0,0 +1,17 @@ +# Reference + +The lookup tables. Every CRD field, every Helm value, every CLI flag, with no narrative wrapping. + +## Pages + +- **[API reference](api/index.md)**. Auto-generated field-by-field detail for `AgentTeam`, `AgentTeamTemplate`, and `AgentTeamRun`. Regenerated from the kubebuilder markers in `api/v1alpha1/` on every site build via `make docs-api`. + +## Coming next + +- **Helm chart values**. Every chart value documented with defaults and production override recipes (will migrate from the existing in-repo [`docs/helm-values.md`](https://github.com/amcheste/claude-teams-operator/blob/main/docs/helm-values.md)) + +## Looking for a tutorial or recipe? + +- **Step-by-step learning?** See the [Tutorials](../tutorials/index.md). +- **Solving a specific operational task?** See the [How-to guides](../how-to/index.md). +- **Want to understand the *why*?** See the [Explanation](../explanation/index.md). diff --git a/docs/requirements.txt b/docs/requirements.txt new file mode 100644 index 0000000..6db7b52 --- /dev/null +++ b/docs/requirements.txt @@ -0,0 +1,4 @@ +# Pinned for reproducible site builds. Bump via Dependabot or by hand +# after testing locally with `mkdocs build --strict`. +mkdocs-material==9.5.50 +mkdocs-git-revision-date-localized-plugin==1.2.7 diff --git a/docs/tutorials/getting-started.md b/docs/tutorials/getting-started.md new file mode 100644 index 0000000..b0e533d --- /dev/null +++ b/docs/tutorials/getting-started.md @@ -0,0 +1,205 @@ +# Getting started + +This tutorial walks you from a fresh laptop to a running AgentTeam in about 15 minutes. By the end you'll have: + +- A local Kubernetes cluster with kagents installed +- A small Cowork-mode AgentTeam that researches a topic and writes a summary file +- The know-how to inspect what's happening with `kubectl` and the dashboard + +You don't need any cloud accounts or external services. Everything runs on your laptop. + +## Prerequisites + +| Tool | Version | Why | +|------|---------|-----| +| [Docker](https://docs.docker.com/get-docker/) | latest | Runs the Kind cluster | +| [kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) | 0.25+ | Single-node Kubernetes for dev | +| [kubectl](https://kubernetes.io/docs/tasks/tools/) | 1.28+ | Interact with the cluster | +| [helm](https://helm.sh/docs/intro/install/) | 3.14+ | Install the operator chart | +| [Anthropic API key](https://console.anthropic.com/) | (any) | Required for agents to actually call Claude | + +You'll also need the kagents repo cloned locally so you can use the included `make kind-create` setup script (which provisions a Kind cluster with the NFS-style RWX storage the operator needs): + +```bash +git clone https://github.com/amcheste/claude-teams-operator.git +cd claude-teams-operator +``` + +## 1. Stand up a local cluster + +```bash +make kind-create +``` + +This creates a Kind cluster named `claude-teams` with a local-path storage class aliased as `nfs`. On a single-node cluster every pod runs on the same node, so a hostPath volume is visible to all pods. That's our RWX-equivalent for laptop testing. + +!!! note "Production deployments need a real RWX backend" + For real multi-node clusters you'll need NFS, EFS, Filestore, or Azure Files. The Kind setup is a single-node convenience, not the production story. See the *Concept: file-based mailbox protocol* page (coming in v0.7.0) for why. + +Verify the cluster is up: + +```bash +kubectl cluster-info --context kind-claude-teams +``` + +## 2. Install kagents + +```bash +helm install kagents \ + oci://ghcr.io/amcheste/charts/claude-teams-operator \ + --namespace claude-teams-system --create-namespace +``` + +Wait for the operator pod to be ready: + +```bash +kubectl rollout status deployment/kagents-controller-manager \ + --namespace claude-teams-system --timeout=120s +``` + +You should see `deployment "kagents-controller-manager" successfully rolled out`. + +## 3. Create your Anthropic API key Secret + +The operator reads this Secret from the namespace where your team runs (not the operator's namespace). Create a namespace for the team and put the key there: + +```bash +kubectl create namespace dev-agents +kubectl create secret generic anthropic-api-key \ + --namespace dev-agents \ + --from-literal=ANTHROPIC_API_KEY=sk-ant-... +``` + +Replace `sk-ant-...` with your actual key from [console.anthropic.com](https://console.anthropic.com/). + +!!! warning "Don't commit your API key" + The Secret stays in the cluster. Never paste a real key into a manifest you'll commit. Use `kubectl create secret` from your shell as above, or sealed-secrets / external-secrets for production. + +## 4. Apply your first AgentTeam + +This is a small Cowork-mode team. No git repo, just an output volume. The lead coordinates a single writer agent that produces a Markdown file. + +```yaml title="hello-team.yaml" +apiVersion: claude.amcheste.io/v1alpha1 +kind: AgentTeam +metadata: + name: hello-team + namespace: dev-agents +spec: + workspace: + output: + mountPath: /workspace/output + size: 1Gi + + auth: + apiKeySecret: anthropic-api-key + + lead: + model: opus + prompt: | + Coordinate a one-person team that writes a 200-word overview of + Kubernetes operators to /workspace/output/overview.md. Make sure + the file is written before declaring the work complete. + + teammates: + - name: writer + model: sonnet + prompt: | + Write a 200-word overview of Kubernetes operators to + /workspace/output/overview.md. Keep it accessible to a reader + who has never used Kubernetes before. Cover: what an operator + is, what problem it solves, and one concrete example. + + lifecycle: + timeout: 30m + budgetLimit: "1.00" +``` + +Apply it: + +```bash +kubectl apply -f hello-team.yaml +``` + +## 5. Watch the team run + +```bash +kubectl get agentteams -n dev-agents -w +``` + +You'll see the team progress through phases: + +| Phase | Meaning | +|-------|---------| +| `Pending` | Operator received the spec; PVCs being provisioned | +| `Initializing` | Init Job running (sets up worktrees / output volume) | +| `Running` | Agent pods are up and working | +| `Completed` | The lead reported the work done | +| `Failed` / `BudgetExceeded` / `TimedOut` | Terminal failure states | + +A 200-word write usually finishes in 1–3 minutes. + +When it reaches `Completed`, press Ctrl-C to stop watching. + +## 6. Inspect what happened + +The `describe` view shows everything in one place: + +```bash +kubectl describe agentteam hello-team -n dev-agents +``` + +You'll see: + +- The `Status` block with phase, ready count, estimated cost +- A `Lead` and `Teammates` section with each agent's pod status +- Recent `Events` from the operator at every phase transition + +To see the actual file the team produced, exec into the writer pod and read it: + +```bash +kubectl exec -n dev-agents hello-team-writer -- cat /workspace/output/overview.md +``` + +## 7. Clean up + +Delete the team and the namespace: + +```bash +kubectl delete agentteam hello-team -n dev-agents +kubectl delete namespace dev-agents +``` + +The operator will tear down all the team's pods, PVCs, and per-agent ServiceAccounts via owner references. To uninstall kagents itself: + +```bash +helm uninstall kagents -n claude-teams-system +kubectl delete namespace claude-teams-system +``` + +To tear down the whole Kind cluster: + +```bash +make kind-delete +``` + +## What you just did + +A real Kubernetes operator just orchestrated two Claude Code instances communicating via a shared filesystem to produce real output, with K8s primitives doing the coordination work. RWX PVC for the mailbox, ServiceAccounts for per-agent identity, owner references for cleanup. No custom protocol, no orchestrator service, no daemon outside the cluster. + +## Where to go next + +- **[How-to guides](../how-to/index.md)**. Install on a real cloud, expose the dashboard, set budget alerts +- **[Reference](../reference/index.md)**. Every CRD field and Helm value documented +- **[Explanation](../explanation/index.md)**. How the file-based mailbox protocol actually works under the hood + +## Common errors + +??? warning "`PVCs stuck in Pending`" + The operator requires a ReadWriteMany-capable StorageClass. On a Kind cluster, `make kind-create` sets one up under the alias `nfs`. If you're using your own cluster, check `kubectl get sc`. There must be one named `nfs` (or you need to pass `--set storage.storageClassName=` when installing the chart). + +??? warning "`Pod stuck in CrashLoopBackOff`" + Check the agent pod logs: `kubectl logs -n dev-agents hello-team-writer`. The most common cause is a missing or invalid Anthropic API key. Re-create the Secret with `kubectl create secret generic anthropic-api-key --namespace dev-agents --from-literal=ANTHROPIC_API_KEY=... --dry-run=client -o yaml | kubectl apply -f -`. + +??? warning "`AgentTeam stuck in Initializing`" + The init Job may have failed. Inspect: `kubectl get jobs -n dev-agents` and `kubectl logs -n dev-agents job/`. Most often this is a permission issue with the StorageClass. diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md new file mode 100644 index 0000000..b076289 --- /dev/null +++ b/docs/tutorials/index.md @@ -0,0 +1,15 @@ +# Tutorials + +Step-by-step lessons that take you from zero to a working AgentTeam. Read these top-to-bottom. They assume you're new to the project. + +## Available tutorials + +- **[Getting started](getting-started.md)**. Install kagents on a Kind cluster and run your first AgentTeam end-to-end. ~15 minutes. + +More tutorials will be added as the project matures. Have a use case you'd like a tutorial for. A security review team, a doc-generation team, multi-cluster fan-out? [Open a Discussion](https://github.com/amcheste/claude-teams-operator/discussions/categories/ideas) and tell us about it. + +## Looking for something else? + +- **Solving a specific problem?** Try the [how-to guides](../how-to/index.md). +- **Looking up a CRD field or Helm value?** See the [reference](../reference/index.md). +- **Wondering why something works the way it does?** See the [explanation](../explanation/index.md) section. diff --git a/go.mod b/go.mod index fcb2e83..ad9bf87 100644 --- a/go.mod +++ b/go.mod @@ -1,17 +1,17 @@ module github.com/amcheste/claude-teams-operator -go 1.25.0 +go 1.26.0 require ( github.com/go-logr/logr v1.4.3 - github.com/onsi/ginkgo/v2 v2.28.2 - github.com/onsi/gomega v1.39.1 + github.com/onsi/ginkgo/v2 v2.28.3 + github.com/onsi/gomega v1.40.0 github.com/prometheus/client_golang v1.23.2 github.com/stretchr/testify v1.11.1 - k8s.io/api v0.35.4 - k8s.io/apimachinery v0.35.4 - k8s.io/client-go v0.35.4 - sigs.k8s.io/controller-runtime v0.23.3 + k8s.io/api v0.36.0 + k8s.io/apimachinery v0.36.0 + k8s.io/client-go v0.36.0 + sigs.k8s.io/controller-runtime v0.24.0 ) require ( @@ -19,7 +19,7 @@ require ( github.com/beorn7/perks v1.0.1 // indirect github.com/cespare/xxhash/v2 v2.3.0 // indirect github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect - github.com/emicklei/go-restful/v3 v3.12.2 // indirect + github.com/emicklei/go-restful/v3 v3.13.0 // indirect github.com/evanphx/json-patch/v5 v5.9.11 // indirect github.com/fsnotify/fsnotify v1.9.0 // indirect github.com/fxamacker/cbor/v2 v2.9.0 // indirect @@ -28,10 +28,9 @@ require ( github.com/go-openapi/jsonreference v0.20.2 // indirect github.com/go-openapi/swag v0.23.0 // indirect github.com/go-task/slim-sprig/v3 v3.0.0 // indirect - github.com/google/btree v1.1.3 // indirect github.com/google/gnostic-models v0.7.0 // indirect github.com/google/go-cmp v0.7.0 // indirect - github.com/google/pprof v0.0.0-20260115054156-294ebfa9ad83 // indirect + github.com/google/pprof v0.0.0-20260402051712-545e8a4df936 // indirect github.com/google/uuid v1.6.0 // indirect github.com/josharian/intern v1.0.0 // indirect github.com/json-iterator/go v1.1.12 // indirect @@ -42,34 +41,34 @@ require ( github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect github.com/prometheus/client_model v0.6.2 // indirect - github.com/prometheus/common v0.66.1 // indirect - github.com/prometheus/procfs v0.16.1 // indirect + github.com/prometheus/common v0.67.5 // indirect + github.com/prometheus/procfs v0.19.2 // indirect github.com/spf13/pflag v1.0.9 // indirect github.com/x448/float16 v0.8.4 // indirect go.uber.org/multierr v1.11.0 // indirect - go.uber.org/zap v1.27.0 // indirect + go.uber.org/zap v1.27.1 // indirect go.yaml.in/yaml/v2 v2.4.3 // indirect go.yaml.in/yaml/v3 v3.0.4 // indirect - golang.org/x/mod v0.32.0 // indirect - golang.org/x/net v0.49.0 // indirect - golang.org/x/oauth2 v0.30.0 // indirect - golang.org/x/sync v0.19.0 // indirect - golang.org/x/sys v0.40.0 // indirect - golang.org/x/term v0.39.0 // indirect - golang.org/x/text v0.33.0 // indirect - golang.org/x/time v0.9.0 // indirect - golang.org/x/tools v0.41.0 // indirect + golang.org/x/mod v0.35.0 // indirect + golang.org/x/net v0.53.0 // indirect + golang.org/x/oauth2 v0.34.0 // indirect + golang.org/x/sync v0.20.0 // indirect + golang.org/x/sys v0.43.0 // indirect + golang.org/x/term v0.42.0 // indirect + golang.org/x/text v0.36.0 // indirect + golang.org/x/time v0.14.0 // indirect + golang.org/x/tools v0.44.0 // indirect gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect - google.golang.org/protobuf v1.36.8 // indirect + google.golang.org/protobuf v1.36.12-0.20260120151049-f2248ac996af // indirect gopkg.in/evanphx/json-patch.v4 v4.13.0 // indirect gopkg.in/inf.v0 v0.9.1 // indirect gopkg.in/yaml.v3 v3.0.1 // indirect - k8s.io/apiextensions-apiserver v0.35.3 // indirect - k8s.io/klog/v2 v2.130.1 // indirect - k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912 // indirect - k8s.io/utils v0.0.0-20251002143259-bc988d571ff4 // indirect + k8s.io/apiextensions-apiserver v0.36.0 // indirect + k8s.io/klog/v2 v2.140.0 // indirect + k8s.io/kube-openapi v0.0.0-20260317180543-43fb72c5454a // indirect + k8s.io/utils v0.0.0-20260210185600-b8788abfbbc2 // indirect sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 // indirect sigs.k8s.io/randfill v1.0.0 // indirect - sigs.k8s.io/structured-merge-diff/v6 v6.3.2-0.20260122202528-d9cc6641c482 // indirect + sigs.k8s.io/structured-merge-diff/v6 v6.3.2 // indirect sigs.k8s.io/yaml v1.6.0 // indirect ) diff --git a/go.sum b/go.sum index 86edcb6..c98c0bc 100644 --- a/go.sum +++ b/go.sum @@ -9,8 +9,8 @@ github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSs github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1VwoXQT9A3Wy9MM3WgvqSxFWenqJduM= github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= -github.com/emicklei/go-restful/v3 v3.12.2 h1:DhwDP0vY3k8ZzE0RunuJy8GhNpPL6zqLkDf9B/a0/xU= -github.com/emicklei/go-restful/v3 v3.12.2/go.mod h1:6n3XBCmQQb25CM2LCACGz8ukIrRry+4bhvbpWn3mrbc= +github.com/emicklei/go-restful/v3 v3.13.0 h1:C4Bl2xDndpU6nJ4bc1jXd+uTmYPVUwkD6bFY/oTyCes= +github.com/emicklei/go-restful/v3 v3.13.0/go.mod h1:6n3XBCmQQb25CM2LCACGz8ukIrRry+4bhvbpWn3mrbc= github.com/evanphx/json-patch v0.5.2 h1:xVCHIVMUu1wtM/VkR9jVZ45N3FhZfYMMYGorLCR8P3k= github.com/evanphx/json-patch v0.5.2/go.mod h1:ZWS5hhDbVDyob71nXKNL0+PWn6ToqBHMikGIFbs31qQ= github.com/evanphx/json-patch/v5 v5.9.11 h1:/8HVnzMq13/3x9TPvjG08wUGqBTmZBsCWzjTM0wiaDU= @@ -41,8 +41,6 @@ github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1v github.com/go-task/slim-sprig/v3 v3.0.0/go.mod h1:W848ghGpv3Qj3dhTPRyJypKRiqCdHZiAzKg9hl15HA8= github.com/goccy/go-yaml v1.18.0 h1:8W7wMFS12Pcas7KU+VVkaiCng+kG8QiFeFwzFb+rwuw= github.com/goccy/go-yaml v1.18.0/go.mod h1:XBurs7gK8ATbW4ZPGKgcbrY1Br56PdM69F7LkFRi1kA= -github.com/google/btree v1.1.3 h1:CVpQJjYgC4VbzxeGVHfvZrv1ctoYCAI8vbl07Fcxlyg= -github.com/google/btree v1.1.3/go.mod h1:qOPhT0dTNdNzV6Z/lhRX0YXUafgPLFUh+gZMl761Gm4= github.com/google/gnostic-models v0.7.0 h1:qwTtogB15McXDaNqTZdzPJRHvaVJlAl+HVQnLmJEJxo= github.com/google/gnostic-models v0.7.0/go.mod h1:whL5G0m6dmc5cPxKc5bdKdEN3UjI7OUGxBlw57miDrQ= github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8= @@ -50,8 +48,8 @@ github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg= github.com/google/gofuzz v1.2.0 h1:xRy4A+RhZaiKjJ1bPfwQ8sedCA+YS2YcCHW6ec7JMi0= github.com/google/gofuzz v1.2.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg= -github.com/google/pprof v0.0.0-20260115054156-294ebfa9ad83 h1:z2ogiKUYzX5Is6zr/vP9vJGqPwcdqsWjOt+V8J7+bTc= -github.com/google/pprof v0.0.0-20260115054156-294ebfa9ad83/go.mod h1:MxpfABSjhmINe3F1It9d+8exIHFvUqtLIRCdOGNXqiI= +github.com/google/pprof v0.0.0-20260402051712-545e8a4df936 h1:EwtI+Al+DeppwYX2oXJCETMO23COyaKGP6fHVpkpWpg= +github.com/google/pprof v0.0.0-20260402051712-545e8a4df936/go.mod h1:MxpfABSjhmINe3F1It9d+8exIHFvUqtLIRCdOGNXqiI= github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0= github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= github.com/josharian/intern v1.0.0 h1:vlS4z54oSdjm0bgjRigI+G1HpF+tI+9rE5LLzOg8HmY= @@ -85,10 +83,10 @@ github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee h1:W5t00kpgFd github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee/go.mod h1:yWuevngMOJpCy52FWWMvUC8ws7m/LJsjYzDa0/r8luk= github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq1c1nUAm88MOHcQC9l5mIlSMApZMrHA= github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ= -github.com/onsi/ginkgo/v2 v2.28.2 h1:DTrMfpqxiNUyQ3Y0zhn1n3cOO2euFgQPYIpkWwxVFps= -github.com/onsi/ginkgo/v2 v2.28.2/go.mod h1:CLtbVInNckU3/+gC8LzkGUb9oF+e8W8TdUsxPwvdOgE= -github.com/onsi/gomega v1.39.1 h1:1IJLAad4zjPn2PsnhH70V4DKRFlrCzGBNrNaru+Vf28= -github.com/onsi/gomega v1.39.1/go.mod h1:hL6yVALoTOxeWudERyfppUcZXjMwIMLnuSfruD2lcfg= +github.com/onsi/ginkgo/v2 v2.28.3 h1:4JvMdwtFU0imd8fHx25OJXoDMRexnf8v5NHKYSTTji4= +github.com/onsi/ginkgo/v2 v2.28.3/go.mod h1:+aXOY+vzZ5mu2iI2HpTZUPmM//oQfsNFX6gU9kNcA44= +github.com/onsi/gomega v1.40.0 h1:Vtol0e1MghCD2ZVIilPDIg44XSL9l2QAn8ZNaljWcJc= +github.com/onsi/gomega v1.40.0/go.mod h1:M/Uqpu/8qTjtzCLUA2zJHX9Iilrau25x1PdoSRbWh5A= github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4= github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0= github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= @@ -98,10 +96,10 @@ github.com/prometheus/client_golang v1.23.2 h1:Je96obch5RDVy3FDMndoUsjAhG5Edi49h github.com/prometheus/client_golang v1.23.2/go.mod h1:Tb1a6LWHB3/SPIzCoaDXI4I8UHKeFTEQ1YCr+0Gyqmg= github.com/prometheus/client_model v0.6.2 h1:oBsgwpGs7iVziMvrGhE53c/GrLUsZdHnqNwqPLxwZyk= github.com/prometheus/client_model v0.6.2/go.mod h1:y3m2F6Gdpfy6Ut/GBsUqTWZqCUvMVzSfMLjcu6wAwpE= -github.com/prometheus/common v0.66.1 h1:h5E0h5/Y8niHc5DlaLlWLArTQI7tMrsfQjHV+d9ZoGs= -github.com/prometheus/common v0.66.1/go.mod h1:gcaUsgf3KfRSwHY4dIMXLPV0K/Wg1oZ8+SbZk/HH/dA= -github.com/prometheus/procfs v0.16.1 h1:hZ15bTNuirocR6u0JZ6BAHHmwS1p8B4P6MRqxtzMyRg= -github.com/prometheus/procfs v0.16.1/go.mod h1:teAbpZRB1iIAJYREa1LsoWUXykVXA1KlTmWl8x/U+Is= +github.com/prometheus/common v0.67.5 h1:pIgK94WWlQt1WLwAC5j2ynLaBRDiinoAb86HZHTUGI4= +github.com/prometheus/common v0.67.5/go.mod h1:SjE/0MzDEEAyrdr5Gqc6G+sXI67maCxzaT3A2+HqjUw= +github.com/prometheus/procfs v0.19.2 h1:zUMhqEW66Ex7OXIiDkll3tl9a1ZdilUOd/F6ZXw4Vws= +github.com/prometheus/procfs v0.19.2/go.mod h1:M0aotyiemPhBCM0z5w87kL22CxfcH05ZpYlu+b4J7mw= github.com/rogpeppe/go-internal v1.14.1 h1:UQB4HGPB6osV0SQTLymcB4TgvyWu6ZyliaW0tI/otEQ= github.com/rogpeppe/go-internal v1.14.1/go.mod h1:MaRKkUm5W0goXpeCfT7UZI6fk/L7L7so1lCWt35ZSgc= github.com/spf13/pflag v1.0.9 h1:9exaQaMOCwffKiiiYk6/BndUBv+iRViNW+4lEMi0PvY= @@ -131,34 +129,34 @@ go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto= go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE= go.uber.org/multierr v1.11.0 h1:blXXJkSxSSfBVBlC76pxqeO+LN3aDfLQo+309xJstO0= go.uber.org/multierr v1.11.0/go.mod h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN80Y= -go.uber.org/zap v1.27.0 h1:aJMhYGrd5QSmlpLMr2MftRKl7t8J8PTZPA732ud/XR8= -go.uber.org/zap v1.27.0/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E= +go.uber.org/zap v1.27.1 h1:08RqriUEv8+ArZRYSTXy1LeBScaMpVSTBhCeaZYfMYc= +go.uber.org/zap v1.27.1/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E= go.yaml.in/yaml/v2 v2.4.3 h1:6gvOSjQoTB3vt1l+CU+tSyi/HOjfOjRLJ4YwYZGwRO0= go.yaml.in/yaml/v2 v2.4.3/go.mod h1:zSxWcmIDjOzPXpjlTTbAsKokqkDNAVtZO0WOMiT90s8= go.yaml.in/yaml/v3 v3.0.4 h1:tfq32ie2Jv2UxXFdLJdh3jXuOzWiL1fo0bu/FbuKpbc= go.yaml.in/yaml/v3 v3.0.4/go.mod h1:DhzuOOF2ATzADvBadXxruRBLzYTpT36CKvDb3+aBEFg= -golang.org/x/mod v0.32.0 h1:9F4d3PHLljb6x//jOyokMv3eX+YDeepZSEo3mFJy93c= -golang.org/x/mod v0.32.0/go.mod h1:SgipZ/3h2Ci89DlEtEXWUk/HteuRin+HHhN+WbNhguU= -golang.org/x/net v0.49.0 h1:eeHFmOGUTtaaPSGNmjBKpbng9MulQsJURQUAfUwY++o= -golang.org/x/net v0.49.0/go.mod h1:/ysNB2EvaqvesRkuLAyjI1ycPZlQHM3q01F02UY/MV8= -golang.org/x/oauth2 v0.30.0 h1:dnDm7JmhM45NNpd8FDDeLhK6FwqbOf4MLCM9zb1BOHI= -golang.org/x/oauth2 v0.30.0/go.mod h1:B++QgG3ZKulg6sRPGD/mqlHQs5rB3Ml9erfeDY7xKlU= -golang.org/x/sync v0.19.0 h1:vV+1eWNmZ5geRlYjzm2adRgW2/mcpevXNg50YZtPCE4= -golang.org/x/sync v0.19.0/go.mod h1:9KTHXmSnoGruLpwFjVSX0lNNA75CykiMECbovNTZqGI= -golang.org/x/sys v0.40.0 h1:DBZZqJ2Rkml6QMQsZywtnjnnGvHza6BTfYFWY9kjEWQ= -golang.org/x/sys v0.40.0/go.mod h1:OgkHotnGiDImocRcuBABYBEXf8A9a87e/uXjp9XT3ks= -golang.org/x/term v0.39.0 h1:RclSuaJf32jOqZz74CkPA9qFuVTX7vhLlpfj/IGWlqY= -golang.org/x/term v0.39.0/go.mod h1:yxzUCTP/U+FzoxfdKmLaA0RV1WgE0VY7hXBwKtY/4ww= -golang.org/x/text v0.33.0 h1:B3njUFyqtHDUI5jMn1YIr5B0IE2U0qck04r6d4KPAxE= -golang.org/x/text v0.33.0/go.mod h1:LuMebE6+rBincTi9+xWTY8TztLzKHc/9C1uBCG27+q8= -golang.org/x/time v0.9.0 h1:EsRrnYcQiGH+5FfbgvV4AP7qEZstoyrHB0DzarOQ4ZY= -golang.org/x/time v0.9.0/go.mod h1:3BpzKBy/shNhVucY/MWOyx10tF3SFh9QdLuxbVysPQM= -golang.org/x/tools v0.41.0 h1:a9b8iMweWG+S0OBnlU36rzLp20z1Rp10w+IY2czHTQc= -golang.org/x/tools v0.41.0/go.mod h1:XSY6eDqxVNiYgezAVqqCeihT4j1U2CCsqvH3WhQpnlg= +golang.org/x/mod v0.35.0 h1:Ww1D637e6Pg+Zb2KrWfHQUnH2dQRLBQyAtpr/haaJeM= +golang.org/x/mod v0.35.0/go.mod h1:+GwiRhIInF8wPm+4AoT6L0FA1QWAad3OMdTRx4tFYlU= +golang.org/x/net v0.53.0 h1:d+qAbo5L0orcWAr0a9JweQpjXF19LMXJE8Ey7hwOdUA= +golang.org/x/net v0.53.0/go.mod h1:JvMuJH7rrdiCfbeHoo3fCQU24Lf5JJwT9W3sJFulfgs= +golang.org/x/oauth2 v0.34.0 h1:hqK/t4AKgbqWkdkcAeI8XLmbK+4m4G5YeQRrmiotGlw= +golang.org/x/oauth2 v0.34.0/go.mod h1:lzm5WQJQwKZ3nwavOZ3IS5Aulzxi68dUSgRHujetwEA= +golang.org/x/sync v0.20.0 h1:e0PTpb7pjO8GAtTs2dQ6jYa5BWYlMuX047Dco/pItO4= +golang.org/x/sync v0.20.0/go.mod h1:9xrNwdLfx4jkKbNva9FpL6vEN7evnE43NNNJQ2LF3+0= +golang.org/x/sys v0.43.0 h1:Rlag2XtaFTxp19wS8MXlJwTvoh8ArU6ezoyFsMyCTNI= +golang.org/x/sys v0.43.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw= +golang.org/x/term v0.42.0 h1:UiKe+zDFmJobeJ5ggPwOshJIVt6/Ft0rcfrXZDLWAWY= +golang.org/x/term v0.42.0/go.mod h1:Dq/D+snpsbazcBG5+F9Q1n2rXV8Ma+71xEjTRufARgY= +golang.org/x/text v0.36.0 h1:JfKh3XmcRPqZPKevfXVpI1wXPTqbkE5f7JA92a55Yxg= +golang.org/x/text v0.36.0/go.mod h1:NIdBknypM8iqVmPiuco0Dh6P5Jcdk8lJL0CUebqK164= +golang.org/x/time v0.14.0 h1:MRx4UaLrDotUKUdCIqzPC48t1Y9hANFKIRpNx+Te8PI= +golang.org/x/time v0.14.0/go.mod h1:eL/Oa2bBBK0TkX57Fyni+NgnyQQN4LitPmob2Hjnqw4= +golang.org/x/tools v0.44.0 h1:UP4ajHPIcuMjT1GqzDWRlalUEoY+uzoZKnhOjbIPD2c= +golang.org/x/tools v0.44.0/go.mod h1:KA0AfVErSdxRZIsOVipbv3rQhVXTnlU6UhKxHd1seDI= gomodules.xyz/jsonpatch/v2 v2.4.0 h1:Ci3iUJyx9UeRx7CeFN8ARgGbkESwJK+KB9lLcWxY/Zw= gomodules.xyz/jsonpatch/v2 v2.4.0/go.mod h1:AH3dM2RI6uoBZxn3LVrfvJ3E0/9dG4cSrbuBJT4moAY= -google.golang.org/protobuf v1.36.8 h1:xHScyCOEuuwZEc6UtSOvPbAT4zRh0xcNRYekJwfqyMc= -google.golang.org/protobuf v1.36.8/go.mod h1:fuxRtAxBytpl4zzqUh6/eyUujkJdNiuEkXntxiD/uRU= +google.golang.org/protobuf v1.36.12-0.20260120151049-f2248ac996af h1:+5/Sw3GsDNlEmu7TfklWKPdQ0Ykja5VEmq2i817+jbI= +google.golang.org/protobuf v1.36.12-0.20260120151049-f2248ac996af/go.mod h1:HTf+CrKn2C3g5S8VImy6tdcUvCska2kB7j23XfzDpco= gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk= gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q= @@ -169,27 +167,27 @@ gopkg.in/inf.v0 v0.9.1/go.mod h1:cWUDdTG/fYaXco+Dcufb5Vnc6Gp2YChqWtbxRZE0mXw= gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= -k8s.io/api v0.35.4 h1:P7nFYKl5vo9AGUp1Z+Pmd3p2tA7bX2wbFWCvDeRv988= -k8s.io/api v0.35.4/go.mod h1:yl4lqySWOgYJJf9RERXKUwE9g2y+CkuwG+xmcOK8wXU= -k8s.io/apiextensions-apiserver v0.35.3 h1:2fQUhEO7P17sijylbdwt0nBdXP0TvHrHj0KeqHD8FiU= -k8s.io/apiextensions-apiserver v0.35.3/go.mod h1:tK4Kz58ykRpwAEkXUb634HD1ZAegEElktz/B3jgETd8= -k8s.io/apimachinery v0.35.4 h1:xtdom9RG7e+yDp71uoXoJDWEE2eOiHgeO4GdBzwWpds= -k8s.io/apimachinery v0.35.4/go.mod h1:NNi1taPOpep0jOj+oRha3mBJPqvi0hGdaV8TCqGQ+cc= -k8s.io/client-go v0.35.4 h1:DN6fyaGuzK64UvnKO5fOA6ymSjvfGAnCAHAR0C66kD8= -k8s.io/client-go v0.35.4/go.mod h1:2Pg9WpsS4NeOpoYTfHHfMxBG8zFMSAUi4O/qoiJC3nY= -k8s.io/klog/v2 v2.130.1 h1:n9Xl7H1Xvksem4KFG4PYbdQCQxqc/tTUyrgXaOhHSzk= -k8s.io/klog/v2 v2.130.1/go.mod h1:3Jpz1GvMt720eyJH1ckRHK1EDfpxISzJ7I9OYgaDtPE= -k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912 h1:Y3gxNAuB0OBLImH611+UDZcmKS3g6CthxToOb37KgwE= -k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912/go.mod h1:kdmbQkyfwUagLfXIad1y2TdrjPFWp2Q89B3qkRwf/pQ= -k8s.io/utils v0.0.0-20251002143259-bc988d571ff4 h1:SjGebBtkBqHFOli+05xYbK8YF1Dzkbzn+gDM4X9T4Ck= -k8s.io/utils v0.0.0-20251002143259-bc988d571ff4/go.mod h1:OLgZIPagt7ERELqWJFomSt595RzquPNLL48iOWgYOg0= -sigs.k8s.io/controller-runtime v0.23.3 h1:VjB/vhoPoA9l1kEKZHBMnQF33tdCLQKJtydy4iqwZ80= -sigs.k8s.io/controller-runtime v0.23.3/go.mod h1:B6COOxKptp+YaUT5q4l6LqUJTRpizbgf9KSRNdQGns0= +k8s.io/api v0.36.0 h1:SgqDhZzHdOtMk40xVSvCXkP9ME0H05hPM3p9AB1kL80= +k8s.io/api v0.36.0/go.mod h1:m1LVrGPNYax5NBHdO+QuAedXyuzTt4RryI/qnmNvs34= +k8s.io/apiextensions-apiserver v0.36.0 h1:Wt7E8J+VBCbj4FjiBfDTK/neXDDjyJVJc7xfuOHImZ0= +k8s.io/apiextensions-apiserver v0.36.0/go.mod h1:kGDjH0msuiIB3tgsYRV0kS9GqpMYMUsQ3GHv7TApyug= +k8s.io/apimachinery v0.36.0 h1:jZyPzhd5Z+3h9vJLt0z9XdzW9VzNzWAUw+P1xZ9PXtQ= +k8s.io/apimachinery v0.36.0/go.mod h1:FklypaRJt6n5wUIwWXIP6GJlIpUizTgfo1T/As+Tyxc= +k8s.io/client-go v0.36.0 h1:pOYi7C4RHChYjMiHpZSpSbIM6ZxVbRXBy7CuiIwqA3c= +k8s.io/client-go v0.36.0/go.mod h1:ZKKcpwF0aLYfkHFCjillCKaTK/yBkEDHTDXCFY6AS9Y= +k8s.io/klog/v2 v2.140.0 h1:Tf+J3AH7xnUzZyVVXhTgGhEKnFqye14aadWv7bzXdzc= +k8s.io/klog/v2 v2.140.0/go.mod h1:o+/RWfJ6PwpnFn7OyAG3QnO47BFsymfEfrz6XyYSSp0= +k8s.io/kube-openapi v0.0.0-20260317180543-43fb72c5454a h1:xCeOEAOoGYl2jnJoHkC3hkbPJgdATINPMAxaynU2Ovg= +k8s.io/kube-openapi v0.0.0-20260317180543-43fb72c5454a/go.mod h1:uGBT7iTA6c6MvqUvSXIaYZo9ukscABYi2btjhvgKGZ0= +k8s.io/utils v0.0.0-20260210185600-b8788abfbbc2 h1:AZYQSJemyQB5eRxqcPky+/7EdBj0xi3g0ZcxxJ7vbWU= +k8s.io/utils v0.0.0-20260210185600-b8788abfbbc2/go.mod h1:xDxuJ0whA3d0I4mf/C4ppKHxXynQ+fxnkmQH0vTHnuk= +sigs.k8s.io/controller-runtime v0.24.0 h1:Ck6N2LdS8Lovy1o25BB4r1xjvLEKUl1s2o9kU+KWDE4= +sigs.k8s.io/controller-runtime v0.24.0/go.mod h1:vFkfY5fGt5xAC/sKb8IBFKgWPNKG9OUG29dR8Y2wImw= sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 h1:IpInykpT6ceI+QxKBbEflcR5EXP7sU1kvOlxwZh5txg= sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730/go.mod h1:mdzfpAEoE6DHQEN0uh9ZbOCuHbLK5wOm7dK4ctXE9Tg= sigs.k8s.io/randfill v1.0.0 h1:JfjMILfT8A6RbawdsK2JXGBR5AQVfd+9TbzrlneTyrU= sigs.k8s.io/randfill v1.0.0/go.mod h1:XeLlZ/jmk4i1HRopwe7/aU3H5n1zNUcX6TM94b3QxOY= -sigs.k8s.io/structured-merge-diff/v6 v6.3.2-0.20260122202528-d9cc6641c482 h1:2WOzJpHUBVrrkDjU4KBT8n5LDcj824eX0I5UKcgeRUs= -sigs.k8s.io/structured-merge-diff/v6 v6.3.2-0.20260122202528-d9cc6641c482/go.mod h1:M3W8sfWvn2HhQDIbGWj3S099YozAsymCo/wrT5ohRUE= +sigs.k8s.io/structured-merge-diff/v6 v6.3.2 h1:kwVWMx5yS1CrnFWA/2QHyRVJ8jM6dBA80uLmm0wJkk8= +sigs.k8s.io/structured-merge-diff/v6 v6.3.2/go.mod h1:M3W8sfWvn2HhQDIbGWj3S099YozAsymCo/wrT5ohRUE= sigs.k8s.io/yaml v1.6.0 h1:G8fkbMSAFqgEFgh4b1wmtzDnioxFCUgTZhlbj5P9QYs= sigs.k8s.io/yaml v1.6.0/go.mod h1:796bPqUfzR/0jLAl6XjHl3Ck7MiyVv8dbTdyT3/pMf4= diff --git a/hack/crd-ref-docs-config.yaml b/hack/crd-ref-docs-config.yaml new file mode 100644 index 0000000..14fd3fa --- /dev/null +++ b/hack/crd-ref-docs-config.yaml @@ -0,0 +1,18 @@ +# Configuration for crd-ref-docs (https://github.com/elastic/crd-ref-docs). +# Used by `make docs-api` to regenerate docs/reference/api/ from the +# kubebuilder markers in api/v1alpha1/. + +processor: + # No type-level ignores — the v0.7.0 reference docs aim to be + # comprehensive. Each spec, status, and embedded helper type gets a + # section. Add an `ignoreTypes` regex here only if a type genuinely + # shouldn't be in the public API surface. + # `List` suffix types are kubebuilder-generated pagination wrappers + # used only by client-go callers — not part of the user-facing API. + ignoreTypes: + - "List$" + ignoreFields: + - "TypeMeta$" + +render: + kubernetesVersion: "1.31" diff --git a/internal/controller/agentteam_controller.go b/internal/controller/agentteam_controller.go index 70e1636..d21ce28 100644 --- a/internal/controller/agentteam_controller.go +++ b/internal/controller/agentteam_controller.go @@ -1494,7 +1494,7 @@ func (r *AgentTeamReconciler) executeOnComplete(ctx context.Context, team *claud case "create-pr": if err := r.executeCreatePR(ctx, team); err != nil { log.Error(err, "create-pr failed") - r.recordEvent(team, corev1.EventTypeWarning, "PRCreationFailed", err.Error()) + r.recordEvent(team, corev1.EventTypeWarning, "PRCreationFailed", "%v", err) return err } case "push-branch": diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..2069355 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,118 @@ +site_name: kagents +site_description: Run Claude Code Agent Teams as a Kubernetes operator +site_url: https://kagents.dev +site_author: CAM Labs LLC + +repo_name: amcheste/claude-teams-operator +repo_url: https://github.com/amcheste/claude-teams-operator +edit_uri: edit/main/docs/ + +copyright: Copyright © 2026 CAM Labs LLC + +# Files inside docs/ that aren't part of the public site. +# `helm-values.md` and `cfp/` predate the docs site and link to repo +# files outside docs/; they'll be migrated into the site (helm-values +# under /reference/, cfp/ stays internal) in follow-up v0.7.0 issues. +# `README.md` collides with index.md on the auto-generated index — kept +# in-repo as the contributor-facing dev-loop guide. +exclude_docs: | + README.md + helm-values.md + cfp/ + +theme: + name: material + features: + - navigation.instant + - navigation.tracking + - navigation.sections + - navigation.top + - navigation.footer + - search.suggest + - search.highlight + - content.code.copy + - content.action.edit + - toc.follow + palette: + - media: "(prefers-color-scheme: light)" + scheme: default + primary: indigo + accent: indigo + toggle: + icon: material/brightness-7 + name: Switch to dark mode + - media: "(prefers-color-scheme: dark)" + scheme: slate + primary: indigo + accent: indigo + toggle: + icon: material/brightness-4 + name: Switch to light mode + font: + text: Inter + code: JetBrains Mono + icon: + repo: fontawesome/brands/github + +plugins: + - search + - git-revision-date-localized: + type: date + enable_creation_date: true + fallback_to_build_date: true + +markdown_extensions: + - admonition + - attr_list + - md_in_html + - tables + - toc: + permalink: true + - pymdownx.details + - pymdownx.highlight: + anchor_linenums: true + line_spans: __span + pygments_lang_class: true + - pymdownx.inlinehilite + - pymdownx.snippets + - pymdownx.superfences: + custom_fences: + # Mermaid diagrams. Material has built-in renderer support; this + # routes ```mermaid fences through it. + - name: mermaid + class: mermaid + format: !!python/name:pymdownx.superfences.fence_code_format + - pymdownx.tabbed: + alternate_style: true + +extra: + social: + - icon: fontawesome/brands/github + link: https://github.com/amcheste/claude-teams-operator + +# Diátaxis nav: Tutorials (learn) / How-to (solve) / Reference (look up) / +# Explanation (understand). Section index pages exist as placeholders so +# the structure is in place even before all content lands. +nav: + - Home: index.md + - Tutorials: + - tutorials/index.md + - Getting started: tutorials/getting-started.md + - How-to guides: + - how-to/index.md + - Install: + - On Amazon EKS: how-to/install/eks.md + - On Google GKE: how-to/install/gke.md + - On Azure AKS: how-to/install/aks.md + - Operate: + - Expose the dashboard: how-to/operate/expose-dashboard.md + - Configure shared storage: how-to/operate/shared-storage.md + - Set budget alerts: how-to/operate/budget-alerts.md + - Reference: + - reference/index.md + - API reference: reference/api/index.md + - Explanation: + - explanation/index.md + - Resource model: explanation/resources.md + - Coordination protocol: explanation/coordination.md + - Operations: explanation/operations.md