diff --git a/docs/runbooks/codex-in-gateway-pod.md b/docs/runbooks/codex-in-gateway-pod.md new file mode 100644 index 00000000..2e4c8777 --- /dev/null +++ b/docs/runbooks/codex-in-gateway-pod.md @@ -0,0 +1,181 @@ +# Running codex (local-CLI wrapper) inside the gateway pod + +ADR-005 Stage 2. The `clawdbot-gateway` pod now ships `codex` and `commonly` +binaries via an init container (`codex-tools-installer`) and a shared +`/tools` volume on `PATH`. This runbook covers the operator steps to wire +a `codex` agent into a pod and start the run loop manually. + +This is the bridge that lets dev agents (theo / nova / pixel / ops) replace +their `acpx_run` calls with `@codex` mentions — codex itself runs as a +first-class Commonly agent in a pod the dev agent shares. + +## Prerequisites + +- Helm release `commonly-dev` includes the chart at or after the commit that + added the `codex-tools-installer` init container (`feat/adr-005-stage-2-codex-image`). +- Codex `auth.json` is already provisioned by the existing `clawdbot-auth-seed` + init container (`/state/.codex/auth.json` → copied to `~/.codex/auth.json` + via the gateway container's `lifecycle.postStart`). No new secret needed — + the wrapper reuses the same chatgpt account #1 the existing acpx_run path + uses. +- Backend image on dev includes `POST /api/agents/runtime/room` and the + agent-room 1:1 enforcement (PR #232 + #235). + +## One-time bootstrap + +Pick a dev pod or Agent DM where the codex agent should be installed (one +per developer is fine; the wrapper serves multiple sessions per ADR-005's +session-per-pod model). + +### 1. Open a shell in the gateway pod + +```bash +GATEWAY_POD=$(kubectl get pod -n commonly-dev -l app=clawdbot-gateway -o jsonpath='{.items[0].metadata.name}') +kubectl exec -n commonly-dev -it "$GATEWAY_POD" -- bash +``` + +### 2. Verify the tools landed + +```bash +codex --version # codex-cli 0.125.0 (or whatever is pinned in values) +commonly --version # 0.1.0 +``` + +If either fails, the init container's soft-fail path may have triggered +(npm registry unreachable, github clone failed, etc.) — the gateway is +still up but `/tools` is empty. Check: + +```bash +kubectl logs -n commonly-dev "$GATEWAY_POD" -c codex-tools-installer +``` + +Look for `[codex-tools-installer] install failed` lines. Re-running the +init container = restart the pod (`kubectl delete pod "$GATEWAY_POD"`). + +### 3. Authenticate the commonly CLI to api-dev + +The wrapper needs a USER token (not the agent runtime token) to call +`commonly agent attach`. Get one from your dev account and save it. + +```bash +commonly login --instance https://api-dev.commonly.me --key dev +# enter email + password at the prompts +``` + +(Inside a non-TTY exec, you'd pipe email/password via stdin — but the +operator step is interactive.) + +### 4. Attach the codex agent to a pod + +Pick a target pod ID (e.g., a dev-team chat pod or an Agent DM created +via the Agent Hub "Talk to" button). Then: + +```bash +commonly agent attach codex \ + --pod \ + --name codex \ + --instance dev +``` + +This: +- Registers the codex agent in the kernel (`AgentInstallation` row) +- Mints a runtime token at `~/.commonly/tokens/codex.json` +- Reuses the codex CLI's existing `auth.json` for actual model access + +### 5. Start the run loop + +```bash +nohup commonly agent run codex > /tmp/commonly-codex-run.log 2>&1 & +``` + +Or in a tmux session if you want to watch it: + +```bash +tmux new -s codex +commonly agent run codex +# Ctrl+b d to detach +``` + +The run loop polls the instance you passed to `commonly login` (here: +api-dev's URL — your own self-hosted instance is whatever URL you logged +in to), spawns codex on each `chat.mention` / `dm.message`, and posts the +response back to the originating pod. Per ADR-005 §Spawning semantics, +one process serializes spawns — collisions queue, no parallelism. + +**Don't run two `commonly agent run codex` processes for the same agent +name.** ADR-005 invariant #4 explicitly calls this out as unsupported in +v1: each `run` would poll, ack, and post independently, producing +duplicate replies. Higher throughput needs a different agent identity +(separate `commonly agent attach codex-2 ...`) — file as a follow-up if +the single-process throughput becomes a real bottleneck. + +### 6. Smoke + +In the target pod, mention `@codex` from any human or agent member: + +``` +@codex please reply with the single word: pong +``` + +Within a minute, codex should post `pong` back. Tail the log to confirm: + +```bash +tail -f /tmp/commonly-codex-run.log +``` + +Expected: + +``` +[codex] polling https://api-dev.commonly.me for events (ctrl+c to stop) +[codex] [chat.mention] spawning codex +[codex] [chat.mention] posted 4 bytes +``` + +## After bootstrap — letting dev agents use it + +Once `@codex` is live in a dev pod, dev agents (theo / nova / pixel / ops) +can mention it from their HEARTBEAT.md template instead of calling +`acpx_run`. Cutover one agent at a time: + +1. Edit the agent's HEARTBEAT.md in `backend/services/registry.js` (the + permanent source of truth — PVC edits get overwritten on + `reprovision-all`). +2. Replace any block that does `acpx_run({ agentId: "codex", ... })` with + `commonly_post_message({ podId, content: "@codex " })` plus the + agent's pattern for reading the response on the next heartbeat tick. +3. Run `reprovision-all` so the new HEARTBEAT lands. +4. Watch the agent's next few heartbeats. Compare end-to-end behavior to + the prior `acpx_run` flow. + +If the parity holds across one heartbeat cycle, roll out to the next agent. +If it doesn't, revert the HEARTBEAT change (one-line revert) and investigate +before broadening. + +## Operational caveats + +- **Shared quota.** All `@codex` invocations use the existing chatgpt + account #1's quota — same one the LiteLLM rotator and acpx_run already + consume. Hitting the weekly cap will manifest as `turn.failed` JSONL + events with "usage limit" messages. The codex adapter (`cli/src/lib/adapters/codex.js`) + surfaces these as the agent's reply, so users see a clear error. To + raise the ceiling: add a dedicated codex account for the wrapper (a + follow-up PR — separate auth.json mounted at a non-shared path, run + loop with `CODEX_HOME` env var pointing at it). +- **Pod restart penalty.** The init container reinstalls `@openai/codex` + + `@commonly/cli` from npm on every pod start (~30s). emptyDir is + intentional — simpler than caching; revisit if restart latency hurts. +- **Run loop survives pod restarts only as a manual step.** This runbook + describes a non-daemonized start. A future iteration will move the run + loop into the pod's lifecycle so it auto-starts. Until then, after + `kubectl delete pod` or `helm upgrade`, re-run step 5. +- **Logs go to the pod's filesystem.** `/tmp/commonly-codex-run.log` is + ephemeral; stream to stdout if you want it in `kubectl logs`. Or wire + through to a sidecar fluent-bit later. + +## Related + +- `cli/src/lib/adapters/codex.js` — the adapter (PR #231) +- `cli/src/commands/agent.js` — `attach`, `run`, `detach` commands +- ADR-005 §Adapter pattern — invariants the adapter holds +- `_external/clawdbot/extensions/commonly/src/tools.ts` — the `acpx_run` + this is replacing (target for removal once all dev agents are cut over) diff --git a/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml b/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml index 76d45d11..0dc52570 100644 --- a/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml +++ b/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml @@ -464,6 +464,61 @@ spec: readOnly: true - name: clawdbot-state mountPath: /state + # ADR-005 Stage 2: install codex CLI + @commonly/cli into a shared + # volume the gateway container mounts. Avoids modifying the + # _external/clawdbot Dockerfile (a submodule on the openclaw fork) — + # the install lives in helm so it ships with the deploy. + # + # `node:22-bookworm-slim` matches the runtime container's Node major + # so any libc-linked binaries are compatible. `git` isn't in slim by + # default; we apt-install it because @commonly/cli isn't published + # to npm yet (ADR-005 Phase 4) and must be installed from source. + # + # Soft-fail: if the npm registry or github is unreachable at pod- + # restart time, the gateway should still start — the run loop is an + # operator-started manual step (Stage 2), and `acpx_run` continues + # to work as a fallback. A hard-failing init container would strand + # the entire gateway pod for an issue with one optional capability. + - name: codex-tools-installer + image: node:22-bookworm-slim + command: + - /bin/sh + - -c + - | + install_codex_tools() { + set -e + export NPM_CONFIG_PREFIX=/tools + apt-get update >/dev/null 2>&1 + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ + git ca-certificates >/dev/null 2>&1 + # Pin codex by version (npm published) — see values.yaml. + npm install --global --no-audit --no-fund \ + "@openai/codex@{{ .Values.agents.clawdbot.codexTools.codexVersion | default "0.125.0" }}" + # @commonly/cli — install from source at the pinned ref. The + # cli/ subdirectory is a self-contained npm package with one + # runtime dep (commander); ~200KB of source. + git clone --depth 1 --branch "{{ .Values.agents.clawdbot.codexTools.commonlyCliRef | default "main" }}" \ + https://github.com/Team-Commonly/commonly.git /tmp/commonly-src + mkdir -p /tools/lib + cp -r /tmp/commonly-src/cli /tools/lib/commonly-cli + cd /tools/lib/commonly-cli + npm install --omit=dev --no-audit --no-fund + ln -sf /tools/lib/commonly-cli/src/index.js /tools/bin/commonly + chmod +x /tools/lib/commonly-cli/src/index.js + # Verify (best-effort; failures here don't block since the + # outer || already swallowed earlier failures we care about). + /tools/bin/codex --version || true + /tools/bin/commonly --version || true + ls -la /tools/bin/ || true + } + install_codex_tools || { + echo "[codex-tools-installer] install failed — gateway will start without /tools binaries" + echo "[codex-tools-installer] operator: 'commonly agent run codex' is unavailable until next pod restart" + exit 0 + } + volumeMounts: + - name: codex-tools + mountPath: /tools containers: - name: clawdbot-gateway image: "{{ .Values.agents.clawdbot.image.repository }}:{{ .Values.agents.clawdbot.image.tag }}" @@ -485,6 +540,12 @@ spec: - -c - mkdir -p /home/node/.codex && cp /state/.codex/auth.json /home/node/.codex/auth.json 2>/dev/null || true env: + # ADR-005 Stage 2: prepend the codex-tools volume to PATH so + # `codex` and `commonly` are on PATH for operator `kubectl exec` + # invocations and the run-loop process. Order matters — these + # binaries shadow any same-named ones in the openclaw image. + - name: PATH + value: "/tools/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" - name: CLAWDBOT_GATEWAY_PORT value: {{ .Values.agents.clawdbot.config.gatewayPort | default 18789 | quote }} - name: CLAWDBOT_GATEWAY_BIND @@ -594,6 +655,12 @@ spec: mountPath: /state - name: clawdbot-workspace mountPath: /workspace + # ADR-005 Stage 2 — codex + commonly CLIs installed by the + # codex-tools-installer init container. Mounted read-only here + # since nothing in the main container needs to write to it. + - name: codex-tools + mountPath: /tools + readOnly: true volumes: - name: clawdbot-config configMap: @@ -604,6 +671,11 @@ spec: - name: clawdbot-workspace persistentVolumeClaim: claimName: clawdbot-workspace-pvc + # emptyDir is fine — re-populated on every pod start by the init + # container. Adds ~30s to startup but avoids a PVC + invalidation + # logic. Revisit if pod-restart latency becomes a real concern. + - name: codex-tools + emptyDir: {} {{- with .Values.agents.clawdbot.nodeSelector }} nodeSelector: {{- toYaml . | nindent 8 }} diff --git a/k8s/helm/commonly/values.yaml b/k8s/helm/commonly/values.yaml index 1ec12f8c..e323797f 100644 --- a/k8s/helm/commonly/values.yaml +++ b/k8s/helm/commonly/values.yaml @@ -188,6 +188,14 @@ agents: repository: gcr.io/YOUR_GCP_PROJECT_ID/clawdbot-gateway tag: latest pullPolicy: IfNotPresent + # ADR-005 Stage 2 — codex + @commonly/cli installed by an init + # container into a shared /tools volume. codexVersion pins the + # @openai/codex npm package; commonlyCliRef pins the git ref this + # repo's cli/ subdirectory is installed from. @commonly/cli is not + # yet on npm (ADR-005 Phase 4), so we install from source. + codexTools: + codexVersion: "0.125.0" + commonlyCliRef: "main" config: gatewayPort: 18789 bridgePort: 18790