From 85f3cd0d64d6a58804a1adc1943fc38a86c473af Mon Sep 17 00:00:00 2001 From: Sam Xu Date: Sat, 25 Apr 2026 16:22:47 -0700 Subject: [PATCH 1/2] =?UTF-8?q?feat(adr-005):=20Stage=202=20=E2=80=94=20co?= =?UTF-8?q?dex=20CLI=20+=20@commonly/cli=20in=20the=20gateway=20pod?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a `codex-tools-installer` init container to the clawdbot-gateway deployment that installs `@openai/codex` and `@commonly/cli` into a shared `/tools` volume the main container mounts (read-only, with `PATH=/tools/bin:...` prepended). This is the bridge that lets dev agents (theo / nova / pixel / ops) eventually replace `acpx_run` with `@codex` mentions per ADR-005. Why an init container instead of modifying the openclaw image: `_external/clawdbot` is a submodule on the Team-Commonly/openclaw fork. Touching its Dockerfile would require a fork PR + a submodule pointer bump in this repo. The init-container path lives entirely in the commonly chart and ships with this deploy. Trade-off: every pod start re-downloads the two packages (~30s); emptyDir is intentional, simpler than caching, revisit if restart latency becomes a real concern. Auth.json reuse: the existing `clawdbot-auth-seed` init container already provisions chatgpt account-1's codex `auth.json` to `/state/.codex/auth.json`, and the gateway container's `lifecycle.postStart` copies it to `~/.codex/auth.json`. The wrapper reuses that — no new ESO secret required for Stage 2 minimum. A dedicated codex account for the wrapper is a follow-up if the shared quota becomes a bottleneck. Run-loop start is operator-driven for now (`commonly agent attach codex` + `commonly agent run codex` inside the pod). Auto-start in the container lifecycle is a follow-up — wanted to ship the substrate first so the manual flow can be validated end-to-end. Runbook at docs/runbooks/codex-in-gateway-pod.md covers the operator bootstrap end-to-end: shell into the pod, verify the tools, login the commonly CLI, attach the agent, start the run loop, smoke. Also covers the dev-agent HEARTBEAT cutover plan (one agent at a time, revert on parity break). Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/runbooks/codex-in-gateway-pod.md | 169 ++++++++++++++++++ .../templates/agents/clawdbot-deployment.yaml | 40 +++++ 2 files changed, 209 insertions(+) create mode 100644 docs/runbooks/codex-in-gateway-pod.md diff --git a/docs/runbooks/codex-in-gateway-pod.md b/docs/runbooks/codex-in-gateway-pod.md new file mode 100644 index 00000000..038e5b34 --- /dev/null +++ b/docs/runbooks/codex-in-gateway-pod.md @@ -0,0 +1,169 @@ +# Running codex (local-CLI wrapper) inside the gateway pod + +ADR-005 Stage 2. The `clawdbot-gateway` pod now ships `codex` and `commonly` +binaries via an init container (`codex-tools-installer`) and a shared +`/tools` volume on `PATH`. This runbook covers the operator steps to wire +a `codex` agent into a pod and start the run loop manually. + +This is the bridge that lets dev agents (theo / nova / pixel / ops) replace +their `acpx_run` calls with `@codex` mentions — codex itself runs as a +first-class Commonly agent in a pod the dev agent shares. + +## Prerequisites + +- Helm release `commonly-dev` includes the chart at or after the commit that + added the `codex-tools-installer` init container (`feat/adr-005-stage-2-codex-image`). +- Codex `auth.json` is already provisioned by the existing `clawdbot-auth-seed` + init container (`/state/.codex/auth.json` → copied to `~/.codex/auth.json` + via the gateway container's `lifecycle.postStart`). No new secret needed — + the wrapper reuses the same chatgpt account #1 the existing acpx_run path + uses. +- Backend image on dev includes `POST /api/agents/runtime/room` and the + agent-room 1:1 enforcement (PR #232 + #235). + +## One-time bootstrap + +Pick a dev pod or Agent DM where the codex agent should be installed (one +per developer is fine; the wrapper serves multiple sessions per ADR-005's +session-per-pod model). + +### 1. Open a shell in the gateway pod + +```bash +GATEWAY_POD=$(kubectl get pod -n commonly-dev -l app=clawdbot-gateway -o jsonpath='{.items[0].metadata.name}') +kubectl exec -n commonly-dev -it "$GATEWAY_POD" -- bash +``` + +### 2. Verify the tools landed + +```bash +codex --version +commonly --version +codex login status # Logged in using ChatGPT +``` + +If any of those fail, the init container hasn't finished or didn't install +cleanly — `kubectl describe pod $GATEWAY_POD` and look at +`codex-tools-installer` status. + +### 3. Authenticate the commonly CLI to api-dev + +The wrapper needs a USER token (not the agent runtime token) to call +`commonly agent attach`. Get one from your dev account and save it. + +```bash +commonly login --instance https://api-dev.commonly.me --key dev +# enter email + password at the prompts +``` + +(Inside a non-TTY exec, you'd pipe email/password via stdin — but the +operator step is interactive.) + +### 4. Attach the codex agent to a pod + +Pick a target pod ID (e.g., a dev-team chat pod or an Agent DM created +via the Agent Hub "Talk to" button). Then: + +```bash +commonly agent attach codex \ + --pod \ + --name codex \ + --instance dev +``` + +This: +- Registers the codex agent in the kernel (`AgentInstallation` row) +- Mints a runtime token at `~/.commonly/tokens/codex.json` +- Reuses the codex CLI's existing `auth.json` for actual model access + +### 5. Start the run loop + +```bash +nohup commonly agent run codex > /tmp/commonly-codex-run.log 2>&1 & +``` + +Or in a tmux session if you want to watch it: + +```bash +tmux new -s codex +commonly agent run codex +# Ctrl+b d to detach +``` + +The run loop polls `https://api-dev.commonly.me/api/agents/runtime/events`, +spawns codex on each `chat.mention` / `dm.message`, and posts the response +back to the originating pod. Per ADR-005 §Spawning semantics, one process +serializes spawns — collisions queue, no parallelism. For higher +throughput, run multiple `commonly agent run codex` instances pointing at +different `auth.json` files. + +### 6. Smoke + +In the target pod, mention `@codex` from any human or agent member: + +``` +@codex please reply with the single word: pong +``` + +Within a minute, codex should post `pong` back. Tail the log to confirm: + +```bash +tail -f /tmp/commonly-codex-run.log +``` + +Expected: + +``` +[codex] polling https://api-dev.commonly.me for events (ctrl+c to stop) +[codex] [chat.mention] spawning codex +[codex] [chat.mention] posted 4 bytes +``` + +## After bootstrap — letting dev agents use it + +Once `@codex` is live in a dev pod, dev agents (theo / nova / pixel / ops) +can mention it from their HEARTBEAT.md template instead of calling +`acpx_run`. Cutover one agent at a time: + +1. Edit the agent's HEARTBEAT.md in `backend/services/registry.js` (the + permanent source of truth — PVC edits get overwritten on + `reprovision-all`). +2. Replace any block that does `acpx_run({ agentId: "codex", ... })` with + `commonly_post_message({ podId, content: "@codex " })` plus the + agent's pattern for reading the response on the next heartbeat tick. +3. Run `reprovision-all` so the new HEARTBEAT lands. +4. Watch the agent's next few heartbeats. Compare end-to-end behavior to + the prior `acpx_run` flow. + +If the parity holds across one heartbeat cycle, roll out to the next agent. +If it doesn't, revert the HEARTBEAT change (one-line revert) and investigate +before broadening. + +## Operational caveats + +- **Shared quota.** All `@codex` invocations use the existing chatgpt + account #1's quota — same one the LiteLLM rotator and acpx_run already + consume. Hitting the weekly cap will manifest as `turn.failed` JSONL + events with "usage limit" messages. The codex adapter (`cli/src/lib/adapters/codex.js`) + surfaces these as the agent's reply, so users see a clear error. To + raise the ceiling: add a dedicated codex account for the wrapper (a + follow-up PR — separate auth.json mounted at a non-shared path, run + loop with `CODEX_HOME` env var pointing at it). +- **Pod restart penalty.** The init container reinstalls `@openai/codex` + + `@commonly/cli` from npm on every pod start (~30s). emptyDir is + intentional — simpler than caching; revisit if restart latency hurts. +- **Run loop survives pod restarts only as a manual step.** This runbook + describes a non-daemonized start. A future iteration will move the run + loop into the pod's lifecycle so it auto-starts. Until then, after + `kubectl delete pod` or `helm upgrade`, re-run step 5. +- **Logs go to the pod's filesystem.** `/tmp/commonly-codex-run.log` is + ephemeral; stream to stdout if you want it in `kubectl logs`. Or wire + through to a sidecar fluent-bit later. + +## Related + +- `cli/src/lib/adapters/codex.js` — the adapter (PR #231) +- `cli/src/commands/agent.js` — `attach`, `run`, `detach` commands +- ADR-005 §Adapter pattern — invariants the adapter holds +- `_external/clawdbot/extensions/commonly/src/tools.ts` — the `acpx_run` + this is replacing (target for removal once all dev agents are cut over) diff --git a/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml b/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml index 76d45d11..dd881ce1 100644 --- a/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml +++ b/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml @@ -464,6 +464,29 @@ spec: readOnly: true - name: clawdbot-state mountPath: /state + # ADR-005 Stage 2: install codex CLI + @commonly/cli into a shared + # volume the gateway container mounts. Avoids modifying the + # _external/clawdbot Dockerfile (which is a submodule on the openclaw + # fork) — the install lives in helm so it ships with the deploy. + # `node:22-bookworm-slim` matches the runtime container's Node major + # so binaries built for libc are compatible. + - name: codex-tools-installer + image: node:22-bookworm-slim + command: + - /bin/sh + - -c + - | + set -e + export NPM_CONFIG_PREFIX=/tools + npm install --global --no-audit --no-fund \ + @openai/codex@latest \ + @commonly/cli@latest + ls -la /tools/bin/ || true + /tools/bin/codex --version || true + /tools/bin/commonly --version || true + volumeMounts: + - name: codex-tools + mountPath: /tools containers: - name: clawdbot-gateway image: "{{ .Values.agents.clawdbot.image.repository }}:{{ .Values.agents.clawdbot.image.tag }}" @@ -485,6 +508,12 @@ spec: - -c - mkdir -p /home/node/.codex && cp /state/.codex/auth.json /home/node/.codex/auth.json 2>/dev/null || true env: + # ADR-005 Stage 2: prepend the codex-tools volume to PATH so + # `codex` and `commonly` are on PATH for operator `kubectl exec` + # invocations and the run-loop process. Order matters — these + # binaries shadow any same-named ones in the openclaw image. + - name: PATH + value: "/tools/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" - name: CLAWDBOT_GATEWAY_PORT value: {{ .Values.agents.clawdbot.config.gatewayPort | default 18789 | quote }} - name: CLAWDBOT_GATEWAY_BIND @@ -594,6 +623,12 @@ spec: mountPath: /state - name: clawdbot-workspace mountPath: /workspace + # ADR-005 Stage 2 — codex + commonly CLIs installed by the + # codex-tools-installer init container. Mounted read-only here + # since nothing in the main container needs to write to it. + - name: codex-tools + mountPath: /tools + readOnly: true volumes: - name: clawdbot-config configMap: @@ -604,6 +639,11 @@ spec: - name: clawdbot-workspace persistentVolumeClaim: claimName: clawdbot-workspace-pvc + # emptyDir is fine — re-populated on every pod start by the init + # container. Adds ~30s to startup but avoids a PVC + invalidation + # logic. Revisit if pod-restart latency becomes a real concern. + - name: codex-tools + emptyDir: {} {{- with .Values.agents.clawdbot.nodeSelector }} nodeSelector: {{- toYaml . | nindent 8 }} From b5ef9f844cebd76f96b20aecf9fcfe6274dcc45e Mon Sep 17 00:00:00 2001 From: Sam Xu Date: Sat, 25 Apr 2026 16:29:55 -0700 Subject: [PATCH 2/2] review(stage-2): address self-review findings on codex-tools-installer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reviewer caught the deployment-blocking issue: @commonly/cli is NOT published to npm (verified — npm 404). The init container as written would fail every gateway-pod start. Fixes (all from the code-reviewer pass): 1. CRITICAL — install @commonly/cli from source. The cli/ subdirectory is a self-contained ~200KB package with one runtime dep (commander). Init container now apt-installs git, clones this repo at the pinned ref, copies cli/ into /tools/lib/commonly-cli, runs npm install --omit=dev, and symlinks the bin. @openai/codex still installs from npm (it IS published, current 0.125.0). 2. IMPORTANT — soft-fail on install error. Wrap the install logic in a function called via `install_codex_tools || { echo ...; exit 0; }` so a transient npm/github outage at pod-restart time doesn't strand the gateway pod. The run loop is operator-started (Stage 2), and acpx_run continues to work as a fallback. A hard-failing init container would take down agent routing for an outage in one optional capability. 3. IMPORTANT — pin versions. New `agents.clawdbot.codexTools` block in values.yaml: codexVersion: "0.125.0", commonlyCliRef: "main". Operator can pin to a SHA or tag without a chart PR. 4. IMPORTANT — runbook contradicted ADR-005 invariant #4 ("two terminals for the same agent are unsupported"). Replaced the "run multiple instances" suggestion with an explicit "don't do this in v1; needs a different agent identity for higher throughput" note. 5. NIT — switched the verification step from `codex login status` (output format depends on codex version) to `codex --version`, which is what the adapter's detect() uses too. Also rewrote the runbook's polling-URL phrasing so a self-hosted operator isn't confused by the dev-instance hostname showing up. helm template renders OK; version pin substitutes correctly: "@openai/codex@0.125.0" git clone --depth 1 --branch "main" ... Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/runbooks/codex-in-gateway-pod.md | 36 ++++++++----- .../templates/agents/clawdbot-deployment.yaml | 54 +++++++++++++++---- k8s/helm/commonly/values.yaml | 8 +++ 3 files changed, 75 insertions(+), 23 deletions(-) diff --git a/docs/runbooks/codex-in-gateway-pod.md b/docs/runbooks/codex-in-gateway-pod.md index 038e5b34..2e4c8777 100644 --- a/docs/runbooks/codex-in-gateway-pod.md +++ b/docs/runbooks/codex-in-gateway-pod.md @@ -37,14 +37,20 @@ kubectl exec -n commonly-dev -it "$GATEWAY_POD" -- bash ### 2. Verify the tools landed ```bash -codex --version -commonly --version -codex login status # Logged in using ChatGPT +codex --version # codex-cli 0.125.0 (or whatever is pinned in values) +commonly --version # 0.1.0 ``` -If any of those fail, the init container hasn't finished or didn't install -cleanly — `kubectl describe pod $GATEWAY_POD` and look at -`codex-tools-installer` status. +If either fails, the init container's soft-fail path may have triggered +(npm registry unreachable, github clone failed, etc.) — the gateway is +still up but `/tools` is empty. Check: + +```bash +kubectl logs -n commonly-dev "$GATEWAY_POD" -c codex-tools-installer +``` + +Look for `[codex-tools-installer] install failed` lines. Re-running the +init container = restart the pod (`kubectl delete pod "$GATEWAY_POD"`). ### 3. Authenticate the commonly CLI to api-dev @@ -90,12 +96,18 @@ commonly agent run codex # Ctrl+b d to detach ``` -The run loop polls `https://api-dev.commonly.me/api/agents/runtime/events`, -spawns codex on each `chat.mention` / `dm.message`, and posts the response -back to the originating pod. Per ADR-005 §Spawning semantics, one process -serializes spawns — collisions queue, no parallelism. For higher -throughput, run multiple `commonly agent run codex` instances pointing at -different `auth.json` files. +The run loop polls the instance you passed to `commonly login` (here: +api-dev's URL — your own self-hosted instance is whatever URL you logged +in to), spawns codex on each `chat.mention` / `dm.message`, and posts the +response back to the originating pod. Per ADR-005 §Spawning semantics, +one process serializes spawns — collisions queue, no parallelism. + +**Don't run two `commonly agent run codex` processes for the same agent +name.** ADR-005 invariant #4 explicitly calls this out as unsupported in +v1: each `run` would poll, ack, and post independently, producing +duplicate replies. Higher throughput needs a different agent identity +(separate `commonly agent attach codex-2 ...`) — file as a follow-up if +the single-process throughput becomes a real bottleneck. ### 6. Smoke diff --git a/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml b/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml index dd881ce1..0dc52570 100644 --- a/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml +++ b/k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml @@ -466,24 +466,56 @@ spec: mountPath: /state # ADR-005 Stage 2: install codex CLI + @commonly/cli into a shared # volume the gateway container mounts. Avoids modifying the - # _external/clawdbot Dockerfile (which is a submodule on the openclaw - # fork) — the install lives in helm so it ships with the deploy. + # _external/clawdbot Dockerfile (a submodule on the openclaw fork) — + # the install lives in helm so it ships with the deploy. + # # `node:22-bookworm-slim` matches the runtime container's Node major - # so binaries built for libc are compatible. + # so any libc-linked binaries are compatible. `git` isn't in slim by + # default; we apt-install it because @commonly/cli isn't published + # to npm yet (ADR-005 Phase 4) and must be installed from source. + # + # Soft-fail: if the npm registry or github is unreachable at pod- + # restart time, the gateway should still start — the run loop is an + # operator-started manual step (Stage 2), and `acpx_run` continues + # to work as a fallback. A hard-failing init container would strand + # the entire gateway pod for an issue with one optional capability. - name: codex-tools-installer image: node:22-bookworm-slim command: - /bin/sh - -c - | - set -e - export NPM_CONFIG_PREFIX=/tools - npm install --global --no-audit --no-fund \ - @openai/codex@latest \ - @commonly/cli@latest - ls -la /tools/bin/ || true - /tools/bin/codex --version || true - /tools/bin/commonly --version || true + install_codex_tools() { + set -e + export NPM_CONFIG_PREFIX=/tools + apt-get update >/dev/null 2>&1 + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ + git ca-certificates >/dev/null 2>&1 + # Pin codex by version (npm published) — see values.yaml. + npm install --global --no-audit --no-fund \ + "@openai/codex@{{ .Values.agents.clawdbot.codexTools.codexVersion | default "0.125.0" }}" + # @commonly/cli — install from source at the pinned ref. The + # cli/ subdirectory is a self-contained npm package with one + # runtime dep (commander); ~200KB of source. + git clone --depth 1 --branch "{{ .Values.agents.clawdbot.codexTools.commonlyCliRef | default "main" }}" \ + https://github.com/Team-Commonly/commonly.git /tmp/commonly-src + mkdir -p /tools/lib + cp -r /tmp/commonly-src/cli /tools/lib/commonly-cli + cd /tools/lib/commonly-cli + npm install --omit=dev --no-audit --no-fund + ln -sf /tools/lib/commonly-cli/src/index.js /tools/bin/commonly + chmod +x /tools/lib/commonly-cli/src/index.js + # Verify (best-effort; failures here don't block since the + # outer || already swallowed earlier failures we care about). + /tools/bin/codex --version || true + /tools/bin/commonly --version || true + ls -la /tools/bin/ || true + } + install_codex_tools || { + echo "[codex-tools-installer] install failed — gateway will start without /tools binaries" + echo "[codex-tools-installer] operator: 'commonly agent run codex' is unavailable until next pod restart" + exit 0 + } volumeMounts: - name: codex-tools mountPath: /tools diff --git a/k8s/helm/commonly/values.yaml b/k8s/helm/commonly/values.yaml index 1ec12f8c..e323797f 100644 --- a/k8s/helm/commonly/values.yaml +++ b/k8s/helm/commonly/values.yaml @@ -188,6 +188,14 @@ agents: repository: gcr.io/YOUR_GCP_PROJECT_ID/clawdbot-gateway tag: latest pullPolicy: IfNotPresent + # ADR-005 Stage 2 — codex + @commonly/cli installed by an init + # container into a shared /tools volume. codexVersion pins the + # @openai/codex npm package; commonlyCliRef pins the git ref this + # repo's cli/ subdirectory is installed from. @commonly/cli is not + # yet on npm (ADR-005 Phase 4), so we install from source. + codexTools: + codexVersion: "0.125.0" + commonlyCliRef: "main" config: gatewayPort: 18789 bridgePort: 18790