feat(adr-005): Stage 2 — codex CLI + @commonly/cli in the gateway pod by samxu01 · Pull Request #236 · Team-Commonly/commonly

samxu01 · 2026-04-25T23:23:13Z

Summary

ADR-005 Stage 2: ship codex and commonly CLIs into the gateway pod so dev agents can eventually mention @codex instead of calling acpx_run.

Two pieces, both small:

Helm chart (k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml) — adds a codex-tools-installer init container that runs npm install --global for @openai/codex@latest and @commonly/cli@latest into a shared emptyDir volume codex-tools, mounted at /tools in the main gateway container. The main container's PATH is prepended with /tools/bin so the binaries are reachable by kubectl exec and any future run-loop process.
Runbook (docs/runbooks/codex-in-gateway-pod.md) — the operator bootstrap: shell into the pod, verify tools, commonly login, commonly agent attach codex --pod <id>, commonly agent run codex, smoke-mention @codex, see pong come back. Also covers the dev-agent HEARTBEAT cutover plan (one agent at a time, revert on parity break).

Why an init container, not a Dockerfile change

_external/clawdbot is a submodule on Team-Commonly/openclaw. Touching its Dockerfile would require a fork PR + a submodule-pointer bump here. The init-container path lives entirely in the commonly chart and ships in one PR. Trade-off: every pod start re-downloads the two packages (~30s). emptyDir is intentional — simpler than caching; revisit if restart latency becomes a real concern.

Auth.json reuse

The existing clawdbot-auth-seed init container already provisions chatgpt account-1's codex auth.json to /state/.codex/auth.json, and the gateway container's lifecycle.postStart copies it to ~/.codex/auth.json. The wrapper reuses that — no new ESO secret required for Stage 2 minimum.

Trade-off: shared quota with the existing acpx_run path. A dedicated codex account for the wrapper is a follow-up if it becomes a bottleneck (separate auth.json mounted at a non-shared path, run loop with CODEX_HOME env var pointing at it).

Why the run loop isn't auto-started yet

Auto-starting commonly agent run codex at container start needs a runtime token + an attached agent name to already exist. Bootstrap is a one-time operator step. Stage 2 ships the substrate; auto-start is a follow-up after the manual flow proves out end-to-end.

What this PR does NOT do

Doesn't flip deploy-dev.yml to push-on-main. The cluster currently has a LiteLLM CrashLoopBackOff (separate ops issue) that fails helm upgrade --wait. Auto-deploying on every merge would just queue failures. File a follow-up once LiteLLM is unstuck.
Doesn't replace any dev agent's acpx_run calls. That's the cutover work the runbook describes — one agent at a time, after this PR + the manual bootstrap land cleanly.
Doesn't touch the _external/clawdbot submodule.

Test plan

helm template renders cleanly (exit 0)
init container, /tools volume mount, and PATH env all present in rendered output
Manual smoke after merge + deploy: kubectl exec into the gateway pod, run codex --version and commonly --version, follow the runbook to attach + run, verify @codex round-trip in a dev pod
Operator validation that the existing ~/.codex/auth.json path is populated when the codex CLI starts (the lifecycle.postStart copy from /state/.codex/auth.json)

🤖 Generated with Claude Code

Adds a `codex-tools-installer` init container to the clawdbot-gateway deployment that installs `@openai/codex` and `@commonly/cli` into a shared `/tools` volume the main container mounts (read-only, with `PATH=/tools/bin:...` prepended). This is the bridge that lets dev agents (theo / nova / pixel / ops) eventually replace `acpx_run` with `@codex` mentions per ADR-005. Why an init container instead of modifying the openclaw image: `_external/clawdbot` is a submodule on the Team-Commonly/openclaw fork. Touching its Dockerfile would require a fork PR + a submodule pointer bump in this repo. The init-container path lives entirely in the commonly chart and ships with this deploy. Trade-off: every pod start re-downloads the two packages (~30s); emptyDir is intentional, simpler than caching, revisit if restart latency becomes a real concern. Auth.json reuse: the existing `clawdbot-auth-seed` init container already provisions chatgpt account-1's codex `auth.json` to `/state/.codex/auth.json`, and the gateway container's `lifecycle.postStart` copies it to `~/.codex/auth.json`. The wrapper reuses that — no new ESO secret required for Stage 2 minimum. A dedicated codex account for the wrapper is a follow-up if the shared quota becomes a bottleneck. Run-loop start is operator-driven for now (`commonly agent attach codex` + `commonly agent run codex` inside the pod). Auto-start in the container lifecycle is a follow-up — wanted to ship the substrate first so the manual flow can be validated end-to-end. Runbook at docs/runbooks/codex-in-gateway-pod.md covers the operator bootstrap end-to-end: shell into the pod, verify the tools, login the commonly CLI, attach the agent, start the run loop, smoke. Also covers the dev-agent HEARTBEAT cutover plan (one agent at a time, revert on parity break). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer caught the deployment-blocking issue: @commonly/cli is NOT published to npm (verified — npm 404). The init container as written would fail every gateway-pod start. Fixes (all from the code-reviewer pass): 1. CRITICAL — install @commonly/cli from source. The cli/ subdirectory is a self-contained ~200KB package with one runtime dep (commander). Init container now apt-installs git, clones this repo at the pinned ref, copies cli/ into /tools/lib/commonly-cli, runs npm install --omit=dev, and symlinks the bin. @openai/codex still installs from npm (it IS published, current 0.125.0). 2. IMPORTANT — soft-fail on install error. Wrap the install logic in a function called via `install_codex_tools || { echo ...; exit 0; }` so a transient npm/github outage at pod-restart time doesn't strand the gateway pod. The run loop is operator-started (Stage 2), and acpx_run continues to work as a fallback. A hard-failing init container would take down agent routing for an outage in one optional capability. 3. IMPORTANT — pin versions. New `agents.clawdbot.codexTools` block in values.yaml: codexVersion: "0.125.0", commonlyCliRef: "main". Operator can pin to a SHA or tag without a chart PR. 4. IMPORTANT — runbook contradicted ADR-005 invariant #4 ("two terminals for the same agent are unsupported"). Replaced the "run multiple instances" suggestion with an explicit "don't do this in v1; needs a different agent identity for higher throughput" note. 5. NIT — switched the verification step from `codex login status` (output format depends on codex version) to `codex --version`, which is what the adapter's detect() uses too. Also rewrote the runbook's polling-URL phrasing so a self-hosted operator isn't confused by the dev-instance hostname showing up. helm template renders OK; version pin substitutes correctly: "@openai/codex@0.125.0" git clone --depth 1 --branch "main" ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samxu01 · 2026-04-25T23:30:13Z

Self-review pass via code-reviewer subagent — addressed the critical + 3 important + 1 nit in b5ef9f844c.

Finding	Severity	Resolution
`@commonly/cli@latest` is NOT on npm (verified — `npm view` returns 404). Init container as written would 100% fail every gateway-pod start	Critical	Install @commonly/cli from source instead. Init container apt-installs git, clones this repo at the pinned ref, copies `cli/` into `/tools/lib/commonly-cli`, runs `npm install --omit=dev`, and symlinks the bin. `@openai/codex` still installs from npm (it IS published).
Hard-failing init container strands the gateway pod for any npm/github transient	Important	Wrapped install logic in `install_codex_tools \|\| { echo failed; exit 0; }`. acpx_run continues working as fallback; operator can re-trigger by deleting the pod.
`@latest` floating versions = bad upstream releases break gateway restarts	Important	New `agents.clawdbot.codexTools` block in values.yaml: `codexVersion: "0.125.0"`, `commonlyCliRef: "main"`. Operator can pin to a SHA or tag without a chart PR.
Runbook said "run multiple `commonly agent run codex` instances" — directly contradicts ADR-005 invariant #4 (`Two terminals for the same agent are unsupported in v1`)	Important	Rewrote the paragraph as an explicit "don't do this in v1" note with a follow-up suggestion (separate agent identity, e.g. `codex-2`, for true parallelism).
`codex login status` output format depends on codex version	Nit	Switched verification to `codex --version` (matches what the adapter's `detect()` uses).

Reviewer's other items (PATH default, runtime UID, api-dev.commonly.me hostname) acknowledged as non-issues / borderline-but-fine in context.

helm template renders OK with the new pin substitutions:

"@openai/codex@0.125.0"
git clone --depth 1 --branch "main" ...

CI re-running on the review-fix commit.

…#236) Adds a `codex-tools-installer` init container to the clawdbot-gateway deployment that installs `@openai/codex` (pinned via `agents.clawdbot.codexTools.codexVersion`) and the `@commonly/cli` (installed from this repo's `cli/` subdirectory at the pinned git ref) into a shared `/tools` volume the main container mounts read-only with `/tools/bin` prepended to PATH. This is the bridge ADR-005 Stage 2 needs so dev agents (theo / nova / pixel / ops) can eventually mention `@codex` instead of calling `acpx_run`. The wrapper itself shipped in PR #231; this PR puts the substrate where it can run. Why an init container, not a Dockerfile change: `_external/clawdbot` is a submodule on the openclaw fork; touching its Dockerfile would require a fork PR + a submodule pointer bump. The init container path lives entirely in the commonly chart and ships with this deploy. Why @commonly/cli installs from source: it's not on npm yet (ADR-005 Phase 4 publication hasn't shipped). The cli/ subdirectory is a self-contained ~200KB package with one runtime dep. The init container apt-installs git, clones this repo at the pinned ref, copies cli/ into /tools/lib/commonly-cli, runs npm install --omit=dev, and symlinks the bin. Pin the ref to a SHA or tag in values.yaml when stability matters. Soft-fail: if the npm registry or github is unreachable at pod-start time, the wrapper falls through with a warning rather than failing the init container. Gateway routing keeps working (acpx_run continues as fallback); operator can re-trigger on the next pod restart. Hard-failing the gateway pod for a transient outage in one optional capability would strand all agent traffic, which is the wrong trade-off. Auth.json reuse: the existing `clawdbot-auth-seed` init container already provisions chatgpt account-1's codex `auth.json` to `/state/.codex/auth.json`, and the gateway container's `lifecycle.postStart` copies it to `~/.codex/auth.json`. The wrapper reuses that — no new ESO secret. Trade-off: shared quota with the existing acpx_run path. A dedicated codex account for the wrapper is a follow-up if it becomes a bottleneck. Run loop is operator-driven for now (`commonly agent attach codex` + `commonly agent run codex` inside the pod). Auto-start in the container lifecycle is a follow-up after the manual flow validates end-to-end. Runbook at `docs/runbooks/codex-in-gateway-pod.md` covers the operator bootstrap end-to-end and the dev-agent HEARTBEAT cutover plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samxu01 · 2026-04-25T23:37:01Z

Squashed and merged manually as commit ff5bc15 to preserve authorship per memory/feedback-pr-merge-pattern.md.

samxu01 and others added 2 commits April 25, 2026 16:22

samxu01 closed this Apr 25, 2026

samxu01 deleted the feat/adr-005-stage-2-codex-image branch April 25, 2026 23:37

samxu01 mentioned this pull request Apr 26, 2026

feat(cli): close cross-agent loop — auto-handle agent.ask + ${COMMONLY_AGENT_TOKEN} substitution #237

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(adr-005): Stage 2 — codex CLI + @commonly/cli in the gateway pod#236

feat(adr-005): Stage 2 — codex CLI + @commonly/cli in the gateway pod#236
samxu01 wants to merge 2 commits intomainfrom
feat/adr-005-stage-2-codex-image

samxu01 commented Apr 25, 2026

Uh oh!

samxu01 commented Apr 25, 2026

Uh oh!

samxu01 commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samxu01 commented Apr 25, 2026

Summary

Why an init container, not a Dockerfile change

Auth.json reuse

Why the run loop isn't auto-started yet

What this PR does NOT do

Test plan

Uh oh!

samxu01 commented Apr 25, 2026

Uh oh!

samxu01 commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant