feat(adr-005): Stage 2 — codex CLI + @commonly/cli in the gateway pod#236
Closed
feat(adr-005): Stage 2 — codex CLI + @commonly/cli in the gateway pod#236
Conversation
Adds a `codex-tools-installer` init container to the clawdbot-gateway deployment that installs `@openai/codex` and `@commonly/cli` into a shared `/tools` volume the main container mounts (read-only, with `PATH=/tools/bin:...` prepended). This is the bridge that lets dev agents (theo / nova / pixel / ops) eventually replace `acpx_run` with `@codex` mentions per ADR-005. Why an init container instead of modifying the openclaw image: `_external/clawdbot` is a submodule on the Team-Commonly/openclaw fork. Touching its Dockerfile would require a fork PR + a submodule pointer bump in this repo. The init-container path lives entirely in the commonly chart and ships with this deploy. Trade-off: every pod start re-downloads the two packages (~30s); emptyDir is intentional, simpler than caching, revisit if restart latency becomes a real concern. Auth.json reuse: the existing `clawdbot-auth-seed` init container already provisions chatgpt account-1's codex `auth.json` to `/state/.codex/auth.json`, and the gateway container's `lifecycle.postStart` copies it to `~/.codex/auth.json`. The wrapper reuses that — no new ESO secret required for Stage 2 minimum. A dedicated codex account for the wrapper is a follow-up if the shared quota becomes a bottleneck. Run-loop start is operator-driven for now (`commonly agent attach codex` + `commonly agent run codex` inside the pod). Auto-start in the container lifecycle is a follow-up — wanted to ship the substrate first so the manual flow can be validated end-to-end. Runbook at docs/runbooks/codex-in-gateway-pod.md covers the operator bootstrap end-to-end: shell into the pod, verify the tools, login the commonly CLI, attach the agent, start the run loop, smoke. Also covers the dev-agent HEARTBEAT cutover plan (one agent at a time, revert on parity break). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer caught the deployment-blocking issue: @commonly/cli is NOT
published to npm (verified — npm 404). The init container as written
would fail every gateway-pod start.
Fixes (all from the code-reviewer pass):
1. CRITICAL — install @commonly/cli from source. The cli/ subdirectory
is a self-contained ~200KB package with one runtime dep (commander).
Init container now apt-installs git, clones this repo at the pinned
ref, copies cli/ into /tools/lib/commonly-cli, runs npm install
--omit=dev, and symlinks the bin. @openai/codex still installs from
npm (it IS published, current 0.125.0).
2. IMPORTANT — soft-fail on install error. Wrap the install logic in a
function called via `install_codex_tools || { echo ...; exit 0; }`
so a transient npm/github outage at pod-restart time doesn't strand
the gateway pod. The run loop is operator-started (Stage 2), and
acpx_run continues to work as a fallback. A hard-failing init
container would take down agent routing for an outage in one
optional capability.
3. IMPORTANT — pin versions. New `agents.clawdbot.codexTools` block
in values.yaml: codexVersion: "0.125.0", commonlyCliRef: "main".
Operator can pin to a SHA or tag without a chart PR.
4. IMPORTANT — runbook contradicted ADR-005 invariant #4 ("two terminals
for the same agent are unsupported"). Replaced the "run multiple
instances" suggestion with an explicit "don't do this in v1; needs
a different agent identity for higher throughput" note.
5. NIT — switched the verification step from `codex login status`
(output format depends on codex version) to `codex --version`,
which is what the adapter's detect() uses too. Also rewrote the
runbook's polling-URL phrasing so a self-hosted operator isn't
confused by the dev-instance hostname showing up.
helm template renders OK; version pin substitutes correctly:
"@openai/codex@0.125.0"
git clone --depth 1 --branch "main" ...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Self-review pass via code-reviewer subagent — addressed the critical + 3 important + 1 nit in
Reviewer's other items (PATH default, runtime UID, helm template renders OK with the new pin substitutions: CI re-running on the review-fix commit. |
samxu01
added a commit
that referenced
this pull request
Apr 25, 2026
…#236) Adds a `codex-tools-installer` init container to the clawdbot-gateway deployment that installs `@openai/codex` (pinned via `agents.clawdbot.codexTools.codexVersion`) and the `@commonly/cli` (installed from this repo's `cli/` subdirectory at the pinned git ref) into a shared `/tools` volume the main container mounts read-only with `/tools/bin` prepended to PATH. This is the bridge ADR-005 Stage 2 needs so dev agents (theo / nova / pixel / ops) can eventually mention `@codex` instead of calling `acpx_run`. The wrapper itself shipped in PR #231; this PR puts the substrate where it can run. Why an init container, not a Dockerfile change: `_external/clawdbot` is a submodule on the openclaw fork; touching its Dockerfile would require a fork PR + a submodule pointer bump. The init container path lives entirely in the commonly chart and ships with this deploy. Why @commonly/cli installs from source: it's not on npm yet (ADR-005 Phase 4 publication hasn't shipped). The cli/ subdirectory is a self-contained ~200KB package with one runtime dep. The init container apt-installs git, clones this repo at the pinned ref, copies cli/ into /tools/lib/commonly-cli, runs npm install --omit=dev, and symlinks the bin. Pin the ref to a SHA or tag in values.yaml when stability matters. Soft-fail: if the npm registry or github is unreachable at pod-start time, the wrapper falls through with a warning rather than failing the init container. Gateway routing keeps working (acpx_run continues as fallback); operator can re-trigger on the next pod restart. Hard-failing the gateway pod for a transient outage in one optional capability would strand all agent traffic, which is the wrong trade-off. Auth.json reuse: the existing `clawdbot-auth-seed` init container already provisions chatgpt account-1's codex `auth.json` to `/state/.codex/auth.json`, and the gateway container's `lifecycle.postStart` copies it to `~/.codex/auth.json`. The wrapper reuses that — no new ESO secret. Trade-off: shared quota with the existing acpx_run path. A dedicated codex account for the wrapper is a follow-up if it becomes a bottleneck. Run loop is operator-driven for now (`commonly agent attach codex` + `commonly agent run codex` inside the pod). Auto-start in the container lifecycle is a follow-up after the manual flow validates end-to-end. Runbook at `docs/runbooks/codex-in-gateway-pod.md` covers the operator bootstrap end-to-end and the dev-agent HEARTBEAT cutover plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Squashed and merged manually as commit ff5bc15 to preserve authorship per memory/feedback-pr-merge-pattern.md. |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ADR-005 Stage 2: ship
codexandcommonlyCLIs into the gateway pod so dev agents can eventually mention@codexinstead of callingacpx_run.Two pieces, both small:
Helm chart (
k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml) — adds acodex-tools-installerinit container that runsnpm install --globalfor@openai/codex@latestand@commonly/cli@latestinto a sharedemptyDirvolumecodex-tools, mounted at/toolsin the main gateway container. The main container'sPATHis prepended with/tools/binso the binaries are reachable bykubectl execand any future run-loop process.Runbook (
docs/runbooks/codex-in-gateway-pod.md) — the operator bootstrap: shell into the pod, verify tools,commonly login,commonly agent attach codex --pod <id>,commonly agent run codex, smoke-mention@codex, seepongcome back. Also covers the dev-agent HEARTBEAT cutover plan (one agent at a time, revert on parity break).Why an init container, not a Dockerfile change
_external/clawdbotis a submodule onTeam-Commonly/openclaw. Touching its Dockerfile would require a fork PR + a submodule-pointer bump here. The init-container path lives entirely in the commonly chart and ships in one PR. Trade-off: every pod start re-downloads the two packages (~30s). emptyDir is intentional — simpler than caching; revisit if restart latency becomes a real concern.Auth.json reuse
The existing
clawdbot-auth-seedinit container already provisions chatgpt account-1's codexauth.jsonto/state/.codex/auth.json, and the gateway container'slifecycle.postStartcopies it to~/.codex/auth.json. The wrapper reuses that — no new ESO secret required for Stage 2 minimum.Trade-off: shared quota with the existing
acpx_runpath. A dedicated codex account for the wrapper is a follow-up if it becomes a bottleneck (separateauth.jsonmounted at a non-shared path, run loop withCODEX_HOMEenv var pointing at it).Why the run loop isn't auto-started yet
Auto-starting
commonly agent run codexat container start needs a runtime token + an attached agent name to already exist. Bootstrap is a one-time operator step. Stage 2 ships the substrate; auto-start is a follow-up after the manual flow proves out end-to-end.What this PR does NOT do
deploy-dev.ymlto push-on-main. The cluster currently has a LiteLLM CrashLoopBackOff (separate ops issue) that failshelm upgrade --wait. Auto-deploying on every merge would just queue failures. File a follow-up once LiteLLM is unstuck.acpx_runcalls. That's the cutover work the runbook describes — one agent at a time, after this PR + the manual bootstrap land cleanly._external/clawdbotsubmodule.Test plan
helm templaterenders cleanly (exit 0)/toolsvolume mount, andPATHenv all present in rendered outputkubectl execinto the gateway pod, runcodex --versionandcommonly --version, follow the runbook to attach + run, verify@codexround-trip in a dev pod~/.codex/auth.jsonpath is populated when the codex CLI starts (thelifecycle.postStartcopy from/state/.codex/auth.json)🤖 Generated with Claude Code