Skip to content

feat(adr-005): Stage 2 — codex CLI + @commonly/cli in the gateway pod#236

Closed
samxu01 wants to merge 2 commits intomainfrom
feat/adr-005-stage-2-codex-image
Closed

feat(adr-005): Stage 2 — codex CLI + @commonly/cli in the gateway pod#236
samxu01 wants to merge 2 commits intomainfrom
feat/adr-005-stage-2-codex-image

Conversation

@samxu01
Copy link
Copy Markdown
Contributor

@samxu01 samxu01 commented Apr 25, 2026

Summary

ADR-005 Stage 2: ship codex and commonly CLIs into the gateway pod so dev agents can eventually mention @codex instead of calling acpx_run.

Two pieces, both small:

  1. Helm chart (k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml) — adds a codex-tools-installer init container that runs npm install --global for @openai/codex@latest and @commonly/cli@latest into a shared emptyDir volume codex-tools, mounted at /tools in the main gateway container. The main container's PATH is prepended with /tools/bin so the binaries are reachable by kubectl exec and any future run-loop process.

  2. Runbook (docs/runbooks/codex-in-gateway-pod.md) — the operator bootstrap: shell into the pod, verify tools, commonly login, commonly agent attach codex --pod <id>, commonly agent run codex, smoke-mention @codex, see pong come back. Also covers the dev-agent HEARTBEAT cutover plan (one agent at a time, revert on parity break).

Why an init container, not a Dockerfile change

_external/clawdbot is a submodule on Team-Commonly/openclaw. Touching its Dockerfile would require a fork PR + a submodule-pointer bump here. The init-container path lives entirely in the commonly chart and ships in one PR. Trade-off: every pod start re-downloads the two packages (~30s). emptyDir is intentional — simpler than caching; revisit if restart latency becomes a real concern.

Auth.json reuse

The existing clawdbot-auth-seed init container already provisions chatgpt account-1's codex auth.json to /state/.codex/auth.json, and the gateway container's lifecycle.postStart copies it to ~/.codex/auth.json. The wrapper reuses that — no new ESO secret required for Stage 2 minimum.

Trade-off: shared quota with the existing acpx_run path. A dedicated codex account for the wrapper is a follow-up if it becomes a bottleneck (separate auth.json mounted at a non-shared path, run loop with CODEX_HOME env var pointing at it).

Why the run loop isn't auto-started yet

Auto-starting commonly agent run codex at container start needs a runtime token + an attached agent name to already exist. Bootstrap is a one-time operator step. Stage 2 ships the substrate; auto-start is a follow-up after the manual flow proves out end-to-end.

What this PR does NOT do

  • Doesn't flip deploy-dev.yml to push-on-main. The cluster currently has a LiteLLM CrashLoopBackOff (separate ops issue) that fails helm upgrade --wait. Auto-deploying on every merge would just queue failures. File a follow-up once LiteLLM is unstuck.
  • Doesn't replace any dev agent's acpx_run calls. That's the cutover work the runbook describes — one agent at a time, after this PR + the manual bootstrap land cleanly.
  • Doesn't touch the _external/clawdbot submodule.

Test plan

  • helm template renders cleanly (exit 0)
  • init container, /tools volume mount, and PATH env all present in rendered output
  • Manual smoke after merge + deploy: kubectl exec into the gateway pod, run codex --version and commonly --version, follow the runbook to attach + run, verify @codex round-trip in a dev pod
  • Operator validation that the existing ~/.codex/auth.json path is populated when the codex CLI starts (the lifecycle.postStart copy from /state/.codex/auth.json)

🤖 Generated with Claude Code

samxu01 and others added 2 commits April 25, 2026 16:22
Adds a `codex-tools-installer` init container to the clawdbot-gateway
deployment that installs `@openai/codex` and `@commonly/cli` into a
shared `/tools` volume the main container mounts (read-only, with
`PATH=/tools/bin:...` prepended). This is the bridge that lets dev
agents (theo / nova / pixel / ops) eventually replace `acpx_run` with
`@codex` mentions per ADR-005.

Why an init container instead of modifying the openclaw image:
`_external/clawdbot` is a submodule on the Team-Commonly/openclaw fork.
Touching its Dockerfile would require a fork PR + a submodule pointer
bump in this repo. The init-container path lives entirely in the
commonly chart and ships with this deploy. Trade-off: every pod start
re-downloads the two packages (~30s); emptyDir is intentional, simpler
than caching, revisit if restart latency becomes a real concern.

Auth.json reuse: the existing `clawdbot-auth-seed` init container
already provisions chatgpt account-1's codex `auth.json` to
`/state/.codex/auth.json`, and the gateway container's
`lifecycle.postStart` copies it to `~/.codex/auth.json`. The wrapper
reuses that — no new ESO secret required for Stage 2 minimum. A
dedicated codex account for the wrapper is a follow-up if the shared
quota becomes a bottleneck.

Run-loop start is operator-driven for now (`commonly agent attach codex`
+ `commonly agent run codex` inside the pod). Auto-start in the
container lifecycle is a follow-up — wanted to ship the substrate
first so the manual flow can be validated end-to-end.

Runbook at docs/runbooks/codex-in-gateway-pod.md covers the operator
bootstrap end-to-end: shell into the pod, verify the tools, login the
commonly CLI, attach the agent, start the run loop, smoke. Also covers
the dev-agent HEARTBEAT cutover plan (one agent at a time, revert on
parity break).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer caught the deployment-blocking issue: @commonly/cli is NOT
published to npm (verified — npm 404). The init container as written
would fail every gateway-pod start.

Fixes (all from the code-reviewer pass):

1. CRITICAL — install @commonly/cli from source. The cli/ subdirectory
   is a self-contained ~200KB package with one runtime dep (commander).
   Init container now apt-installs git, clones this repo at the pinned
   ref, copies cli/ into /tools/lib/commonly-cli, runs npm install
   --omit=dev, and symlinks the bin. @openai/codex still installs from
   npm (it IS published, current 0.125.0).

2. IMPORTANT — soft-fail on install error. Wrap the install logic in a
   function called via `install_codex_tools || { echo ...; exit 0; }`
   so a transient npm/github outage at pod-restart time doesn't strand
   the gateway pod. The run loop is operator-started (Stage 2), and
   acpx_run continues to work as a fallback. A hard-failing init
   container would take down agent routing for an outage in one
   optional capability.

3. IMPORTANT — pin versions. New `agents.clawdbot.codexTools` block
   in values.yaml: codexVersion: "0.125.0", commonlyCliRef: "main".
   Operator can pin to a SHA or tag without a chart PR.

4. IMPORTANT — runbook contradicted ADR-005 invariant #4 ("two terminals
   for the same agent are unsupported"). Replaced the "run multiple
   instances" suggestion with an explicit "don't do this in v1; needs
   a different agent identity for higher throughput" note.

5. NIT — switched the verification step from `codex login status`
   (output format depends on codex version) to `codex --version`,
   which is what the adapter's detect() uses too. Also rewrote the
   runbook's polling-URL phrasing so a self-hosted operator isn't
   confused by the dev-instance hostname showing up.

helm template renders OK; version pin substitutes correctly:
  "@openai/codex@0.125.0"
  git clone --depth 1 --branch "main" ...

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samxu01
Copy link
Copy Markdown
Contributor Author

samxu01 commented Apr 25, 2026

Self-review pass via code-reviewer subagent — addressed the critical + 3 important + 1 nit in b5ef9f844c.

Finding Severity Resolution
@commonly/cli@latest is NOT on npm (verified — npm view returns 404). Init container as written would 100% fail every gateway-pod start Critical Install @commonly/cli from source instead. Init container apt-installs git, clones this repo at the pinned ref, copies cli/ into /tools/lib/commonly-cli, runs npm install --omit=dev, and symlinks the bin. @openai/codex still installs from npm (it IS published).
Hard-failing init container strands the gateway pod for any npm/github transient Important Wrapped install logic in install_codex_tools || { echo failed; exit 0; }. acpx_run continues working as fallback; operator can re-trigger by deleting the pod.
@latest floating versions = bad upstream releases break gateway restarts Important New agents.clawdbot.codexTools block in values.yaml: codexVersion: "0.125.0", commonlyCliRef: "main". Operator can pin to a SHA or tag without a chart PR.
Runbook said "run multiple commonly agent run codex instances" — directly contradicts ADR-005 invariant #4 (Two terminals for the same agent are unsupported in v1) Important Rewrote the paragraph as an explicit "don't do this in v1" note with a follow-up suggestion (separate agent identity, e.g. codex-2, for true parallelism).
codex login status output format depends on codex version Nit Switched verification to codex --version (matches what the adapter's detect() uses).

Reviewer's other items (PATH default, runtime UID, api-dev.commonly.me hostname) acknowledged as non-issues / borderline-but-fine in context.

helm template renders OK with the new pin substitutions:

"@openai/codex@0.125.0"
git clone --depth 1 --branch "main" ...

CI re-running on the review-fix commit.

samxu01 added a commit that referenced this pull request Apr 25, 2026
…#236)

Adds a `codex-tools-installer` init container to the clawdbot-gateway
deployment that installs `@openai/codex` (pinned via
`agents.clawdbot.codexTools.codexVersion`) and the `@commonly/cli`
(installed from this repo's `cli/` subdirectory at the pinned git ref)
into a shared `/tools` volume the main container mounts read-only with
`/tools/bin` prepended to PATH.

This is the bridge ADR-005 Stage 2 needs so dev agents (theo / nova /
pixel / ops) can eventually mention `@codex` instead of calling
`acpx_run`. The wrapper itself shipped in PR #231; this PR puts the
substrate where it can run.

Why an init container, not a Dockerfile change: `_external/clawdbot`
is a submodule on the openclaw fork; touching its Dockerfile would
require a fork PR + a submodule pointer bump. The init container path
lives entirely in the commonly chart and ships with this deploy.

Why @commonly/cli installs from source: it's not on npm yet (ADR-005
Phase 4 publication hasn't shipped). The cli/ subdirectory is a
self-contained ~200KB package with one runtime dep. The init container
apt-installs git, clones this repo at the pinned ref, copies cli/ into
/tools/lib/commonly-cli, runs npm install --omit=dev, and symlinks the
bin. Pin the ref to a SHA or tag in values.yaml when stability matters.

Soft-fail: if the npm registry or github is unreachable at pod-start
time, the wrapper falls through with a warning rather than failing the
init container. Gateway routing keeps working (acpx_run continues as
fallback); operator can re-trigger on the next pod restart. Hard-failing
the gateway pod for a transient outage in one optional capability would
strand all agent traffic, which is the wrong trade-off.

Auth.json reuse: the existing `clawdbot-auth-seed` init container
already provisions chatgpt account-1's codex `auth.json` to
`/state/.codex/auth.json`, and the gateway container's
`lifecycle.postStart` copies it to `~/.codex/auth.json`. The wrapper
reuses that — no new ESO secret. Trade-off: shared quota with the
existing acpx_run path. A dedicated codex account for the wrapper is a
follow-up if it becomes a bottleneck.

Run loop is operator-driven for now (`commonly agent attach codex` +
`commonly agent run codex` inside the pod). Auto-start in the container
lifecycle is a follow-up after the manual flow validates end-to-end.

Runbook at `docs/runbooks/codex-in-gateway-pod.md` covers the operator
bootstrap end-to-end and the dev-agent HEARTBEAT cutover plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samxu01
Copy link
Copy Markdown
Contributor Author

samxu01 commented Apr 25, 2026

Squashed and merged manually as commit ff5bc15 to preserve authorship per memory/feedback-pr-merge-pattern.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant