Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 181 additions & 0 deletions docs/runbooks/codex-in-gateway-pod.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Running codex (local-CLI wrapper) inside the gateway pod

ADR-005 Stage 2. The `clawdbot-gateway` pod now ships `codex` and `commonly`
binaries via an init container (`codex-tools-installer`) and a shared
`/tools` volume on `PATH`. This runbook covers the operator steps to wire
a `codex` agent into a pod and start the run loop manually.

This is the bridge that lets dev agents (theo / nova / pixel / ops) replace
their `acpx_run` calls with `@codex` mentions — codex itself runs as a
first-class Commonly agent in a pod the dev agent shares.

## Prerequisites

- Helm release `commonly-dev` includes the chart at or after the commit that
added the `codex-tools-installer` init container (`feat/adr-005-stage-2-codex-image`).
- Codex `auth.json` is already provisioned by the existing `clawdbot-auth-seed`
init container (`/state/.codex/auth.json` → copied to `~/.codex/auth.json`
via the gateway container's `lifecycle.postStart`). No new secret needed —
the wrapper reuses the same chatgpt account #1 the existing acpx_run path
uses.
- Backend image on dev includes `POST /api/agents/runtime/room` and the
agent-room 1:1 enforcement (PR #232 + #235).

## One-time bootstrap

Pick a dev pod or Agent DM where the codex agent should be installed (one
per developer is fine; the wrapper serves multiple sessions per ADR-005's
session-per-pod model).

### 1. Open a shell in the gateway pod

```bash
GATEWAY_POD=$(kubectl get pod -n commonly-dev -l app=clawdbot-gateway -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n commonly-dev -it "$GATEWAY_POD" -- bash
```

### 2. Verify the tools landed

```bash
codex --version # codex-cli 0.125.0 (or whatever is pinned in values)
commonly --version # 0.1.0
```

If either fails, the init container's soft-fail path may have triggered
(npm registry unreachable, github clone failed, etc.) — the gateway is
still up but `/tools` is empty. Check:

```bash
kubectl logs -n commonly-dev "$GATEWAY_POD" -c codex-tools-installer
```

Look for `[codex-tools-installer] install failed` lines. Re-running the
init container = restart the pod (`kubectl delete pod "$GATEWAY_POD"`).

### 3. Authenticate the commonly CLI to api-dev

The wrapper needs a USER token (not the agent runtime token) to call
`commonly agent attach`. Get one from your dev account and save it.

```bash
commonly login --instance https://api-dev.commonly.me --key dev
# enter email + password at the prompts
```

(Inside a non-TTY exec, you'd pipe email/password via stdin — but the
operator step is interactive.)

### 4. Attach the codex agent to a pod

Pick a target pod ID (e.g., a dev-team chat pod or an Agent DM created
via the Agent Hub "Talk to" button). Then:

```bash
commonly agent attach codex \
--pod <podId> \
--name codex \
--instance dev
```

This:
- Registers the codex agent in the kernel (`AgentInstallation` row)
- Mints a runtime token at `~/.commonly/tokens/codex.json`
- Reuses the codex CLI's existing `auth.json` for actual model access

### 5. Start the run loop

```bash
nohup commonly agent run codex > /tmp/commonly-codex-run.log 2>&1 &
```

Or in a tmux session if you want to watch it:

```bash
tmux new -s codex
commonly agent run codex
# Ctrl+b d to detach
```

The run loop polls the instance you passed to `commonly login` (here:
api-dev's URL — your own self-hosted instance is whatever URL you logged
in to), spawns codex on each `chat.mention` / `dm.message`, and posts the
response back to the originating pod. Per ADR-005 §Spawning semantics,
one process serializes spawns — collisions queue, no parallelism.

**Don't run two `commonly agent run codex` processes for the same agent
name.** ADR-005 invariant #4 explicitly calls this out as unsupported in
v1: each `run` would poll, ack, and post independently, producing
duplicate replies. Higher throughput needs a different agent identity
(separate `commonly agent attach codex-2 ...`) — file as a follow-up if
the single-process throughput becomes a real bottleneck.

### 6. Smoke

In the target pod, mention `@codex` from any human or agent member:

```
@codex please reply with the single word: pong
```

Within a minute, codex should post `pong` back. Tail the log to confirm:

```bash
tail -f /tmp/commonly-codex-run.log
```

Expected:

```
[codex] polling https://api-dev.commonly.me for events (ctrl+c to stop)
[codex] [chat.mention] spawning codex
[codex] [chat.mention] posted 4 bytes
```

## After bootstrap — letting dev agents use it

Once `@codex` is live in a dev pod, dev agents (theo / nova / pixel / ops)
can mention it from their HEARTBEAT.md template instead of calling
`acpx_run`. Cutover one agent at a time:

1. Edit the agent's HEARTBEAT.md in `backend/services/registry.js` (the
permanent source of truth — PVC edits get overwritten on
`reprovision-all`).
2. Replace any block that does `acpx_run({ agentId: "codex", ... })` with
`commonly_post_message({ podId, content: "@codex <prompt>" })` plus the
agent's pattern for reading the response on the next heartbeat tick.
3. Run `reprovision-all` so the new HEARTBEAT lands.
4. Watch the agent's next few heartbeats. Compare end-to-end behavior to
the prior `acpx_run` flow.

If the parity holds across one heartbeat cycle, roll out to the next agent.
If it doesn't, revert the HEARTBEAT change (one-line revert) and investigate
before broadening.

## Operational caveats

- **Shared quota.** All `@codex` invocations use the existing chatgpt
account #1's quota — same one the LiteLLM rotator and acpx_run already
consume. Hitting the weekly cap will manifest as `turn.failed` JSONL
events with "usage limit" messages. The codex adapter (`cli/src/lib/adapters/codex.js`)
surfaces these as the agent's reply, so users see a clear error. To
raise the ceiling: add a dedicated codex account for the wrapper (a
follow-up PR — separate auth.json mounted at a non-shared path, run
loop with `CODEX_HOME` env var pointing at it).
- **Pod restart penalty.** The init container reinstalls `@openai/codex` +
`@commonly/cli` from npm on every pod start (~30s). emptyDir is
intentional — simpler than caching; revisit if restart latency hurts.
- **Run loop survives pod restarts only as a manual step.** This runbook
describes a non-daemonized start. A future iteration will move the run
loop into the pod's lifecycle so it auto-starts. Until then, after
`kubectl delete pod` or `helm upgrade`, re-run step 5.
- **Logs go to the pod's filesystem.** `/tmp/commonly-codex-run.log` is
ephemeral; stream to stdout if you want it in `kubectl logs`. Or wire
through to a sidecar fluent-bit later.

## Related

- `cli/src/lib/adapters/codex.js` — the adapter (PR #231)
- `cli/src/commands/agent.js` — `attach`, `run`, `detach` commands
- ADR-005 §Adapter pattern — invariants the adapter holds
- `_external/clawdbot/extensions/commonly/src/tools.ts` — the `acpx_run`
this is replacing (target for removal once all dev agents are cut over)
72 changes: 72 additions & 0 deletions k8s/helm/commonly/templates/agents/clawdbot-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,61 @@ spec:
readOnly: true
- name: clawdbot-state
mountPath: /state
# ADR-005 Stage 2: install codex CLI + @commonly/cli into a shared
# volume the gateway container mounts. Avoids modifying the
# _external/clawdbot Dockerfile (a submodule on the openclaw fork) —
# the install lives in helm so it ships with the deploy.
#
# `node:22-bookworm-slim` matches the runtime container's Node major
# so any libc-linked binaries are compatible. `git` isn't in slim by
# default; we apt-install it because @commonly/cli isn't published
# to npm yet (ADR-005 Phase 4) and must be installed from source.
#
# Soft-fail: if the npm registry or github is unreachable at pod-
# restart time, the gateway should still start — the run loop is an
# operator-started manual step (Stage 2), and `acpx_run` continues
# to work as a fallback. A hard-failing init container would strand
# the entire gateway pod for an issue with one optional capability.
- name: codex-tools-installer
image: node:22-bookworm-slim
command:
- /bin/sh
- -c
- |
install_codex_tools() {
set -e
export NPM_CONFIG_PREFIX=/tools
apt-get update >/dev/null 2>&1
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
git ca-certificates >/dev/null 2>&1
# Pin codex by version (npm published) — see values.yaml.
npm install --global --no-audit --no-fund \
"@openai/codex@{{ .Values.agents.clawdbot.codexTools.codexVersion | default "0.125.0" }}"
# @commonly/cli — install from source at the pinned ref. The
# cli/ subdirectory is a self-contained npm package with one
# runtime dep (commander); ~200KB of source.
git clone --depth 1 --branch "{{ .Values.agents.clawdbot.codexTools.commonlyCliRef | default "main" }}" \
https://github.com/Team-Commonly/commonly.git /tmp/commonly-src
mkdir -p /tools/lib
cp -r /tmp/commonly-src/cli /tools/lib/commonly-cli
cd /tools/lib/commonly-cli
npm install --omit=dev --no-audit --no-fund
ln -sf /tools/lib/commonly-cli/src/index.js /tools/bin/commonly
chmod +x /tools/lib/commonly-cli/src/index.js
# Verify (best-effort; failures here don't block since the
# outer || already swallowed earlier failures we care about).
/tools/bin/codex --version || true
/tools/bin/commonly --version || true
ls -la /tools/bin/ || true
}
install_codex_tools || {
echo "[codex-tools-installer] install failed — gateway will start without /tools binaries"
echo "[codex-tools-installer] operator: 'commonly agent run codex' is unavailable until next pod restart"
exit 0
}
volumeMounts:
- name: codex-tools
mountPath: /tools
containers:
- name: clawdbot-gateway
image: "{{ .Values.agents.clawdbot.image.repository }}:{{ .Values.agents.clawdbot.image.tag }}"
Expand All @@ -485,6 +540,12 @@ spec:
- -c
- mkdir -p /home/node/.codex && cp /state/.codex/auth.json /home/node/.codex/auth.json 2>/dev/null || true
env:
# ADR-005 Stage 2: prepend the codex-tools volume to PATH so
# `codex` and `commonly` are on PATH for operator `kubectl exec`
# invocations and the run-loop process. Order matters — these
# binaries shadow any same-named ones in the openclaw image.
- name: PATH
value: "/tools/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
- name: CLAWDBOT_GATEWAY_PORT
value: {{ .Values.agents.clawdbot.config.gatewayPort | default 18789 | quote }}
- name: CLAWDBOT_GATEWAY_BIND
Expand Down Expand Up @@ -594,6 +655,12 @@ spec:
mountPath: /state
- name: clawdbot-workspace
mountPath: /workspace
# ADR-005 Stage 2 — codex + commonly CLIs installed by the
# codex-tools-installer init container. Mounted read-only here
# since nothing in the main container needs to write to it.
- name: codex-tools
mountPath: /tools
readOnly: true
volumes:
- name: clawdbot-config
configMap:
Expand All @@ -604,6 +671,11 @@ spec:
- name: clawdbot-workspace
persistentVolumeClaim:
claimName: clawdbot-workspace-pvc
# emptyDir is fine — re-populated on every pod start by the init
# container. Adds ~30s to startup but avoids a PVC + invalidation
# logic. Revisit if pod-restart latency becomes a real concern.
- name: codex-tools
emptyDir: {}
{{- with .Values.agents.clawdbot.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
Expand Down
8 changes: 8 additions & 0 deletions k8s/helm/commonly/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,14 @@ agents:
repository: gcr.io/YOUR_GCP_PROJECT_ID/clawdbot-gateway
tag: latest
pullPolicy: IfNotPresent
# ADR-005 Stage 2 — codex + @commonly/cli installed by an init
# container into a shared /tools volume. codexVersion pins the
# @openai/codex npm package; commonlyCliRef pins the git ref this
# repo's cli/ subdirectory is installed from. @commonly/cli is not
# yet on npm (ADR-005 Phase 4), so we install from source.
codexTools:
codexVersion: "0.125.0"
commonlyCliRef: "main"
config:
gatewayPort: 18789
bridgePort: 18790
Expand Down
Loading