Symptom
When a per-user codex backend dies (timeout, /stop, /new, recycle, or any _kill / shutdown path), the inner Rust codex binary is left running as the target os_user. Reproducible on 2026-05-15 after a Codex timed out event:
$ ps | grep 'codex app-server'
45609 ?? S 0:02.59 .../codex-darwin-arm64/vendor/.../codex/codex app-server
PID 45608 (the node wrapper) is gone, but PID 45609 (the Rust binary node fork-exec'd) survived. The Rust binary held a session in ~daniel/.codex/sessions/... open and continued consuming whatever resources codex retains between turns. The orphan does not exit on its own and accumulates across recycles.
Root cause
The #456 escalation ported to codex.py in PR #484 walks ONE level deeper than the sudo wrapper. For codex's actual process tree:
sudo (kai)
└─ node /Users/daniel/.npm-global/bin/codex app-server (daniel)
└─ /Users/daniel/.npm-global/lib/.../codex/codex app-server (daniel)
pgrep -P <sudo_pid> returns the node PID. _send_signal then sudo -n -u daniel /bin/kill -SIGKILL <node_pid>. The node wrapper exits. Its child (the Rust binary) reparents to init and stays alive.
The claude.py escalation does not hit this case because the claude CLI does not have a node-wrapper layer — pgrep -P <sudo_pid> returns the actual claude PID directly. Codex's npm-global install introduces an extra hop the escalation logic was never written for.
Proposed fix
_lookup_inner_codex_pid and _async_lookup_inner_codex_pid in codex.py should walk one level deeper for codex: after finding the first child (node), do a second pgrep -P against that to find the Rust binary. If the second pgrep returns nothing, fall back to the first PID (single-binary install without the node wrapper).
Alternative shapes (combine or pick):
- Walk the entire descendant tree (
pgrep -P -d ',' <pid> recursively until no children) and kill them all. More defensive against future codex packaging changes.
- Use
pkill -P <sudo_pid> -TERM semantics that signal all descendants at once. Less precise on what we're killing.
The kill semantics also need to flip from killing the wrapper-PID to killing the leaf-PID, with a fallback kill of any intermediate wrapper PIDs to be safe.
The sudoers rule (daniel) NOPASSWD: /bin/kill already allows the bot to signal any daniel-owned process, so no install-time changes are needed.
Tests
tests/test_codex.py: add a TestCodexGrandchildEscalation that mocks pgrep to return two levels of children (node-PID then Rust-PID) and asserts that _send_signal issues sudo kill against the Rust-PID (not the node-PID).
- Regression: existing tests assume single-level child; need updating to reflect the two-level walk.
Acceptance
Workaround until merged
Operators with an orphaned codex Rust binary can kill it manually:
sudo -u <os_user> pkill -f 'codex/codex app-server'
This kills any daniel-owned codex Rust binary regardless of parent. The persistent backend will spawn a fresh one on the next message.
Related
Symptom
When a per-user codex backend dies (timeout, /stop, /new, recycle, or any
_kill/shutdownpath), the inner Rust codex binary is left running as the target os_user. Reproducible on 2026-05-15 after aCodex timed outevent:PID 45608 (the
nodewrapper) is gone, but PID 45609 (the Rust binarynodefork-exec'd) survived. The Rust binary held a session in~daniel/.codex/sessions/...open and continued consuming whatever resources codex retains between turns. The orphan does not exit on its own and accumulates across recycles.Root cause
The #456 escalation ported to
codex.pyin PR #484 walks ONE level deeper than the sudo wrapper. For codex's actual process tree:pgrep -P <sudo_pid>returns thenodePID._send_signalthensudo -n -u daniel /bin/kill -SIGKILL <node_pid>. The node wrapper exits. Its child (the Rust binary) reparents to init and stays alive.The claude.py escalation does not hit this case because the claude CLI does not have a node-wrapper layer —
pgrep -P <sudo_pid>returns the actual claude PID directly. Codex's npm-global install introduces an extra hop the escalation logic was never written for.Proposed fix
_lookup_inner_codex_pidand_async_lookup_inner_codex_pidincodex.pyshould walk one level deeper for codex: after finding the first child (node), do a secondpgrep -Pagainst that to find the Rust binary. If the second pgrep returns nothing, fall back to the first PID (single-binary install without the node wrapper).Alternative shapes (combine or pick):
pgrep -P -d ',' <pid>recursively until no children) and kill them all. More defensive against future codex packaging changes.pkill -P <sudo_pid> -TERMsemantics that signal all descendants at once. Less precise on what we're killing.The kill semantics also need to flip from killing the wrapper-PID to killing the leaf-PID, with a fallback kill of any intermediate wrapper PIDs to be safe.
The sudoers rule
(daniel) NOPASSWD: /bin/killalready allows the bot to signal any daniel-owned process, so no install-time changes are needed.Tests
tests/test_codex.py: add aTestCodexGrandchildEscalationthat mockspgrepto return two levels of children (node-PID then Rust-PID) and asserts that_send_signalissuessudo killagainst the Rust-PID (not the node-PID).Acceptance
_killon a codex backend,pgrep -u daniel codexreturns no results.shutdownandrestartpaths leave no daniel-owned codex processes behind.Workaround until merged
Operators with an orphaned codex Rust binary can kill it manually:
This kills any daniel-owned codex Rust binary regardless of parent. The persistent backend will spawn a fresh one on the next message.
Related