Skip to content

todo: document zombie reaping gap in orphan_reaper#223

Open
dangtrivan15 wants to merge 7 commits intoAutoForgeAI:masterfrom
dangtrivan15:fix/zombie-reaping
Open

todo: document zombie reaping gap in orphan_reaper#223
dangtrivan15 wants to merge 7 commits intoAutoForgeAI:masterfrom
dangtrivan15:fix/zombie-reaping

Conversation

@dangtrivan15
Copy link

Summary

  • Adds a TODO documenting that the orphan reaper (from fix: eliminate orphaned process leaks causing OOM kills #222) kills live orphaned processes but cannot clean up zombie processes (state Z)
  • Zombies occur in containerized deployments where PID 1 is not a proper init (no tini/dumb-init) and never calls waitpid()
  • Documents two known solutions: container-side (tini) and code-side (PR_SET_CHILD_SUBREAPER + SIGCHLD handler)

Context

Observed in a K8s deployment: after PR #222, orphaned chrome/node/esbuild processes are correctly terminated (they become zombies), but the zombie entries accumulate in the PID table because the container's PID 1 (node via autoforge-bin) doesn't reap them.

Zombies don't consume memory (RSS=0), so this doesn't contribute to the OOM issue #222 addressed. However, zombie accumulation could eventually exhaust the PID table in long-running containers.

Relates to #222, #164, #197.

🤖 Generated with Claude Code

dangtrivan15 and others added 7 commits March 8, 2026 19:52
kill_process_tree() now uses os.killpg() when the subprocess was started
with start_new_session=True, killing the entire process group atomically
before falling back to psutil tree walk. This eliminates the race where
children get reparented to PID 1 before psutil.children() runs.

Addresses AutoForgeAI#164, AutoForgeAI#197.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each subprocess now runs in its own process group, enabling atomic
cleanup via os.killpg(). This prevents children from escaping
kill_process_tree() by getting reparented to PID 1.

6 spawn sites updated: 4 in parallel_orchestrator.py,
1 in process_manager.py, 1 in dev_server_manager.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
killProcess() now sends SIGTERM to -pid (the process group), ensuring
all child processes are terminated on Ctrl+C or SIGTERM. Uvicorn is
spawned with detached:true to create a new process group on Unix.
Also switches from execSync to execFileSync to avoid shell injection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scans every 60s for chrome/node/esbuild processes orphaned under PID 1
and kills them after a 30s grace period. Only active on Linux containers.
This is defense-in-depth: catches any orphans that escape the
process-group kill mechanism.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The orphan reaper kills live orphaned processes but cannot clean up
zombies (state Z) when the container PID 1 doesn't call waitpid().
Documents known container-side and code-side solutions.

Relates to AutoForgeAI#222.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant