todo: document zombie reaping gap in orphan_reaper#223
Open
dangtrivan15 wants to merge 7 commits intoAutoForgeAI:masterfrom
Open
todo: document zombie reaping gap in orphan_reaper#223dangtrivan15 wants to merge 7 commits intoAutoForgeAI:masterfrom
dangtrivan15 wants to merge 7 commits intoAutoForgeAI:masterfrom
Conversation
kill_process_tree() now uses os.killpg() when the subprocess was started with start_new_session=True, killing the entire process group atomically before falling back to psutil tree walk. This eliminates the race where children get reparented to PID 1 before psutil.children() runs. Addresses AutoForgeAI#164, AutoForgeAI#197. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each subprocess now runs in its own process group, enabling atomic cleanup via os.killpg(). This prevents children from escaping kill_process_tree() by getting reparented to PID 1. 6 spawn sites updated: 4 in parallel_orchestrator.py, 1 in process_manager.py, 1 in dev_server_manager.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
killProcess() now sends SIGTERM to -pid (the process group), ensuring all child processes are terminated on Ctrl+C or SIGTERM. Uvicorn is spawned with detached:true to create a new process group on Unix. Also switches from execSync to execFileSync to avoid shell injection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scans every 60s for chrome/node/esbuild processes orphaned under PID 1 and kills them after a 30s grace period. Only active on Linux containers. This is defense-in-depth: catches any orphans that escape the process-group kill mechanism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The orphan reaper kills live orphaned processes but cannot clean up zombies (state Z) when the container PID 1 doesn't call waitpid(). Documents known container-side and code-side solutions. Relates to AutoForgeAI#222. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tini/dumb-init) and never callswaitpid()tini) and code-side (PR_SET_CHILD_SUBREAPER+SIGCHLDhandler)Context
Observed in a K8s deployment: after PR #222, orphaned chrome/node/esbuild processes are correctly terminated (they become zombies), but the zombie entries accumulate in the PID table because the container's PID 1 (
nodeviaautoforge-bin) doesn't reap them.Zombies don't consume memory (RSS=0), so this doesn't contribute to the OOM issue #222 addressed. However, zombie accumulation could eventually exhaust the PID table in long-running containers.
Relates to #222, #164, #197.
🤖 Generated with Claude Code