Skip to content

fix: reap completed process pool workers#243

Open
MTG-Thomas wants to merge 4 commits into
jackmusick:mainfrom
MTG-Thomas:codex/process-pool-reap
Open

fix: reap completed process pool workers#243
MTG-Thomas wants to merge 4 commits into
jackmusick:mainfrom
MTG-Thomas:codex/process-pool-reap

Conversation

@MTG-Thomas
Copy link
Copy Markdown
Contributor

Summary

  • reap raw forked process-pool children when they exit
  • join completed on-demand workers before removing them from pool state
  • avoid reporting a completed child as crashed when its result is already queued

Testing

  • python3 -m py_compile src/services/execution/process_pool.py tests/unit/execution/test_process_pool.py
  • PYTHONPATH=$PWD/api pytest api/tests/unit/execution/test_process_pool.py -q -k "PidWrapper or reaps_completed_child or queued_result or crash_detection_replaces_process or sigkilled_worker" (7 passed)

Note: the full test_process_pool.py run reached 42/43 in my ad hoc WSL venv; the remaining failure is an existing admission-control test trying to resolve redis:6379 without the compose test environment.

@MTG-Thomas MTG-Thomas requested a review from jackmusick as a code owner May 13, 2026 12:46
Comment thread api/src/services/execution/process_pool.py Fixed
@MTG-Thomas
Copy link
Copy Markdown
Contributor Author

Updated the PR to cover the clean-exit/result-delivery race that showed up in E2E. The new commit adds a short grace period for exit_code=0 workers whose result pipe has not delivered yet, while preserving the crash path after the grace period.

Verification:

  • CI is green at 9c8556f (lint/type check, unit, E2E, CodeQL).
  • Local Docker E2E on a Linux test VM: 1235 passed, 49 skipped.

Copy link
Copy Markdown
Owner

@jackmusick jackmusick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reaping logic and grace-window are well-scoped. Verified locally: full unit suite (3777) passes on the branch, including the 5 new regression tests pinning the clean-exit + queued-result race and the zombie-reap path. Important fix for on-demand mode as we lean on it harder for Chat V2 sandboxes.

@jackmusick jackmusick enabled auto-merge (squash) May 16, 2026 17:59
@jackmusick jackmusick added the automerge Kodiak: queue this PR for auto-rebase + squash-merge once approved and CI green label May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge Kodiak: queue this PR for auto-rebase + squash-merge once approved and CI green

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants