fix: spawn supervisor child to avoid OpenSSL fork deadlock#55
Merged
Conversation
4fe56a0 to
4bda435
Compare
The parent server opens Redis-over-TLS connections during lifespan
startup, which initialises OpenSSL state. OpenSSL is not fork-safe —
inheriting random-pool state and internal locks via `fork()` can
deadlock the child the first time it calls `ssl.SSLContext.__new__`
(observed as an intermittent "Supervisor child did not register a
node within 30s" handshake timeout, with the child stuck in
`ssl.create_default_context` under `redis-py`'s SSL connect path).
Use `mp.get_context("spawn")` for the supervisor `Process` and for
the IPC queues in `create_task_channel`. Spawn execs a fresh
interpreter so the child gets clean OpenSSL state. The IPC primitives
must be created from the same context — mixing fork-context
`SemLock`s with a spawn-context process raises a RuntimeError.
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Newly disclosed CVE in diffusers 0.36.0, fixed in 0.38.0. Bumping is blocked by the same `safetensors>=0.8.0rc0` pre-release requirement that already gates GHSA-j7w6-vpvq-j3gm and GHSA-98h9-4798-4q5v; adding to the existing diffusers row block in the workflow and the advisory table. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
4bda435 to
dd00769
Compare
`mp.get_context("spawn")` returns a singleton instantiated at
`multiprocessing` import time, so `@functools.cache` on a getter was
redundant and the "lazily constructed" framing was misleading. A
module-level constant is simpler, honest about what it is, and reads
naturally at call sites.
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Newly disclosed CVE in starlette 0.52.1, fixed in 1.0.1. Bumping is blocked by `gradio==5.50` (transitive via `vllm-omni==0.18`), which caps `starlette<1.0` — same chain that gates the existing gradio / vllm-omni CVE ignores. Add to the worker-GPU pip-audit invocation (where the failure surfaced) and document the row in the advisory table. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The previous commit added the ignore only to the worker-GPU step because that's where the CVE first surfaced. The lock then resolved starlette to a different (still <1.0.1) version on the server side, exposing the same advisory in the server pip-audit step. Same blocker (gradio 5.50 caps starlette<1.0 via vllm-omni 0.18 — already in the docs table). Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
timzsu
requested changes
May 22, 2026
Collaborator
timzsu
left a comment
There was a problem hiding this comment.
One minor comment. PTAL.
| run: | | ||
| grep -v '@ git+' src/server/requirements.txt > /tmp/requirements-server-audit.txt | ||
| uvx pip-audit==2.9.0 --strict \ | ||
| --ignore-vuln PYSEC-2026-161 \ |
Collaborator
There was a problem hiding this comment.
Do we need this line? I have locally experimented and found that bumping fastapi to >=0.135.0 can solve this ignore cleanly. (The doc addition mentions that the ignore is due to vllm-omni, which is not installed in the server environment).
Collaborator
Author
There was a problem hiding this comment.
Fixed by removing the line and bumping FastAPI bound.
The previous ignore for PYSEC-2026-161 on the server pip-audit step cited the gradio 5.50 / vllm-omni 0.18 cap on `starlette<1.0`, but neither lives in the server requirements — only the worker-GPU layer brings them in. Bumping fastapi's floor lets `pip-audit` install a starlette past the 1.0.1 fix on the server, so the server step audits clean without an ignore. The worker-GPU ignore stays — that chain still caps starlette there. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Fixes an intermittent server-startup hang where the FastAPI lifespan times out with
Supervisor child did not register a node within 30s (child still alive). The hang reproduces roughly 1-in-5 to 1-in-10 stack boots and is more likely after a stack restart that overlaps with other docker activity.Also addresses two newly disclosed CVEs that started failing
pip-auditon this PR —starlettePYSEC-2026-161 is resolved on the server by bumpingfastapi;diffusersGHSA-7wx4-6vff-v64p and the worker-GPUstarletteexposure are silenced because both upgrades are blocked by existing pins.Changes
src/server/supervisor/supervisor.py— spawn the supervisor child viamp.get_context("spawn")instead of the default fork. Widen the process annotation toBaseProcess | Noneso spawn'sSpawnProcesstype-checks.src/server/utils/concurrent.py—create_task_channelnow creates its IPC queues from the same spawn context, so the SemLocks inside are spawn-compatible (mixing fork and spawn primitives raisesRuntimeErrorat spawn time).pyproject.toml/uv.lock/src/server/requirements.txt— bumpfastapi>=0.135.0so the server pip-audit picks up a starlette past the PYSEC-2026-161 fix..github/workflows/security.yml+docs/CODE_STYLE.md— add--ignore-vulnentries forGHSA-7wx4-6vff-v64p(diffusers, all worker steps) andPYSEC-2026-161(starlette, worker-GPU only), with matching rows in the advisory table. Both upgrades are blocked there by existing transitive pins.Design
The parent server opens Redis-over-TLS connections during lifespan startup, which initialises OpenSSL state in the parent process. OpenSSL is not fork-safe — random-pool state and internal locks inherited via
fork()can deadlock the child the first time it callsssl.SSLContext.__new__. We caught the child stuck there with afaulthandlerthread dump, insideredis-py's SSL connect path onSyncRedisClient.ping().Switching to
spawnexecs a fresh Python interpreter for the child, so OpenSSL initialises cleanly. The spawn overhead is ~0.5–1 s and pays only once per stack boot.Test Plan
E2E (local stack, no plugin scenarios):
flowmesh stack up× 25 consecutive cycles withworker up cpu 1mixed in — 25/25 healthy, zero handshake timeouts. Pre-fix, the same loop reproduced the timeout at ~1 in 5–10 cycles.Test Result
25-cycle stress loop:
PASS=25 FAIL=0.Pre-submission Checklist
pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.uv sync --all-packages --group ci --frozen).[BREAKING]and described migration steps above.