feat: cap SSH session container resources by spec and worker config by kaiitunnz · Pull Request #50 · mlsys-io/FlowMesh

kaiitunnz · 2026-05-15T13:12:10Z

Purpose

Address the Fix SSH resource access item in RFC #48. SSH session containers previously had unbounded host CPU / memory / GPU access — no nano_cpus / mem_limit / pids_limit / device_requests Docker kwargs were set (or device_requests exposed every worker GPU regardless of spec). This PR adds operator caps (SSH_MAX_CPU / SSH_MAX_MEMORY / SSH_MAX_PIDS) and enforces min(task_spec, worker_cap) both at dispatch and at runtime, and additionally adds an opt-in ENABLE_SSH_GPU_LIMIT flag that limits each SSH session's GPU access to the spec-requested subset.

Changes

Schema — new SSHLimits model in shared/schemas/worker.py; mirrored on the SDK.
Supervisor — DockerWorkerConfig.ssh gains max_cpu / max_memory / max_pids; WorkerInfo.ssh_limits populated when SSH is enabled; SSH env injection gated on enable_ssh.
Server registry — Worker.ssh_limits rehydrated from Redis; hw_satisfies clamps the worker's physical capacity by ssh_limits for SSH tasks only.
Worker — WorkerConfig.ssh_limits parsed from env (fail-fast on bad values); reported on registration; startup warning when SSH is enabled without a cap.
SSH executor — resolves min(spec, cap) per session; applies nano_cpus / mem_limit / pids_limit Docker kwargs; logs a WARNING ... clamping to cap when the spec exceeds the cap. When ENABLE_SSH_GPU_LIMIT=true, also resolves a spec-matching GPU subset and wires both device_requests (host IDs) and a normalized CUDA_VISIBLE_DEVICES (container-side 0..N-1); when false (default), every worker GPU is mounted as before.
Executor plumbing — Executor.__init__ and MPExecutor thread WorkerHardware through to subclasses via cooperative *args, **kwargs, so the SSH executor can read the worker's GPU device list when resolving the subset.
Shared GPU helpers — new shared/utils/hardware.py centralizes GPU type normalization, memory parsing, per-device matching, and the unified-memory fallback so dispatcher filtering and worker-side slicing stay in sync.
Env schema + docs — stack env_schema.py and docs/ENV.md updated. EnvVar gains min_inclusive / max_inclusive flags so SSH_MAX_CPU=0 is rejected at the stack-validation layer instead of only failing at worker boot.

Design

Two-sided enforcement. Dispatch refuses to route an SSH task to a worker whose effective ceiling can't satisfy the request, so the common case never reaches the executor. The runtime clamp is defense in depth for config drift between the dispatcher's view of ssh_limits and the worker's. SSH_MAX_PIDS is admin-only — the spec has no PID field.

GPU slicing is opt-in via ENABLE_SSH_GPU_LIMIT and has no operator cap. When enabled, the SSH executor picks the smallest subset of the worker's host GPUs (WORKER_HOST_GPU_ID) that satisfies the spec's type and per-device memory floor, passes those host IDs through device_requests, and writes CUDA_VISIBLE_DEVICES=0,1,...,N-1 to match the 0..N-1 view the NVIDIA container runtime exposes inside the session. The default-off behavior is the pre-PR pass-through: every worker GPU is mounted regardless of the spec.

Default is pass-through unlimited; the cap is opt-in. The startup warning makes the unlimited default visible to operators.

Test Plan

uv run pre-commit run --files <touched files>
uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py

Local-stack manual verification across four scenarios, with full step-by-step commands in tmp/e2e_tests/ssh_limit/test_ssh_limit.md. Per-scenario summary:

A — no cap: all SSH_MAX_* empty; expect no Docker resource kwargs on the SSH container and the "SSH resource cap not configured" startup warning.
B — worker cap only: SSH_MAX_CPU=1.5, SSH_MAX_MEMORY=512Mi, SSH_MAX_PIDS=256; expect those values to round-trip through the registry and appear on the SSH container's HostConfig.
C — spec + cap interaction: sub-cases on an SSH template with a resources.hardware block — spec below cap (spec wins), spec above cap on the only worker (dispatcher refuses).
D — GPU slicing: GPU worker brought up via flowmesh stack worker up gpu all with ENABLE_SSH_GPU_LIMIT=true; submit SSH tasks with no GPU spec / count: 1 / type: <model> / memory: <floor> / count > N_GPUS. For each, inspect the SSH container's HostConfig.DeviceRequests[0].DeviceIDs and Config.Env for CUDA_VISIBLE_DEVICES, and nvidia-smi -L inside the container.

Each scenario uses examples/templates/ssh_with_inputs.yaml — the spec-modifying sub-cases fork it and add a resources.hardware block to the annotate stage. The session stays alive for docker inspect; we cancel the workflow once the inspect is done.

Test Result

$ uv run pre-commit run --all-files
# All passed

$ uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
# 972 passed, 18 warnings in ~30s

Local-stack (CPU worker, worker_cpu_0):

A: worker logged the "SSH resource cap not configured" warning. flowmesh worker list showed no SSH limits. SSH container HostConfig → {NanoCpus: 0, Memory: 0, PidsLimit: null}.
B: worker logged the configured cap. Registry round-tripped the same values. SSH container HostConfig → {NanoCpus: 1500000000, Memory: 536870912, PidsLimit: 256}; in-container cgroup files corroborated.
C (spec < cap): container Memory: 268435456, NanoCpus: 1000000000; no clamp warning.
C (spec exceeds cap, dispatch refuses): task stayed PENDING; cancelled explicitly. The runtime-clamp path is exercised only by unit tests in this PR — end-to-end, any spec above the cap is rejected at dispatch before the executor sees it.

Local-stack (GPU worker, worker_gpu_all, 4× H200, image nvidia/cuda:12.9.1-base-ubuntu24.04, ENABLE_SSH_GPU_LIMIT=true):

D (no GPU spec): SSH container's DeviceRequests[0].DeviceIDs listed all 4 host GPU IDs; CUDA_VISIBLE_DEVICES="0,1,2,3"; nvidia-smi -L reported 4 GPUs.
D (count: 1): DeviceRequests[0].DeviceIDs contained exactly one host GPU ID (the first in the worker's device list); CUDA_VISIBLE_DEVICES="0"; nvidia-smi -L reported a single GPU.
D (type: H200): matched all 4 devices; with count: 1 the first H200 was mounted; CUDA_VISIBLE_DEVICES="0".
D (memory floor below per-device): matched all 4 devices; count: 1 mounted one; CUDA_VISIBLE_DEVICES="0".
D (count: 5 > N_GPUS): dispatcher refused; task stayed PENDING with no SSH container launched; cancelled explicitly.
Heterogeneous-GPU sub-case (different types or memory tiers on the same worker) not tested — the testbed has only 4× H200, so the executor's per-device type/memory filter was exercised only on a homogeneous fleet. The dispatcher-side rejection path was confirmed.

Follow-ups

lxcfs for /proc view isolation. Cgroup limits enforce actual usage but /proc/cpuinfo, /proc/meminfo, etc. still report the host. lxcfs synthesizes per-cgroup /proc views via FUSE; a follow-up PR can wire a SSH_USE_LXCFS=true flag into SSHExecutor._build_run_kwargs to bind-mount /var/lib/lxcfs/proc/* into the session container. Separate concern from cgroup enforcement; same host-level lxcfs daemon serves all containers via per-reader cgroup lookup.
Make env parsing stricter. Raise a warning or an error when the env parsing encounters unexpected type.

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that uv run pytest tests/ passes locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
I have updated documentation or config examples if user-facing behavior changed.

timzsu

Have we implemented the GPU resource cap for SSH session? Or is it now purely based on the assumption that the worker has exactly the number of GPUs that the user wants?

SSH session containers previously had unbounded access to the host's CPU and memory. Plumb an operator-set cap (SSH_MAX_CPU / SSH_MAX_MEMORY / SSH_MAX_PIDS) from the supervisor to the worker registry as a new SSHLimits record, and apply min(task spec, worker cap) as Docker nano_cpus / mem_limit / pids_limit when spawning the session. The dispatcher's hw_satisfies clamps physical capacity by the cap for SSH tasks so workers that cannot satisfy a request are filtered out before silent under-provisioning at runtime. Unset values mean unbounded (historical behavior), with a worker startup warning when SSH is enabled but uncapped. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

`EnvVar.min_value` / `max_value` previously only modeled inclusive bounds. `SSH_MAX_CPU=0` slipped past stack-side validation and only failed at worker boot. Add `min_inclusive` / `max_inclusive` flags (default `True`, preserving existing behavior) and apply the strict variant to `SSH_MAX_CPU` so a zero value is rejected at the source. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Adds a third optional ``hardware`` parameter to ``Executor.__init__`` so executors can read probed worker hardware without re-collecting it. ``initialize_executors`` now forwards the value alongside ``config`` and ``lifecycle``; subclasses switch to ``(*args, **kwargs)`` passthrough (matching the existing ``GovernanceMixin`` pattern) so future signature extensions don't break the chain. No executor consumes the new parameter yet — wiring only. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

MPExecutor now accepts and forwards WorkerHardware to the inner executor subprocess, mirroring the base Executor signature. A shared make_worker_hardware test factory replaces the per-file fixture. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Centralize GPU type normalization, memory parsing, per-device matching, and unified-memory fallback into shared.utils.hardware so dispatcher filtering and worker-side slicing stay in sync. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

SSH containers now expose only the spec-requested subset of the worker's GPUs (filtering by type and per-device memory, with a unified-memory fallback) and write a normalized CUDA_VISIBLE_DEVICES matching the container's 0..N-1 view of the mounted devices. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Default off — SSH session containers receive every worker GPU regardless of `resources.hardware.gpu`. Operators turn slicing on with ENABLE_SSH_GPU_LIMIT=true, which propagates from the supervisor through the worker into SSHConfig.from_spec's GPU resolver. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

Some comments. PTAL.

timzsu · 2026-05-16T15:30:21Z

+SSH_MAX_PIDS=
+# Whether to apply requested GPU limits to SSH tasks.
+# If false, SSH tasks are allocated all available GPUs
+# regardless of their resource requests.


I wonder if we should enable the GPU limit by default. In the long term, I think seeing exactly what the user request should be the cleaner default design. I do not see a scenario where we should disable the limit. How do you think?

I’m thinking about hardware utilization here. Unlike CPU and RAM, which are shared across workers and other processes, GPUs are assigned exclusively to a worker. If we only expose the spec-requested GPUs to the SSH task, any remaining GPUs on that worker may sit idle, which can reduce overall utilization and has billing implications as well.

Is it possible to allow multiple SSH sessions to share the same worker?

No, it is not. Only one SSH task can run at a time per worker.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Safety net for the case where a GPU-requesting SSH task is routed to a worker reporting no host GPUs. Previously the resolver silently returned an empty subset; it now raises ExecutionError, matching the other unsatisfiable-request paths in the same function. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

kaiitunnz marked this pull request as ready for review May 15, 2026 13:22

kaiitunnz requested a review from timzsu as a code owner May 15, 2026 13:22

timzsu mentioned this pull request May 15, 2026

[RFC]: FlowMesh 2026 Q2 Roadmap #48

Open

9 tasks

timzsu requested changes May 15, 2026

View reviewed changes

Comment thread cli/stack/src/flowmesh_cli_stack/env_schema.py Outdated

Comment thread src/server/env.py

timzsu self-requested a review May 15, 2026 15:39

kaiitunnz added 8 commits May 16, 2026 15:18

refactor: apply black

6708603

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

kaiitunnz force-pushed the kaiitunnz/feat/limit-ssh-hw branch from 581ac46 to 6708603 Compare May 16, 2026 15:18

timzsu requested changes May 16, 2026

View reviewed changes

kaiitunnz added 2 commits May 16, 2026 16:25

docs: add ENABLE_SSH_GPU_LIMIT to env reference

796c5d4

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cap SSH session container resources by spec and worker config#50

feat: cap SSH session container resources by spec and worker config#50
kaiitunnz wants to merge 10 commits into
mainfrom
kaiitunnz/feat/limit-ssh-hw

kaiitunnz commented May 15, 2026 •

edited

Loading

Uh oh!

timzsu left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

timzsu left a comment

Uh oh!

Uh oh!

timzsu May 16, 2026

Uh oh!

kaiitunnz May 16, 2026

Uh oh!

timzsu May 17, 2026

Uh oh!

kaiitunnz May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaiitunnz commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Design

Test Plan

Test Result

Follow-ups

Uh oh!

timzsu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timzsu May 16, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz May 16, 2026

Choose a reason for hiding this comment

Uh oh!

timzsu May 17, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaiitunnz commented May 15, 2026 •

edited

Loading

timzsu left a comment •

edited

Loading