feat: cap SSH session container resources by spec and worker config#50
feat: cap SSH session container resources by spec and worker config#50kaiitunnz wants to merge 10 commits into
Conversation
SSH session containers previously had unbounded access to the host's CPU and memory. Plumb an operator-set cap (SSH_MAX_CPU / SSH_MAX_MEMORY / SSH_MAX_PIDS) from the supervisor to the worker registry as a new SSHLimits record, and apply min(task spec, worker cap) as Docker nano_cpus / mem_limit / pids_limit when spawning the session. The dispatcher's hw_satisfies clamps physical capacity by the cap for SSH tasks so workers that cannot satisfy a request are filtered out before silent under-provisioning at runtime. Unset values mean unbounded (historical behavior), with a worker startup warning when SSH is enabled but uncapped. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
`EnvVar.min_value` / `max_value` previously only modeled inclusive bounds. `SSH_MAX_CPU=0` slipped past stack-side validation and only failed at worker boot. Add `min_inclusive` / `max_inclusive` flags (default `True`, preserving existing behavior) and apply the strict variant to `SSH_MAX_CPU` so a zero value is rejected at the source. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Adds a third optional ``hardware`` parameter to ``Executor.__init__`` so executors can read probed worker hardware without re-collecting it. ``initialize_executors`` now forwards the value alongside ``config`` and ``lifecycle``; subclasses switch to ``(*args, **kwargs)`` passthrough (matching the existing ``GovernanceMixin`` pattern) so future signature extensions don't break the chain. No executor consumes the new parameter yet — wiring only. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
MPExecutor now accepts and forwards WorkerHardware to the inner executor subprocess, mirroring the base Executor signature. A shared make_worker_hardware test factory replaces the per-file fixture. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Centralize GPU type normalization, memory parsing, per-device matching, and unified-memory fallback into shared.utils.hardware so dispatcher filtering and worker-side slicing stay in sync. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
SSH containers now expose only the spec-requested subset of the worker's GPUs (filtering by type and per-device memory, with a unified-memory fallback) and write a normalized CUDA_VISIBLE_DEVICES matching the container's 0..N-1 view of the mounted devices. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Default off — SSH session containers receive every worker GPU regardless of `resources.hardware.gpu`. Operators turn slicing on with ENABLE_SSH_GPU_LIMIT=true, which propagates from the supervisor through the worker into SSHConfig.from_spec's GPU resolver. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
581ac46 to
6708603
Compare
| SSH_MAX_PIDS= | ||
| # Whether to apply requested GPU limits to SSH tasks. | ||
| # If false, SSH tasks are allocated all available GPUs | ||
| # regardless of their resource requests. |
There was a problem hiding this comment.
I wonder if we should enable the GPU limit by default. In the long term, I think seeing exactly what the user request should be the cleaner default design. I do not see a scenario where we should disable the limit. How do you think?
There was a problem hiding this comment.
I’m thinking about hardware utilization here. Unlike CPU and RAM, which are shared across workers and other processes, GPUs are assigned exclusively to a worker. If we only expose the spec-requested GPUs to the SSH task, any remaining GPUs on that worker may sit idle, which can reduce overall utilization and has billing implications as well.
There was a problem hiding this comment.
Is it possible to allow multiple SSH sessions to share the same worker?
There was a problem hiding this comment.
No, it is not. Only one SSH task can run at a time per worker.
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Safety net for the case where a GPU-requesting SSH task is routed to a worker reporting no host GPUs. Previously the resolver silently returned an empty subset; it now raises ExecutionError, matching the other unsatisfiable-request paths in the same function. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Purpose
Address the Fix SSH resource access item in RFC #48. SSH session containers previously had unbounded host CPU / memory / GPU access — no
nano_cpus/mem_limit/pids_limit/device_requestsDocker kwargs were set (ordevice_requestsexposed every worker GPU regardless of spec). This PR adds operator caps (SSH_MAX_CPU/SSH_MAX_MEMORY/SSH_MAX_PIDS) and enforcesmin(task_spec, worker_cap)both at dispatch and at runtime, and additionally adds an opt-inENABLE_SSH_GPU_LIMITflag that limits each SSH session's GPU access to the spec-requested subset.Changes
SSHLimitsmodel inshared/schemas/worker.py; mirrored on the SDK.DockerWorkerConfig.sshgainsmax_cpu/max_memory/max_pids;WorkerInfo.ssh_limitspopulated when SSH is enabled; SSH env injection gated onenable_ssh.Worker.ssh_limitsrehydrated from Redis;hw_satisfiesclamps the worker's physical capacity byssh_limitsfor SSH tasks only.WorkerConfig.ssh_limitsparsed from env (fail-fast on bad values); reported on registration; startup warning when SSH is enabled without a cap.min(spec, cap)per session; appliesnano_cpus/mem_limit/pids_limitDocker kwargs; logs aWARNING ... clamping to capwhen the spec exceeds the cap. WhenENABLE_SSH_GPU_LIMIT=true, also resolves a spec-matching GPU subset and wires bothdevice_requests(host IDs) and a normalizedCUDA_VISIBLE_DEVICES(container-side0..N-1); when false (default), every worker GPU is mounted as before.Executor.__init__andMPExecutorthreadWorkerHardwarethrough to subclasses via cooperative*args, **kwargs, so the SSH executor can read the worker's GPU device list when resolving the subset.shared/utils/hardware.pycentralizes GPU type normalization, memory parsing, per-device matching, and the unified-memory fallback so dispatcher filtering and worker-side slicing stay in sync.env_schema.pyanddocs/ENV.mdupdated.EnvVargainsmin_inclusive/max_inclusiveflags soSSH_MAX_CPU=0is rejected at the stack-validation layer instead of only failing at worker boot.Design
Two-sided enforcement. Dispatch refuses to route an SSH task to a worker whose effective ceiling can't satisfy the request, so the common case never reaches the executor. The runtime clamp is defense in depth for config drift between the dispatcher's view of
ssh_limitsand the worker's.SSH_MAX_PIDSis admin-only — the spec has no PID field.GPU slicing is opt-in via
ENABLE_SSH_GPU_LIMITand has no operator cap. When enabled, the SSH executor picks the smallest subset of the worker's host GPUs (WORKER_HOST_GPU_ID) that satisfies the spec's type and per-device memory floor, passes those host IDs throughdevice_requests, and writesCUDA_VISIBLE_DEVICES=0,1,...,N-1to match the 0..N-1 view the NVIDIA container runtime exposes inside the session. The default-off behavior is the pre-PR pass-through: every worker GPU is mounted regardless of the spec.Default is pass-through unlimited; the cap is opt-in. The startup warning makes the unlimited default visible to operators.
Test Plan
Local-stack manual verification across four scenarios, with full step-by-step commands in
tmp/e2e_tests/ssh_limit/test_ssh_limit.md. Per-scenario summary:SSH_MAX_*empty; expect no Docker resource kwargs on the SSH container and the "SSH resource cap not configured" startup warning.SSH_MAX_CPU=1.5,SSH_MAX_MEMORY=512Mi,SSH_MAX_PIDS=256; expect those values to round-trip through the registry and appear on the SSH container'sHostConfig.resources.hardwareblock — spec below cap (spec wins), spec above cap on the only worker (dispatcher refuses).flowmesh stack worker up gpu allwithENABLE_SSH_GPU_LIMIT=true; submit SSH tasks with no GPU spec /count: 1/type: <model>/memory: <floor>/count > N_GPUS. For each, inspect the SSH container'sHostConfig.DeviceRequests[0].DeviceIDsandConfig.EnvforCUDA_VISIBLE_DEVICES, andnvidia-smi -Linside the container.Each scenario uses
examples/templates/ssh_with_inputs.yaml— the spec-modifying sub-cases fork it and add aresources.hardwareblock to theannotatestage. The session stays alive fordocker inspect; we cancel the workflow once the inspect is done.Test Result
Local-stack (CPU worker,
worker_cpu_0):flowmesh worker listshowed no SSH limits. SSH containerHostConfig→{NanoCpus: 0, Memory: 0, PidsLimit: null}.HostConfig→{NanoCpus: 1500000000, Memory: 536870912, PidsLimit: 256}; in-container cgroup files corroborated.Memory: 268435456,NanoCpus: 1000000000; no clamp warning.PENDING; cancelled explicitly. The runtime-clamp path is exercised only by unit tests in this PR — end-to-end, any spec above the cap is rejected at dispatch before the executor sees it.Local-stack (GPU worker,
worker_gpu_all, 4× H200, imagenvidia/cuda:12.9.1-base-ubuntu24.04,ENABLE_SSH_GPU_LIMIT=true):DeviceRequests[0].DeviceIDslisted all 4 host GPU IDs;CUDA_VISIBLE_DEVICES="0,1,2,3";nvidia-smi -Lreported 4 GPUs.count: 1):DeviceRequests[0].DeviceIDscontained exactly one host GPU ID (the first in the worker's device list);CUDA_VISIBLE_DEVICES="0";nvidia-smi -Lreported a single GPU.type: H200): matched all 4 devices; withcount: 1the first H200 was mounted;CUDA_VISIBLE_DEVICES="0".memoryfloor below per-device): matched all 4 devices;count: 1mounted one;CUDA_VISIBLE_DEVICES="0".count: 5> N_GPUS): dispatcher refused; task stayedPENDINGwith no SSH container launched; cancelled explicitly.Follow-ups
/procview isolation. Cgroup limits enforce actual usage but/proc/cpuinfo,/proc/meminfo, etc. still report the host.lxcfssynthesizes per-cgroup/procviews via FUSE; a follow-up PR can wire aSSH_USE_LXCFS=trueflag intoSSHExecutor._build_run_kwargsto bind-mount/var/lib/lxcfs/proc/*into the session container. Separate concern from cgroup enforcement; same host-levellxcfsdaemon serves all containers via per-reader cgroup lookup.Pre-submission Checklist
pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.uv sync --all-packages --group ci --frozen).[BREAKING]and described migration steps above.