Skip to content

feat: cap SSH session container resources by spec and worker config#50

Open
kaiitunnz wants to merge 10 commits into
mainfrom
kaiitunnz/feat/limit-ssh-hw
Open

feat: cap SSH session container resources by spec and worker config#50
kaiitunnz wants to merge 10 commits into
mainfrom
kaiitunnz/feat/limit-ssh-hw

Conversation

@kaiitunnz
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz commented May 15, 2026

Purpose

Address the Fix SSH resource access item in RFC #48. SSH session containers previously had unbounded host CPU / memory / GPU access — no nano_cpus / mem_limit / pids_limit / device_requests Docker kwargs were set (or device_requests exposed every worker GPU regardless of spec). This PR adds operator caps (SSH_MAX_CPU / SSH_MAX_MEMORY / SSH_MAX_PIDS) and enforces min(task_spec, worker_cap) both at dispatch and at runtime, and additionally adds an opt-in ENABLE_SSH_GPU_LIMIT flag that limits each SSH session's GPU access to the spec-requested subset.

Changes

  • Schema — new SSHLimits model in shared/schemas/worker.py; mirrored on the SDK.
  • SupervisorDockerWorkerConfig.ssh gains max_cpu / max_memory / max_pids; WorkerInfo.ssh_limits populated when SSH is enabled; SSH env injection gated on enable_ssh.
  • Server registryWorker.ssh_limits rehydrated from Redis; hw_satisfies clamps the worker's physical capacity by ssh_limits for SSH tasks only.
  • WorkerWorkerConfig.ssh_limits parsed from env (fail-fast on bad values); reported on registration; startup warning when SSH is enabled without a cap.
  • SSH executor — resolves min(spec, cap) per session; applies nano_cpus / mem_limit / pids_limit Docker kwargs; logs a WARNING ... clamping to cap when the spec exceeds the cap. When ENABLE_SSH_GPU_LIMIT=true, also resolves a spec-matching GPU subset and wires both device_requests (host IDs) and a normalized CUDA_VISIBLE_DEVICES (container-side 0..N-1); when false (default), every worker GPU is mounted as before.
  • Executor plumbingExecutor.__init__ and MPExecutor thread WorkerHardware through to subclasses via cooperative *args, **kwargs, so the SSH executor can read the worker's GPU device list when resolving the subset.
  • Shared GPU helpers — new shared/utils/hardware.py centralizes GPU type normalization, memory parsing, per-device matching, and the unified-memory fallback so dispatcher filtering and worker-side slicing stay in sync.
  • Env schema + docs — stack env_schema.py and docs/ENV.md updated. EnvVar gains min_inclusive / max_inclusive flags so SSH_MAX_CPU=0 is rejected at the stack-validation layer instead of only failing at worker boot.

Design

Two-sided enforcement. Dispatch refuses to route an SSH task to a worker whose effective ceiling can't satisfy the request, so the common case never reaches the executor. The runtime clamp is defense in depth for config drift between the dispatcher's view of ssh_limits and the worker's. SSH_MAX_PIDS is admin-only — the spec has no PID field.

GPU slicing is opt-in via ENABLE_SSH_GPU_LIMIT and has no operator cap. When enabled, the SSH executor picks the smallest subset of the worker's host GPUs (WORKER_HOST_GPU_ID) that satisfies the spec's type and per-device memory floor, passes those host IDs through device_requests, and writes CUDA_VISIBLE_DEVICES=0,1,...,N-1 to match the 0..N-1 view the NVIDIA container runtime exposes inside the session. The default-off behavior is the pre-PR pass-through: every worker GPU is mounted regardless of the spec.

Default is pass-through unlimited; the cap is opt-in. The startup warning makes the unlimited default visible to operators.

Test Plan

uv run pre-commit run --files <touched files>
uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py

Local-stack manual verification across four scenarios, with full step-by-step commands in tmp/e2e_tests/ssh_limit/test_ssh_limit.md. Per-scenario summary:

  • A — no cap: all SSH_MAX_* empty; expect no Docker resource kwargs on the SSH container and the "SSH resource cap not configured" startup warning.
  • B — worker cap only: SSH_MAX_CPU=1.5, SSH_MAX_MEMORY=512Mi, SSH_MAX_PIDS=256; expect those values to round-trip through the registry and appear on the SSH container's HostConfig.
  • C — spec + cap interaction: sub-cases on an SSH template with a resources.hardware block — spec below cap (spec wins), spec above cap on the only worker (dispatcher refuses).
  • D — GPU slicing: GPU worker brought up via flowmesh stack worker up gpu all with ENABLE_SSH_GPU_LIMIT=true; submit SSH tasks with no GPU spec / count: 1 / type: <model> / memory: <floor> / count > N_GPUS. For each, inspect the SSH container's HostConfig.DeviceRequests[0].DeviceIDs and Config.Env for CUDA_VISIBLE_DEVICES, and nvidia-smi -L inside the container.

Each scenario uses examples/templates/ssh_with_inputs.yaml — the spec-modifying sub-cases fork it and add a resources.hardware block to the annotate stage. The session stays alive for docker inspect; we cancel the workflow once the inspect is done.

Test Result

$ uv run pre-commit run --all-files
# All passed

$ uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
# 972 passed, 18 warnings in ~30s

Local-stack (CPU worker, worker_cpu_0):

  • A: worker logged the "SSH resource cap not configured" warning. flowmesh worker list showed no SSH limits. SSH container HostConfig{NanoCpus: 0, Memory: 0, PidsLimit: null}.
  • B: worker logged the configured cap. Registry round-tripped the same values. SSH container HostConfig{NanoCpus: 1500000000, Memory: 536870912, PidsLimit: 256}; in-container cgroup files corroborated.
  • C (spec < cap): container Memory: 268435456, NanoCpus: 1000000000; no clamp warning.
  • C (spec exceeds cap, dispatch refuses): task stayed PENDING; cancelled explicitly. The runtime-clamp path is exercised only by unit tests in this PR — end-to-end, any spec above the cap is rejected at dispatch before the executor sees it.

Local-stack (GPU worker, worker_gpu_all, 4× H200, image nvidia/cuda:12.9.1-base-ubuntu24.04, ENABLE_SSH_GPU_LIMIT=true):

  • D (no GPU spec): SSH container's DeviceRequests[0].DeviceIDs listed all 4 host GPU IDs; CUDA_VISIBLE_DEVICES="0,1,2,3"; nvidia-smi -L reported 4 GPUs.
  • D (count: 1): DeviceRequests[0].DeviceIDs contained exactly one host GPU ID (the first in the worker's device list); CUDA_VISIBLE_DEVICES="0"; nvidia-smi -L reported a single GPU.
  • D (type: H200): matched all 4 devices; with count: 1 the first H200 was mounted; CUDA_VISIBLE_DEVICES="0".
  • D (memory floor below per-device): matched all 4 devices; count: 1 mounted one; CUDA_VISIBLE_DEVICES="0".
  • D (count: 5 > N_GPUS): dispatcher refused; task stayed PENDING with no SSH container launched; cancelled explicitly.
  • Heterogeneous-GPU sub-case (different types or memory tiers on the same worker) not tested — the testbed has only 4× H200, so the executor's per-device type/memory filter was exercised only on a homogeneous fleet. The dispatcher-side rejection path was confirmed.

Follow-ups

  • lxcfs for /proc view isolation. Cgroup limits enforce actual usage but /proc/cpuinfo, /proc/meminfo, etc. still report the host. lxcfs synthesizes per-cgroup /proc views via FUSE; a follow-up PR can wire a SSH_USE_LXCFS=true flag into SSHExecutor._build_run_kwargs to bind-mount /var/lib/lxcfs/proc/* into the session container. Separate concern from cgroup enforcement; same host-level lxcfs daemon serves all containers via per-reader cgroup lookup.
  • Make env parsing stricter. Raise a warning or an error when the env parsing encounters unexpected type.

Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

@kaiitunnz kaiitunnz marked this pull request as ready for review May 15, 2026 13:22
@kaiitunnz kaiitunnz requested a review from timzsu as a code owner May 15, 2026 13:22
@timzsu timzsu mentioned this pull request May 15, 2026
9 tasks
Copy link
Copy Markdown
Collaborator

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we implemented the GPU resource cap for SSH session? Or is it now purely based on the assumption that the worker has exactly the number of GPUs that the user wants?

Comment thread cli/stack/src/flowmesh_cli_stack/env_schema.py Outdated
Comment thread src/server/env.py
@timzsu timzsu self-requested a review May 15, 2026 15:39
kaiitunnz added 8 commits May 16, 2026 15:18
SSH session containers previously had unbounded access to the host's
CPU and memory. Plumb an operator-set cap (SSH_MAX_CPU / SSH_MAX_MEMORY
/ SSH_MAX_PIDS) from the supervisor to the worker registry as a new
SSHLimits record, and apply min(task spec, worker cap) as Docker
nano_cpus / mem_limit / pids_limit when spawning the session. The
dispatcher's hw_satisfies clamps physical capacity by the cap for SSH
tasks so workers that cannot satisfy a request are filtered out before
silent under-provisioning at runtime. Unset values mean unbounded
(historical behavior), with a worker startup warning when SSH is
enabled but uncapped.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
`EnvVar.min_value` / `max_value` previously only modeled inclusive
bounds. `SSH_MAX_CPU=0` slipped past stack-side validation and only
failed at worker boot. Add `min_inclusive` / `max_inclusive` flags
(default `True`, preserving existing behavior) and apply the strict
variant to `SSH_MAX_CPU` so a zero value is rejected at the source.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Adds a third optional ``hardware`` parameter to ``Executor.__init__`` so
executors can read probed worker hardware without re-collecting it.
``initialize_executors`` now forwards the value alongside ``config`` and
``lifecycle``; subclasses switch to ``(*args, **kwargs)`` passthrough
(matching the existing ``GovernanceMixin`` pattern) so future signature
extensions don't break the chain. No executor consumes the new parameter
yet — wiring only.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
MPExecutor now accepts and forwards WorkerHardware to the inner executor
subprocess, mirroring the base Executor signature. A shared
make_worker_hardware test factory replaces the per-file fixture.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Centralize GPU type normalization, memory parsing, per-device matching,
and unified-memory fallback into shared.utils.hardware so dispatcher
filtering and worker-side slicing stay in sync.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
SSH containers now expose only the spec-requested subset of the worker's
GPUs (filtering by type and per-device memory, with a unified-memory
fallback) and write a normalized CUDA_VISIBLE_DEVICES matching the
container's 0..N-1 view of the mounted devices.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Default off — SSH session containers receive every worker GPU regardless
of `resources.hardware.gpu`. Operators turn slicing on with
ENABLE_SSH_GPU_LIMIT=true, which propagates from the supervisor through
the worker into SSHConfig.from_spec's GPU resolver.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/feat/limit-ssh-hw branch from 581ac46 to 6708603 Compare May 16, 2026 15:18
Copy link
Copy Markdown
Collaborator

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. PTAL.

Comment thread docs/ENV.md
SSH_MAX_PIDS=
# Whether to apply requested GPU limits to SSH tasks.
# If false, SSH tasks are allocated all available GPUs
# regardless of their resource requests.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should enable the GPU limit by default. In the long term, I think seeing exactly what the user request should be the cleaner default design. I do not see a scenario where we should disable the limit. How do you think?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m thinking about hardware utilization here. Unlike CPU and RAM, which are shared across workers and other processes, GPUs are assigned exclusively to a worker. If we only expose the spec-requested GPUs to the SSH task, any remaining GPUs on that worker may sit idle, which can reduce overall utilization and has billing implications as well.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to allow multiple SSH sessions to share the same worker?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not. Only one SSH task can run at a time per worker.

Comment thread src/worker/executors/ssh_executor.py
kaiitunnz added 2 commits May 16, 2026 16:25
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Safety net for the case where a GPU-requesting SSH task is routed to a
worker reporting no host GPUs. Previously the resolver silently returned
an empty subset; it now raises ExecutionError, matching the other
unsatisfiable-request paths in the same function.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants