Skip to content

Security hardening: pinned versions, locked deps, CUDA/PyTorch upgrades#13

Open
MadiatorLabs wants to merge 2 commits intorunpod-workers:mainfrom
MadiatorLabs:main
Open

Security hardening: pinned versions, locked deps, CUDA/PyTorch upgrades#13
MadiatorLabs wants to merge 2 commits intorunpod-workers:mainfrom
MadiatorLabs:main

Conversation

@MadiatorLabs
Copy link
Contributor

Summary

Centralized version pinning with locked dependency chain

  • All version pins (ComfyUI, custom node SHAs, PyTorch, FileBrowser) controlled from docker-bake.hcl as single source of truth
  • Dockerfiles declare bare ARG names — no hardcoded defaults
  • Every Python dependency is locked at build time via pip-compile --generate-hashes — full hash-verified supply chain
  • Separate PyTorch pins for regular and 5090 images so versions can diverge independently
  • scripts/fetch-hashes.sh added to query GitHub API for latest custom node commit SHAs with HCL-formatted output

Base image upgrades

  • Ubuntu: 22.04 → 24.04 for both images
  • Regular image: CUDA 12.4 → 12.6 (cu126 wheels) the CUDA 12.4 no longers getting PyTorch updates.
  • 5090 image: CUDA 12.8 → 12.9 (cu129 wheels) native Blackwell SASS, compatible with CUDA 12.8 Hosts
  • PyTorch: 2.10.0 + torchvision 0.25.0 + torchaudio 2.10.0 for both images

ComfyUI-Manager compatibility

  • ComfyUI and all custom nodes get git init with tagged commits and upstream remotes at build time
  • Manager can detect versions and users can update via Manager UI at their own risk

@MadiatorLabs MadiatorLabs changed the title Security changes now all is static Security hardening: pinned versions, locked deps, CUDA/PyTorch upgrades Feb 24, 2026
@ssube-runpod
Copy link

Regular image: CUDA 12.4 → 12.6 (cu126 wheels) the CUDA 12.4 no longers getting PyTorch updates.

12.6 is not a widely supported version and not well represented in our fleet. 12.4 was the last LTS before 13.0, which is why we are using that.

5090 image: CUDA 12.8 → 12.9 (cu129 wheels) native Blackwell SASS, compatible with CUDA 12.8 Hosts

Please don't use 12.9 for anything, we are avoiding it across the board, so 12.8 or 13.0 are preferable.

fi

# Tail the log file
tail -f /workspace/runpod-slim/comfyui.log

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still may want this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal is safe. Here's why:

What the old loop did: On every boot, iterate all custom_nodes/*/ and re-run pip install -r requirements.txt, python install.py, pip install -e . for each.

Why it's not needed anymore:

  1. Baked nodes: All pre-installed nodes have their deps in the image already (hash-verified lock file). No boot-time install needed.
  2. User-installed nodes via Manager: When a user installs a node through ComfyUI-Manager at runtime, Manager handles the pip install and install.py during that session. Those packages go into the persistent venv on /workspace, so they survive container restarts.
  3. The old loop was wasteful: Re-running install.py on every boot could re-download models, overwrite user configs, or cause other side effects. Most well-written nodes check if setup is already done on import.

The only edge case is a node with a poorly-written install.py that doesn't persist its state — but that's a node bug, not something we should paper over with a boot-time loop that slows down every start.

Recommendation: The removal is correct. Document in the README that runtime-installed custom nodes are the user's responsibility and their deps persist in the venv across reboots.

@TimPietruskyRunPod TimPietruskyRunPod self-requested a review March 11, 2026 09:21
Copy link
Contributor

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Solid PR — unified Dockerfile, pinned deps with hash verification, and faster cold starts are all significant improvements. A few items to address before merging:

  1. PR description has wrong CUDA versions — description says 12.6/12.9 but code uses 12.8/13.0. Please update the description to match.
  2. Venv path rename (.venv.venv-cu128) is a breaking change for existing deployments with persistent /workspace. Should be documented in the PR description and README so users know a re-setup will be triggered on next boot.
  3. fetch-hashes.sh uses GNU-only grep -oP — breaks on macOS where developers run it.
  4. prebake-manager-cache.py registry fetch should be non-fatal since it's an optimization.
  5. Verify openssl is available in the runtime stage — needed by start.sh for SSH password generation.

See inline comments for details.

}
variable "MANAGER_SHA" {
default = "f41365abe957"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mismatch: The PR description says regular image goes from CUDA 12.4 → 12.6 (cu126) and 5090 from 12.8 → 12.9 (cu129), but the actual code here sets 12-8/cu128 for regular and 13-0/cu130 for 5090. Please update the PR description to match the code.

set -e # Exit the script if any statement returns a non-true return value

COMFYUI_DIR="/workspace/runpod-slim/ComfyUI"
VENV_DIR="$COMFYUI_DIR/.venv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking change for existing deployments: Renaming from .venv to .venv-cu128 means existing containers with a persistent /workspace will not find VENV_DIR on next boot, triggering a full re-setup.

This is probably intentional (unifying with the old 5090 path), but it should be:

  1. Called out explicitly in the PR description as a migration impact
  2. Documented in the README or a migration note so users know to expect a one-time re-setup on first boot after upgrading

get_current_hash() {
local var_name="$1"
grep -A2 "variable \"${var_name}\"" "$BAKE_FILE" | grep -oP 'default\s*=\s*"\K[^"]*' || echo "unknown"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grep -oP requires GNU grep (Perl regex mode). macOS ships BSD grep which doesn't support -oP. Since this is a developer utility run locally (not in Docker), this will break for macOS developers.

Consider replacing with sed or awk, e.g.:

grep 'default' | sed 's/.*default *= *"\([^"]*\)".*/\1/'

Or at minimum add a comment/check that GNU grep is required.

echo "ERROR" ; return
}
echo "$response" | grep -oP '"sha"\s*:\s*"\K[a-f0-9]{12}' | head -1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same GNU grep issue here — grep -oP with \K lookbehind won't work on macOS. Could use:

echo "$response" | grep -o '"sha" *: *"[a-f0-9]\{12\}' | head -1 | grep -o '[a-f0-9]\{12\}$'


# Fetch paginated registry and save as aggregated JSON
print(" Fetching ComfyRegistry (paginated)...")
nodes = fetch_registry_all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If api.comfy.org is down or returns unexpected JSON (missing totalPages/nodes key), the entire Docker build fails with an unhelpful traceback. Since this cache is a cold-start optimization (not required — Manager re-fetches after 24h), consider wrapping the registry fetch in a try/except that prints a warning and continues:

try:
    nodes = fetch_registry_all()
    registry_data = json.dumps(nodes, separators=(",", ":"))
    fname = cache_filename(REGISTRY_URL)
    (CACHE_DIR / fname).write_bytes(registry_data.encode())
    print(f"  Cached {len(nodes)} registry nodes")
except Exception as e:
    print(f"  WARNING: Registry fetch failed ({e}), skipping — Manager will fetch on first start")

git remote add origin https://github.com/ltdrdata/ComfyUI-Manager.git && \
cd /tmp/build/ComfyUI/custom_nodes/ComfyUI-KJNodes && \
git init && git add -A && git -c user.name=- -c user.email=- commit -q -m "ComfyUI-KJNodes ${KJNODES_SHA}" && \
git remote add origin https://github.com/kijai/ComfyUI-KJNodes.git && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify openssl availability in runtime stage: start.sh uses openssl rand -base64 12 for SSH password generation when no PUBLIC_KEY is set. The runtime stage doesn't explicitly install openssl. It likely comes as a transitive dependency of openssh-server or libssl-dev, but commit 6f77fe5 on main ("chore: add openssl") suggests this was previously an issue.

Please verify it's present in the final image, or add it explicitly to the apt-get install list to be safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants