Skip to content

nvidia-docker: unbreak the runc symlink#280087

Merged
SomeoneSerge merged 5 commits intoNixOS:masterfrom
SomeoneSerge:fix/nvidia-docker-runtime
Jan 11, 2024
Merged

nvidia-docker: unbreak the runc symlink#280087
SomeoneSerge merged 5 commits intoNixOS:masterfrom
SomeoneSerge:fix/nvidia-docker-runtime

Conversation

@SomeoneSerge
Copy link
Contributor

(cherry picked from commit 1e1eb8b)

Description of changes

"${nvidia-docker}/bin/nvidia-container-runtime" was broken referring to a non-existent file; we should probably delete and deprecate nvidia-docker

Note that nvidia-docker had been deprecated upstream for a long time, we should actually remove it. Our libnvidia-docker and nvidia-container-toolkit are terribly outdated too. I started preparing the updates in #279235 but I felt uncomfortable moving forward with them because it turned out that docker run --gpus (and dokcer run --runtime nvidia) is broken even master (at least on my host), both with and without the updates.

All that stuff is out of scope for this PR, this only includes the obvious fixes and a small refactoring that helps with debugging:

  • Easier overrides and inspectability
  • Support nix-update nvidiaCtkPackages.nvidia-container-toolkit-docker, etc
  • Add missing meta to the top-level attributes

I can look into updates later, but first I'd like to learn why --gpus all broke in the first place.

I'm not exposing nvidia-container-toolkit-* as top-level attributes and do not recurse into nvidiaCtkPackages because there's no need to yet (keeping the interface smaller)

I also don't see any reason to keep using the symlinkJoins, but this doesn't matter because they likely go away together with nvidia-docker?

Seeing how this PR is meant to be small, I intend to merge it as soon as Ofborg allows, unless anybody objects.

CC @jmbaur @GTrunSec @cpcloud

Also CC @kirillrdy @siraben @aaronjheng based on git-blame.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

(cherry picked from commit 42ed2f8)
...this way we expose and allow overriding the symlinkJoin constituent components

(cherry picked from commit 1142433)
@SomeoneSerge SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Jan 10, 2024
@SomeoneSerge SomeoneSerge mentioned this pull request Jan 10, 2024
21 tasks
@ofborg ofborg bot added the 8.has: package (new) This PR adds a new package label Jan 10, 2024
@ofborg ofborg bot requested a review from cpcloud January 10, 2024 19:59
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. labels Jan 10, 2024
Copy link
Contributor

@jmbaur jmbaur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. In a future change, I would like to start using Nvidia's tooling for CDI (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html) that AFAIK would allow us to ditch using nvidia-container-runtime stuff altogether and just generate a single yaml file on bootup that describes the hardware, then the container runtime knows how to read that natively. Podman supports it well and it will be in docker v25 which should be released soon. I've gotten it working for jetson devices, but have yet to try it out on x86_64.

@SomeoneSerge SomeoneSerge merged commit 093f4f5 into NixOS:master Jan 11, 2024
@SomeoneSerge
Copy link
Contributor Author

@jmbaur that sounds great, do ping me in the PR!

@aaronmondal
Copy link
Contributor

aaronmondal commented Jan 11, 2024

FYI @jmbaur @SomeoneSerge I'm also trying to get this to work. Adjacent to #280184 I sent #278969 which still works with the docker --gpus approach and can generate CDIs but couldn't use the CDIs. I ran into some runc issues so maybe this patch fixes those issues. I'll try rebasing and seeing how things go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: cuda Parallel computing platform and API 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants