nvidia-docker: unbreak the runc symlink by SomeoneSerge · Pull Request #280087 · NixOS/nixpkgs

SomeoneSerge · 2024-01-10T18:50:28Z

(cherry picked from commit 1e1eb8b)

Description of changes

"${nvidia-docker}/bin/nvidia-container-runtime" was broken referring to a non-existent file; we should probably delete and deprecate nvidia-docker

Note that nvidia-docker had been deprecated upstream for a long time, we should actually remove it. Our libnvidia-docker and nvidia-container-toolkit are terribly outdated too. I started preparing the updates in #279235 but I felt uncomfortable moving forward with them because it turned out that docker run --gpus (and dokcer run --runtime nvidia) is broken even master (at least on my host), both with and without the updates.

All that stuff is out of scope for this PR, this only includes the obvious fixes and a small refactoring that helps with debugging:

Easier overrides and inspectability
Support nix-update nvidiaCtkPackages.nvidia-container-toolkit-docker, etc
Add missing meta to the top-level attributes

I can look into updates later, but first I'd like to learn why --gpus all broke in the first place.

I'm not exposing nvidia-container-toolkit-* as top-level attributes and do not recurse into nvidiaCtkPackages because there's no need to yet (keeping the interface smaller)

I also don't see any reason to keep using the symlinkJoins, but this doesn't matter because they likely go away together with nvidia-docker?

Seeing how this PR is meant to be small, I intend to merge it as soon as Ofborg allows, unless anybody objects.

CC @jmbaur @GTrunSec @cpcloud

Also CC @kirillrdy @siraben @aaronjheng based on git-blame.

Things done

Add a 👍 reaction to pull requests you find important.

(cherry picked from commit 1e1eb8b)

(cherry picked from commit 42ed2f8)

...this way we expose and allow overriding the symlinkJoin constituent components (cherry picked from commit 1142433)

jmbaur

LGTM. In a future change, I would like to start using Nvidia's tooling for CDI (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html) that AFAIK would allow us to ditch using nvidia-container-runtime stuff altogether and just generate a single yaml file on bootup that describes the hardware, then the container runtime knows how to read that natively. Podman supports it well and it will be in docker v25 which should be released soon. I've gotten it working for jetson devices, but have yet to try it out on x86_64.

SomeoneSerge · 2024-01-11T04:28:35Z

@jmbaur that sounds great, do ping me in the PR!

aaronmondal · 2024-01-11T16:56:24Z

FYI @jmbaur @SomeoneSerge I'm also trying to get this to work. Adjacent to #280184 I sent #278969 which still works with the docker --gpus approach and can generate CDIs but couldn't use the CDIs. I ran into some runc issues so maybe this patch fixes those issues. I'll try rebasing and seeing how things go.

SomeoneSerge added 5 commits January 10, 2024 17:50

nvidia-docker: unbreak the runc symlink

2b3eaf5

(cherry picked from commit 1e1eb8b)

libnvidia-container: set mainProgram

88f438f

(cherry picked from commit 42ed2f8)

nvidiaCtkPackages: init

336e221

...this way we expose and allow overriding the symlinkJoin constituent components (cherry picked from commit 1142433)

nvidia-docker: support config.toml as an attrset argument

5e7c297

nvidia-docker: add missing meta

4160504

SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Jan 10, 2024

SomeoneSerge mentioned this pull request Jan 10, 2024

apptainer: unbreak --nv #279235

Closed

21 tasks

ofborg bot added the 8.has: package (new) This PR adds a new package label Jan 10, 2024

ofborg bot requested a review from cpcloud January 10, 2024 19:59

ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. labels Jan 10, 2024

jmbaur approved these changes Jan 11, 2024

View reviewed changes

SomeoneSerge merged commit 093f4f5 into NixOS:master Jan 11, 2024

aaronmondal mentioned this pull request Jan 11, 2024

nvidia-container-toolkit: 1.9.0 -> 1.15.0-rc.3 #278969

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nvidia-docker: unbreak the runc symlink#280087

nvidia-docker: unbreak the runc symlink#280087
SomeoneSerge merged 5 commits intoNixOS:masterfrom
SomeoneSerge:fix/nvidia-docker-runtime

SomeoneSerge commented Jan 10, 2024

Uh oh!

jmbaur left a comment

Uh oh!

SomeoneSerge commented Jan 11, 2024

Uh oh!

aaronmondal commented Jan 11, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

SomeoneSerge commented Jan 10, 2024

Description of changes

Things done

Uh oh!

jmbaur left a comment

Choose a reason for hiding this comment

Uh oh!

SomeoneSerge commented Jan 11, 2024

Uh oh!

aaronmondal commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aaronmondal commented Jan 11, 2024 •

edited

Loading