NixOS: Add support for CDI#284507
Conversation
83f87d0 to
6511fb2
Compare
6511fb2 to
46afc98
Compare
553836e to
8d2508a
Compare
pkgs/applications/virtualization/nvidia-container-toolkit/default.nix
Outdated
Show resolved
Hide resolved
0d7dd00 to
95e9d84
Compare
|
The last commit contains more cleanup than is needed, since we support Docker < 25, and thus, we'll still need the container runtime approach, given the CDI support was included in Docker 25. When the final PR is submitted I'll get rid of that cleanup commit. |
aaronmondal
left a comment
There was a problem hiding this comment.
I like how this cleanup makes it easier to keep track of all the nvidia dependencies.
One thing I'd like to add is that some sort of integration test would probably be a good idea. I'm not sure whether this is possible as they'd probably have to run in a NixOS VM though.
@SomeoneSerge Might have some comments on the new layout of the package.nix/default.nix files.
SomeoneSerge
left a comment
There was a problem hiding this comment.
Overall looks much simpler and cleaner than what we had
4ad8cfa to
a252928
Compare
|
I'd not tackle |
SomeoneSerge
left a comment
There was a problem hiding this comment.
Thank you for the latest round of changes! I think the warnings look good. Ofborg is happy. I'm attaching a few more questions of which only the one about cdi.{dynamic,static} makes me hesitate about merging.
This is part of the to-be public interface so my assessment is that it's important to resolve this concern now. I don't know what the actual cost of introducing an interface in nixpkgs-unstable and potentially changing it before the release is.
You've done a lot of work on this PR, and I think we're almost there. Sorry to be annoying.
nixos/modules/services/hardware/nvidia-container-toolkit/cdi-generate.nix
Outdated
Show resolved
Hide resolved
nixos/modules/services/hardware/nvidia-container-toolkit/cdi-generate.nix
Outdated
Show resolved
Hide resolved
nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix
Outdated
Show resolved
Hide resolved
nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix
Outdated
Show resolved
Hide resolved
SomeoneSerge
left a comment
There was a problem hiding this comment.
I feel a bit disoriented, but I think all of my remaining comments are mere nitpicks. We also outlined a few TODOs for the immediate future (correct handling of closures, decisions about /usr/bin, libc priorities, deciding about whether to delete nvidia-container-toolkit-cdi-generate as a separate nixos module).
I think I'm convinced enough by the public-facing side of the PR. I also think this should just work for most of the basic use-cases (the overlay experience may need more looking into; and yes, I haven't yet tested this branch myself). In the (hopefully unlikely) case this PR does break somebody's workflow, I am online and respond to github pings, and I expect @ereslibre to be available as well, so we can always discuss and fix stuff.
...Is it time?
Likewise here, I'm happy to take ownership of this bit. |
nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix
Outdated
Show resolved
Hide resolved
aaronmondal
left a comment
There was a problem hiding this comment.
So I just tested this on an instance with a T4 GPU on GCP. Seems to "just work" out of the box.
One thing I'm noticing is that the generated cdi file is not symlinked to /etc/cdi/nvidia.yaml as described in the nvidia docs. This doesn't seem to be an issue but higher level tooling (like the nvidia-operator for k8s) might have issues with this. I'd expect this to be solvable on the side of said tooling though, so this seems fine to me (and we can always change this later).
Fantastic work!
|
I just want to double check a couple things and answer all remaining conversations on this PR and mark it as ready to review, sorry for this last wait. Final stretch :) Thanks a ton for your confirmation @aaronmondal :) |
We'll have to check with the nvidia-operator and other components, but in principle both /etc/cdi and /var/run/cdi should be inspected by the runtimes (according to the CDI spec, prioritizing /var/run/cdi over /etc/cdi). With this implementation, users can provide their static configuration, that is placed under /etc/cdi, and the "dynamic configuration", that is only implemented for nvidia (as of today) will place the dynamic configuration under /run/cdi (to which /var/run is linked to in NixOS.) |
|
Everything seems to be fine, this is the only thing I cannot explain: What is exactly the meaning of Anyhow, I think we can move on, and fix any issues that might appear due to this change. As I have stated before, I take ownership of this logic and I'm happy to address those issues. |
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/docker-rootless-with-nvidia-support/37069/21 |
|
I just want to say thanks for the work. I was watching this issue closely as I would benefit a lot from this. Thanks! |
nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix
Show resolved
Hide resolved
nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix
Show resolved
Hide resolved
|
Likewise, I have been looking into this issue and just found this pull request. Thank you for all the work on this issue - greatly appreciated. I have linked this pull request to some reddit questions about GPU use with docker/podman on NixOS where others are trying to find their way through it. Many thanks! |
Thanks a lot for this @stuzenz. Just a reminder that this feature was updated in subsequent PR's and that |
Description of changes
This builds on top of #278969. It provides two main features to NixOS related to the Container Device Interface (CDI):
virtualisation.containers.cdi.staticattrset of JSONs. This can be used to provide static valid JSON files for CDI that will be placed under/etc/cdi.virtualisation.containers.cdi.dynamic.nvidia.enableboolean. When enabled, it will use the nvidia-container-toolkit to generate a CDI spec based on the current machine, as a systemd unit file. The generated JSON file will be patched to accomodate the NixOS host system, and will be placed under/run/cdi/nvidia-container-toolkit.json.As per the CDI spec, CDI-compliant runtimes need to inspect
/etc/cdi(static configuration) and/var/run/cdi(dynamic configuration); prioritizing what is under the dynamic configuration over the static configuration in case of clash.So, this implementation is compliant with this specification.
Things done
nix.conf? (See Nix manual)sandbox = relaxedsandbox = truenix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/)Add a 👍 reaction to pull requests you find important.