nvidia-container-toolkit: 1.9.0 -> 1.15.0-rc.3#278969
nvidia-container-toolkit: 1.9.0 -> 1.15.0-rc.3#278969SomeoneSerge merged 2 commits intoNixOS:masterfrom
Conversation
|
FYI bumping to 1.14.3 doesn't seem like a good option since building it segfaults during runtime due to NVIDIA/go-nvml#36. In the current state of this PR the Regular |
0618e5e to
55efde8
Compare
|
@aaronmondal Did you try to run podman with the generated CDI spec? I think it would make sense to rework the podman nvidia integration to work on top of CDI. |
There was a problem hiding this comment.
Is there much point giving them the real ldconfig if ldconfig is never going to do the expected thing on NixOS (and shouldn't be expected to, in general)?
There was a problem hiding this comment.
Without this it triggers this error:
nvidia-container-cli: ldcache error: open failed: /sbin/ldconfig: no such file or directory: unknown.
I'm not sure whether this is the right way to fix it though.
There was a problem hiding this comment.
I think the correct patch would be to make ldconfig optional. Using ldconfig the way they do is wrong by design
There was a problem hiding this comment.
Replace with a link to coreutils true, perhaps?
There was a problem hiding this comment.
In the current state of this PR the nvidia-ctk tool doesn't properly autodetect cuda
Do you know which "mode" it's trying to use to discover the libraries? E.g. libnvidia-container, csv, etc
EDIT: also, in case it wasn't clear, this ugly patch was sufficient to make nvidia-container-cli info (from libnvidia-container) work again, and I believe nvidia-ctk uses nvidia-container-cli in one of the "modes": 35b1062#diff-2b4dc4504c07052fdeb991c058ab1cd1b3fc215f2475fddab960ebea2db772e7
There was a problem hiding this comment.
Thanks for opening the PR! Fingers crossed jaja
There was a problem hiding this comment.
CDI generation auto-detects mode as "nvml".
It's using different modes during container invocations though. In #280184 it's complaining like this, so I guess there it's failing in "legacy" mode?
docker run --gpus=all -it ubuntu bash
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container
process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: ldcache error: open failed: /sbin/ldconfig: no such file or directory: unknown.
ERRO[0000] error waiting for container:It works in this patch because of one of the changed /sbin/ldconfig paths. I'm not sure which one though 😅 I'm also not sure whether just brute-forcing this is the way to go. It's certainly curious that this works on nixpkgs variant of ldconfig. Seems to just look for the file but not invoke it?
There was a problem hiding this comment.
This particular error should be addressed by the linked patch
There was a problem hiding this comment.
I see. The patch approach seems better than the substitutions I'm doing here. Looks like the remaining issue is the runc/crun crash. I believe this is the same issue you also encountered in #279235 (comment). Not being able to get any useful information out of --debug mode certainly makes this somewhat tricky to figure out.
|
Ok podman doesn't work: podman run -it --rm --device nvidia.com/gpu=all ubuntu bash
Error: OCI runtime error: crun: {"msg":"error executing hook `/run/current-system/sw/bin/nvidia-ctk` (exit code: 1)","level":"error","time":"..."}The corresponding CDI spec looks like this: /etc/cdi/nvidia.yaml---
cdiVersion: 0.5.0
containerEdits:
deviceNodes:
- path: /dev/nvidia-modeset
- path: /dev/nvidia-uvm
- path: /dev/nvidia-uvm-tools
- path: /dev/nvidiactl
hooks:
- args:
- nvidia-ctk
- hook
- update-ldcache
- --folder
- /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib
hookName: createContainer
path: /run/current-system/sw/bin/nvidia-ctk
mounts:
- containerPath: /etc/egl/egl_external_platform.d/10_nvidia_wayland.json
hostPath: /etc/egl/egl_external_platform.d/10_nvidia_wayland.json
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /etc/egl/egl_external_platform.d/15_nvidia_gbm.json
hostPath: /etc/egl/egl_external_platform.d/15_nvidia_gbm.json
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libEGL_nvidia.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libEGL_nvidia.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libGLESv1_CM_nvidia.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libGLESv1_CM_nvidia.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libGLESv2_nvidia.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libGLESv2_nvidia.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libGLX_nvidia.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libGLX_nvidia.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libcuda.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libcuda.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libcudadebugger.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libcudadebugger.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libglxserver_nvidia.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libglxserver_nvidia.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvcuvid.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvcuvid.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-allocator.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-allocator.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-cfg.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-cfg.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-egl-gbm.so.1.1.0
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-egl-gbm.so.1.1.0
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-eglcore.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-eglcore.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-encode.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-encode.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-fbc.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-fbc.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-glcore.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-glcore.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-glsi.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-glsi.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-glvkspirv.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-glvkspirv.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-gpucomp.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-gpucomp.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-ml.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-ml.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-ngx.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-ngx.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-nvvm.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-nvvm.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-opencl.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-opencl.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-opticalflow.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-opticalflow.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-pkcs11-openssl3.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-pkcs11-openssl3.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-pkcs11.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-pkcs11.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-ptxjitcompiler.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-ptxjitcompiler.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-rtcore.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-rtcore.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-tls.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvidia-tls.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvoptix.so.545.29.06
hostPath: /nix/store/wyfi50q4mcipw1xr0r1hxyzzaimzm593-nvidia-x11-545.29.06-6.1.69/lib/libnvoptix.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /run/current-system/sw/bin/nvidia-cuda-mps-control
hostPath: /run/current-system/sw/bin/nvidia-cuda-mps-control
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /run/current-system/sw/bin/nvidia-cuda-mps-server
hostPath: /run/current-system/sw/bin/nvidia-cuda-mps-server
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /run/current-system/sw/bin/nvidia-debugdump
hostPath: /run/current-system/sw/bin/nvidia-debugdump
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /run/current-system/sw/bin/nvidia-smi
hostPath: /run/current-system/sw/bin/nvidia-smi
options:
- ro
- nosuid
- nodev
- bind
devices:
- containerEdits:
deviceNodes:
- path: /dev/nvidia0
- path: /dev/dri/card0
- path: /dev/dri/renderD128
hooks:
- args:
- nvidia-ctk
- hook
- create-symlinks
- --link
- ../card0::/dev/dri/by-path/pci-0000:00:05.0-card
- --link
- ../renderD128::/dev/dri/by-path/pci-0000:00:05.0-render
hookName: createContainer
path: /run/current-system/sw/bin/nvidia-ctk
- args:
- nvidia-ctk
- hook
- chmod
- --mode
- "755"
- --path
- /dev/dri
hookName: createContainer
path: /run/current-system/sw/bin/nvidia-ctk
name: "0"
- containerEdits:
deviceNodes:
- path: /dev/nvidia0
- path: /dev/dri/card0
- path: /dev/dri/renderD128
hooks:
- args:
- nvidia-ctk
- hook
- create-symlinks
- --link
- ../card0::/dev/dri/by-path/pci-0000:00:05.0-card
- --link
- ../renderD128::/dev/dri/by-path/pci-0000:00:05.0-render
hookName: createContainer
path: /run/current-system/sw/bin/nvidia-ctk
- args:
- nvidia-ctk
- hook
- chmod
- --mode
- "755"
- --path
- /dev/dri
hookName: createContainer
path: /run/current-system/sw/bin/nvidia-ctk
name: all
kind: nvidia.com/gpu@jmbaur's patches allow adding a new hooks:
- args:
- nvidia-ctk
- hook
- update-ldcache
- --ldconfig-path
- myldconfigpath
- --folder
- /nix/store/qhw7ag7945046gm7z2sryx266hk5masw-nvidia-x11-545.29.06-6.1.71/lib
hookName: createContainer
path: /run/current-system/sw/bin/nvidia-ctkBut it doesn't seem to have any effect on whatever is failing during runtime. |
|
Thank you for this WIP! I made podman work based on this derivation, but I want to put all pieces together in a better way. I have set up my NixOS environment with the following setting: etc."cdi/nvidia.yaml".text = ''
---
cdiVersion: 0.5.0
containerEdits:
deviceNodes:
- path: /dev/nvidia-modeset
- path: /dev/nvidia-uvm
- path: /dev/nvidia-uvm-tools
- path: /dev/nvidiactl
hooks:
- args: []
hookName: createContainer
path: /nix/store/fcjkd0v1ybrgd3fvvpljj2m526wvi4f5-container-toolkit-container-toolkit-1.15.0-rc.1/bin/nvidia-ctk
mounts:
... <more data> ...
''I generated this CDI content by running: On this generated file I removed the I think it would be interesting to expose a NixOS option for CDI, maybe under I hope to be able to propose something while you are also working on it. |
|
Oh damn this also works with docker! docker run \
--rm -ti \
--runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all \
-v /nix:/nix \
ubuntu \
/run/current-system/sw/bin/nvidia-smiIt seems that the manually adjusting the generated CDI to have I guess there are a few things that need changing then:
I wonder whether it would make sense to make CDI the default. It's already the recommended approach for podman and docker will support CDI for the "--device" syntax in the upcoming release: docker/cli#3864. In the meantime we could use the If I remember correctly the |
I agree that we should move to CDI and make that the default. We also have to take into account the use case of cross-compiling or building a NixOS system for a remote system, where we cannot use |
|
Although I will keep this PR updated with what I find. |
pkgs/applications/virtualization/nvidia-container-toolkit/default.nix
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Although nvidia-smi works in the case I described, having an application relying on CUDA (e.g. ludwigai/ludwig-gpu:latest) does not identify the GPU's
I don't know yet what exactly update-ldcache refers to, but the FHS apps rely on /etc/ld.so.{cache,conf} to discover "global" libraries, whereas NixOS deploys the impure drivers in a predefined location. You'd have to generate the /etc/ld.so.* stuff in the container as aware of the drivers' location for the FHS apps to work, and you would also have to mount the /run/opengl-driver/lib link farm for Nixpkgs apps to work. Alternatively, you can set the LD_LIBRARY_PATH for both.
You could test if the Nixpkgs apps currently work for you by building and docker load-ing something simple like
with import ./. { config.allowUnfree = true; };
dockerTools.buildLayeredImage rec{
name = cudaPackages.saxpy.pname;
tag = "latest";
contents = buildEnv {
inherit name;
paths = [ cudaPackages.saxpy ];
};
}
There was a problem hiding this comment.
I'm assuming update-ldcache is useful given we inject libraries from the host on the container with the CDI hooks. We cannot assume the container image will be NixOS: huge chances it won't be.
I think we probably still want the update-ldcache hook for all containerized and packaged software we can run from within a NixOS host, that get the CUDA libraries injected from the host.
This hook is a no-op if it cannot find /etc/ld.so.{cache,conf}, which is also fine.
There was a problem hiding this comment.
Okay, I was able to make LocalAI work just fine with podman and CDI. I removed the ldcache hook, it's not really mandatory. I did a couple of approaches with this:
-
Improve the
update-ldcachesubcommand of nvidia-container-toolkit by adding-ntoldconfigcall. This made this command succeed along with a change to point to an existing ldconfig in the nix store. -
Delete the
update-ldcachehook and add a CDI hook that alters theLD_LIBRARY_PATHenvvar, like so:
{
"name": "0",
"containerEdits": {
"env": [
"LD_LIBRARY_PATH=/nix/store/l543ki4i4z56gc9gx5p4qzna2m24aywr-nvidia-x11-545.29.06-6.1.72/lib:$LD_LIBRARY_PATH"
],
"deviceNodes": [
{
"path": "/dev/nvidia1"
...
I'm feeling more positive about the second option; I think is probably less brittle. This means ditching the first option and never needing to call to update-ldcache on the container root. When I'm happy with the changes I can share with you.
Also, do you know if we could create a mechanism that generates a CDI spec by calling to nvidia-ctk cdi generate on boot, and writes the result with our modifications on /etc/cdi/nvidia.yaml? I think a systemd unit that performs this on boot would be ideal.
WDYT?
There was a problem hiding this comment.
update-ldcache
There are two tiers of needs: there's what NixOS needs, and there's the fact that the Nixpkgs package is broken. At the bare minimum, we want the NixOS module to work. Ideally, we want the Nixpkgs package to work both in FHS and on NixOS with or without a dedicated module.
For the latter we do need, imo, to patch both libnvidia-container and nvidia-docker-toolkit to skip ldcache if there's a static configuration available (e.g. if we taught it about @driverLink@ at build time and the directory happens to exist at runtime, although that's rather specific).
For the former it'd be seemingly enough to update the module to deploy etc."cdi/nvidia.yaml"
Also, do you know if we could create a mechanism that generates a CDI spec by calling to nvidia-ctk cdi generate on boot, and writes the result with our modifications on /etc/cdi/nvidia.yaml
I would first consider generating a static nvidia.yaml, described in the Nix language?
There was a problem hiding this comment.
options.xxxxxxx.cdi.settings = mkOption { type = (pkgs.formats.toml { }).type; /* ... */ } I guess, and etc."cdi/nvidia.yaml" = pkgs.formats.toml.generate config.xcxxxx.cdi.settings
My main motivation is to avoid any weird nvidia stuff doing mutable operations on boot
Do you mean having something like: Or, do you mean literally having "typed" CDI settings? I would say the latter is not worth the effort. Just double checking if we are on the same page. |
|
@SomeoneSerge do you know if a runtime supports fully fledged CDI, whether nvidia-container-runtime is still required instead of running runc/crun directly? One problem setting LD_LIBRARY_PATH is that we cannot augment this envvar with CDI, only set/replace it: This ^ is not working, and is not meant to work (confirmed on CNCF Slack on tag-runtime). The pattern "export VAR=/something/else:$VAR" is not going to fly on We probably want to track the migration to CDI and nvidia-docker deprecation and removal on some specific issue (so I stop spamming this PR :P). I don't see an issue open for that, do you have any thoughts on that @SomeoneSerge @aaronmondal? |
|
On this document anchor you can find more information about NVIDIA's rationale on their shift to running ldconfig on the container root vs using LD_LIBRARY_PATH with the mapped libraries. |
Not sure if this is what you are asking, but Podman has full CDI support and I don't think it needs nvidia-container-runtime. |
Thank you. Yes, this was my question. Podman, crio, containerd and moby all of them seem to have integration with container-device-interface. I think we can get rid of runtime wrappers. I was only dubious in case this nvidia runtime perform something missing on CDI, as I didn't check exhaustively. |
|
More information: there is work ongoing for allowing to configure the ldconfig path as well as the invocation parameters on https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/525 -- already mentioned in #278969 (comment). This is going to help on our use case, we will be able to get rid of the patching for the ldconfig and potentially the generated CDI spec tweaking. |
|
This is the WIP that I have, using this rebased PR as a base: This is a manual way of defining CDI with their content. This instance will be mapped to virtualisation.containers.cdi = {
nvidia = builtins.fromJSON ''
{
"cdiVersion": "0.5.0",
...
}
'';
other-provider = builtins.fromJSON ''
{
"cdiVersion": "0.5.0",
...
}
'';
};This is an automatic way of auto generating the CDI interface: virtualisation.containers.cdi.nvidia = "nvidia-ctk-generate";I am not sold on the types yet, because you could do something like the following: virtualisation.containers.cdi = {
nvidia = "nvidia-ctk-generate";
other-provider = "nvidia-ctk-generate";
};And now you would have Another problem is that when you ask for the auto-generated CDI, the
I am early sharing to know your thoughts on this. Besides the issues I have mentioned the integration is working flawlessly for me in my tests. |
I like this simple way of defining custom CDI resources.
But I was a bit surprised how the autogeneration works. First I was thinking you can pass an arbitrary executable and it would just write the output. But then I would have expected to see This makes me a bit worried about the maintenance burden. Maybe we can push some more things upstream so that no or less post processing is needed.
I think generally this is fine as the generated CDI data is kind of autodetected runtime data. So generating at bootup is ok. Wouldn't puttint the CDI files somewhere in |
I'm going to give it a second thought, yes. I think it might be worth to just add a
I agree completely. I take ownership of this code and to find a sweet spot for improving the upstream project; they are very welcome to this kind of improvements.
Yes, I think that might be fine, still we don't want to leave dangling symlinks around so a cleanup is necessary IMO. |
|
I do not think it would be good to put NVIDIA CDI devices into nixos configuration directly. There is a good amount of logic that |
|
@jmbaur I've pulled in your |
I haven't yet submitted that patch to their project on GitLab, as I'm not sure if it aligns with their goals, but I was messing around with it on an x86_64 workstation I have access to and it does make Also to note is the draft PR of mine has a few other patches that address some of the |
b338e84 to
24134da
Compare
|
Result of 8 packages built:
|
|
My findings at the current stage of this PR:
sudo nvidia-ctk cdi generate --output /etc/cdi/nvidia.yaml
10,15c10
< - args:
< - nvidia-ctk
< - hook
< - update-ldcache
< - --folder
< - /run/opengl-driver/lib
---
> - args: []Then you can use podman with CDI like so: podman run \
--rm \
--device nvidia.com/gpu=all \
-v /nix:/nix \
ubuntu \
/run/current-system/sw/bin/nvidia-smi -L
docker run \
--rm -it \
--runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all \
ubuntu nvidia-smi
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: /nix/store/bdgz70m507gzfjg52yrqj5sa0b3rf04n-nvidia-docker/bin/nvidia-container-runtime did not terminate successfully: exit status 125: unknown flag: --root
See 'docker --help'.
Usage: docker [OPTIONS] COMMAND
...Looks like some arguments are not passed correctly to |
What does doing this modification solve? Is there some error that occurs later on when spawning containers?
Docker versions < 25 do not support CDI at all. You will need the correct version of docker and you will also need to start the daemon with |
10,15c10
< - args:
< - nvidia-ctk
< - hook
< - update-ldcache
< - --folder
< - /run/opengl-driver/lib
---
> - args: []This will prevent some containers from finding the mounted libcuda & friends libraries if they have an ldcache already present, since it will not be refreshed. I'll open a PR for CDI tomorrow with a proposal. I think it's good to have:
But I think, as things are at this moment, we need to keep the |
|
Opened the PR to add support for CDI: #284507. There are some mounts I need to re-validate to only mount what is really required. Please provide feedback about that integration over there if you feel like it. :) |
24134da to
a55c829
Compare
SomeoneSerge
left a comment
There was a problem hiding this comment.
@aaronmondal could you update the commit message? Smth like nvidia-container-toolkit: 1.9.0 -> 1.15.0-rc.3.
At a (yet another) glance, this looks good, we should probably merge and move on to libnvidia-container and the CDI PR
a55c829 to
9daafdf
Compare
pkgs/applications/virtualization/nvidia-container-toolkit/default.nix
Outdated
Show resolved
Hide resolved
SomeoneSerge
left a comment
There was a problem hiding this comment.
Aside from ldflags, I suppose this is ready?
SomeoneSerge
left a comment
There was a problem hiding this comment.
This has been open for a while and there haven't been any objections. I intend to merge this as soon as Ofborg finishes re-evaluation.
|
Sry for the delay. Currently travelling. Looks good 👌 |
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/using-nvidia-container-runtime-with-containerd-on-nixos/27865/30 |
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/nvidia-gpu-support-in-podman-and-cdi-nvidia-ctk/36286/4 |
|
Uhm, I screwed up. Ofborg didn't actually rebuild anything, and my commit updating ldflags broke the whole thing because I assumed the wrong |
Description of changes
This changes bumps the
nvidia-container-toolkitfrom 1.9.0 to 1.15.0-rc.3. This likely won't be too much of a notable change yet, but it allows deprecating the ancientnvidia-dockerpackages in future commits. This change also adds thenvidia-ctktool as it's part of the toolkit.Fixes #278155
Fixes #272235
Things done
nix.conf? (See Nix manual)sandbox = relaxedsandbox = truenix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/)Add a 👍 reaction to pull requests you find important.