Skip to content

RTX 4090, cuFileHandleRegister error: GPUDirect Storage not supported on current file #51

@flyingKangaroo1

Description

@flyingKangaroo1

NVIDIA Open GPU Kernel Modules Version

570.148.08

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04.5 LTS

Kernel Release

Linux 6.8.0-106-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4090

Describe the bug

I am trying to run GDS to stream data directly between my NVMe SSD and GPU VRAM. I am getting a file registration error during the gdsio test.
Running ./gdscheck gives following output.

 GDS release version: 1.13.1.3
 nvidia_fs version:  2.28 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe P2PDMA        : Unsupported
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_pci_p2pdma : true
 properties.use_compat_mode : false
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 64 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 fs.gpfs.gds_async_support: true
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA GeForce RTX 4090 bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12080
 Platform: B650M K, Arch: x86_64(Linux 6.8.0-106-generic)
 Platform verification succeeded

It shows that it supports gds.
When i run ./gdsio, it gives following error

usr/local/cuda/gds/tools$ sudo ./gdsio -f /home/test_gds.dat -d 0 -w 1 -s 2G -i 1M -x 0 -I 0
file register error: GPUDirect Storage not supported on current file filename :/home/test_gds.dat

this is the following cufile.log

 06-04-2026 14:59:16:997 [pid=44990 tid=44990] ERROR  cufio-fs:204 NVMe Driver not registered with nvidia-fs!!!
 06-04-2026 14:59:16:997 [pid=44990 tid=44990] ERROR  cufio-fs:204 NVMe Driver not registered with nvidia-fs!!!
 06-04-2026 14:59:16:997 [pid=44990 tid=44990] NOTICE  cufio-fs:451 dumping volume attributes: DEVNAME:/dev/nvme0n1p2,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,UDEV_PCI_BRIDGE:0000:00:01.2,device/transport:pcie,ext4_journal_mode:ordered,fsid:[FSID_REDACTED],numa_node:-1,queue/logical_block_size:4096,wwid:eui.[REDACTED],
 06-04-2026 14:59:16:997 [pid=44990 tid=44990] ERROR  cufio:297 cuFileHandleRegister error, file checks failed
 06-04-2026 14:59:16:997 [pid=44990 tid=44990] ERROR  cufio:339 cuFileHandleRegister error: GPUDirect Storage not supported on current file
 06-04-2026 14:59:31:762 [pid=45020 tid=45020] ERROR  cufio-fs:204 NVMe Driver not registered with nvidia-fs!!!
 06-04-2026 14:59:31:762 [pid=45020 tid=45020] ERROR  cufio-fs:204 NVMe Driver not registered with nvidia-fs!!!
 06-04-2026 14:59:31:762 [pid=45020 tid=45020] NOTICE  cufio-fs:451 dumping volume attributes: DEVNAME:/dev/nvme0n1p2,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,UDEV_PCI_BRIDGE:0000:00:01.2,device/transport:pcie,ext4_journal_mode:ordered,fsid:[FSID_REDACTED],numa_node:-1,queue/logical_block_size:4096,wwid:eui.[REDACTED],
 06-04-2026 14:59:31:762 [pid=45020 tid=45020] ERROR  cufio:297 cuFileHandleRegister error, file checks failed
 06-04-2026 14:59:31:762 [pid=45020 tid=45020] ERROR  cufio:339 cuFileHandleRegister error: GPUDirect Storage not supported on current file
 06-04-2026 14:59:35:560 [pid=45038 tid=45038] ERROR  cufio-fs:204 NVMe Driver not registered with nvidia-fs!!!
 06-04-2026 14:59:35:560 [pid=45038 tid=45038] ERROR  cufio-fs:204 NVMe Driver not registered with nvidia-fs!!!
 06-04-2026 14:59:35:560 [pid=45038 tid=45038] NOTICE  cufio-fs:451 dumping volume attributes: DEVNAME:/dev/nvme0n1p2,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,UDEV_PCI_BRIDGE:0000:00:01.2,device/transport:pcie,ext4_journal_mode:ordered,fsid:[FSID_REDACTED],numa_node:-1,queue/logical_block_size:4096,wwid:eui.[REDACTED],
 06-04-2026 14:59:35:560 [pid=45038 tid=45038] ERROR  cufio:297 cuFileHandleRegister error, file checks failed
 06-04-2026 14:59:35:560 [pid=45038 tid=45038] ERROR  cufio:339 cuFileHandleRegister error: GPUDirect Storage not supported on current file
 06-04-2026 15:24:33:8 [pid=45511 tid=45511] ERROR  cufio-fs:204 NVMe Driver not registered with nvidia-fs!!!
 06-04-2026 15:24:33:8 [pid=45511 tid=45511] ERROR  cufio-fs:204 NVMe Driver not registered with nvidia-fs!!!
 06-04-2026 15:24:33:8 [pid=45511 tid=45511] NOTICE  cufio-fs:451 dumping volume attributes: DEVNAME:/dev/nvme0n1p2,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,UDEV_PCI_BRIDGE:0000:00:01.2,device/transport:pcie,ext4_journal_mode:ordered,fsid:[FSID_REDACTED],numa_node:-1,queue/logical_block_size:4096,wwid:eui.[REDACTED],
 06-04-2026 15:24:33:8 [pid=45511 tid=45511] ERROR  cufio:297 cuFileHandleRegister error, file checks failed
 06-04-2026 15:24:33:8 [pid=45511 tid=45511] ERROR  cufio:339 cuFileHandleRegister error: GPUDirect Storage not supported on current file

Is this error happening because my nvme drive doesn't support p2pmem or control memory buffer? Is there a way to fix this error?

To Reproduce

Hardware Details:
GPU: NVIDIA GeForce RTX 4090
Motherboard: B650M K
SSD: SK Hynix Platinum P41 1TB

sudo apt-get --purge autoremove nvidia-*
sudo apt-get --purge autoremove libnvidia*

# https://www.nvidia.com/en-us/drivers/details/245523/
chmod +x ./NVIDIA-Linux-x86_64-570.148.08.run
sudo ./NVIDIA-Linux-x86_64-570.148.08.run --no-kernel-module

git clone -b 570.148.08-p2p https://github.com/tinygrad/open-gpu-kernel-modules
./install.sh
# reboot

sudo apt install nvidia-gds-12.8
sudo gedit /etc/cufile.json
"use_pci_p2pdma" /etc/cufile.json # False->True
"allow_compat_mode" /etc/cufile.json # True->False

sudo gedit /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off" # amd_iommu=off

git clone https://github.com/NVIDIA/gds-nvidia-fs
cd src
sudo make NVIDIA_SRC_DIR=$HOME/open-gpu-kernel-modules/kernel-open/nvidia \
          CONFIG_DISABLE_NVFS_KERN_RDMA_SUPPORT=1 \
          KBUILD_EXTRA_SYMBOLS=$HOME/open-gpu-kernel-modules/kernel-open/Module.symvers
sudo depmod -a
sudo modprobe nvidia-fs
sudo setpci -s 0000:00:01.1

Bug Incidence

Always

nvidia-bug-report.log.gz

Is this error happening because my nvme drive doesn't support p2pmem or control memory buffer? Is there a way to fix this error?

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions