-
Notifications
You must be signed in to change notification settings - Fork 265
Description
Hi! π
First of all, thank you for the amazing work on ComfyUI-GGUF β it's a fantastic
project and the community really appreciates it!
I'm running ComfyUI on a NVIDIA Jetson Orin NX 16GB (ARM/aarch64) and I've been
trying to get GGUF models working. Unfortunately I'm hitting a consistent crash
that I couldn't resolve despite several attempts. I'm reporting it here hoping
it might help improve compatibility with Jetson devices, which are becoming
increasingly popular for local AI inference.
Thanks in advance for any insight! π
Environment
- Device: NVIDIA Jetson Orin NX 16GB (Engineering Reference Developer Kit Super)
- SoC: tegra234
- CUDA Arch: 8.7
- OS: Ubuntu 22.04 (aarch64)
- L4T: 36.4.7
- CUDA: 12.6.85
- cuDNN: 9.19.0.56
- TensorRT: 10.7.0.23
- Python: 3.10.12
- PyTorch: 2.5.0a0+872d972e41.nv24.08 (NVIDIA custom build)
- Also tested with: PyTorch 2.8.0 (from pypi.jetson-ai-lab.io/jp6/cu126)
- ComfyUI-GGUF: 6ea2651 (latest main)
- gguf package: 0.18.0
Model
z_image_turbo-Q4_K_M.ggufloaded via Unet Loader (GGUF) node
Error
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at
"/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":838
With PyTorch 2.8.0 the same crash occurs at line 1131.
Traceback
The crash occurs in ops.py line 45-58, specifically when calling .to(device)
on a GGMLTensor:
File "ComfyUI-GGUF/ops.py", line 58, in to
new = super().to(*args, **kwargs)
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at
"c10/cuda/CUDACachingAllocator.cpp":838
What I tried
- Updated
ggufpackage from 0.17.1 to 0.18.0 β same error - Modified
get_torch_compiler_disable_decorator()inops.pyto always
returndummy_decorator(bypass torch.compile) β same error - Upgraded PyTorch to 2.8.0 (from Jetson AI Lab repo) β same error at
different line (1131) - Fresh ComfyUI install with PyTorch 2.8.0 β same error
Root cause hypothesis
The CUDACachingAllocator on Jetson crashes when trying to move a custom
torch.Tensor subclass (GGMLTensor) to CUDA device. This appears to be
a known issue with PyTorch custom tensor subclasses on Jetson's unified memory
architecture (CPU+GPU share the same memory pool).
Standard safetensors models (e.g. Juggernaut XL fp16) work perfectly on the
same setup.
Question
Is there a workaround to load GGUF models without triggering .to(cuda)
on the GGMLTensor subclass? Or is there a way to force dequantization on CPU
before moving to GPU?