Skip to content

RuntimeError: CUDACachingAllocator INTERNAL ASSERT FAILED on Jetson Orin NX (CUDA 12.6, PyTorch nv24.08/2.8.0)Β #430

@SuperTommy2017

Description

@SuperTommy2017

Hi! πŸ‘‹

First of all, thank you for the amazing work on ComfyUI-GGUF β€” it's a fantastic
project and the community really appreciates it!

I'm running ComfyUI on a NVIDIA Jetson Orin NX 16GB (ARM/aarch64) and I've been
trying to get GGUF models working. Unfortunately I'm hitting a consistent crash
that I couldn't resolve despite several attempts. I'm reporting it here hoping
it might help improve compatibility with Jetson devices, which are becoming
increasingly popular for local AI inference.

Thanks in advance for any insight! πŸ™

Environment

  • Device: NVIDIA Jetson Orin NX 16GB (Engineering Reference Developer Kit Super)
  • SoC: tegra234
  • CUDA Arch: 8.7
  • OS: Ubuntu 22.04 (aarch64)
  • L4T: 36.4.7
  • CUDA: 12.6.85
  • cuDNN: 9.19.0.56
  • TensorRT: 10.7.0.23
  • Python: 3.10.12
  • PyTorch: 2.5.0a0+872d972e41.nv24.08 (NVIDIA custom build)
  • Also tested with: PyTorch 2.8.0 (from pypi.jetson-ai-lab.io/jp6/cu126)
  • ComfyUI-GGUF: 6ea2651 (latest main)
  • gguf package: 0.18.0

Model

  • z_image_turbo-Q4_K_M.gguf loaded via Unet Loader (GGUF) node

Error

RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at 
"/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":838

With PyTorch 2.8.0 the same crash occurs at line 1131.

Traceback

The crash occurs in ops.py line 45-58, specifically when calling .to(device)
on a GGMLTensor:

File "ComfyUI-GGUF/ops.py", line 58, in to
    new = super().to(*args, **kwargs)
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at 
"c10/cuda/CUDACachingAllocator.cpp":838

What I tried

  1. Updated gguf package from 0.17.1 to 0.18.0 β€” same error
  2. Modified get_torch_compiler_disable_decorator() in ops.py to always
    return dummy_decorator (bypass torch.compile) β€” same error
  3. Upgraded PyTorch to 2.8.0 (from Jetson AI Lab repo) β€” same error at
    different line (1131)
  4. Fresh ComfyUI install with PyTorch 2.8.0 β€” same error

Root cause hypothesis

The CUDACachingAllocator on Jetson crashes when trying to move a custom
torch.Tensor subclass (GGMLTensor) to CUDA device. This appears to be
a known issue with PyTorch custom tensor subclasses on Jetson's unified memory
architecture (CPU+GPU share the same memory pool).

Standard safetensors models (e.g. Juggernaut XL fp16) work perfectly on the
same setup.

Question

Is there a workaround to load GGUF models without triggering .to(cuda)
on the GGMLTensor subclass? Or is there a way to force dequantization on CPU
before moving to GPU?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions