RuntimeError: CUDACachingAllocator INTERNAL ASSERT FAILED on Jetson Orin NX (CUDA 12.6, PyTorch nv24.08/2.8.0)

## Hi! 👋

First of all, thank you for the amazing work on ComfyUI-GGUF — it's a fantastic 
project and the community really appreciates it!

I'm running ComfyUI on a NVIDIA Jetson Orin NX 16GB (ARM/aarch64) and I've been 
trying to get GGUF models working. Unfortunately I'm hitting a consistent crash 
that I couldn't resolve despite several attempts. I'm reporting it here hoping 
it might help improve compatibility with Jetson devices, which are becoming 
increasingly popular for local AI inference.

Thanks in advance for any insight! 🙏


## Environment
- **Device**: NVIDIA Jetson Orin NX 16GB (Engineering Reference Developer Kit Super)
- **SoC**: tegra234
- **CUDA Arch**: 8.7
- **OS**: Ubuntu 22.04 (aarch64)
- **L4T**: 36.4.7
- **CUDA**: 12.6.85
- **cuDNN**: 9.19.0.56
- **TensorRT**: 10.7.0.23
- **Python**: 3.10.12
- **PyTorch**: 2.5.0a0+872d972e41.nv24.08 (NVIDIA custom build)
- **Also tested with**: PyTorch 2.8.0 (from pypi.jetson-ai-lab.io/jp6/cu126)
- **ComfyUI-GGUF**: 6ea2651 (latest main)
- **gguf package**: 0.18.0

## Model
- `z_image_turbo-Q4_K_M.gguf` loaded via **Unet Loader (GGUF)** node

## Error
```
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at 
"/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":838
```
With PyTorch 2.8.0 the same crash occurs at line 1131.

## Traceback
The crash occurs in `ops.py` line 45-58, specifically when calling `.to(device)` 
on a GGMLTensor:
```
File "ComfyUI-GGUF/ops.py", line 58, in to
    new = super().to(*args, **kwargs)
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at 
"c10/cuda/CUDACachingAllocator.cpp":838
```

## What I tried
1. Updated `gguf` package from 0.17.1 to 0.18.0 — same error
2. Modified `get_torch_compiler_disable_decorator()` in `ops.py` to always 
   return `dummy_decorator` (bypass torch.compile) — same error
3. Upgraded PyTorch to 2.8.0 (from Jetson AI Lab repo) — same error at 
   different line (1131)
4. Fresh ComfyUI install with PyTorch 2.8.0 — same error

## Root cause hypothesis
The `CUDACachingAllocator` on Jetson crashes when trying to move a custom 
`torch.Tensor` subclass (`GGMLTensor`) to CUDA device. This appears to be 
a known issue with PyTorch custom tensor subclasses on Jetson's unified memory 
architecture (CPU+GPU share the same memory pool).

Standard safetensors models (e.g. Juggernaut XL fp16) work perfectly on the 
same setup.

## Question
Is there a workaround to load GGUF models without triggering `.to(cuda)` 
on the GGMLTensor subclass? Or is there a way to force dequantization on CPU 
before moving to GPU?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDACachingAllocator INTERNAL ASSERT FAILED on Jetson Orin NX (CUDA 12.6, PyTorch nv24.08/2.8.0) #430

Hi! 👋

Environment

Model

Error

Traceback

What I tried

Root cause hypothesis

Question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RuntimeError: CUDACachingAllocator INTERNAL ASSERT FAILED on Jetson Orin NX (CUDA 12.6, PyTorch nv24.08/2.8.0) #430

Description

Hi! 👋

Environment

Model

Error

Traceback

What I tried

Root cause hypothesis

Question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions