Skip to content

Add CUDA GPU backend for NVIDIA acceleration#12

Open
0xSufi wants to merge 3 commits intoantirez:mainfrom
0xSufi:cuda-backend
Open

Add CUDA GPU backend for NVIDIA acceleration#12
0xSufi wants to merge 3 commits intoantirez:mainfrom
0xSufi:cuda-backend

Conversation

@0xSufi
Copy link

@0xSufi 0xSufi commented Feb 11, 2026

Summary

  • Add CUDA/cuBLAS GPU backend for NVIDIA GPUs (SM 8.0+), bringing inference from ~40 sec to ~1 sec on RTX 3090
  • Implements BF16 weight caching, custom CUDA kernels (attention, RoPE, norms, activations), and monolithic GPU step functions that run all transformer layers with a single CPU↔GPU sync
  • New make cuda build target; CUDA and Metal are mutually exclusive at compile time

Performance (RTX 3090, 3.6s test audio)

Encoder Decoder (per step) Decoder (total)
CPU (OpenBLAS) 10,871 ms ~500 ms 28,446 ms
CUDA 225 ms (48x) 12.3 ms (40x) 782 ms (36x)

Architecture

  • cuBLAS matmul: BF16 weights cached on GPU after first use, F32→BF16 activation conversion on-device, TF32 tensor ops on Ampere+
  • Custom CUDA kernels: rms_norm (shared-memory reduction), silu, gelu, add/mul, apply_rope (paired rotation), causal_attention (online softmax, GQA, sliding
    window), ada_scale, bias_add
  • Monolithic step functions: Decoder (26 layers + logits) and encoder (32 layers + final norm) execute entirely on GPU with one cudaStreamSynchronize — no per-layer CPU
    round-trips
  • Unified memory KV cache: cudaMallocManaged for zero-copy CPU↔GPU KV caches; GPU writes KV entries via device→managed copy to avoid page thrashing
  • Persistent GPU buffers: Single cudaMalloc per component (decoder/encoder), pointer arithmetic for sub-buffers, reused across tokens

New files

  • voxtral_cuda.h — C header with CUDA backend API
  • voxtral_cuda.cu — Full CUDA implementation (~1100 lines)

Modified files

  • Makefilemake cuda target (nvcc + gcc, links cudart/cublas)
  • main.c — CUDA init/shutdown lifecycle
  • voxtral.c — BF16 + F32 weight warmup, unified memory KV cache allocation
  • voxtral_kernels.c#ifdef USE_CUDA dispatch for bf16 matmul functions
  • voxtral_decoder.c — CUDA monolithic decoder step dispatch, unified memory KV cache
  • voxtral_encoder.c — CUDA monolithic encoder step dispatch, unified memory KV cache

Test plan

  • make cuda builds without errors
  • Short audio transcription correct ("Hello, this is a test...")
  • Long audio transcription works (JFK 11s, MLK 3min)
  • Falls back to CPU path gracefully if CUDA step returns -1
  • Verify on different SM 8.x GPUs (tested on SM 8.6 / RTX 3090)

🤖 Generated with Claude Code

0xSufi and others added 3 commits February 11, 2026 18:03
Add cuBLAS-accelerated matrix multiplication with BF16 weight caching,
giving ~13x overall speedup on RTX 3090 vs CPU-only OpenBLAS
(encoder 5.5x, decoder prefill 51x, decoder per-step 25x).

New files:
- voxtral_cuda.h: C header with CUDA backend API
- voxtral_cuda.cu: cuBLAS GEMM with BF16 weight cache, F32→BF16
  activation conversion kernel, activation buffer pool, and
  cudaMallocManaged for KV caches

Modified files:
- voxtral_kernels.c: #ifdef USE_CUDA dispatch for bf16 matmul functions
- voxtral_encoder.c: CUDA unified memory for encoder KV cache
- voxtral_decoder.c: CUDA unified memory for decoder KV cache + grow
- voxtral.c: CUDA weight warmup at load time, shared memory free
- main.c: vox_cuda_init() / vox_cuda_shutdown()
- Makefile: `make cuda` target (requires CUDA toolkit + OpenBLAS, SM 8.0+)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Custom CUDA kernel with online softmax, GQA support, and sliding window
masking. One block per (head, query_pos) with warp shuffle + shared memory
dot product reduction. Only dispatched for the encoder incremental path
where KV cache is in managed memory — decoder stays on CPU since
single-token attention is faster without kernel launch + page migration
overhead.

Encoder: 1879ms → 634ms (2.97x), decoder unchanged at ~16ms/step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 3 of CUDA backend: keep activations on GPU between operations,
eliminating per-matmul CPU↔GPU round-trips. Adds custom CUDA kernels
(rms_norm, silu, gelu, add/mul, RoPE, ada_scale, bias_add), a device-
pointer cuBLAS helper, persistent GPU buffers, and monolithic decoder
(26 layers) and encoder (32 layers) step functions that execute with a
single cudaStreamSynchronize.

Performance on RTX 3090 (3.6s test audio):
- Encoder: 225ms (was 634ms in Phase 2, 1879ms in Phase 1)
- Decoder: 12.3 ms/step (was 16.8ms in Phase 2, 16.7ms in Phase 1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@0xSufi
Copy link
Author

0xSufi commented Feb 12, 2026

Ciao Salvatore, mi sembra che Opus 4.6 abbia fatto un buon lavoro qui, magari troppo Linux specific?
Le performances non sono male.

Sei un grande!

@Danmoreng
Copy link

How is this different from the existing CUDA PR here #7 ?

@0xSufi
Copy link
Author

0xSufi commented Feb 14, 2026

How is this different from the existing CUDA PR here #7 ?

Well, first of all look at the lines of code this PR added vs the other one, they are substantially different. Also this PR was created entirely with Claude Code (Opus 4.6) admittedly, I am not sure about #7. I have skimmed it briefly and it looks like it's Windows specific vs this PR, which should be Linux specific (I asked antirez about it).

Edit: Here's Opus 4.6 answer to your question:

● Both PRs add CUDA GPU backends for NVIDIA acceleration to voxtral.c, but they differ significantly in scope and approach:

  PR #12 — "Add CUDA GPU backend for NVIDIA acceleration"

  - Size: ~1,436 additions, 8 files changed
  - Scope: Focused, clean CUDA backend
  - Implementation: Single .cu file (~1100 lines), cuBLAS for matmul, custom kernels (RMSNorm, SiLU, GeLU, RoPE, causal attention, etc.)
  - Architecture: Monolithic GPU step functions (one cudaStreamSynchronize per step), unified memory (cudaMallocManaged) for KV caches, BF16 weight caching
  - Performance: RTX 3090, 3.6s audio — encoder 48x faster, decoder 40x faster (~1s total vs ~40s CPU)
  - Platform: Linux only
  - State: Open

  PR #7 — "CUDA backend (NVIDIA/WSL2) + faster-than-real-time STT"

  - Size: ~12,156 additions, 30 files changed
  - Scope: Much larger — CUDA backend + Windows support + many optimizations
  - Implementation: Separate voxtral_cuda.c (~6382 lines) + voxtral_cuda_kernels.cu (~2927 lines), uses CUDA Driver API + cuBLAS/cuBLASLt
  - Extra features:
    - Many tuning knobs (VOX_CUDA_FAST, VOX_CUDA_PIPELINE_FULL, VOX_CUDA_LOGITS_INT8, cuBLASLt autotune, etc.)
    - Fused kernels (RMSNorm + ada_scale + BF16 cast in one kernel)
    - CUDA graphs for decoder loop
    - INT8-quantized LM head (optional)
    - Pre-compiled cubin blob (avoids PTX JIT issues on WSL2)
    - Native Windows build (PowerShell scripts, WASAPI microphone capture)
  - Performance: RTX 3080 Ti, 180s audio — 5.5x real-time with CUDA fast + INT8
  - Platform: Linux, WSL2, and Windows
  - State: Open

  Key Differences

  ┌────────────────────┬─────────────────┬────────────────────────────────────┐
  │       Aspect       │     PR #12      │               PR #7                │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ Lines added        │ ~1,400          │ ~12,000                            │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ Files changed      │ 8               │ 30                                 │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ CUDA API           │ Runtime API     │ Driver API + cuBLASLt              │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ KV cache           │ Unified memory  │ Device memory + lazy host download │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ Windows support    │ No              │ Yes (WASAPI mic, PowerShell build) │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ Optimization knobs │ Minimal         │ Extensive (graphs, INT8, autotune) │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ Complexity         │ Simple, focused │ Feature-rich, many options         │
  └────────────────────┴─────────────────┴────────────────────────────────────┘

  In short: PR #12 is a clean, minimal CUDA backend; PR #7 is a comprehensive, heavily optimized CUDA backend with Windows support and many tuning options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants