Skip to content

CUDA backend (NVIDIA/WSL2) + faster-than-real-time STT#7

Open
HorizonXP wants to merge 35 commits intoantirez:mainfrom
HorizonXP:codex/cuda-wsl2-upstream
Open

CUDA backend (NVIDIA/WSL2) + faster-than-real-time STT#7
HorizonXP wants to merge 35 commits intoantirez:mainfrom
HorizonXP:codex/cuda-wsl2-upstream

Conversation

@HorizonXP
Copy link

@HorizonXP HorizonXP commented Feb 7, 2026

Summary

This PR adds an NVIDIA CUDA backend for voxtral.c (Linux/WSL2 tested) and enables faster-than-real-time speech-to-text on an RTX 3080 Ti.

It also adds native Windows build support (PowerShell build + model download scripts) and Windows microphone capture via WASAPI (ported from Danmoreng’s work referenced in PR comment #3867041166).

Key knobs:

  • VOX_CUDA_FAST=1: convenience preset that enables the best-known decoder speedups by default (CUDA graphs + attention v4 (fused KV append; falls back to v3) + merged projections + device RoPE + fused top1-only logits when alternatives are off), unless explicitly overridden.
  • VOX_CUDA_PIPELINE_FULL=1: experimental full CUDA streaming pipeline (keeps adapter embeddings on-device; thread-safe across multiple contexts/streams via serialization).
  • VOX_CUDA_LOGITS_FUSED=1: top1-only logits path (avoids materializing the full logits buffer when only the best token id is needed). Enabled by default under VOX_CUDA_FAST=1 (disable with VOX_DISABLE_CUDA_LOGITS_FUSED=1).
  • VOX_CUDA_LOGITS_INT8=1: opt-in INT8-quantized LM head for top1-only logits (reduces bandwidth of the vocab x dim projection). Default off; may affect accuracy.
  • VOX_CUDA_CUBLASLT_MAX_WS_MB=<MB>: cap cuBLASLt workspace used for M=1 GEMM algo selection (default: 32). Larger values can enable faster kernels at the cost of persistent VRAM.
  • VOX_CUDA_LT_COMPUTE=32F|32F_FAST_16BF|32F_FAST_TF32|32F_FAST_16F: opt-in cuBLASLt compute modes for BF16 M=1 GEMMs (default: 32F). May change outputs slightly; validate with ./scripts/accuracy_regression.sh.
  • VOX_DISABLE_CUBLASLT_AUTOTUNE=1: disable best-effort cuBLASLt autotune for repeated M=1 decoder GEMMs (enabled by default under VOX_CUDA_FAST=1; override with VOX_CUDA_CUBLASLT_AUTOTUNE=0/1). This can reduce prefill overhead on very short clips.

Benchmarks (RTX 3080 Ti, WSL2)

Definitions:

  • Wall transcribe: wall time excluding model load
  • xRT: times real-time = audio_seconds / wall_seconds (higher is better; > 1.0x is faster-than-real-time)

All timings are from VOX_PRINT_TIMINGS=1.

Sample Audio BLAS wall (xRT) CUDA wall (xRT) CUDA fast wall (xRT) CUDA fast + INT8 logits wall (xRT)
samples/test_speech.wav 3.64s 49.77s (0.07x) 2.60s (1.40x) 2.55s (1.43x) 2.93s (1.25x)
samples/I_have_a_dream.ogg 180.02s (very slow; skip) 83.48s (2.16x) 35.22s (5.11x) 32.53s (5.53x)

Notes:

  • Model load is printed separately (Model load:) and is small (hundreds of ms here; includes CUDA driver init on first run).
  • For the long sample, the decoder dominates baseline runtime; VOX_CUDA_FAST=1 primarily accelerates the decoder step loop (graphs + v4 + merged weights).
  • VOX_CUDA_LOGITS_INT8=1 is most useful on longer samples: it does a one-time LM-head quantize+upload on first use (INT8 weights are ~384MiB). On very short clips, that one-time work can outweigh the per-step speedup.
  • VOX_CUDA_FAST=1 also enables cuBLASLt autotune for the repeated M=1 decoder GEMMs; on very short clips the one-time tuning shows up as higher prefill. Disable via VOX_DISABLE_CUBLASLT_AUTOTUNE=1 if benchmarking minimal startup latency.
  • CUDA keeps decoder KV on-device; host KV is lazily downloaded only if CPU attention is used. Perf tweak: avoid prefill KV device->host copies + skip large host KV memmoves during compaction when host KV is stale.
  • BLAS is not close to real-time on these tests; for long samples it can take tens of minutes.

Detailed Timing Breakdown

test_speech.wav (3.641750s):

  • BLAS: model load 67 ms, wall 49768 ms, encoder 20643 ms, decoder 29116 ms (prefill 8455 ms + 369.0 ms/step)
  • CUDA: model load 237 ms, wall 2597 ms, encoder 615 ms, decoder 1972 ms (prefill 1298 ms + 12.0 ms/step)
  • CUDA + VOX_CUDA_FAST=1: model load 256 ms, wall 2551 ms, encoder 612 ms, decoder 1928 ms (prefill 1359 ms + 10.2 ms/step)
  • CUDA + VOX_CUDA_FAST=1 VOX_CUDA_LOGITS_INT8=1: model load 246 ms, wall 2925 ms, encoder 716 ms, decoder 2199 ms (prefill 1624 ms + 10.3 ms/step)

I_have_a_dream.ogg (180.021438s; converted to 16kHz mono WAV for the run):

  • CUDA: model load 256 ms, wall 83477 ms, encoder 2588 ms, decoder 80684 ms (prefill 2607 ms + 34.5 ms/step)
  • CUDA + VOX_CUDA_FAST=1: model load 263 ms, wall 35218 ms, encoder 2489 ms, decoder 32525 ms (prefill 1506 ms + 13.7 ms/step)
  • CUDA + VOX_CUDA_FAST=1 VOX_CUDA_LOGITS_INT8=1: model load 244 ms, wall 32529 ms, encoder 2422 ms, decoder 29902 ms (prefill 1474 ms + 12.6 ms/step)

How To Build / Run (Linux / WSL2)

Build:

  • make cuda (requires CUDA Toolkit + nvcc)

Run:

  • ./download_model.sh
  • VOX_PRINT_TIMINGS=1 ./voxtral -d voxtral-model -i samples/test_speech.wav
  • Recommended speed preset: VOX_CUDA_FAST=1 VOX_PRINT_TIMINGS=1 ./voxtral -d voxtral-model -i samples/I_have_a_dream.ogg
  • Optional INT8 logits (accuracy-risky): VOX_CUDA_FAST=1 VOX_CUDA_LOGITS_INT8=1 VOX_PRINT_TIMINGS=1 ./voxtral -d voxtral-model -i samples/I_have_a_dream.ogg

Benchmark helper:

  • ./scripts/benchmark_backends.sh voxtral-model samples/test_speech.wav
  • Skip the slow CPU BLAS run: VOX_BENCH_SKIP_BLAS=1 ./scripts/benchmark_backends.sh voxtral-model samples/I_have_a_dream.ogg
  • Run extra CUDA variants: VOX_BENCH_CUDA_OPTS=1 ...

Windows (Native)

New files:

  • WINDOWS_CUDA_GUIDE.md
  • build.ps1, download_model.ps1, runtest.ps1
  • voxtral_mic_win32.c (WASAPI mic)

Quickstart:

.\download_model.ps1
.\build.ps1 -Cuda
.\voxtral.exe -d voxtral-model -i samples\jfk.wav
.\voxtral.exe -d voxtral-model --from-mic -I 0.5

Implementation Notes

  • Uses CUDA Driver API + cuBLAS/cuBLASLt.
  • Decoder FFN norm uses a fused CUDA kernel (vox_rms_norm_to_bf16_ada) to combine RMSNorm + (1+ada_scale) + BF16 cast (reduces per-step kernel count).
  • Embeds a nvcc -cubin blob (voxtral_cuda_kernels_cubin.h) to avoid PTX JIT compatibility issues on WSL2.
  • Device BF16 weight cache with conservative VRAM sizing; optional cold-start knobs:
    • VOX_CUDA_PREFETCH=1, VOX_CUDA_HOSTREG_GIB=<GiB>, async alloc mempool (default on; disable via VOX_DISABLE_CUDA_MEMPOOL=1).
  • Decoder KV cache lives on-device; the host KV cache is kept consistent via lazy download on CPU fallback (so if CUDA-full runs for a while, then falls back, CPU attention won’t read stale host KV).

Validation

Ran (WSL2):

  • ./scripts/validate_cuda.sh voxtral-model samples/test_speech.wav
  • ./scripts/validate_cuda_pipeline_compact.sh voxtral-model samples/antirez_speaking_italian_short.ogg
  • ./scripts/stress_cuda_two_streams.sh voxtral-model samples/test_speech.wav
  • ./scripts/accuracy_regression.sh voxtral-model samples/test_speech.wav 0
  • ./runtest.sh

Credits

  • Windows-native CUDA build + WASAPI mic support (plus several portability fixes) were originally implemented by @Danmoreng (see comment #3867041166, branch cuda-fork-merge) and adapted into this PR.

@Danmoreng
Copy link

Danmoreng commented Feb 7, 2026

Why WSL though and not native CUDA under windows? I was also trying to get CUDA to work with gemini, I'm not there yet unfortunately. However, I have a working Windows CPU AVX512 build (powershell build script) here: https://github.com/Danmoreng/voxtral.c/tree/windows-support

Would you mind if I merge your CUDA implementation into my fork?

@HorizonXP
Copy link
Author

Why WSL though and not native CUDA under windows? I was also trying to get CUDA to work with gemini, I'm not there yet unfortunately. However, I have a working Windows CPU AVX512 build (powershell build script) here: https://github.com/Danmoreng/voxtral.c/tree/windows-support

Would you mind if I merge your CUDA implementation into my fork?

Fair point. To be candid, this approach reflects what I had available and what I’m comfortable running locally. The primary goal was simply to leverage the GPU where possible and prove out the core CUDA path.

My hope was that this would still be reasonably portable, at least across Linux and WSL, and that it could form a solid base for broader CUDA support. I don’t consider this complete yet, and I’m totally fine if you decide to merge it or not in its current form.

You’ve already pointed out a real gap that I agree is worth addressing. I suspect it’s better handled as a follow-up PR once the core CUDA functionality has landed. Supporting all CUDA environments cleanly (Linux, WSL, Windows) is doable, but the toolchains and constraints differ enough that it will likely require explicit platform conditionals and some careful structuring.

My thinking was to get the fundamentals right first, correctness and meaningful GPU acceleration, and then iterate toward platform-specific polish once we know the shape of the solution is sound.

@HorizonXP
Copy link
Author

Correctness fix: if CUDA-full decoder falls back to CPU, host KV cache can be stale.

Now we track host KV validity (ctx->kv_cache_host_valid_len) and lazily download missing KV rows from the device KV cache before running CPU attention (vox_cuda_kv_cache_download_host(), supports fp16/fp32 device KV).

Cherry-picked commit: 2433096

@Danmoreng
Copy link

Danmoreng commented Feb 8, 2026

Awesome, managed to get direct windows build to work on my machine and even added microphone input. works like a charm on my machine (Laptop RTX 5080 16GB). If you're interested, you can check it out here: https://github.com/Danmoreng/voxtral.c/tree/cuda-fork-merge

https://github.com/Danmoreng/voxtral.c/blob/cuda-fork-merge/WINDOWS_CUDA_GUIDE.md

If you need Microsoft MSVC / CUDA Toolkit under windows for building, I have a powershell script that installs all the requirements for building llama.cpp under windows already here: https://github.com/Danmoreng/llama.cpp-installer/blob/main/install_llama_cpp.ps1

PS C:\Development\voxtral.c> .\voxtral.exe -d voxtral-model --from-mic -I 0.5
Loading weights...
Model loaded.
WASAPI: Capture started successfully
Listening (Ctrl+C to stop)...
[cuda] decoder prefill enabled (seq_len=38)
Testing real-time transcription with the Voxtral model and the CUDA build. On a laptop with an RTX 5080, with 16GB of VRAM, and it works like a charm. It works perfectly fine. This is really cool. Thank you. Goodbye.
Stopping...

Encoder: 3005 mel -> 367 tokens (4357 ms)
Decoder: 60 text tokens (329 steps) in 7845 ms (prefill 1871 ms + 18.2 ms/step)

- Add CUDA prefill (seq_len>1) to populate KV cache on-device and sync host KV.

- Cache cuBLASLt descriptors/layouts for M=1 BF16 GEMMs to reduce per-call overhead.

- Document new env VOX_DISABLE_CUDA_PREFILL and update benchmark notes.
- Add dynamic KV-append + dynamic attention kernels for graph capture.

- Capture a single-token decoder step graph (opt-in via VOX_CUDA_GRAPHS=1).

- Add bf16 cache eviction counter for observability.
- Add fused RMSNorm->BF16 kernel and use it in encoder/decoder attention norms.

- Add mul_1p_rows kernel to apply ada_scale across prefill sequences in one launch.

- Document CUDA Graphs opt-in and related env flags.
@HorizonXP HorizonXP force-pushed the codex/cuda-wsl2-upstream branch from b9569fa to 1f0efcf Compare February 8, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants