CUDA backend (NVIDIA/WSL2) + faster-than-real-time STT by HorizonXP · Pull Request #7 · antirez/voxtral.c

HorizonXP · 2026-02-07T23:03:11Z

Summary

This PR adds an NVIDIA CUDA backend for voxtral.c (Linux/WSL2 tested) and enables faster-than-real-time speech-to-text on an RTX 3080 Ti.

It also adds native Windows build support (PowerShell build + model download scripts) and Windows microphone capture via WASAPI (ported from Danmoreng’s work referenced in PR comment #3867041166).

Key knobs:

VOX_CUDA_FAST=1: convenience preset that enables the best-known decoder speedups by default (CUDA graphs + attention v4 (fused KV append; falls back to v3) + merged projections + device RoPE + fused top1-only logits when alternatives are off), unless explicitly overridden.
VOX_CUDA_PIPELINE_FULL=1: experimental full CUDA streaming pipeline (keeps adapter embeddings on-device; thread-safe across multiple contexts/streams via serialization).
VOX_CUDA_LOGITS_FUSED=1: top1-only logits path (avoids materializing the full logits buffer when only the best token id is needed). Enabled by default under VOX_CUDA_FAST=1 (disable with VOX_DISABLE_CUDA_LOGITS_FUSED=1).
VOX_CUDA_LOGITS_INT8=1: opt-in INT8-quantized LM head for top1-only logits (reduces bandwidth of the vocab x dim projection). Default off; may affect accuracy.
VOX_CUDA_CUBLASLT_MAX_WS_MB=<MB>: cap cuBLASLt workspace used for M=1 GEMM algo selection (default: 32). Larger values can enable faster kernels at the cost of persistent VRAM.
VOX_CUDA_LT_COMPUTE=32F|32F_FAST_16BF|32F_FAST_TF32|32F_FAST_16F: opt-in cuBLASLt compute modes for BF16 M=1 GEMMs (default: 32F). May change outputs slightly; validate with ./scripts/accuracy_regression.sh.
VOX_DISABLE_CUBLASLT_AUTOTUNE=1: disable best-effort cuBLASLt autotune for repeated M=1 decoder GEMMs (enabled by default under VOX_CUDA_FAST=1; override with VOX_CUDA_CUBLASLT_AUTOTUNE=0/1). This can reduce prefill overhead on very short clips.

Benchmarks (RTX 3080 Ti, WSL2)

Definitions:

Wall transcribe: wall time excluding model load
xRT: times real-time = audio_seconds / wall_seconds (higher is better; > 1.0x is faster-than-real-time)

All timings are from VOX_PRINT_TIMINGS=1.

Sample	Audio	BLAS wall (xRT)	CUDA wall (xRT)	CUDA fast wall (xRT)	CUDA fast + INT8 logits wall (xRT)
`samples/test_speech.wav`	3.64s	49.77s (0.07x)	2.60s (1.40x)	2.55s (1.43x)	2.93s (1.25x)
`samples/I_have_a_dream.ogg`	180.02s	(very slow; skip)	83.48s (2.16x)	35.22s (5.11x)	32.53s (5.53x)

Notes:

Model load is printed separately (Model load:) and is small (hundreds of ms here; includes CUDA driver init on first run).
For the long sample, the decoder dominates baseline runtime; VOX_CUDA_FAST=1 primarily accelerates the decoder step loop (graphs + v4 + merged weights).
VOX_CUDA_LOGITS_INT8=1 is most useful on longer samples: it does a one-time LM-head quantize+upload on first use (INT8 weights are ~384MiB). On very short clips, that one-time work can outweigh the per-step speedup.
VOX_CUDA_FAST=1 also enables cuBLASLt autotune for the repeated M=1 decoder GEMMs; on very short clips the one-time tuning shows up as higher prefill. Disable via VOX_DISABLE_CUBLASLT_AUTOTUNE=1 if benchmarking minimal startup latency.
CUDA keeps decoder KV on-device; host KV is lazily downloaded only if CPU attention is used. Perf tweak: avoid prefill KV device->host copies + skip large host KV memmoves during compaction when host KV is stale.
BLAS is not close to real-time on these tests; for long samples it can take tens of minutes.

Detailed Timing Breakdown

test_speech.wav (3.641750s):

BLAS: model load 67 ms, wall 49768 ms, encoder 20643 ms, decoder 29116 ms (prefill 8455 ms + 369.0 ms/step)
CUDA: model load 237 ms, wall 2597 ms, encoder 615 ms, decoder 1972 ms (prefill 1298 ms + 12.0 ms/step)
CUDA + VOX_CUDA_FAST=1: model load 256 ms, wall 2551 ms, encoder 612 ms, decoder 1928 ms (prefill 1359 ms + 10.2 ms/step)
CUDA + VOX_CUDA_FAST=1 VOX_CUDA_LOGITS_INT8=1: model load 246 ms, wall 2925 ms, encoder 716 ms, decoder 2199 ms (prefill 1624 ms + 10.3 ms/step)

I_have_a_dream.ogg (180.021438s; converted to 16kHz mono WAV for the run):

CUDA: model load 256 ms, wall 83477 ms, encoder 2588 ms, decoder 80684 ms (prefill 2607 ms + 34.5 ms/step)
CUDA + VOX_CUDA_FAST=1: model load 263 ms, wall 35218 ms, encoder 2489 ms, decoder 32525 ms (prefill 1506 ms + 13.7 ms/step)
CUDA + VOX_CUDA_FAST=1 VOX_CUDA_LOGITS_INT8=1: model load 244 ms, wall 32529 ms, encoder 2422 ms, decoder 29902 ms (prefill 1474 ms + 12.6 ms/step)

How To Build / Run (Linux / WSL2)

Build:

make cuda (requires CUDA Toolkit + nvcc)

Run:

./download_model.sh
VOX_PRINT_TIMINGS=1 ./voxtral -d voxtral-model -i samples/test_speech.wav
Recommended speed preset: VOX_CUDA_FAST=1 VOX_PRINT_TIMINGS=1 ./voxtral -d voxtral-model -i samples/I_have_a_dream.ogg
Optional INT8 logits (accuracy-risky): VOX_CUDA_FAST=1 VOX_CUDA_LOGITS_INT8=1 VOX_PRINT_TIMINGS=1 ./voxtral -d voxtral-model -i samples/I_have_a_dream.ogg

Benchmark helper:

./scripts/benchmark_backends.sh voxtral-model samples/test_speech.wav
Skip the slow CPU BLAS run: VOX_BENCH_SKIP_BLAS=1 ./scripts/benchmark_backends.sh voxtral-model samples/I_have_a_dream.ogg
Run extra CUDA variants: VOX_BENCH_CUDA_OPTS=1 ...

Windows (Native)

New files:

WINDOWS_CUDA_GUIDE.md
build.ps1, download_model.ps1, runtest.ps1
voxtral_mic_win32.c (WASAPI mic)

Quickstart:

.\download_model.ps1
.\build.ps1 -Cuda
.\voxtral.exe -d voxtral-model -i samples\jfk.wav
.\voxtral.exe -d voxtral-model --from-mic -I 0.5

Implementation Notes

Uses CUDA Driver API + cuBLAS/cuBLASLt.
Decoder FFN norm uses a fused CUDA kernel (vox_rms_norm_to_bf16_ada) to combine RMSNorm + (1+ada_scale) + BF16 cast (reduces per-step kernel count).
Embeds a nvcc -cubin blob (voxtral_cuda_kernels_cubin.h) to avoid PTX JIT compatibility issues on WSL2.
Device BF16 weight cache with conservative VRAM sizing; optional cold-start knobs:
- VOX_CUDA_PREFETCH=1, VOX_CUDA_HOSTREG_GIB=<GiB>, async alloc mempool (default on; disable via VOX_DISABLE_CUDA_MEMPOOL=1).
Decoder KV cache lives on-device; the host KV cache is kept consistent via lazy download on CPU fallback (so if CUDA-full runs for a while, then falls back, CPU attention won’t read stale host KV).

Validation

Ran (WSL2):

./scripts/validate_cuda.sh voxtral-model samples/test_speech.wav
./scripts/validate_cuda_pipeline_compact.sh voxtral-model samples/antirez_speaking_italian_short.ogg
./scripts/stress_cuda_two_streams.sh voxtral-model samples/test_speech.wav
./scripts/accuracy_regression.sh voxtral-model samples/test_speech.wav 0
./runtest.sh

Credits

Windows-native CUDA build + WASAPI mic support (plus several portability fixes) were originally implemented by @Danmoreng (see comment #3867041166, branch cuda-fork-merge) and adapted into this PR.

Danmoreng · 2026-02-07T23:51:07Z

Why WSL though and not native CUDA under windows? I was also trying to get CUDA to work with gemini, I'm not there yet unfortunately. However, I have a working Windows CPU AVX512 build (powershell build script) here: https://github.com/Danmoreng/voxtral.c/tree/windows-support

Would you mind if I merge your CUDA implementation into my fork?

HorizonXP · 2026-02-08T00:15:16Z

Why WSL though and not native CUDA under windows? I was also trying to get CUDA to work with gemini, I'm not there yet unfortunately. However, I have a working Windows CPU AVX512 build (powershell build script) here: https://github.com/Danmoreng/voxtral.c/tree/windows-support

Would you mind if I merge your CUDA implementation into my fork?

Fair point. To be candid, this approach reflects what I had available and what I’m comfortable running locally. The primary goal was simply to leverage the GPU where possible and prove out the core CUDA path.

My hope was that this would still be reasonably portable, at least across Linux and WSL, and that it could form a solid base for broader CUDA support. I don’t consider this complete yet, and I’m totally fine if you decide to merge it or not in its current form.

You’ve already pointed out a real gap that I agree is worth addressing. I suspect it’s better handled as a follow-up PR once the core CUDA functionality has landed. Supporting all CUDA environments cleanly (Linux, WSL, Windows) is doable, but the toolchains and constraints differ enough that it will likely require explicit platform conditionals and some careful structuring.

My thinking was to get the fundamentals right first, correctness and meaningful GPU acceleration, and then iterate toward platform-specific polish once we know the shape of the solution is sound.

HorizonXP · 2026-02-08T05:07:49Z

Correctness fix: if CUDA-full decoder falls back to CPU, host KV cache can be stale.

Now we track host KV validity (ctx->kv_cache_host_valid_len) and lazily download missing KV rows from the device KV cache before running CPU attention (vox_cuda_kv_cache_download_host(), supports fp16/fp32 device KV).

Cherry-picked commit: 2433096

Danmoreng · 2026-02-08T11:38:37Z

Awesome, managed to get direct windows build to work on my machine and even added microphone input. works like a charm on my machine (Laptop RTX 5080 16GB). If you're interested, you can check it out here: https://github.com/Danmoreng/voxtral.c/tree/cuda-fork-merge

https://github.com/Danmoreng/voxtral.c/blob/cuda-fork-merge/WINDOWS_CUDA_GUIDE.md

If you need Microsoft MSVC / CUDA Toolkit under windows for building, I have a powershell script that installs all the requirements for building llama.cpp under windows already here: https://github.com/Danmoreng/llama.cpp-installer/blob/main/install_llama_cpp.ps1

PS C:\Development\voxtral.c> .\voxtral.exe -d voxtral-model --from-mic -I 0.5
Loading weights...
Model loaded.
WASAPI: Capture started successfully
Listening (Ctrl+C to stop)...
[cuda] decoder prefill enabled (seq_len=38)
Testing real-time transcription with the Voxtral model and the CUDA build. On a laptop with an RTX 5080, with 16GB of VRAM, and it works like a charm. It works perfectly fine. This is really cool. Thank you. Goodbye.
Stopping...

Encoder: 3005 mel -> 367 tokens (4357 ms)
Decoder: 60 text tokens (329 steps) in 7845 ms (prefill 1871 ms + 18.2 ms/step)

- Add CUDA prefill (seq_len>1) to populate KV cache on-device and sync host KV. - Cache cuBLASLt descriptors/layouts for M=1 BF16 GEMMs to reduce per-call overhead. - Document new env VOX_DISABLE_CUDA_PREFILL and update benchmark notes.

- Add dynamic KV-append + dynamic attention kernels for graph capture. - Capture a single-token decoder step graph (opt-in via VOX_CUDA_GRAPHS=1). - Add bf16 cache eviction counter for observability.

- Add fused RMSNorm->BF16 kernel and use it in encoder/decoder attention norms. - Add mul_1p_rows kernel to apply ada_scale across prefill sequences in one launch. - Document CUDA Graphs opt-in and related env flags.

- Add fused SiLU*mul kernel (best-effort) - Support GEMM beta accumulation and use it to fold residual adds into matmuls - Apply the same fusions to prefill and graph capture paths

Performance: remove decoder prefill DtoH KV copies and skip host KV memmoves when host cache is stale.

Implements optional INT8 quantized LM head for fused top1 logits.

HorizonXP added 18 commits February 8, 2026 10:52

Add CUDA backend (NVIDIA) + validation/bench scripts

3db17ee

Bench: report total time including model load

fadce6a

cuda: add decoder prefill fast path

668a136

- Add CUDA prefill (seq_len>1) to populate KV cache on-device and sync host KV. - Cache cuBLASLt descriptors/layouts for M=1 BF16 GEMMs to reduce per-call overhead. - Document new env VOX_DISABLE_CUDA_PREFILL and update benchmark notes.

cuda: add optional CUDA Graph decoder step

78993d2

- Add dynamic KV-append + dynamic attention kernels for graph capture. - Capture a single-token decoder step graph (opt-in via VOX_CUDA_GRAPHS=1). - Add bf16 cache eviction counter for observability.

cuda: fuse rmsnorm->bf16 + batch ada scaling

18c8cee

- Add fused RMSNorm->BF16 kernel and use it in encoder/decoder attention norms. - Add mul_1p_rows kernel to apply ada_scale across prefill sequences in one launch. - Document CUDA Graphs opt-in and related env flags.

cuda: add opt-in v2 attention kernels

5d870ad

cuda: add attention v3 chunked GQA kernel

6c41fb0

cuda: tile attn v3, auto-enable v3 in graphs, add opt-in conv stem

48ec943

cuda: opt-in full streaming pipeline (device adapter + step embed)

b26c243

bench: print xRT in benchmark_backends.sh

c037462

build: fix OpenMP default, CUDA stubs, pipeline guard, test robustness

c0e056f

cuda: sync host KV cache on CPU fallback

fa38062

cuda: reduce weight-cache overhead (mempool + prefetch)

aba3aaf

docs: note CUDA weight-cache cold-start knobs

2a6cc95

cuda: merge decoder projections + device RoPE for graphs

ce832c9

docs: add merged-weights + rope-dev bench variants

bec586e

docs: refresh 60s CUDA opt-in numbers

17a0906

cuda: update kv_cache_len after successful step

1f0efcf

HorizonXP force-pushed the codex/cuda-wsl2-upstream branch from b9569fa to 1f0efcf Compare February 8, 2026 16:14

HorizonXP added 7 commits February 8, 2026 13:28

cuda: ring-buffer pipeline adapter + VOX_CUDA_FAST

d262f6f

scripts/docs: validate pipeline compaction + fast bench

f9f97a1

windows: native CUDA build + WASAPI mic

9069688

cuda: reduce decoder kernel launches (#6)

7c70532

- Add fused SiLU*mul kernel (best-effort) - Support GEMM beta accumulation and use it to fold residual adds into matmuls - Apply the same fusions to prefill and graph capture paths

cuda: reduce per-step overhead in decoder CUDA graph (#7)

e7f3af6

cuda: attn v4 fuse KV append into v3 partial (#8)

c01ad55

cuda: optional fused top1 logits (no materialized logits buffer) (#9)

fcdb5c1

HorizonXP added 10 commits February 8, 2026 21:17

cuda: fuse ffn RMSNorm+ada+bf16 cast; tunable cublasLt workspace (#10)

8c84a5d

cuda: keep host KV stale; avoid compaction memmove when stale

fd06910

Performance: remove decoder prefill DtoH KV copies and skip host KV memmoves when host cache is stale.

cuda: enable fused top1 logits under VOX_CUDA_FAST (#12)

461ed21

cuda: cuBLASLt autotune for M=1 BF16 GEMMs

80a02f9

cuda: cublasLt transpose-B view for M=1 GEMMs (#15) (#21)

a53223d

cuda: decoder attention v5 skips inactive chunks (#22)

208ad6e

cuda: speed up fused top1 logits kernel (#23)

9e03d0b

cuda: opt-in cuBLASLt computeType override for BF16 GEMMs (#16) (#24)

366a872

cuda: make pipeline full thread-safe across contexts (#25)

b0cdc90

cuda: opt-in INT8 fused top1 logits (LM head)

1d574d6

Implements optional INT8 quantized LM head for fused top1 logits.

Danmoreng mentioned this pull request Feb 13, 2026

Add CUDA GPU backend for NVIDIA acceleration #12

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA backend (NVIDIA/WSL2) + faster-than-real-time STT#7

CUDA backend (NVIDIA/WSL2) + faster-than-real-time STT#7
HorizonXP wants to merge 35 commits intoantirez:mainfrom
HorizonXP:codex/cuda-wsl2-upstream

HorizonXP commented Feb 7, 2026 •

edited

Loading

Uh oh!

Danmoreng commented Feb 7, 2026 •

edited

Loading

Uh oh!

HorizonXP commented Feb 8, 2026

Uh oh!

HorizonXP commented Feb 8, 2026

Uh oh!

Danmoreng commented Feb 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HorizonXP commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks (RTX 3080 Ti, WSL2)

Detailed Timing Breakdown

How To Build / Run (Linux / WSL2)

Windows (Native)

Implementation Notes

Validation

Credits

Uh oh!

Danmoreng commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HorizonXP commented Feb 8, 2026

Uh oh!

HorizonXP commented Feb 8, 2026

Uh oh!

Danmoreng commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HorizonXP commented Feb 7, 2026 •

edited

Loading

Danmoreng commented Feb 7, 2026 •

edited

Loading

Danmoreng commented Feb 8, 2026 •

edited

Loading