Add CUDA GPU backend for NVIDIA acceleration#12
Open
0xSufi wants to merge 3 commits intoantirez:mainfrom
Open
Conversation
Add cuBLAS-accelerated matrix multiplication with BF16 weight caching, giving ~13x overall speedup on RTX 3090 vs CPU-only OpenBLAS (encoder 5.5x, decoder prefill 51x, decoder per-step 25x). New files: - voxtral_cuda.h: C header with CUDA backend API - voxtral_cuda.cu: cuBLAS GEMM with BF16 weight cache, F32→BF16 activation conversion kernel, activation buffer pool, and cudaMallocManaged for KV caches Modified files: - voxtral_kernels.c: #ifdef USE_CUDA dispatch for bf16 matmul functions - voxtral_encoder.c: CUDA unified memory for encoder KV cache - voxtral_decoder.c: CUDA unified memory for decoder KV cache + grow - voxtral.c: CUDA weight warmup at load time, shared memory free - main.c: vox_cuda_init() / vox_cuda_shutdown() - Makefile: `make cuda` target (requires CUDA toolkit + OpenBLAS, SM 8.0+) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Custom CUDA kernel with online softmax, GQA support, and sliding window masking. One block per (head, query_pos) with warp shuffle + shared memory dot product reduction. Only dispatched for the encoder incremental path where KV cache is in managed memory — decoder stays on CPU since single-token attention is faster without kernel launch + page migration overhead. Encoder: 1879ms → 634ms (2.97x), decoder unchanged at ~16ms/step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 3 of CUDA backend: keep activations on GPU between operations, eliminating per-matmul CPU↔GPU round-trips. Adds custom CUDA kernels (rms_norm, silu, gelu, add/mul, RoPE, ada_scale, bias_add), a device- pointer cuBLAS helper, persistent GPU buffers, and monolithic decoder (26 layers) and encoder (32 layers) step functions that execute with a single cudaStreamSynchronize. Performance on RTX 3090 (3.6s test audio): - Encoder: 225ms (was 634ms in Phase 2, 1879ms in Phase 1) - Decoder: 12.3 ms/step (was 16.8ms in Phase 2, 16.7ms in Phase 1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
|
Ciao Salvatore, mi sembra che Opus 4.6 abbia fatto un buon lavoro qui, magari troppo Linux specific? Sei un grande! |
|
How is this different from the existing CUDA PR here #7 ? |
Author
Well, first of all look at the lines of code this PR added vs the other one, they are substantially different. Also this PR was created entirely with Claude Code (Opus 4.6) admittedly, I am not sure about #7. I have skimmed it briefly and it looks like it's Windows specific vs this PR, which should be Linux specific (I asked antirez about it). Edit: Here's Opus 4.6 answer to your question: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
make cudabuild target; CUDA and Metal are mutually exclusive at compile timePerformance (RTX 3090, 3.6s test audio)
Architecture
rms_norm(shared-memory reduction),silu,gelu,add/mul,apply_rope(paired rotation),causal_attention(online softmax, GQA, slidingwindow),
ada_scale,bias_addcudaStreamSynchronize— no per-layer CPUround-trips
cudaMallocManagedfor zero-copy CPU↔GPU KV caches; GPU writes KV entries via device→managed copy to avoid page thrashingcudaMallocper component (decoder/encoder), pointer arithmetic for sub-buffers, reused across tokensNew files
voxtral_cuda.h— C header with CUDA backend APIvoxtral_cuda.cu— Full CUDA implementation (~1100 lines)Modified files
Makefile—make cudatarget (nvcc + gcc, links cudart/cublas)main.c— CUDA init/shutdown lifecyclevoxtral.c— BF16 + F32 weight warmup, unified memory KV cache allocationvoxtral_kernels.c—#ifdef USE_CUDAdispatch for bf16 matmul functionsvoxtral_decoder.c— CUDA monolithic decoder step dispatch, unified memory KV cachevoxtral_encoder.c— CUDA monolithic encoder step dispatch, unified memory KV cacheTest plan
make cudabuilds without errors🤖 Generated with Claude Code