Skip to content

Latest commit

 

History

History
235 lines (191 loc) · 16 KB

File metadata and controls

235 lines (191 loc) · 16 KB

whispercpp

Safe Rust bindings for whisper.cpp speech-to-text inference.

github LoC Build codecov

docs.rs crates.io crates.io license

Safe Rust bindings for whisper.cpp speech-to-text inference.

  • Always-bundled build. whispercpp-sys cmake-builds a vendored, patched whisper.cpp; there is no pkg-config / system-install path. The patched source lives on a fork branch with each fix as a reviewable commit (see Memory safety below).
  • Panic-free safe surface. Every FFI call is wrapped in a C++ exception-catching shim, every fallible setter returns WhisperError, every accessor short-circuits on poisoned state.
  • Send + Sync Context; per-Context State is Send. Concurrent inference is serialized through a per-Context mutex so per-call leak budgets are structural, not documentary.
  • Backend matrix. Metal, CoreML, Vulkan, OpenCL, CUDA, ROCm (HIP), oneAPI (SYCL), Moore Threads (MUSA), OpenVINO, OpenBLAS — all opt-in via Cargo features.
  • DTW token timestamps. Built-in token-level timing via DTW over the configured alignment heads (AlignmentHeadsPreset), with safe per-token availability through Token::t_dtw() -> Option<i64>. See DTW timestamps.

Installation

[dependencies]
whispercpp = "0.2"

The default build is plain CPU. Opt into accelerators per-target:

# macOS Apple Silicon
[target.'cfg(all(target_os = "macos", target_arch = "aarch64"))'.dependencies]
whispercpp = { version = "0.2", features = ["metal", "coreml"] }

# Linux + NVIDIA
[target.'cfg(all(target_os = "linux", target_arch = "x86_64"))'.dependencies]
whispercpp = { version = "0.2", features = ["cuda"] }

Examples

A working end-to-end example lives at whispercpp/examples/smoke.rs.

Backends

All backend features chain to the matching whispercpp-sys feature which toggles the corresponding ggml / whisper CMake flag.

Feature Backend Platforms
metal Metal GPU Apple
coreml CoreML / ANE encoder Apple (with .mlmodelc)
vulkan Vulkan compute Linux / Windows / Android / MoltenVK on macOS
opencl OpenCL (mobile / Adreno) Linux / Android
cuda NVIDIA CUDA Linux / Windows
hipblas AMD ROCm / HIP Linux
sycl Intel oneAPI / Arc Linux / Windows
musa Moore Threads MUSA Linux
openvino Intel OpenVINO encoder Linux / Windows
openblas OpenBLAS CPU Any
serde Serialize / Deserialize for Lang (lowercase ISO-639-1)

GPU backends require the corresponding vendor SDK (CUDA Toolkit, ROCm, oneAPI, etc.) installed at link time. CI exercises the bundled CPU path on Linux/macOS/Windows and Metal+CoreML on macOS.

DTW timestamps

Token-level timestamps via DTW over the decoder's cross-attention weights. Enable at Context construction:

use whispercpp::{Context, ContextParams, AlignmentHeadsPreset};

let ctx = Context::new(
    "ggml-large-v3-turbo.bin",
    ContextParams::new()
        .with_use_gpu(true)
        .with_dtw_token_timestamps(true)
        .with_dtw_aheads_preset(AlignmentHeadsPreset::LargeV3Turbo),
)?;

Match AlignmentHeadsPreset to your model — the safe API ships every standard checkpoint preset (TinyEn through LargeV3Turbo). Mismatched presets produce noisy timings without erroring; bound-checked by required_dtw_mem_size_for and rejected at load if the model's n_text_ctx exceeds SUPPORTED_DTW_N_TEXT_CTX.

After state.full(&params, &samples), read per-token DTW timing as Option<i64> (centiseconds):

for i in 0..state.n_segments() {
    let seg = state.segment(i).unwrap();
    for j in 0..seg.n_tokens() {
        let token = seg.token(j).unwrap();
        match token.t_dtw() {
            Some(t) => println!("token={} t_dtw={:.2}s",
                token.id(), t as f64 / 100.0),
            None    => /* DTW unavailable for this token */ (),
        }
    }
}

None covers four cases: DTW not enabled at construction, non-text token (special / timestamp), per-segment DTW skip because Params::set_audio_ctx was overridden too small, or audio window too short for the median-filter pass. The underlying C-side patch (whispercpp-sys: dtw t_dtw sentinel init) initialises t_dtw = -1 before every DTW pass so the sentinel uniquely identifies "unavailable" — Some(0) is a valid timestamp (token at audio offset 0), not the sentinel.

Constraints (enforced at Context::new):

Constraint What it does
dtw + flash_attn Rejected. Whisper.cpp silently disables DTW under flash-attn; the wrapper refuses the combination explicitly.
dtw + custom n_text_ctx > 448 Rejected. The DTW scratch arena is sized for standard whisper checkpoints; non-standard models with larger text context would overflow it.
dtw_mem_size Clamped to [MIN_DTW_MEM_SIZE, MAX_DTW_MEM_SIZE], then raised to the per-preset minimum from required_dtw_mem_size_for.

Native abort paths inside the DTW helper (allocation failures, invalid windows, decoder errors) are all converted to WhisperError::StateLost via the existing exception shim — no abort() is reachable from safe Rust through this surface.

Memory safety

whisper.cpp is a binary parser of attacker-controllable model files plus a substantial C++ inference path. The vendored submodule is pinned to our fork branch (Findit-AI/whisper.cpp@rust), which carries fixes for upstream issues reachable from safe Rust:

  • whisper_kv_cache_free made idempotent (closes a multi-decoder OOM double-free of a ggml backend buffer).
  • whisper_init_state / whisper_init_with_params_no_state / whisper_vad_init_with_params wrapped in RAII so a throw mid-init releases the partial allocation rather than leaking the whisper_context / whisper_state.
  • Tensor headers fully validated: n_dims ∈ [0, 4], name length bounded, ttype < GGML_TYPE_COUNT, per-dim positivity, 64-bit overflow check on nelements.
  • Hparams validated against generous-but-bounded ranges; min n_text_ctx enforced so the decode batch can hold the worst-case prompt.
  • Special-token ids verified to fit n_vocab after the multilingual shift (closes a corrupt-vocab OOB into logits[]).
  • File / buffer loaders throw on partial reads (peek-based EOF detection so clean end-of-tensor-list still terminates).
  • Tensor-name set tracking rejects models that satisfy the loaded-count check by repeating one name.
  • ggml_log_set installed once per process via std::atomic so concurrent create_state + State::full don't race on ggml's static logger globals.
  • vocab.num_languages() synthesis null-checks whisper_lang_str (closes std::string(nullptr) UB).
  • The abort callback is wired through every sched-based graph compute so cancellation interrupts the long-running encoder / decoder paths, not just the gaps between them.

A C++ exception-catching shim layer (whispercpp_shim.cpp) sits between the safe Rust API and every throwing entry point. The bindgen allowlist is enumerated symbol-by-symbol — only no-throw raw whisper_* functions are exposed; every throwing function goes through a whispercpp_* shim that catches and surfaces the exception class as a sentinel (WhisperError::ConstructorLost, StateLost, etc.).

build.rs includes a canary that scans the linked source for the required patch markers and hard-fails the build if any are missing.

For the design details, the per-finding analysis lives on the fork branch's commit history.

Crate structure

Crate Purpose
whispercpp Safe Rust API (Context, State, Params, Lang, WhisperError). End-user dependency.
whispercpp-sys Bindgen output + build.rs (cmake build, link directives) + the C++ exception-catching shim.

End users should depend on whispercpp. whispercpp-sys is re-exported as whispercpp::sys for callers who need a raw escape hatch (review every use carefully — only no-throw symbols are exposed but it's unsafe regardless).

Supported platforms

CI runs on ubuntu-latest, macos-latest, and windows-latest. Sanitizer (ASan + UBSan) and Miri jobs gate the unsafe boundary on every PR. MSRV is pinned in Cargo.toml and enforced via rust-version.

License

whispercpp is under the terms of both the MIT license and the Apache License (Version 2.0).

See LICENSE-APACHE, LICENSE-MIT for details.

Copyright (c) 2026 FinDIT Studio authors.