Skip to content

feat(dtw): safe Rust API for whisper.cpp DTW token timestamps#7

Merged
al8n merged 3 commits intomainfrom
feat/dtw
May 8, 2026
Merged

feat(dtw): safe Rust API for whisper.cpp DTW token timestamps#7
al8n merged 3 commits intomainfrom
feat/dtw

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 8, 2026

Adds the safe wrapper layer for whisper.cpp's DTW (Dynamic Time Warping) token-level timestamp path, plus the matching native-side hardening that makes the feature reachable from safe Rust without exposing abort paths under resource pressure or malformed input.

Public API

whispercpp::ContextParams gains:

  • with_dtw_token_timestamps(bool) — enable the DTW pass at context construction.
  • with_dtw_aheads_preset(AlignmentHeadsPreset) — pick the alignment-head set; one variant per shipping whisper checkpoint (TinyEn through LargeV3Turbo). None disables DTW even when the flag is on.
  • with_dtw_mem_size(usize) — override the DTW scratch arena. Clamped to [MIN_DTW_MEM_SIZE, MAX_DTW_MEM_SIZE] and raised to the per-preset minimum from required_dtw_mem_size_for(preset).

whispercpp::Token gains:

  • t_dtw() -> Option<i64> — DTW-derived timestamp in centiseconds. None covers DTW-disabled, non-text tokens, per-segment skips (audio_ctx mismatch, short-window medfilt). -1 sentinel set by the native side before each DTW pass.

New public constants / functions:

  • DEFAULT_DTW_MEM_SIZE, MIN_DTW_MEM_SIZE, MAX_DTW_MEM_SIZE — scratch arena bounds (128 MiB, 128 MiB, 4 GiB).
  • SUPPORTED_DTW_N_TEXT_CTX (= 448) — the standard whisper text-context window the wrapper budgets for; non-standard models with larger n_text_ctx are refused at Context::new when DTW is on.
  • required_dtw_mem_size_for(AlignmentHeadsPreset) — per-preset scratch requirement (callers can pre-size budgets).

Configuration rejections enforced at Context::new:

  • DTW + flash_attn (whisper.cpp silently disables DTW under flash-attn; refused explicitly so the Rust API contract isn't violated).
  • DTW + custom n_text_ctx > 448 (the scratch arena is sized for standard checkpoints).

Usage

use whispercpp::{Context, ContextParams, AlignmentHeadsPreset};

let ctx = Context::new(
    "ggml-large-v3-turbo.bin",
    ContextParams::new()
        .with_use_gpu(true)
        .with_dtw_token_timestamps(true)
        .with_dtw_aheads_preset(AlignmentHeadsPreset::LargeV3Turbo),
)?;

// ... state.full(&params, &samples)? ...

for i in 0..state.n_segments() {
    let seg = state.segment(i).unwrap();
    for j in 0..seg.n_tokens() {
        let token = seg.token(j).unwrap();
        match token.t_dtw() {
            Some(t) => println!("token={} t_dtw={:.2}s", token.id(), t as f64 / 100.0),
            None    => /* DTW unavailable for this token */ (),
        }
    }
}

Native-side patches (in submodule Findit-AI/whisper.cpp@rust)

Sixteen patches harden every assertion / abort / silent-skip path reachable from safe Rust through the DTW surface, plus several adjacent paths the audit surfaced.

DTW-specific:

  1. DTW scratch RAII guardgctx wrapped in a unique_ptr so any throw between ggml_init and the explicit ggml_free releases the arena via stack unwinding. Closes a ~dtw_mem_size (default 128 MiB) leak per failed decode.
  2. DTW scratch alloc-fail throws — explicit throw std::bad_alloc() on ggml_init NULL, replacing a silent return that left every Token::t_dtw at zero with no error signal.
  3. DTW token assignment bounded — replaces the nested iterator walk over result_all with a flat list of text-token pointers + bounded index walk. Closes a past-the-end iterator deref on the last token of the last segment (C++ UB).
  4. DTW short-window medfilt clamp — adapts the hardcoded medfilt_width=7 down to the largest odd value strictly less than n_audio_tokens, skipping the median filter entirely for n_audio_tokens <= 1. Closes a WHISPER_ASSERT(filter_width < a->ne[2]) abort on residual short segments.
  5. DTW audio_ctx override guard — replaces WHISPER_ASSERT(n_frames <= n_audio_ctx * 2) with a recoverable WARN+return when callers override Params::set_audio_ctx smaller than the chunk requires.
  6. DTW backtrace impossible-case throws — replaces WHISPER_ASSERT(0) in the impossible branch of the lattice state machine with a std::runtime_error throw.
  7. DTW aheads_cross_QKs invariants throw — post-decode null / dimension checks throw rather than abort.
  8. DTW backend compute throws — checks ggml_backend_init_by_type for NULL and ggml_backend_graph_compute for non-success, throwing on either.
  9. DTW decode failure throws — replaces a WHISPER_ASSERT(0) after a failed whisper_decode_internal pass with std::bad_alloc.
  10. DTW t_dtw sentinel init — sets every text token's t_dtw = -1 BEFORE any skip path, so the safe wrapper's Option<i64> accessor can distinguish "DTW skipped" from "DTW computed at audio offset 0".

Adjacent paths surfaced by the audit:

  1. ggml_init OOM-safe context alloc (in ggml/src/ggml.c) — replaces GGML_MALLOC + GGML_ASSERT(ctx->mem_buffer != NULL) with plain malloc + null-handling, so OOM returns NULL instead of abort()-ing.
  2. whispercpp_ggml_init_or_throw wrapper — every unchecked ggml_init call site in whisper.cpp (8 sites: model_load, graph builders for conv/encoder/cross/decoder/vad, vad_init, bench) goes through the wrapper, which throws std::bad_alloc on NULL. Closes 8 SIGSEGV paths reachable from safe Rust.
  3. KV buffer null throws — replaces WHISPER_ASSERT(!!kv_pad.buffer) / WHISPER_ASSERT(!!kv_self.buffer) in the encoder / decoder graph builders with throws.
  4. token_to_str sparse-vocab no-throw — replaces id_to_token.at(token) with .find() returning NULL on miss. Closes a std::out_of_range UB across extern "C" for sparse-vocab models where hparams.n_vocab exceeds the actually-loaded vocab table.
  5. Hparams head divisibility check — rejects n_audio_state % n_audio_head != 0 and n_text_state % n_text_head != 0 at load time. Closes a GGML_ASSERT on shape mismatch during encoder graph build.

Build-time verification

whispercpp-sys/build.rs::verify_patched_source scans the linked submodule for every patch's sentinel marker (29 markers across src/whisper.cpp and ggml/src/ggml.c) and hard-fails the build if any are missing. The shape changed from a single flat marker list to (file, marker) tuples since some patches now sit in ggml.c alongside the existing whisper.cpp ones.

Audit scope

A comprehensive audit of WHISPER_ASSERT / GGML_ASSERT / GGML_ABORT sites in the linked submodule classified each as:

  • HARDENED — reachable from safe Rust + can fire under runtime conditions (16 sites converted to throws).
  • STRUCTURAL — reachable but only fires on programming error (left as-is; hardening would mask wrapper bugs).
  • UNREACHABLE — in functions our wrapper never invokes (VAD, grammar, bench).

The audit's classification is documented in the round-9 commit message in the per-round commit history (preserved on the submodule's rust branch). After the audit + previous patches, every assertion path reachable from safe Rust under runtime conditions is now a throw that the existing exception shim converts to a Rust WhisperError. Remaining sites guard programming-error invariants.

Testing

28 unit tests cover the new public API surface:

  • AlignmentHeadsPreset enum bijection (every variant maps to a distinct C enum value).
  • Per-preset alignment head counts pinned against whisper.cpp's g_aheads_* tables.
  • Per-preset scratch budget pins (LargeV2 = 278 MiB, SmallEn = 230 MiB, MediumEn = 218 MiB).
  • clamp_dtw_mem_size boundary cases (0, MIN-1, MIN, MIN+1, MAX-1, MAX, MAX+1, usize::MAX).
  • with_dtw_mem_size setter clamping (0, max).
  • Context::new rejection of DTW + flash_attn (in two setter orders + a positive no-effect-config case).
  • Token::t_dtw() sentinel mapping (-1None, 0Some(0)).
  • SUPPORTED_DTW_N_TEXT_CTX constant pin.

Fault-injection tests for OOM scenarios (allocator override, model with non-divisible head dims, sparse vocab) need crafted GGUF fixtures and LD_PRELOAD malloc-fail harnesses, deferred to follow-up infrastructure work — the structural contracts are now correct (failures propagate through C++ unwinding to the FFI shim, never abort from safe Rust).

Documentation

Crate version

Both whispercpp and whispercpp-sys bumped to 0.2.0; the dependency declaration in whispercpp/Cargo.toml was updated to ^0.2.

🤖 Generated with Claude Code

Adds the safe wrapper layer for whisper.cpp's DTW
(Dynamic Time Warping) token-level timestamp path,
plus the matching native-side hardening that makes the
feature reachable from safe Rust without exposing
abort paths under resource pressure or malformed input.

`whispercpp::ContextParams` gains:
- `with_dtw_token_timestamps(bool)` — enable the DTW
  pass at context construction.
- `with_dtw_aheads_preset(AlignmentHeadsPreset)` —
  pick the alignment-head set; one variant per
  shipping whisper checkpoint (`TinyEn` through
  `LargeV3Turbo`). `None` disables DTW even when the
  flag is on.
- `with_dtw_mem_size(usize)` — override the DTW
  scratch arena. Clamped to
  `[MIN_DTW_MEM_SIZE, MAX_DTW_MEM_SIZE]` and raised
  to the per-preset minimum from
  `required_dtw_mem_size_for(preset)`.

`whispercpp::Token` gains:
- `t_dtw() -> Option<i64>` — DTW-derived timestamp in
  centiseconds. `None` covers DTW-disabled, non-text
  tokens, per-segment skips (audio_ctx mismatch,
  short-window medfilt). `-1` sentinel set by the
  native side before each DTW pass.

New public constants / functions:
- `DEFAULT_DTW_MEM_SIZE`, `MIN_DTW_MEM_SIZE`,
  `MAX_DTW_MEM_SIZE` — scratch arena bounds (128 MiB,
  128 MiB, 4 GiB).
- `SUPPORTED_DTW_N_TEXT_CTX` (= 448) — the standard
  whisper text-context window the wrapper budgets
  for; non-standard models with larger `n_text_ctx`
  are refused at `Context::new` when DTW is on.
- `required_dtw_mem_size_for(AlignmentHeadsPreset)`
  — per-preset scratch requirement (callers can
  pre-size budgets).

Configuration rejections enforced at `Context::new`:
- DTW + `flash_attn` (whisper.cpp silently disables
  DTW under flash-attn; refused explicitly so the
  Rust API contract isn't violated).
- DTW + custom `n_text_ctx > 448` (the scratch arena
  is sized for standard checkpoints).

`Findit-AI/whisper.cpp@rust`)

Sixteen native-side patches harden every assertion /
abort / silent-skip path reachable from safe Rust
through the DTW surface, plus several adjacent paths
the audit surfaced:

DTW-specific:
1. DTW scratch RAII guard — `gctx` wrapped in a
   `unique_ptr` so any throw between
   `ggml_init` and the explicit `ggml_free` releases
   the arena via stack unwinding. Closes a
   ~`dtw_mem_size` (default 128 MiB) leak per failed
   decode.
2. DTW scratch alloc-fail throws — explicit
   `throw std::bad_alloc()` on `ggml_init` NULL,
   replacing a silent return that left every
   `Token::t_dtw` at zero with no error signal.
3. DTW token assignment bounded — replaces the nested
   iterator walk over `result_all` with a flat list
   of text-token pointers + bounded index walk.
   Closes a past-the-end iterator deref on the last
   token of the last segment.
4. DTW short-window medfilt clamp — adapts the
   hardcoded `medfilt_width=7` down to the largest
   odd value strictly less than `n_audio_tokens`,
   skipping the median filter entirely for
   `n_audio_tokens <= 1`. Closes a
   `WHISPER_ASSERT(filter_width < a->ne[2])` abort
   on residual short segments.
5. DTW `audio_ctx` override guard — replaces
   `WHISPER_ASSERT(n_frames <= n_audio_ctx * 2)`
   with a recoverable WARN+return when callers
   override `Params::set_audio_ctx` smaller than the
   chunk requires.
6. DTW backtrace impossible-case throws —
   replaces `WHISPER_ASSERT(0)` in the impossible
   branch of the lattice state machine with a
   `std::runtime_error` throw.
7. DTW `aheads_cross_QKs` invariants throw —
   post-decode null / dimension checks throw rather
   than abort.
8. DTW backend compute throws — checks
   `ggml_backend_init_by_type` for NULL and
   `ggml_backend_graph_compute` for non-success,
   throwing on either.
9. DTW decode failure throws — replaces a
   `WHISPER_ASSERT(0)` after a failed
   `whisper_decode_internal` pass with
   `std::bad_alloc`.
10. DTW `t_dtw` sentinel init — sets every text
    token's `t_dtw = -1` BEFORE any skip path, so
    the safe wrapper's `Option<i64>` accessor can
    distinguish "DTW skipped" from "DTW computed
    at audio offset 0".

Adjacent paths surfaced by the audit:
11. `ggml_init` OOM-safe context alloc (in
    `ggml/src/ggml.c`) — replaces `GGML_MALLOC` +
    `GGML_ASSERT(ctx->mem_buffer != NULL)` with
    plain `malloc` + null-handling, so OOM returns
    NULL instead of `abort()`-ing.
12. `whispercpp_ggml_init_or_throw` wrapper — every
    unchecked `ggml_init` call site in whisper.cpp
    (8 sites: model_load, graph builders for
    conv/encoder/cross/decoder/vad, vad_init,
    bench) goes through the wrapper, which throws
    `std::bad_alloc` on NULL. Closes 8 SIGSEGV
    paths reachable from safe Rust.
13. KV buffer null throws — replaces
    `WHISPER_ASSERT(!!kv_pad.buffer)` /
    `WHISPER_ASSERT(!!kv_self.buffer)` in the
    encoder / decoder graph builders with throws.
14. `token_to_str` sparse-vocab no-throw — replaces
    `id_to_token.at(token)` with `.find()` returning
    NULL on miss. Closes a
    `std::out_of_range` UB across `extern "C"` for
    sparse-vocab models where `hparams.n_vocab`
    exceeds the actually-loaded vocab table.
15. Hparams head divisibility check — rejects
    `n_audio_state % n_audio_head != 0` and
    `n_text_state % n_text_head != 0` at load time.
    Closes a `GGML_ASSERT` on shape mismatch during
    encoder graph build.

`whispercpp-sys/build.rs::verify_patched_source`
scans the linked submodule for every patch's
sentinel marker (29 markers across `src/whisper.cpp`
and `ggml/src/ggml.c`) and hard-fails the build if
any are missing. The shape changed from a single
flat marker list to `(file, marker)` tuples since
some patches now sit in `ggml.c` alongside the
existing `whisper.cpp` ones.

28 unit tests cover the new public API surface:
- `AlignmentHeadsPreset` enum bijection (every
  variant maps to a distinct C enum value).
- Per-preset alignment head counts pinned against
  whisper.cpp's `g_aheads_*` tables.
- Per-preset scratch budget pins (`LargeV2` =
  278 MiB, `SmallEn` = 230 MiB, `MediumEn` =
  218 MiB).
- `clamp_dtw_mem_size` boundary cases (0, MIN-1,
  MIN, MIN+1, MAX-1, MAX, MAX+1, `usize::MAX`).
- `with_dtw_mem_size` setter clamping (0, max).
- `Context::new` rejection of DTW + `flash_attn`
  (in two setter orders + a positive
  no-effect-config case).
- `Token::t_dtw()` sentinel mapping (`-1` → `None`,
  `0` → `Some(0)`).
- `SUPPORTED_DTW_N_TEXT_CTX` constant pin.

Fault-injection tests for OOM scenarios (allocator
override, model with non-divisible head dims,
sparse vocab) need crafted GGUF fixtures and
`LD_PRELOAD` malloc-fail harnesses, deferred to
follow-up infrastructure work — the structural
contracts are now correct (failures propagate
through C++ unwinding to the FFI shim, never abort
from safe Rust).

- README adds a `## DTW timestamps` section: enable-
  at-construction example, per-token reader pattern,
  enumerated `None` cases, rejection rules table.
- `whispercpp/TODO.md` removes the "DTW token
  timestamps" entry from the deliberate-omissions
  list.
- GitHub issue #6 (intentional omissions) updated to
  drop the DTW section.

Both `whispercpp` and `whispercpp-sys` bumped to
0.2.0; the dependency declaration in
`whispercpp/Cargo.toml` was updated to `^0.2`.
CI's `cargo clippy --workspace --all-targets` failed under
`RUSTFLAGS=-Dwarnings` on:

```
error: this assertion has a constant value
    --> whispercpp/src/context.rs:1289:5
     |
1289 |     assert!(MIN_DTW_MEM_SIZE <= MAX_DTW_MEM_SIZE);
     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```

Both sides are `const usize`, so the comparison is computed
at compile time. clippy's `assertions_on_constants` lint
wants a `const { ... }` block to make the compile-time
evaluation explicit.

Wrapping the assertion in `const { ... }` preserves the
semantics (the order invariant pin) and satisfies clippy
without needing `#[allow]`.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a safe Rust wrapper surface for whisper.cpp’s DTW (Dynamic Time Warping) token-level timestamp path, alongside build-time verification that the bundled native sources include the required hardening patches.

Changes:

  • Introduces DTW configuration on ContextParams (enable flag, alignment-head preset, and bounded scratch-memory sizing) plus load-time validation for unsupported DTW configurations.
  • Exposes DTW-derived per-token timestamps via Token::t_dtw() -> Option<i64> and documents the two timestamp sources (t0/t1 vs t_dtw).
  • Strengthens whispercpp-sys patch verification to check markers across multiple native files and bumps both crates to 0.2.0.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
whispercpp/TODO.md Removes DTW timestamps from the “intentional omissions” list now that the feature is supported.
whispercpp/src/state.rs Adds t_dtw field/accessor on Token plus unit tests for projection and sentinel mapping.
whispercpp/src/lib.rs Re-exports the new DTW-related API (preset enum, constants, sizing helper).
whispercpp/src/context.rs Implements DTW configuration, memory sizing/clamping, and Context::new validation/rejections; adds extensive unit tests.
whispercpp/Cargo.toml Bumps whispercpp version to 0.2.0 and updates whispercpp-sys dependency to 0.2.
whispercpp-sys/Cargo.toml Bumps whispercpp-sys version to 0.2.0.
whispercpp-sys/build.rs Updates patch-marker verification to validate markers across src/whisper.cpp and ggml/src/ggml.c.
README.md Updates published crate version to 0.2 and documents DTW timestamps usage/constraints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread whispercpp/src/context.rs Outdated
@uqio uqio changed the title feat(DTW): support DTW functionalities feat(dtw): safe Rust API for whisper.cpp DTW token timestamps May 8, 2026
The constant `MAX_DTW_MEM_SIZE = 4 * 1024 * 1024 * 1024`
evaluates to `2^32`. On 32-bit targets `usize::MAX` is
`2^32 - 1`, so the const-eval overflows and the crate
fails to compile (`error: this arithmetic operation will
overflow`). The crate's CI matrix only covers 64-bit
targets (macos-latest, ubuntu-latest, windows-latest,
aarch64-linux), so the issue didn't surface, but
downstream users on i686 / armv7 / wasm32 etc. would hit
it on first build.

Split via `cfg(target_pointer_width)`:

* 64-bit: 4 GiB (unchanged) — three orders of magnitude
  above `required_dtw_mem_size_for(LargeV2) = 278 MiB`,
  so a `usize::MAX` slip still saturates short of
  `ggml_init`'s internal arena-math overflow.
* 32-bit / 16-bit: 1 GiB. Still ~3.7× the per-preset
  worst case, so the safety property (saturate above the
  realistic peak to dodge the `ggml_init` overflow) is
  preserved. `usize::MAX = 2^32 - 1` on 32-bit gives
  the cap 75% headroom under the type's max, so
  `MAX + 1` arithmetic in tests and `clamp_dtw_mem_size`
  doesn't overflow.

`required_dtw_mem_size_for` already clamps its output
against `MAX_DTW_MEM_SIZE`, so the per-preset minimums
(all ≤ 278 MiB) fit comfortably within the smaller
32-bit cap. No test changes needed: existing tests use
`MAX_DTW_MEM_SIZE - 1` and `MAX_DTW_MEM_SIZE + 1`, both
of which now fit in `usize` on every supported pointer
width.

Found by Copilot's review on PR #7
(#7 (review)).
@al8n al8n merged commit c642ec5 into main May 8, 2026
15 checks passed
@al8n al8n deleted the feat/dtw branch May 8, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants