Skip to content

Token-stream iterators: Segment / Token #3

@uqio

Description

@uqio

Add ergonomic iterators to the safe API:

  • State::segments_iter() -> impl Iterator<Item = Segment<'_>>
  • Segment::tokens_iter() -> impl Iterator<Item = Token> (or Token<'_> if borrow-into-state needed)

Why

Today callers iterate via index:

for i in 0..state.n_segments() {
    let seg = state.segment(i).unwrap();
    for j in 0..seg.n_tokens() {
        let tok = seg.token(j).unwrap();
        ...
    }
}

An iterator API would let for seg in state.segments_iter() work directly, and compose with .filter().map().collect().

Why this isn't trivial

Segment and Token borrow from State via raw pointer (NonNull<sys::whisper_state>). A correct iterator needs to project through that borrow without aliasing, which is non-obvious in safe Rust. Specifically:

  • State::segments_iter(&self) returns an iterator that hands out Segment<'_> borrowing from &self. Each Segment carries a PhantomData<&'a ()>. This needs to be sound when multiple Segments are alive simultaneously (the whisper_state is shared but neither segment mutates it).
  • Segment::tokens_iter(&self) similar — multiple Token snapshots from one Segment.

Look at the existing State::segment(i) / Segment::token(tok_idx) patterns in whispercpp/src/state.rs for the borrow shape. The iterator just needs to drive an index counter and call those methods, but the lifetime annotations on the iterator type need to be careful.

Tests

  • Empty state iterates zero times.
  • Iterator length matches n_segments() / n_tokens().
  • Multiple iterators alive concurrently don't fight (the underlying whisper_state isn't mutated by reads).
  • Miri (the existing CI job) should pass over the new iterator types.

From whispercpp/TODO.md § 3 "Larger work".

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions