Skip to content

0.1.0#1

Open
uqio wants to merge 222 commits intomainfrom
0.1.0
Open

0.1.0#1
uqio wants to merge 222 commits intomainfrom
0.1.0

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 2, 2026

No description provided.

uqio and others added 30 commits April 28, 2026 19:56
Captures the v1 design: a Sans-I/O state-machine core (Transcriber)
that owns cut and dispatch logic with no ML deps, paired with a
default-on runner feature that wires whisper-rs and ort-based
wav2vec2 forced alignment. Output is per-merged-chunk Transcript
with mediatime ranges, language, full text, word-level timestamps,
and provenance back to source VAD segments.

References WhisperX's merge_chunks for the cut stage, adapts the
batched-inference pattern to whisper.cpp's N-state concurrency
model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P0 — emission ordering and back-pressure correctness:
- Dispatch state machine now tracks next_emit_chunk_id and emits
  Event::Transcript / Event::Error in strict chunk-id order via
  flush_in_order_events; out-of-order worker completions stage on
  ChunkPhase::Ready until predecessors emit.
- cut_pending now holds chunk descriptors only (no extracted audio).
  SampleBuffer becomes the sole audio back-pressure choke point;
  buffer_cap_samples bounds total memory under sustained inference
  saturation.

P1 — semantic and interface defects:
- SampleBuffer accepts forward gaps within gap_tolerance_samples
  (zero-fill); rejects only PTS regressions and oversized gaps.
- LanguagePolicy::{Auto, Lock, AutoLockAfter(n)} added; default
  AutoLockAfter(1) to match WhisperX language-locking behaviour.
- Lang::ANY sentinel removed; replaced with AlignerKey::{Lang, Any}
  enum lifted into the type system.
- Backend invariant in §3.4: core renamed Whisper* to Asr*
  (Command::RunAsr, AsrParams, AsrResult, AsrTokenHint,
  inject_asr_result). Whisper-specific fields confined to runner.
- AlignmentPool concurrency clarified: v1 single worker; future
  parallelism conditional on ort EP thread-safety analysis.
- WhisperState/WhisperContext sharing flagged as §13.1 spike.

P2 — algorithmic correctness:
- Cut state machine hard-splits any single VAD segment longer than
  chunk_size; chunk-size bound is now true chunk_size, not
  chunk_size + max(seg_duration).
- Aligner::align is silence-aware (zero-mask outside sub_segments)
  and normalisation-aware (per-language TextNormalizer trait;
  surface-form recovery from normalised tokens to original text).
- condition_on_previous_text rationale rewritten: it controls
  intra-chunk decoder prompt continuation, not cross-chunk
  continuity.
- Transcript.text contradiction resolved: text is verbatim Whisper
  output (punctuated/cased); Word.text is per-word surface form.

P3 — documentation cleanup:
- alignment feature moved to opt-in (default = ["std", "runner"]).
- bundled-tiny dropped from v1 features; moved to future work as a
  build-time fetch with checksum verification.
- All public types use private fields with getters per the
  findit-studio no-public-fields convention; Transcript and Word
  refactored accordingly.
- Concrete error variants for AsrFailureKind and
  AlignmentFailureKind; thiserror, smallvec listed as permanent
  deps.
- Concurrency diagram redrawn with safer ASCII.
- Memory math corrected to include decoder workspace, alignment
  logits, and per-worker model duplication scenarios.
- ort version pinned (= "2.0.0-rc.12") matching siblings.
- 16k → 48k integer-ratio round-trip constraint documented.
- Test matrix expanded with edge cases (single-segment-overrun,
  zero-gap segments, empty whisper output, worker hang, language
  lock).

P4 — open architectural questions:
- §1.6 added: speaker-agnostic semantics; downstream diarization
  joins on Word.range and vad_segments. Confirmation requested
  on whether the diarization plan needs additional anchors.
- §1.7 added: crate vs ThreadService deployment, marked as a
  decision needed before implementation.

P5 — additions to v1:
- Per-call AsrParamsOverride for language hints / per-packet
  param tweaks.
- Worker hang timeout protection on ASR and alignment workers;
  drain_timeout cap on drain().
- Device enum (Cpu/Cuda/Metal/Vulkan) replaces use_gpu: bool.

Added §13 Open Risks for items requiring spike or external input
before implementation begins. Updated Appendix B with resolutions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§1.6: pin the diarization join contract against dia v0.1.0.
  dia::DiarizedSpan.range is mediatime::TimeRange at 1/16000
  timebase, identical to whispery's Word.range. Indexer joins
  by plain interval overlap; whispery does not depend on dia
  and exposes no speaker-aware fields. Inline the relevant
  DiarizedSpan struct shape and a sketch of the indexer-side
  join.

§1.7: lock to crate-only deployment for v1, matching all sibling
  crates (silero, soundevents, textclap, dia). A wrapper service
  binary, if ever needed, is additive and does not change the
  public surface.

§13.4: convert from "open questions, get confirmation" to a
  record of resolutions; flag the dependency on dia v0.1.0's
  current shape so a future dia API change forces a revisit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two big shifts in this revision: aligning §5.6/§6 with whisper-rs
0.13.x's actual API, and switching whispery to a two-timebase
output model (internal 1/16000 indexing, external caller-chosen
timebase for all emitted ranges).

Audit-level fixes (P0/P1):

B1: Drop public Device enum. whisper-rs selects backends via
  Cargo features; only runtime knobs are use_gpu and gpu_device:
  i32. WhisperPoolConfig now exposes use_gpu + gpu_device +
  flash_attn directly.
B2: WhisperContext::new_with_params(path, WhisperContextParameters)
  is the actual API; documented at the seam.
B3: condition_on_previous_text renamed to no_context, matching
  whisper-rs polarity. Default true (no cross-chunk context). Doc
  rewritten to explain intra-chunk vs cross-chunk semantics.
B4: SamplingStrategy moved to AsrParams (Greedy{best_of} /
  BeamSearch{beam_size, patience}); FullParams is constructed
  fresh per chunk, so beam parameters are runtime knobs after all.
B5: Drop suppress_tokens (whisper-rs has no token-list setter).
  Only suppress_blank and suppress_non_speech_tokens remain.
B6: Drop temperature_schedule. The runner implements its own
  ladder via repeated state.full() calls; AsrParams carries
  initial_temperature, temperature_increment, max_attempts.
B7: Resolve §1.5 vs §5.6 contradiction by deleting AsrTokenHint
  and Command::RunAlignment.token_hints. v1 does not enable DTW;
  wav2vec2 forced alignment doesn't need per-token seeds.
B8: Drop AsrFailureKind::EmptyOutput. Empty whisper output is a
  normal Transcript with empty text and elevated no_speech_prob,
  not a failure.
B9: Pin invariants in §5.5: chunk_id increments on Event::Error
  too; flush_in_order_events runs before trim on every inject path.
B10: Rewrite §6.4 to avoid the inline-send saturation deadlock.
  Dispatch loop uses non-blocking try_send + always-drain pattern,
  with crossbeam_channel::select! for backpressure parking.
B11: Pin Backpressure contract: inputs are accepted (state
  advanced), caller drains before pushing again, never retries
  the same call. would_accept() predicate added for callers
  that need a non-mutating check.

M1: Hard-split rule pinned to ceil(len / chunk_size) equal-length
  parts. MergedChunk.subs becomes Vec<SubRange> with SubOrigin
  tagging Vad{vad_seq} vs HardSplit{vad_seq, part, total_parts}.
  Provenance is reconstructable.
M2: GapExceedsTolerance recovery path: Transcriber::restart_at(pts)
  flushes cut, drains in-flight, re-anchors the buffer.
M3: GPU backends default worker_count = 1 (whisper.cpp serialises
  on a single GPU). CPU defaults to num_cpus / 2.
M4: §6.3.2 step 7 indexing fix. Per-word frame ranges go into
  Vec<Option<...>> of length n (= original_words.len()), not a
  packed list. Words whose audio fell in the silence-mask region
  remain None and are skipped in step 8.
M5: §4.2/§4.3 Word.text doc-comments fixed: original surface form
  with punctuation/casing, recovered via the normalisation map in
  §6.3.2 step 9. Drop the contradictory "lowercased and
  punctuation-stripped" wording.
M7: AlignerKey::Any is registry-miss fallback only. Aligner
  failures emit AlignmentFailed; no silent fall-through.
M8: §6.3.3 Mutex<Aligner> justification rewritten. Real reason is
  &mut self requirement of ort::Session::run with a worker on a
  separate thread, not "forward-looking multi-worker."

Two-timebase output model (response to user guidance):

§4.1 rewritten. whispery indexes internally at 1/16000 (matches
silero, soundevents, dia internally) but emits all public
TimeRanges in the caller's chosen output timebase. The first
push_samples Timestamp's timebase becomes the output timebase
for the Transcriber's lifetime; subsequent pushes must use the
same one. SampleBuffer tracks both base_pts_out (output tb) and
base_sample_idx (16 kHz internal); samples_to_output_range
converts via mediatime::Timebase::rescale_pts.

VadSegment refactored from { range: TimeRange } to
{ start_sample, end_sample } (silero-native 16 kHz indices). The
caller passes silero output verbatim; whispery does the analysis
to output-timebase conversion internally.

§1.6 dia integration adjusted: dia stays at 1/16000; the indexer
joins across timebases via mediatime's 128-bit cross-multiply.
Lossless for integer-ratio output rates (1/48000, 1/8000), exact
to sub-sample for non-integer ratios.

P3 documentation cleanup:

- §2.1 added: honest scope of inherited conventions (vs the
  earlier overclaim that Sans-I/O / mediatime / smol_str /
  smallvec were sibling conventions). VadSegment is documented
  as not isomorphic to silero's SpeechSegment.
- ort version pinned with `=2.0.0-rc.12` (Cargo's pre-release
  semantics need the equality sign).
- Lang derives Hash, Eq, PartialEq, Clone (required for HashMap
  key in AlignmentSet).
- §4.1 explicit on mediatime API: const nz() helper for
  NonZeroU32; no Add/Sub operators on Timestamp; TimeRange::new
  panics on negative span; rescale_to saturates.
- §4.5 documents the two distinct error channels: RunnerError
  (synchronous structural) vs WorkFailure (async per-chunk).
- §6.4 worker hang protection: whisper-rs's
  set_abort_callback_safe used for ASR cooperative timeout.
- §13.1 reduced from 1-day spike to 2-hour verification:
  whisper-rs is officially thread-safe (confirmed via context7
  query); WhisperState is owned in 0.13.x with no lifetime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rface holes

Verified the 3 NEW BLOCKERs the round-2 audit found and applied
the proposed fixes; verified mediatime source confirms M-α
(IntervalTree<TimeRange, ...> is fiction); applied all 9 net-new
MAJORs and the small-item sweep.

NB1 — §5.4 SampleBuffer no longer recomputes base_pts on trim.
Layout becomes (output_tb, base_pts_out_anchor: i64 immutable,
absolute_sample_offset: u64 monotonic, buffer_drop_offset: u64
monotonic, samples). samples_to_output_range always rescales
from the immutable anchor against the absolute sample index, so
truncation error is bounded at ±1 PTS regardless of trim history.
NTSC and MPEG-TS-style non-integer-ratio rates no longer
accumulate drift.

NB2 — §6.4.1 dispatch loop now correctly returns
RunnerError::Backpressure when block_on_full_queue=false and a
try_send returned Full. drive_one_step's signature changed from
Result<(), RunnerError> to Result<bool, RunnerError> so the
saturation wait loop has a real "made progress" signal (M-η).
The block_on_full_queue=true path uses crossbeam_channel::select!
with a 10 ms safety default to wait for any worker result before
retrying.

NB3 — §5.3 hard-split formula pinned to the per-index form:
  start_i = seg.start + (i × len) / n
  end_i   = start_{i+1} for i<n-1; end_{n-1} = seg.end
This guarantees max part length = ceil(len/n) ≤ chunk_size_samples
and parts differ by at most 1 sample. The naive
floor(len/n)+remainder reading (which produces 9,9,11 for len=29
chunk_size=10) is explicitly disallowed.

M-α — Verified mediatime/src/lib.rs at line 436 (TimeRange has
Eq/Hash but no Ord/PartialOrd) and line 411 (only Timestamp has
Ord, via cmp_semantic at lib.rs:303-328). The IntervalTree path
in §1.6 was fiction; replaced with two real paths (canonicalise
to whispery's output tb, or canonicalise to dia's 1/16000) plus
a "roll your own newtype using cmp_semantic" note. The §1.6
indexer pseudo-code rewritten using BTreeMap<i64, _> keyed on
canonicalised end PTS rather than the nonexistent IntervalTree.

M-β — Added restart_at, would_accept, unpoll_command, and
output_timebase to Transcriber's §5.1 public surface. All three
were referenced by the rest of the doc as if they existed but
were missing from the impl block.

M-γ — push_vad_segment doc-comment now states it returns
TranscriberError::OutputTimebaseUnset before the first
push_samples. Added the OutputTimebaseUnset variant to
TranscriberError, plus InconsistentTimebase for cross-timebase
push attempts.

M-δ — §5.4 PTS regression check moved to output-PTS space. delta
is computed as starts_at.pts() - expected_pts_out (in output_tb);
only the zero-fill width converts back to 16k samples. This
eliminates the spurious-PtsRegression failure mode on
non-integer-ratio output timebases where contiguous caller pushes
appear to regress because of round-trip truncation.

M-ε — §5.4.1 restart_at explicitly documents that next_chunk_id
is NOT reset; chunk_id continuity is preserved so in-flight
chunks from before the gap still emit normally. The first
post-restart chunk has chunk_id one larger than the last
pre-restart chunk.

M-ζ — Trim's low-water computation (§5.5) now reads from
cut_pending only, not in_flight. In-flight chunks have their
audio in their own Arc<[f32]> and are decoupled from the live
buffer, so trim doesn't need to consider them. After restart_at,
cut_pending is empty (Cut was flushed), so trim's low-water is
the buffer's high-water — the entire empty buffer is eligible
for drop. This eliminates stale-anchor PTS values poisoning trim
across a restart.

M-η — drive_one_step now returns Result<bool, RunnerError>; the
caller's saturation wait loop has a real progress signal to drive
the spin condition.

M-θ — §6.3.2 word_idx_per_token changed to Vec<Option<usize>>.
wav2vec2 word-delimiter (|), <unk>, and special tokens get None
and are skipped during the Viterbi walk in step 7. Step 7 also
explicitly notes it skips blank-emitting frames (CTC blank token).

M-ι — §6.4.3 worker hang protection got recycle hysteresis. CPU
backends recycle WhisperState on every timeout (cheap); GPU
backends require timeout_streak_threshold consecutive timeouts
(default 3) before recycling, because GPU create_state allocates
KV-cache buffers and stalls the entire pool when worker_count=1.
WhisperPoolConfig.timeout_streak_threshold is now in §8.

M-κ — AsrParams docs explicitly note the layered-ladder concern.
Whisper.cpp has its own internal temperature ladder
(temperature_inc, max_decoding_failures); the runner must set
temperature_inc=0 and max_decoding_failures=1 inside FullParams
so the runner's outer ladder is the sole authority. The
spec acknowledges that if whisper-rs 0.13.x doesn't expose those
setters, the runner pins initial_temperature=0 and the AsrParams
temperature fields are advisory; resolved during the §13.1
verification.

Smaller items:
- §8 defaults table: max_queued_chunks default = worker_count + 4;
  timeout_streak_threshold default 1 (CPU) / 3 (GPU).
- §5.3 step 3 pins current_end ≥ current_start invariant
  explicitly (set current_end = sub.start when current_start is
  initialised), eliminating the underflow guard reliance.
- §6.4.1 drain() now reuses the saturation select-loop with a
  10 ms default and exits on idle.
- §5.6 full_params_from now shows the set_abort_callback_safe
  wire-in for worker hang protection.
- §6.1 builder section now shows the canonical
  WhisperContext::new_with_params construction so reviewers can
  see where the actual whisper-rs API lives.
- §4.3 Word.text doc-comment notes that silence-masked words are
  absent from Transcript.words (so words[].text joined is not
  guaranteed to equal Transcript.text modulo whitespace).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng variants

NB-α — §5.4.1 restart_at now resets BOTH absolute_sample_offset
and buffer_drop_offset to 0. The v3 draft carried
absolute_sample_offset forward as the new buffer_drop_offset,
which made expected_pts_out = anchor + rescale(carried_offset)
on the very push that was supposed to recover the gap. The
first post-restart push tripped PtsRegression every time. With
both offsets reset, the first push computes
expected_pts_out = starts_at.pts() + rescale(0) = starts_at.pts()
exactly, matching the caller. Pre-restart in-flight chunks
already hold their audio in their own Arc<[f32]>s
(extracted at chunk-creation time per §5.4.1 step 5), so
resetting indices is safe.

NB-β — §6.4.1 saturation wait no longer uses
crossbeam_channel::select! with empty arm bodies. select!'s recv
arms perform the actual receive and bind the message to the arm
body; an empty `=> {}` body silently drops the result and the
next drive_one_step's Phase 1 try_recv would find the channel
empty — silent data loss exactly on the saturation path.
Replaced with crossbeam_channel::Select::ready_timeout, which
signals readiness without consuming. drive_one_step's Phase 1
try_recv then drains the now-ready message normally.

NB-γ — RunnerError gains a Backpressure { buffered, cap }
variant that §6.4.1 was constructing without it being declared.
The reviewer's claim that the entire RunnerError enum was
missing was incorrect (it was at line 610 throughout) but the
specific variant referenced by drive_one_step's Backpressure
return path was indeed undeclared.

W3 — §5.1 surface gains pub fn next_expected_starts_at(&self)
-> Option<mediatime::Timestamp>. Strictly-contiguous callers
should use this authoritative value instead of computing their
own per-packet running sum, because per-packet
mediatime::Timebase::rescale_pts truncates and a sequence of
per-packet rescales does not always equal one rescale of the
cumulative sample count (the regression check is anchored on
the cumulative form). Without the accessor, callers on
non-integer-ratio output timebases (1/30001, MPEG-TS, NTSC)
would eventually drift by ±1 PTS and trip a spurious
PtsRegression after enough packets.

M-κ — full_params_from now sets set_temperature(attempt_temp),
set_temperature_inc(0.0), set_max_decoding_failures(1) so the
runner's outer ladder is the sole authority. Each state.full()
call is one decoding attempt at exactly the runner-supplied
temperature; whisper.cpp's internal ladder is disabled.
run_with_temperature_ladder updated to pass attempt_temperature
into full_params_from on each retry. The AsrParams
log_prob_threshold doc-comment rewritten to reflect that the
suppression is wired (not a future concern), with a
compatibility note that the §13.1 verification spike confirms
the precise setter names; if a future whisper-rs renames or
removes them, the implementation falls back to a single attempt
per chunk and downgrades the temperature fields to advisory.

§8 defaults table — removed three stale rows (beam_size,
temperature_schedule, condition_on_previous_text) and added
nine accurate ones for the v3 AsrParams shape: strategy,
initial_temperature, temperature_increment, max_attempts,
no_speech_threshold, log_prob_threshold,
compression_ratio_threshold, no_context, suppress_blank,
suppress_non_speech_tokens, n_threads, initial_prompt. The
no_context row explicitly documents the polarity inversion vs
WhisperX's condition_on_previous_text.

§12 stale AsrParamsOverride field list now reads
{ language_hint, strategy, initial_temperature, initial_prompt }
matching §6.1.

§13.4 cross-timebase wording fixed: now says "mediatime exposes
128-bit cross-multiply via Timestamp::cmp_semantic only —
TimeRange has no Ord — so the indexer canonicalises ranges to
one timebase up front" rather than the v3 draft's claim that
mediatime supports the cross-timebase join "directly via 128-bit
cross-multiply" (which contradicted the §1.6 rewrite).

§4.3 Word.range doc-comment now describes partial-alignment
semantics: when silence-aware alignment drops words, surviving
words have a `range` covering only frames the Viterbi path
attributed to them (not adjacent words' frames or masked-region
frames), and the `text` is still the full original surface form
even if alignment only saw a fraction of the audio.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… c_int

Latent bug — §5.4.1 restart_at now drains cut_pending in step 1
before clearing the buffer in step 3. cut_pending lives on the
dispatch state machine (not the cut state machine that
Cut::flush() drains), so without an explicit drain its entries
would survive into the post-restart frame carrying old-anchor
SampleRange indices. The next trim() would then compute a stale
low-water and the next promotion attempt would extract against
indices that no longer exist in the cleared buffer — hard panic.
The drain promotes each queued entry to in_flight (extract
samples in old-frame indexing, cache TimeRange against the old
anchor, build ChunkRecord). The drain-on-restart path is allowed
to temporarily exceed max_in_flight; restart_at is a one-time
event and dropping the chunks would lose transcription work the
caller had already paid for. The bug existed in v3 too (cut_pending
was already there) but v4's cleaner re-anchor logic surfaced it.

NIT 1 — §5.6 rewritten the "fallback" wording to compile-time
guidance. If a future whisper-rs is found (during §13.1
verification) to lack set_temperature or set_temperature_inc,
the runner module must be built with an explicit alternative
implementation that omits those calls — this is a compile-time
situation requiring a code-path alternative, not a runtime
fallback. In that alternative path the AsrParams temperature
fields become advisory.

set_max_decoding_failures wired as best-effort. Context7
verification confirms set_temperature and set_temperature_inc
in whisper-rs 0.13.x but does not list set_max_decoding_failures.
Since temperature_inc=0 alone fully disables whisper.cpp's
internal ladder (the loop `for t=initial; t<=1.0+1e-6; t+=inc`
iterates exactly once with inc=0), max_decoding_failures is a
secondary safeguard. The doc-comment on log_prob_threshold and
the inline comment in full_params_from now describe this
explicitly; if §13.1 finds the setter absent, omitting that
line is a no-op for behaviour.

§5.6 AsrParams.n_threads type changed from i32 to
std::os::raw::c_int to match whisper-rs's set_n_threads
parameter exactly. Documented inline.

§6.4.1 added a one-line comment about
crossbeam_channel::Select::ready_timeout's documented spurious
wake behaviour. The pattern is harmless under spurious return
(next try_recv returns Empty, outer loop spins) but worth
forestalling future confusion.

§12 future-work entry on AsrParamsOverride now explains the
intentional minimality: per-packet caller can shift the ladder's
starting point (initial_temperature) but not reshape it
(temperature_increment, max_attempts); reshape requires
re-building the runner with asr_params(...). The constraint is
documented as deliberate so future additions don't accidentally
break the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IB1 — §5.1 Transcriber surface gains per-method `Errors:` blocks
listing the exact TranscriberError variants each method can return.
Removes the implementer-guesses-and-tests-lock-it-in failure mode.
push_samples documents PtsRegression / GapExceedsTolerance /
Backpressure / InconsistentTimebase / AfterEof. push_vad_segment
documents OutputTimebaseUnset / PtsRegression{kind:VadSegment} /
AfterEof and pins the strict-monotonic invariant on segment
ordering. signal_eof, restart_at, inject_*_result, inject_failure
get explicit error contracts.

IB2 — §5.5 invariant 4 carves an explicit exception for
restart_at's cut_pending drain: the invariant prohibits exceeding
max_in_flight in steady state, but restart_at is allowed to push
above the cap by however many entries were queued in cut_pending
at restart time. The exceedance is bounded by max_queued_chunks
and decays as workers drain normally. Trim's promotion guard is
documented as suspended for the duration of the drain. Required
because the alternative (dropping queued chunks) loses paid-for
transcription work.

§3.3 — re-exports mediatime::{Timebase, Timestamp, TimeRange} so
consumers don't need a separate mediatime dep just to name them.
InvalidLang added to the public exports.

§5.1 — unpoll_command's visibility narrowed to pub(crate) (was
pub with prose "runner-only"). Same crate as the runner module,
so pub(crate) is the right gate. Out-of-tree consumers driving
the state machine themselves don't need this affordance.

§6.1 — process_packet documents the vad_segments contract
explicitly: strictly monotonically increasing start_sample, no
overlap, no duplicates; violations surface as PtsRegression.
Empty packet (samples.is_empty()) handling documented as a no-op
when delta_pts_out is 0.

§5.3 / VadSegment::new — panics on `end_sample <= start_sample`
(strict inequality). Zero-duration VAD segments would produce
zero-length MergedChunks downstream which break alignment;
silero never produces them anyway.

§6.2 — WhisperPoolConfig.dispatch_idle_poll exposes the previously-
magic 10ms saturation-wait timeout. §6.4.1 saturation wait now
references whisper_pool_config.dispatch_idle_poll instead of the
hard-coded value.

§6.4.3 — added a clarification on streak-vs-chunk_id correspondence:
each timeout is one chunk_id; recycle does not retroactively retry
timed-out chunks (they emit as Event::Error and decay out of
in_flight); streak counting is per-worker, not per-chunk_id.

§4.5 — InvalidLang struct added (returned by Lang::from_iso639_1).

§10 — added §10.4 "v3-v5 regression tests" (8 named tests covering
the specific defects caught during review rounds: PTS drift on
non-integer-ratio TBs, saturation-wait result preservation,
restart_at cut_pending drain, next_expected_starts_at correctness,
layered-ladder suppression, unpoll_command round-trip, empty-packet
handling, zero-duration VadSegment panic, output-PTS-space
regression check). The CI matrix subsection was renumbered to §10.5.

§13.1 — verification bullet 4 rewritten. The original "confirm
threshold setter names" wording was already stale by v4; the
revised bullet pins the actual setters whose presence the v6
implementation relies on (set_temperature, set_temperature_inc,
and the best-effort set_max_decoding_failures), with the
compile-time fallback path described in §5.6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lang is now a typed enum over whisper.cpp's supported language
set, marked #[non_exhaustive] for forward compat without
semver-major bumps, with a fallback Other(SmolStr) variant for
unknown ISO codes that whisper auto-detect surfaces.

§4.4 — replaced the SmolStr newtype declaration with the enum.
99 named variants plus Other; full variant list deferred to
Appendix C to keep §4.4 readable. Documented the canonicalisation
invariant: from_iso639_1 maps known codes to named variants and
NEVER produces Other for an enum-known code, so structural
PartialEq/Hash stay correct (Lang::En != Lang::Other("en") is
fine because no API path can construct Lang::Other("en")).

Discussed why Other(SmolStr) is preferred over
Result<Lang, InvalidLang>: an indexing engine processing
thousands of hours shouldn't fail a chunk on an unfamiliar
language code; Other lets the transcript flow through and the
indexer logs the unusual SmolStr.

Discussed why #[non_exhaustive] in addition to Other: when
whisper.cpp adds a new language, inserting the named variant
later is not a breaking change; external matches already need
a `_` arm so their compiled code keeps working. Variant just
stops appearing under Lang::Other and starts appearing under
its named form.

§4.5 — InvalidLang type removed. from_iso639_1 is now a total
function returning Lang directly, no Result. Documented inline
why InvalidLang is gone for future readers.

§3.3 — InvalidLang re-export dropped from the public surface.

Appendix B — item 1 (Lang typed-enum vs SmolStr deferred decision)
moved to "Resolved" with a pointer to §4.4 / Appendix C.

Appendix C added — full Lang variant list (99 named variants
plus Other), with naming notes for the few non-2-letter codes
(Haw for Hawaiian, Yue for Cantonese, Jw for whisper.cpp's
non-modern Javanese code) and a one-line description of how
new whisper.cpp languages are added (insert alphabetically,
add as_str / from_iso639_1 match arms, no other code paths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…time SemVer caveat

Three small additions following the round-6 final audit verdict
of "PR-ready":

1. §5.1 VadSegment::new doc-comment now notes that panic in
   const fn is stable on Rust ≥ 1.57, well below the crate's
   ≥ 1.85 MSRV inherited from siblings. Pre-emptive: removes
   the implementer's "wait, is this actually allowed?" check.

2. §10.4 regression tests gain an explicit Lang canonicalisation
   invariant test. For every named variant V in Appendix C,
   assert Lang::from_iso639_1(V.as_str()) == V AND the result
   does not match Lang::Other(_). Plus the symmetric
   pure-Other round-trip ("zzz" → Lang::Other("zzz")). Pins
   the §4.4 contract that no API path produces Lang::Other(s)
   for an enum-known code s.

3. §3.3 mediatime re-export block now documents the SemVer
   coupling: a mediatime major bump is automatically a whispery
   major because the public API names mediatime types directly.
   The mediatime dependency stays pinned to one major in
   Cargo.toml; coordinated bumps when needed.

After 6 rounds of multi-agent adversarial review, the design has
converged. Architecture sound, contracts honored, dispatch loop
deadlock-free, buffer arithmetic drift-free, alignment correct,
dia integration verified, every public method has a documented
error contract, every load-bearing API claim verified against
real codebases or live docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-emptively close the doc-comment MINORs the round-6 audit
flagged as appropriate for the first implementation PR. Fixing
now in the spec costs ~15 lines and saves the implementer the
same set of "spec is silent about X" questions later.

§5.1 — Transcriber documented as Send + !Sync. Multi-threaded
drivers must wrap in Mutex; whispery does not provide internal
synchronisation.

§4.5 — WorkFailure now derives Clone + Debug explicitly. Clone
is needed because the dispatch state machine moves the failure
into Event::Error while runner-side code may also want to log
it; the contained String is the only non-trivial allocation.

§6.2 — AsrWorkItem.asr_timeout provenance documented inline:
sourced from the runner builder's worker_timeouts(asr, _),
stamped per-job at dispatch, fed into the abort_flag watchdog.

§6.3 — Aligner.blank_token_id documented: read from the wav2vec2
tokenizer's special-tokens map at Aligner::from_paths time
(standard <pad> / [PAD] convention; future builder method for
non-standard models).

§5.1 — is_idle() doc-comment expanded: enumerates the queues
that must all be empty, and explicitly notes that pre-restart
in_flight chunks keep is_idle() false until they emit
(restart_at does not synthetically clear them).

§6.4.1 — Disconnected-channel behaviour of Select::ready_timeout
documented: closed worker channels wake the wait, drive_one_step's
next try_recv returns Disconnected, dispatch surfaces
WhisperPoolShutdown. No silent stall on worker panic.

§10.4 — extended the unpoll_command test (M12) with a
park-and-resume case that exercises the full wake/select cycle:
saturate work_tx, park via unpoll, fire result, assert
ready_timeout wakes within dispatch_idle_poll, assert next
drive_one_step lands the parked command without losing the
result.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§11 working-memory bullet now notes that the total scales roughly
linearly with worker_count via the per-worker decoder workspace
term (~20 MiB / worker on tiny), with worked examples for 8 and
16 workers. Model weights and the alignment-logits buffer are
called out as independent of worker_count so readers don't
extrapolate the wrong scaling onto them.

§5.6 AsrParams.n_threads doc-comment now adds the SemVer caveat:
c_int is i32 on every supported platform today, so the alias is
a no-op. The alias exists to track whisper-rs's signature
exactly; an upstream type change (or a future c_int variation)
would propagate as a breaking change here, requiring a
coordinated whispery major.

After 7 rounds of multi-agent adversarial audit, every
documentation item flagged across rounds 1-7 has been addressed
in the spec or explicitly deferred with rationale. The design
has converged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implementation plan covering whispery's no-ML-deps foundation:
public types (Transcript, Word, Lang, errors, VadSegment), the
Sans-I/O core module (cut state machine, sample buffer, dispatch
state machine, Transcriber), benches, and an end-to-end mocked-
backend integration test.

25 tasks total, organised into 11 sections. Each task has 3-5
TDD-style steps with concrete code, exact commands, and explicit
expected outputs. After Plan A merges, whispery is a usable
state-machine crate that can be driven via examples/core_only.rs;
Plan B will add the whisper-rs runner and Plan C the alignment
pipeline.

Spec ref: docs/superpowers/specs/2026-04-28-whispery-cut-batch-whisper-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan A scope: core deps only (mediatime, smol_str, thiserror,
smallvec). Plan B will add the runner feature pulling whisper-rs
and crossbeam-channel; Plan C will add alignment pulling ort,
tokenizers, ndarray.

Spec: docs/superpowers/specs/2026-04-28-whispery-cut-batch-whisper-design.md §3.2.
Code-quality review found that smol_str declared without
default-features=false silently links its std feature even when a
consumer enables whispery with --no-default-features, breaking
the no_std promise lib.rs claims via #![cfg_attr(not(feature =
"std"), no_std)]. Add default-features=false on smol_str and
propagate smol_str/std through whispery's std feature.

thiserror has the same structural pattern but its std feature is
empty in 2.x so there is no current correctness impact;
deferred.

Spec: §3.2.
Wires up the module tree: time, types, core. Crate-level lints
match the silero convention (deny missing_docs, forbid unsafe_code).
Modules are empty placeholders that subsequent tasks fill in.

Spec: §3.1, §3.3.
Forward fix flagged by Task 2 code review: when types code lands
in Task 8 (errors via thiserror) and Task 10 (AsrParams uses
SmallVec), the no_std build path will need these defaults off.
Applying now while the manifest is fresh saves a fix-up commit
later. thiserror v2's std feature is currently empty so this is
a behaviour-no-op today; smallvec works on alloc-only out of the
box.

Spec: §3.2.
The analysis timebase is internal-only; whispery's public
output uses whatever timebase the caller's first push_samples
Timestamp carries.

Spec: §4.1.
Implementation Plan §4. Spec: §4.2 (Transcript.chunk_id), §5.5
(monotonicity contract — id increments on Event::Error too).
Spec §4.1.3: 16 kHz sample-index input, panic on zero or negative
duration. silero never produces zero-duration segments; panic
catches programmer error at the boundary.
Variant table from whisper.cpp's g_lang. as_str() for the
known→ISO direction. The reverse direction (from_iso639_1) and
the canonicalisation invariant test land in the next task.

Added #[allow(missing_docs)] on the enum to satisfy the crate's
#![deny(missing_docs)] — the 99 variant names are ISO 639-1 codes
and are self-documenting by name.

Spec: §4.4, Appendix C.
Plan A Task 6 code review caught an internal inconsistency in
the spec: the prose at §Appendix C said "99 named variants" but
the code-block table immediately above lists 100 (10 rows × 10).
Hand-counted whisper.cpp v1.7.6's g_lang: 100 supported
languages. The table is correct; the prose was wrong.

This commit fixes only the prose. The plan file's Task 7 test
fixture (which asserts known.len() == 99) is corrected when
Task 7's implementer is dispatched.

Spec: §4.4, Appendix C.
Total-function constructor: every &str produces a Lang. Known
codes canonicalise to named variants, unknowns go to Other.
Test verifies Lang::from_iso639_1(V.as_str()) == V for every
named variant AND that the result is never Lang::Other(_).

Spec: §4.4, §10.4 (Lang canonicalisation invariant test).
Two channels: synchronous TranscriberError on push/inject paths,
asynchronous WorkFailure via Event::Error. WorkFailure derives
Clone + Debug so the dispatch loop can move it into Event::Error
while runner code logs it.

Spec: §4.5.
Crate-private constructors used by the dispatch state machine
(Transcript::new) and alignment pipeline (Word::new). A
test-only \`for_test\` module gives downstream tests concise
helpers for synthesising Transcripts and Words.

Spec: §4.2, §4.3.
Backend-agnostic: no whisper-rs names in any of these types. The
runner translates AsrParams into FullParams in Plan B. Default
values match WhisperX (BeamSearch{5,-1.0}, no_context=true,
temperature ladder 0.0/+0.2 × 6 attempts).

Spec: §3.4 (backend invariant), §5.6.
Two cosmetic fixes after Task 10's commit:

1. Cargo.toml gains empty `runner = []` and
   `alignment = ["runner"]` features. The cfg(feature = "alignment")
   branches in src/core/command.rs were emitting `unexpected_cfg`
   warnings on every build because the feature wasn't declared. Plan
   B/C will replace these stubs with real feature payloads.

2. The `pub(crate) type ChunkAudio = Arc<[f32]>` alias is consumed
   by the dispatch state machine in Task 16; allow(dead_code) until
   then.

Spec: §3.2.
… Cut)

State struct only — push_segment and flush land in the next task
with their tests. Crate-private; nothing here crosses the public
surface.

Spec: §5.3.
Per-index split formula: start_i = seg.start + (i*len)/n with the
last part absorbing the remainder by setting end_{n-1} =
seg.end_sample. Guarantees max part length ≤ chunk_size_samples;
the audit's pathological case (len=29, chunk_size=10, n=3) now
produces 9/10/10, never 9/9/11.

Tests cover: empty flush, single short segment, multi-segment
merge, multi-segment with mid-stream flush, over-long single
segment with three-way split, and the strict-bound regression
test for the audit case.

Spec: §5.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio and others added 30 commits May 4, 2026 16:27
…gth))

Codex round-23 [high] finding: `backtrack_beam` cloned each
beam's full `Vec<PathPoint>` on every stay/change branch. With
`beam_width = 2` and 4 branches per outer iteration over `T`
frames, total path-copy cost was O(T²) — ~72 MB of memory
churn for T=1500 (a 30 s chunk), ~3.2 GB for T=10000. The
trellis budget caps `T * num_tokens` at 32 M cells, so a
long-`T`/modest-`num_tokens` case can pass that gate while
still drowning the alignment worker in path-clone allocation
and surfacing as a watchdog timeout instead of a bounded
result.

Replaced the per-beam `Vec<PathPoint>` with a flat arena of
`BeamNode { token_index, time_index, score, point_score, prev:
Option<u32> }`. Active beams are now `Vec<u32>` indices into
the arena. Each branch is O(1): push one new node + record its
predecessor index. The path is reconstructed once at the end
by walking the predecessor chain from the winning beam back to
the seed (O(T) work, O(T) allocation, total).

Memory bound: arena grows by at most `beam_width * 2` per
iteration × T iterations = ~4T nodes worst case. At T=1500 /
beam=2 that's ~96 KB; at T=10000 it's ~1 MB. Linear in T
instead of quadratic.

Path reconstruction uses the natural ascending-time order from
the chain walk (winner has the smallest time, seed has the
largest). The leading-blank fill (frames `[0, winner.t)` →
token-0 + blank emission, WhisperX visualisation convention)
is prepended in ascending time too, so no final reverse is
needed. The previous build-then-reverse code was an artefact
of the path being constructed in descending-time order during
the cloning approach.

Also dropped the now-unused `PathPoint` struct (replaced
end-to-end by `BeamNode` internally and `PathPointPublic` at
the API boundary).

Verification:
- Tests fixed an early bug where prepending the leading-blank
  fill to a vector that was about to be reversed put those
  frames in the wrong relative position. Round-trip path-order
  pinned by the existing `backtrack_beam_simple_two_token_path`
  test (T=3, tokens=[1,2]: path[0].t=0, path[2].t=2).
- 282 lib tests pass.
- 7 whisperx_unit_parity tests pass.
- 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998,
  below_0.5=0 across 854 word pairs (bit-identical to the
  pre-refactor numbers — confirms the algorithmic semantics
  are preserved).
… reject

Two findings from `/codex:adversarial-review --base main`:

[medium] `build_speech_frames` did not clamp sub-segment bounds
to `[0, n_samples]`, while `build_speech_mask` did. With the
WhisperX effective ratio's last frame interval `[(T-1)*spf,
T*spf)` extending just past `n_samples`, an overshoot
sub-segment could "credit" the trailing frame with phantom-
sample overlap — marking it as speech against audio that
doesn't exist, while step 0's silence-mask had nothing to
zero in that region. `compose_words` then accepts words at
`coverage >= 0.5` and only clamps the emitted end timestamp,
so a word spanning the last real frame plus the overshoot
frame can survive over mostly silent / nonexistent audio.

Fix: pass `n_samples: u64` into `build_speech_frames` and
clamp `seg_start` / `seg_end` to `[0, n_samples_i64]` exactly
like `build_speech_mask` does. Both helpers now agree on the
same coordinate contract.

Added `build_speech_frames_clamps_overshoot_seg_to_chunk_end`
regression covering full-overshoot and partial-overshoot
sub-segments — both must yield `false` on the trailing frame.

[medium] `WhisperPoolOptions::with_worker_count(0)` /
`set_worker_count(0)` were silently accepted, and
`WhisperPool::new` then spawned no worker threads (`for _ in
0..0`) while still binding the work channel. Chunks would
enter `in_flight` with no receiver capable of producing
results, stalling the pump/drain loop until the configured
timeout fires — a fatal stall instead of a clear construction
error.

Two-layer defence:
- Setter (`set_worker_count`) and builder (`with_worker_count`)
  panic on `value == 0`. Same pattern as
  `Aligner::set_sample_rate(0)` and the transcriber-config
  setters that already panic on degenerate values; catches
  the explicit programmer-error case at the API boundary.
- `WhisperPool::new` returns `RunnerError::WhisperContextLoad`
  with a contract-violation message when `worker_count == 0`
  in the deserialised config — covers the serde path that
  bypasses the setter.

Two new `#[should_panic]` regressions exercise both API
shapes; a follow-up integration test for the
`WhisperPool::new` error path is gated behind needing a real
`WhisperContext`, deferred.

Verification:
- 285 lib tests pass (was 282; +1 overshoot regression, +2
  zero-worker panic regressions).
- 7 whisperx_unit_parity tests pass.
- 5 dia parity fixtures: median 0.995-0.999, mean
  0.983-0.998, below_0.5=0 across 854 word pairs (bit-
  identical to pre-fix; the overshoot clamp doesn't bite the
  parity fixtures because their `raw_asr_segments[]` are
  always inside chunk bounds — the fix is defensive against
  malformed external callers).
…sion)

Codex round-22 [medium] finding: `log_softmax_with_finite_guard`
computed `log_z = max + sum.ln() as f32`, then `lp = x - log_z`.
For a row with a large common offset like `[1e20, 1e20]`,
`sum.ln() = ln(2) ≈ 0.69` rounded away when added to `max =
1e20` in f32 — `1e20 + 0.69 ≈ 1e20` — and every `lp` collapsed
to `0.0` instead of the correct `-ln(2) ≈ -0.693`. The
finiteness checks passed but the outputs were no longer
log-probabilities, hiding a backend / export numeric failure
as plausible alignment input.

Fix: keep the subtraction of `max` in shifted f64 throughout.
Compute `log_z_shifted = sum.ln()` (a small f64 in `[0, ln(V)]`
under any finite input), and per element `lp = ((x as f64 -
max as f64) - log_z_shifted) as f32`. Both terms in the
subtraction are bounded — `(x - max) <= 0` and
`log_z_shifted >= 0` whenever any element equals `max` — so
`lp_f64 ∈ (-∞, 0]` regardless of `max`'s magnitude. Casting to
f32 preserves the precision that mattered.

Kept the per-element finiteness check: pathological inputs
like `[f32::MAX, -f32::MAX]` can still underflow `lp_f64 as
f32` to `-inf`, and surfacing that as `ModelInferenceFailed`
keeps the backend-numeric-failure path typed (a `-inf`
slipping into `data` would let Viterbi return `NoAlignmentPath`
recoverably and mask the bug as `words: []`). The existing
`log_softmax_rejects_finite_extremes_that_overflow_lp` test
still pins this behaviour.

Replaced the previous `log_z`-finiteness check with a
`log_z_shifted`-finiteness check (the same value semantically;
only fires on the all-(-inf) case, already covered by
`log_softmax_rejects_all_neg_infinity_row`).

New regression test:
- `log_softmax_large_common_offset_normalises_to_unit_exp_sum`:
  feeds `[1e20, 1e20]`, asserts the output exp-sum is 1.0 and
  each lp is ≈ -ln(2). Pre-fix this test would fail with
  exp-sum=2.0 (each lp=0.0).

Verification:
- 286 lib tests pass (was 285, +1 large-offset regression).
- 7 whisperx_unit_parity tests pass.
- 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998,
  below_0.5=0 (bit-identical to pre-fix; ORT's wav2vec2 logits
  are bounded to typical ranges, so the f64-shift produces the
  same f32 outputs as the previous f32-fold for normal inputs
  — the fix is defensive against malformed model exports).
Codex round-25 [medium] finding: for inputs shorter than
wav2vec2's 400-sample receptive field, the encode path pads
the buffer to 400 before invoking ORT. The encoder then emits
`T` frames whose stride corresponds to the PADDED length
(`padded.len() / (T - 1)`), but downstream `samples_per_frame`
uses the original `samples.len() / (T - 1)`. The two views
disagree by exactly the padding ratio: `compose_words` emits
word ranges shrunk by `samples.len() / padded.len()` and
`build_speech_frames` classifies frames against intervals
that don't match the encoder's own view. A short utterance at
the edge can be silently dropped by the speech post-pass or
get a word span projected onto the wrong part of the original
audio rather than recognized as padding-only.

Codex's recommendation offered two paths:
- Carry both lengths explicitly (validate against padded
  length, mark padded frames non-speech, clamp emitted ranges
  to original).
- Skip alignment for sub-400-sample chunks.

Took the second path. At 25 ms or less, a chunk cannot
realistically contain a meaningful CTC path through any
non-trivial transcript anyway — the trellis budget requires
`T >= num_tokens`, and `T` for a sub-400-sample chunk is 1-2
frames. The simplest safe response is an early-out matching
the existing `EmptyText` / empty-tokens short-circuit:
return `Ok(empty AlignmentResult)`. Alignment becomes a no-op
on degenerate input rather than a data-loss path.

The padding code below the guard remains as defense-in-depth
(it's now provably unreachable, but the explicit
`if normalized_samples.len() < 400` branch documents the
contract and protects against a future refactor that
accidentally moves the guard).

Added regression test
`sub_400_sample_chunk_short_circuits_to_empty_result`: feeds
a 200-sample buffer with realistic transcript text and
asserts the call returns `Ok(empty)` rather than `Err` or
mis-projected word ranges. Skips when the wav2vec2 fixture
isn't present (offline / `WHISPERY_OFFLINE=1`); same gating
as the existing `empty_normalised_text_returns_empty_alignment_result`
test.

Verification:
- 287 lib tests pass (was 286, +1 sub-400 regression).
- 7 whisperx_unit_parity tests pass.
- 5 dia parity fixtures: median 0.995-0.999, mean
  0.983-0.998, below_0.5=0 (bit-identical to pre-fix; real
  fixtures don't trigger the guard, all segments are >> 400
  samples).
…sample-rate setter

Codex round-26 surfaced two findings, both legitimate.

[medium] Backtracker reserve can exceed the trellis budget
(`trellis_beam.rs:486`).
The trellis budget caps `T * num_tokens` at 32 M cells, NOT
`T` alone. A degenerate `num_tokens = 1` pass-through can
therefore drive `T` up to 32 M frames while still passing the
trellis guard. The previous backtracker code then issued a
`Vec::with_capacity(1 + beam_width * 2 * T)` pre-reserve —
~96 KB at typical T ≈ 1500, but ~3 GB at T = 32 M, allocated
up-front before the per-iteration abort check could fire.
Dropped the pre-reserve entirely (`Vec::new()`); push-driven
growth is amortised O(1) and bounded by the same per-iteration
abort flag the loop already honours. For typical T ≈ 1500 the
doubling churn is ~400 KB total — dwarfed by the trellis
itself. Avoids OOM/panic on pathological input by letting the
existing abort path bound work in-band rather than front-
loading allocation.

[medium] Public sample-rate override is ignored
(`aligner.rs:192-204`).
`set_sample_rate` / `with_sample_rate` mutated `self.sample_rate`,
but the alignment algorithm never consumed that field —
silence-mask, frame timebase, and stride checks all hard-coded
`SAMPLE_RATE_HZ` (16 kHz). A caller using the override for a
non-16 kHz model got plausible-but-wrong masks and word
timestamps instead of a configuration error.

Removed the `sample_rate: u32` struct field entirely along
with both setters and builders. The `sample_rate()` getter
now returns `SAMPLE_RATE_HZ` directly — informational only,
documents what the aligner expects. Non-16 kHz wav2vec2
support is out of scope for v1; if it becomes in-scope, the
right shape is to wire the override through silence-mask,
frame-timebase, and stride checks together (with end-to-end
tests) rather than expose a knob that mutates a dead field.

This is a public API breaking change (callers of
`set_sample_rate` / `with_sample_rate` will fail to compile),
but pre-1.0 those methods never worked correctly anyway.

Verification:
- 287 lib tests pass.
- 7 whisperx_unit_parity tests pass.
- 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998,
  below_0.5=0 (bit-identical to pre-fix; both fixes are
  behaviour-neutral on real chunks — the arena pre-reserve
  was a memory-budget concern, not a correctness one, and
  the sample-rate setters were never read in the algorithm).
…to union

Codex round-27 [medium] finding: `build_speech_frames` summed
each sub-segment's per-frame intersection independently, so two
overlapping ranges (e.g. `[0, 100]` and `[50, 150]` inside frame
0's `[0, 320)` interval) contributed `100 + 100 = 200` to the
frame's overlap counter even though their UNION only covers 150
samples. With `min_overlap_samples = 160`, the frame would clear
the threshold via raw sum (200 ≥ 160) despite its union being
below it (150 < 160), disagreeing with `build_speech_mask`'s
union semantics (per-sample boolean OR) and letting
`compose_words` retain words whose audio is mostly masked
silence.

The contract is "≥ 50 % of the frame's samples are inside the
VAD speech UNION", which matches `build_speech_mask`'s
boolean-OR aggregation. Fix: clamp + sort + merge sub-segments
into a non-overlapping union before per-frame accumulation.
Touching segments (`s == last.1`) merge too, so [0, 80] +
[80, 160] becomes a contiguous [0, 160] (matches the
`build_speech_mask` per-sample OR semantics for a single
contiguous voiced span).

Real wav2vec2 fixture VAD segments don't overlap (silero / VAD
upstream guarantees disjoint intervals), so the parity numbers
are bit-identical to pre-fix. The fix is defensive against
external `align_chunk` callers that pass arbitrary VAD output —
which the public API documents as accepting in chunk-local
1/16000 timebase but doesn't otherwise validate.

Three new regression tests in `compose::tests`:
- `build_speech_frames_uses_union_not_sum_for_overlapping_segments`:
  two overlapping segs whose UNION (150) < threshold (160) but
  SUM (200) > threshold — must classify silent. Plus a
  triple-overlap stress case (sum 240, union 120, classifies
  silent).
- `build_speech_frames_treats_adjacent_segments_as_contiguous`:
  touching segs `[0, 80]` + `[80, 160]` form a [0, 160] union
  that exactly hits the threshold → speech.

Verification:
- 289 lib tests pass (was 287; +2 union tests, +1 adjacent
  test).
- 7 whisperx_unit_parity tests pass.
- 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998,
  below_0.5=0 (bit-identical to pre-fix).
Codex round-28 [medium] finding: `tokenize_with_word_map` took
`wildcard_chars_per_word: &[u32]` — a single TOTAL count per
word — and pushed every wildcard at the END of the word's
encoded chars. That made leading punctuation like `"hello`
indistinguishable from trailing `hello"` once it reached the
CTC graph: both produced `[h, e, l, l, o, *]`. The wildcard
frames always landed at the trailing-frames zone, biasing the
word's emit-end timing past where the leading-punct silence
actually sat in the audio. WhisperX's reference behaviour
keeps `*` placeholders in source order.

Fix: split the per-word count into `(prefix, suffix)` tuples
end-to-end:

- `NormalizedText::wildcard_chars_per_word: Vec<u32>` →
  `NormalizedText::wildcard_boundary_per_word: Vec<(u32, u32)>`.
  `with_wildcards` now takes the tuple form; getter renamed to
  `wildcard_boundary_per_word()`.

- `EnglishNormalizer` computes prefix vs suffix separately by
  splitting `strip_word_punct` into a left-trim then a
  right-trim and counting each side. Hyphenated words: the
  FIRST piece inherits the original word's prefix-stripped
  count, the LAST piece inherits the suffix-stripped count;
  middle pieces get `(0, 0)` (the hyphen is treated as a real
  word boundary, not a wildcard frame).

- `tokenize_with_word_map` takes
  `wildcard_boundary_per_word: &[(u32, u32)]`. For each word it
  pushes prefix wildcards BEFORE the encoded chars and suffix
  wildcards (plus any internal-skipped chars discovered during
  this pass) AFTER. Internal-skipped (`.` for abbreviations)
  still lumps into the suffix bucket — placing it exactly in
  source order would require interleaving wildcards mid-word
  during per-char encoding, which the current flow doesn't
  support. The suffix-leaning approximation is strictly better
  than the previous all-at-end design for the common
  leading-vs-trailing case Codex flagged; the rare internal
  case takes the same hit it already had.

Three new regression tests in `tokenize::tests`:

- `leading_wildcards_land_before_encoded_chars` — pins the new
  prefix-injection contract for `"hello` (prefix=1, suffix=0):
  `*` must land at index 0.
- `trailing_wildcards_land_after_encoded_chars` — renamed +
  hardened version of the previous test; pins suffix-only
  behaviour (`hello"`).
- `paired_wildcards_bracket_encoded_chars` — pins the
  paired-punctuation case `(hello)`: prefix at start, suffix
  at end, only encoded chars in the middle.

The length-mismatch test was renamed to
`wildcard_boundary_per_word_length_mismatch_errors`.

Verification:
- 291 lib tests pass (was 289; +3 wildcard placement tests,
  −1 renamed/dropped from the previous total-count contract).
- 7 whisperx_unit_parity tests pass.
- 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998,
  below_0.5=0 (bit-identical to pre-fix; the dia fixtures'
  source text doesn't trigger the leading-punctuation case
  where the wildcard order matters — the fix is defensive
  against transcripts that DO have leading boundary
  punctuation).
…tcher

Codex round-29 findings:

1. min_speech_coverage validation (high). Out-of-range values were
   stored verbatim, so a config typo of 1.5 silently dropped every
   word (no word's coverage can exceed 1.0). Now coerce: clamp to
   [0.0, 1.0]; treat NaN as the documented default (0.5).

2. inject_wordlevel_model_type used naive substring search and
   brace counting (medium). A user tokenizer.json whose string
   values contained "model", `{`/`}`, or "type" substrings could
   be misdetected and patched at the wrong byte range. Replaced
   with a quote-aware scanner that tracks in_string / escape
   state across three helpers (find_top_level_object_value_open,
   find_matching_close_brace, has_top_level_key).

5-fixture parity bit-identical to round-28
(median 0.995-0.999, below_0.5=0 across all).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-30 findings:

1. Per-packet ASR override binding (medium). The override snapshot
   was taken at chunk *emission* time. A chunk whose audio was
   pushed under packet A's override but whose VAD-driven close
   happened in packet B silently picked up B's override (or
   None) — language hints, prompts, and strategy overrides got
   attached to the wrong audio under normal streaming VAD
   latency. Fix: cut state machine snapshots `current_override`
   when accumulation starts (current_start: None → Some), carries
   it on `MergedChunk.override_at_start`, and dispatch.on_emit
   reads from there. First-override-wins semantics.

   New regression test (src/core/transcriber.rs):
   `override_binds_to_packet_that_started_chunk_not_packet_that_closed_it`.

2. poll_transcript / poll_error silently dropped fatal runner
   errors (high). `let _ = self.drive_one_step()` dropped
   `WhisperPoolShutdown` and `Backpressure` errors — a dead pool
   looked like an empty stream. Fix: change signatures to
   `Result<Option<_>, RunnerError>`; stash any drive_one_step
   error in a `pending_fatal` slot; surface it once buffered
   transcripts/errors have been drained (so already-arrived
   results aren't lost).

5-fixture parity bit-identical to round-29
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-31 finding: whisper-rs's set_language and
set_initial_prompt build CStrings via CString::new(...).expect(...).
A SmolStr or Lang::Other carrying an interior NUL byte therefore
panicked the worker thread mid-inference. Single-worker pools
shut down; multi-worker pools left the chunk stranded with no
WorkFailure (drain hangs until drain_timeout). Public callers
reach these strings through AsrParamsOverride::with_initial_prompt
and Lang::Other without validation.

Validate before crossing the FFI boundary in full_params_from
and surface the failure as an in-band
WorkFailure::AsrFailed { kind: BackendError } for the offending
chunk. The pool stays alive; subsequent chunks transcribe
normally. Diagnostic message includes byte length but not
content (initial_prompt may be sensitive).

Two new unit regressions cover Lang::Other("xx\0yy") and
initial_prompt = "hint\0poison".

5-fixture parity bit-identical to round-30
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-32 findings:

1. ASR timeout silent fail-open (high). run_with_temperature_ladder
   only consulted abort_flag in the Err path. If the watchdog fired
   during inference but whisper.cpp finished before observing the
   next abort-callback poll, an apparently-successful AsrResult
   reached the caller — violating the per-job timeout AND skipping
   timeout-streak recycling (was_timeout was derived from the
   un-rewritten outcome). Fix: post-watchdog-join, re-check
   abort_flag and rewrite outcome to WorkerHangTimeout. Mirrors
   alignment_pool.rs:266's existing pattern.

2. Unbounded Lang::Other intern leak (medium). intern_lang_str
   leaks one &'static str per distinct language hint and never
   evicts. Lang::Other(SmolStr) is public, so a tenant or buggy
   caller producing unique high-cardinality hints would grow
   process memory without bound. Fix: validate before interning.
   Restrict to lowercase ASCII letters [a-z], len 1..=8 — covers
   every whisper.cpp-recognized code with comfortable headroom.
   Reject as in-band WorkFailure::AsrFailed (consistent with the
   round-31 NUL-rejection style); diagnostic message does NOT
   echo offending bytes (hint can be public input).

5-fixture parity bit-identical to round-31
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-33 findings:

1. Serde bypassed VadSegment invariant (high). derive(Deserialize)
   skipped VadSegment::new's end_sample > start_sample check.
   Reversed-range serialized input would wrap sample_count() to
   ~u64::MAX and trip the cut state machine's hard-split assert.
   Fix: replace derive with a custom Deserialize that re-validates
   at the serde boundary and surfaces a typed deserialize error.

2. max_attempts=0 silently dropped all ASR (medium). The retry
   ladder iterates `for _attempt in 0..max_attempts` — zero meant
   zero state.full() calls and AllTemperaturesFailed for every
   chunk. Fix: panic on 0 in set/with_max_attempts (matches the
   codebase's existing zero-rejection style for sizing config like
   worker_count, chunk_size, max_in_flight); add a serde
   deserialize_with that surfaces the violation as a typed error.

3. is_idle() ignored pending_fatal (medium). A poll that stashed
   a fatal could be shadowed by is_idle()=true on the next call,
   so callers polling "until idle" would stop without ever
   observing the structural failure. Fix: add !pending_fatal to
   the conjunction. Extracted is_idle_inner const fn so the
   contract is testable without a real ManagedTranscriber.

5-fixture parity bit-identical to round-32
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-34 findings:

1. Invalid n_threads crossed FFI unchecked (high). whisper.cpp's
   decoder loops do `vec<thread>(n_threads - 1)` when n_threads
   isn't 1; 0 underflows to a huge alloc, negatives abort. The
   public AsrParams::set/with_n_threads accepted any i32 and
   forwarded to set_n_threads. Fix: panic on `< 1` in setters,
   serde deserialize_with rejects, defensive check in
   full_params_from returns chunk-level WorkFailure.

2. Dead GPU-default cfg (medium). default_worker_count and
   default_timeout_streak_threshold checked feature names like
   `_whisper_cuda` that don't exist in Cargo.toml — the cfg
   branches always returned false, so GPU users silently got
   CPU defaults (oversubscribing one GPU with multiple workers).
   Fix: drop the dead cfg, add explicit
   WhisperPoolOptions::new_for_gpu(model_path) constructor that
   picks single-worker / longer-streak defaults appropriate for
   serialised GPU inference. CPU `new()` unchanged.

3. Lang serde shape mismatched docs (medium). derive(Serialize,
   Deserialize) produced Rust variant names `"En"` and
   `{"Other":"xx"}`, contradicting the documented lowercase ISO
   string format. Fix: custom Serialize/Deserialize that writes
   `as_str()` and reads via `Lang::from_iso639_1`, with the same
   shape validation as round-32 (lowercase ASCII letters,
   1..=8 bytes). Canonicalisation invariant preserved across
   serde — `"en"` deserialises to Lang::En, not
   Lang::Other("en"). Legacy derive-shaped JSON now rejected.

5-fixture parity bit-identical to round-33
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Configs are usually human-edited and humans naturally use mixed
case ("EN", "En", "en"). The ISO 639 standard treats codes as
case-insensitive — make Lang's serde deserializer accept any
case and normalize to canonical lowercase before
Lang::from_iso639_1.

Wire format and FFI behavior unchanged: serialization always
emits lowercase, the alignment-stage validation in
runner/whisper_pool.rs is unchanged (it never sees uppercase
because both named variants and post-deserialize Lang::Other
inner strings are always lowercase).

Round-trip asymmetry by design: "EN" in → Lang::En → "en" out.

Side benefit for migration from the round-34 derive switch:
legacy configs using Rust-variant-shape strings like "En"
(uppercase first letter) continue to work; only the
externally-tagged {"Other":"xx"} form remains rejected.

Two new regression tests; existing serde_rejects_uppercase
removed (it's no longer the contract).

5-fixture parity bit-identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-35 findings:

1. process_packet partial-commit (high). The previous order
   (push_samples → push_vads loop) committed samples first, then
   pushed VADs one at a time. A failing VAD #N left samples + VADs
   0..N-1 committed; caller could not safely retry the packet
   (same starts_at would trip PtsRegression). Fix: add
   precheck_vad_segments on Transcriber that validates eof,
   timebase, ordering, and high-water (against a floor projection
   of post-push high water) WITHOUT mutating state. Call it before
   push_samples in process_packet so process_packet is now
   either fully atomic or returns Err with no commit. The
   precheck is conservative (floor projection assumes no
   gap-fill), which can false-reject VADs targeting the rare
   zero-fill region — silero never emits such segments in
   practice. False-accept is impossible.

2. drain idle/timeout reporting (medium). drain() used
   self.core.is_idle() which ignored pending_transcripts,
   pending_errors, AND the round-30 pending_fatal slot. A fatal
   recorded behind buffered output could coexist with drain
   returning Ok. Fix: drain now (a) takes pending_fatal eagerly
   if set and returns Err, (b) checks self.is_idle() (the
   pub predicate including all four conjuncts) for the success
   case, and (c) reports DrainTimeout.in_flight via a new
   in_flight_chunk_count() that returns cut_pending.len() +
   in_flight.len() — the documented chunk count, not the
   sample count that buffered_samples() returned (which can
   read 0 after trim while real chunks remain).

7 new precheck regression tests cover valid/invalid orderings,
ahead-of-projected-water, AfterEof, and OutputTimebaseUnset.

5-fixture parity bit-identical
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-36 [critical]: the previous
`recv_timeout(timeout).is_err()` matched both
`Err(Timeout)` (real hang) AND `Err(Disconnected)` (clean cancel
when the worker drops cancel_tx after fast inference). Round-32's
post-watchdog `abort_flag` re-check then rewrote every successful
`AsrResult` into `WorkFailure::WorkerHangTimeout` — every
transcript would be lost in production ASR runs. The 5-fixture
parity suite hides this because it injects pre-computed WhisperX
ASR dumps and only exercises the alignment path.

Fix: extract `watchdog_should_signal_timeout` helper that
pattern-matches on `RecvTimeoutError`. Only `Timeout` flips the
flag; `Ok(())` and `Disconnected` are clean exits. Mirrors the
alignment pool's existing watchdog pattern at
runner/alignment_pool.rs:209-220.

5 new regression tests cover: helper-level discrimination of
each variant; end-to-end "drop tx does NOT flip flag";
end-to-end "real timeout DOES flip flag".

Parity unchanged (alignment-only path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-37 findings (both high):

1. AsrParamsOverride.{language_hint,initial_prompt} use
   Option<Option<T>> for the absent / clear / set distinction,
   but the derived Deserialize collapsed JSON `null` and "field
   absent" into the same outer None — making
   `{"language_hint": null}` indistinguishable from omitting
   the field, defeating the documented "clear this override"
   contract. Locked-language streams could not unlock via JSON
   config. Fix: deserialize_with helpers
   `deserialize_double_option_lang` /
   `deserialize_double_option_smolstr` preserve the three-way
   distinction at the serde boundary. Lang's case-insensitive
   ISO deserializer continues to apply through the helper.
   Four new round-trip tests cover absent/null/value plus the
   three-state round-trip.

2. SampleBuffer::append committed the anchor, zero-filled, and
   advanced absolute_sample_offset for empty packets at a
   forward delta within gap tolerance. A "heartbeat" call at a
   slightly future PTS would create phantom audio and reject
   the next real packet at the originally-expected PTS as
   PtsRegression — caller had no recovery path. Fix: empty
   packets at delta > 0 are now a true no-op for stream state
   (validation still runs). delta == 0 empty packets and
   non-empty packets unchanged. Three new regression tests cover
   the heartbeat-then-real-audio sequence, empty-first-then-
   forward-heartbeat-then-real-audio, and exactly-on-time empty
   no-op.

5-fixture parity bit-identical
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… leak

Codex round-38, all four findings:

1. CString leak per ASR attempt (high). whisper-rs's set_language
   / set_initial_prompt use CString::into_raw() with no Drop, so
   each attempt leaked one C string. Build the FullParams template
   once per chunk and clone for each temperature attempt. Reduces
   leak from O(attempts × chunks) to O(chunks) — 6× at default
   max_attempts = 6. FullParams: Clone shallow-copies the inner
   struct including the leaked *const c_char pointers; aliasing
   is safe because the strings have static (leaked) lifetime.
   Inter-chunk dedup deferred (abort_callback captures per-job
   state, more involved).

2. drain_timeout bypassed while a command is parked (high).
   pump_until_idle_or_progress looped forever on parked +
   no-progress; the drain deadline wasn't checked inside the loop.
   A worker stuck in state.full made drain hang past its
   configured timeout. Thread an Option<Instant> deadline through
   the pump; check before wait_for_progress. process_packet /
   signal_eof keep deadline = None (existing behavior); drain
   passes Some(start + timeout). Now matches the documented
   timeout contract.

3. no_speech_threshold was a public no-op (medium). The knob
   existed in AsrParams but full_params_from never forwarded it
   and the accept gate only checked avg_logprob / compression
   ratio. Forward to set_no_speech_thold AND check post-decode:
   when mean no_speech_prob > threshold AND avg_logprob is too
   low to trust, short-circuit to empty AsrResult (no temperature
   retries — temperature can't conjure speech the model already
   evaluated as absent). Mirrors WhisperX / OpenAI Whisper's
   silence detection.

4. First empty packet still committed the anchor (medium).
   Extends round-37's empty-packet fix: an empty FIRST push has
   delta == 0 by definition (no prior anchor) so the round-37
   bail (delta > 0) didn't fire, and the first-push commit path
   ran. Result: a heartbeat at PTS X locked the stream there,
   and real audio at PTS 0 tripped PtsRegression /
   InconsistentTimebase. Now empty packets are a true no-op for
   stream state regardless of first-push. Real first audio at
   any PTS becomes the actual first push.

5-fixture parity bit-identical
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant