0.1.0 by uqio · Pull Request #1 · Findit-AI/whispery

uqio · 2026-05-02T10:38:59Z

No description provided.

Captures the v1 design: a Sans-I/O state-machine core (Transcriber) that owns cut and dispatch logic with no ML deps, paired with a default-on runner feature that wires whisper-rs and ort-based wav2vec2 forced alignment. Output is per-merged-chunk Transcript with mediatime ranges, language, full text, word-level timestamps, and provenance back to source VAD segments. References WhisperX's merge_chunks for the cut stage, adapts the batched-inference pattern to whisper.cpp's N-state concurrency model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P0 — emission ordering and back-pressure correctness: - Dispatch state machine now tracks next_emit_chunk_id and emits Event::Transcript / Event::Error in strict chunk-id order via flush_in_order_events; out-of-order worker completions stage on ChunkPhase::Ready until predecessors emit. - cut_pending now holds chunk descriptors only (no extracted audio). SampleBuffer becomes the sole audio back-pressure choke point; buffer_cap_samples bounds total memory under sustained inference saturation. P1 — semantic and interface defects: - SampleBuffer accepts forward gaps within gap_tolerance_samples (zero-fill); rejects only PTS regressions and oversized gaps. - LanguagePolicy::{Auto, Lock, AutoLockAfter(n)} added; default AutoLockAfter(1) to match WhisperX language-locking behaviour. - Lang::ANY sentinel removed; replaced with AlignerKey::{Lang, Any} enum lifted into the type system. - Backend invariant in §3.4: core renamed Whisper* to Asr* (Command::RunAsr, AsrParams, AsrResult, AsrTokenHint, inject_asr_result). Whisper-specific fields confined to runner. - AlignmentPool concurrency clarified: v1 single worker; future parallelism conditional on ort EP thread-safety analysis. - WhisperState/WhisperContext sharing flagged as §13.1 spike. P2 — algorithmic correctness: - Cut state machine hard-splits any single VAD segment longer than chunk_size; chunk-size bound is now true chunk_size, not chunk_size + max(seg_duration). - Aligner::align is silence-aware (zero-mask outside sub_segments) and normalisation-aware (per-language TextNormalizer trait; surface-form recovery from normalised tokens to original text). - condition_on_previous_text rationale rewritten: it controls intra-chunk decoder prompt continuation, not cross-chunk continuity. - Transcript.text contradiction resolved: text is verbatim Whisper output (punctuated/cased); Word.text is per-word surface form. P3 — documentation cleanup: - alignment feature moved to opt-in (default = ["std", "runner"]). - bundled-tiny dropped from v1 features; moved to future work as a build-time fetch with checksum verification. - All public types use private fields with getters per the findit-studio no-public-fields convention; Transcript and Word refactored accordingly. - Concrete error variants for AsrFailureKind and AlignmentFailureKind; thiserror, smallvec listed as permanent deps. - Concurrency diagram redrawn with safer ASCII. - Memory math corrected to include decoder workspace, alignment logits, and per-worker model duplication scenarios. - ort version pinned (= "2.0.0-rc.12") matching siblings. - 16k → 48k integer-ratio round-trip constraint documented. - Test matrix expanded with edge cases (single-segment-overrun, zero-gap segments, empty whisper output, worker hang, language lock). P4 — open architectural questions: - §1.6 added: speaker-agnostic semantics; downstream diarization joins on Word.range and vad_segments. Confirmation requested on whether the diarization plan needs additional anchors. - §1.7 added: crate vs ThreadService deployment, marked as a decision needed before implementation. P5 — additions to v1: - Per-call AsrParamsOverride for language hints / per-packet param tweaks. - Worker hang timeout protection on ASR and alignment workers; drain_timeout cap on drain(). - Device enum (Cpu/Cuda/Metal/Vulkan) replaces use_gpu: bool. Added §13 Open Risks for items requiring spike or external input before implementation begins. Updated Appendix B with resolutions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

§1.6: pin the diarization join contract against dia v0.1.0. dia::DiarizedSpan.range is mediatime::TimeRange at 1/16000 timebase, identical to whispery's Word.range. Indexer joins by plain interval overlap; whispery does not depend on dia and exposes no speaker-aware fields. Inline the relevant DiarizedSpan struct shape and a sketch of the indexer-side join. §1.7: lock to crate-only deployment for v1, matching all sibling crates (silero, soundevents, textclap, dia). A wrapper service binary, if ever needed, is additive and does not change the public surface. §13.4: convert from "open questions, get confirmation" to a record of resolutions; flag the dependency on dia v0.1.0's current shape so a future dia API change forces a revisit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two big shifts in this revision: aligning §5.6/§6 with whisper-rs 0.13.x's actual API, and switching whispery to a two-timebase output model (internal 1/16000 indexing, external caller-chosen timebase for all emitted ranges). Audit-level fixes (P0/P1): B1: Drop public Device enum. whisper-rs selects backends via Cargo features; only runtime knobs are use_gpu and gpu_device: i32. WhisperPoolConfig now exposes use_gpu + gpu_device + flash_attn directly. B2: WhisperContext::new_with_params(path, WhisperContextParameters) is the actual API; documented at the seam. B3: condition_on_previous_text renamed to no_context, matching whisper-rs polarity. Default true (no cross-chunk context). Doc rewritten to explain intra-chunk vs cross-chunk semantics. B4: SamplingStrategy moved to AsrParams (Greedy{best_of} / BeamSearch{beam_size, patience}); FullParams is constructed fresh per chunk, so beam parameters are runtime knobs after all. B5: Drop suppress_tokens (whisper-rs has no token-list setter). Only suppress_blank and suppress_non_speech_tokens remain. B6: Drop temperature_schedule. The runner implements its own ladder via repeated state.full() calls; AsrParams carries initial_temperature, temperature_increment, max_attempts. B7: Resolve §1.5 vs §5.6 contradiction by deleting AsrTokenHint and Command::RunAlignment.token_hints. v1 does not enable DTW; wav2vec2 forced alignment doesn't need per-token seeds. B8: Drop AsrFailureKind::EmptyOutput. Empty whisper output is a normal Transcript with empty text and elevated no_speech_prob, not a failure. B9: Pin invariants in §5.5: chunk_id increments on Event::Error too; flush_in_order_events runs before trim on every inject path. B10: Rewrite §6.4 to avoid the inline-send saturation deadlock. Dispatch loop uses non-blocking try_send + always-drain pattern, with crossbeam_channel::select! for backpressure parking. B11: Pin Backpressure contract: inputs are accepted (state advanced), caller drains before pushing again, never retries the same call. would_accept() predicate added for callers that need a non-mutating check. M1: Hard-split rule pinned to ceil(len / chunk_size) equal-length parts. MergedChunk.subs becomes Vec<SubRange> with SubOrigin tagging Vad{vad_seq} vs HardSplit{vad_seq, part, total_parts}. Provenance is reconstructable. M2: GapExceedsTolerance recovery path: Transcriber::restart_at(pts) flushes cut, drains in-flight, re-anchors the buffer. M3: GPU backends default worker_count = 1 (whisper.cpp serialises on a single GPU). CPU defaults to num_cpus / 2. M4: §6.3.2 step 7 indexing fix. Per-word frame ranges go into Vec<Option<...>> of length n (= original_words.len()), not a packed list. Words whose audio fell in the silence-mask region remain None and are skipped in step 8. M5: §4.2/§4.3 Word.text doc-comments fixed: original surface form with punctuation/casing, recovered via the normalisation map in §6.3.2 step 9. Drop the contradictory "lowercased and punctuation-stripped" wording. M7: AlignerKey::Any is registry-miss fallback only. Aligner failures emit AlignmentFailed; no silent fall-through. M8: §6.3.3 Mutex<Aligner> justification rewritten. Real reason is &mut self requirement of ort::Session::run with a worker on a separate thread, not "forward-looking multi-worker." Two-timebase output model (response to user guidance): §4.1 rewritten. whispery indexes internally at 1/16000 (matches silero, soundevents, dia internally) but emits all public TimeRanges in the caller's chosen output timebase. The first push_samples Timestamp's timebase becomes the output timebase for the Transcriber's lifetime; subsequent pushes must use the same one. SampleBuffer tracks both base_pts_out (output tb) and base_sample_idx (16 kHz internal); samples_to_output_range converts via mediatime::Timebase::rescale_pts. VadSegment refactored from { range: TimeRange } to { start_sample, end_sample } (silero-native 16 kHz indices). The caller passes silero output verbatim; whispery does the analysis to output-timebase conversion internally. §1.6 dia integration adjusted: dia stays at 1/16000; the indexer joins across timebases via mediatime's 128-bit cross-multiply. Lossless for integer-ratio output rates (1/48000, 1/8000), exact to sub-sample for non-integer ratios. P3 documentation cleanup: - §2.1 added: honest scope of inherited conventions (vs the earlier overclaim that Sans-I/O / mediatime / smol_str / smallvec were sibling conventions). VadSegment is documented as not isomorphic to silero's SpeechSegment. - ort version pinned with `=2.0.0-rc.12` (Cargo's pre-release semantics need the equality sign). - Lang derives Hash, Eq, PartialEq, Clone (required for HashMap key in AlignmentSet). - §4.1 explicit on mediatime API: const nz() helper for NonZeroU32; no Add/Sub operators on Timestamp; TimeRange::new panics on negative span; rescale_to saturates. - §4.5 documents the two distinct error channels: RunnerError (synchronous structural) vs WorkFailure (async per-chunk). - §6.4 worker hang protection: whisper-rs's set_abort_callback_safe used for ASR cooperative timeout. - §13.1 reduced from 1-day spike to 2-hour verification: whisper-rs is officially thread-safe (confirmed via context7 query); WhisperState is owned in 0.13.x with no lifetime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rface holes Verified the 3 NEW BLOCKERs the round-2 audit found and applied the proposed fixes; verified mediatime source confirms M-α (IntervalTree<TimeRange, ...> is fiction); applied all 9 net-new MAJORs and the small-item sweep. NB1 — §5.4 SampleBuffer no longer recomputes base_pts on trim. Layout becomes (output_tb, base_pts_out_anchor: i64 immutable, absolute_sample_offset: u64 monotonic, buffer_drop_offset: u64 monotonic, samples). samples_to_output_range always rescales from the immutable anchor against the absolute sample index, so truncation error is bounded at ±1 PTS regardless of trim history. NTSC and MPEG-TS-style non-integer-ratio rates no longer accumulate drift. NB2 — §6.4.1 dispatch loop now correctly returns RunnerError::Backpressure when block_on_full_queue=false and a try_send returned Full. drive_one_step's signature changed from Result<(), RunnerError> to Result<bool, RunnerError> so the saturation wait loop has a real "made progress" signal (M-η). The block_on_full_queue=true path uses crossbeam_channel::select! with a 10 ms safety default to wait for any worker result before retrying. NB3 — §5.3 hard-split formula pinned to the per-index form: start_i = seg.start + (i × len) / n end_i = start_{i+1} for i<n-1; end_{n-1} = seg.end This guarantees max part length = ceil(len/n) ≤ chunk_size_samples and parts differ by at most 1 sample. The naive floor(len/n)+remainder reading (which produces 9,9,11 for len=29 chunk_size=10) is explicitly disallowed. M-α — Verified mediatime/src/lib.rs at line 436 (TimeRange has Eq/Hash but no Ord/PartialOrd) and line 411 (only Timestamp has Ord, via cmp_semantic at lib.rs:303-328). The IntervalTree path in §1.6 was fiction; replaced with two real paths (canonicalise to whispery's output tb, or canonicalise to dia's 1/16000) plus a "roll your own newtype using cmp_semantic" note. The §1.6 indexer pseudo-code rewritten using BTreeMap<i64, _> keyed on canonicalised end PTS rather than the nonexistent IntervalTree. M-β — Added restart_at, would_accept, unpoll_command, and output_timebase to Transcriber's §5.1 public surface. All three were referenced by the rest of the doc as if they existed but were missing from the impl block. M-γ — push_vad_segment doc-comment now states it returns TranscriberError::OutputTimebaseUnset before the first push_samples. Added the OutputTimebaseUnset variant to TranscriberError, plus InconsistentTimebase for cross-timebase push attempts. M-δ — §5.4 PTS regression check moved to output-PTS space. delta is computed as starts_at.pts() - expected_pts_out (in output_tb); only the zero-fill width converts back to 16k samples. This eliminates the spurious-PtsRegression failure mode on non-integer-ratio output timebases where contiguous caller pushes appear to regress because of round-trip truncation. M-ε — §5.4.1 restart_at explicitly documents that next_chunk_id is NOT reset; chunk_id continuity is preserved so in-flight chunks from before the gap still emit normally. The first post-restart chunk has chunk_id one larger than the last pre-restart chunk. M-ζ — Trim's low-water computation (§5.5) now reads from cut_pending only, not in_flight. In-flight chunks have their audio in their own Arc<[f32]> and are decoupled from the live buffer, so trim doesn't need to consider them. After restart_at, cut_pending is empty (Cut was flushed), so trim's low-water is the buffer's high-water — the entire empty buffer is eligible for drop. This eliminates stale-anchor PTS values poisoning trim across a restart. M-η — drive_one_step now returns Result<bool, RunnerError>; the caller's saturation wait loop has a real progress signal to drive the spin condition. M-θ — §6.3.2 word_idx_per_token changed to Vec<Option<usize>>. wav2vec2 word-delimiter (|), <unk>, and special tokens get None and are skipped during the Viterbi walk in step 7. Step 7 also explicitly notes it skips blank-emitting frames (CTC blank token). M-ι — §6.4.3 worker hang protection got recycle hysteresis. CPU backends recycle WhisperState on every timeout (cheap); GPU backends require timeout_streak_threshold consecutive timeouts (default 3) before recycling, because GPU create_state allocates KV-cache buffers and stalls the entire pool when worker_count=1. WhisperPoolConfig.timeout_streak_threshold is now in §8. M-κ — AsrParams docs explicitly note the layered-ladder concern. Whisper.cpp has its own internal temperature ladder (temperature_inc, max_decoding_failures); the runner must set temperature_inc=0 and max_decoding_failures=1 inside FullParams so the runner's outer ladder is the sole authority. The spec acknowledges that if whisper-rs 0.13.x doesn't expose those setters, the runner pins initial_temperature=0 and the AsrParams temperature fields are advisory; resolved during the §13.1 verification. Smaller items: - §8 defaults table: max_queued_chunks default = worker_count + 4; timeout_streak_threshold default 1 (CPU) / 3 (GPU). - §5.3 step 3 pins current_end ≥ current_start invariant explicitly (set current_end = sub.start when current_start is initialised), eliminating the underflow guard reliance. - §6.4.1 drain() now reuses the saturation select-loop with a 10 ms default and exits on idle. - §5.6 full_params_from now shows the set_abort_callback_safe wire-in for worker hang protection. - §6.1 builder section now shows the canonical WhisperContext::new_with_params construction so reviewers can see where the actual whisper-rs API lives. - §4.3 Word.text doc-comment notes that silence-masked words are absent from Transcript.words (so words[].text joined is not guaranteed to equal Transcript.text modulo whitespace). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ng variants NB-α — §5.4.1 restart_at now resets BOTH absolute_sample_offset and buffer_drop_offset to 0. The v3 draft carried absolute_sample_offset forward as the new buffer_drop_offset, which made expected_pts_out = anchor + rescale(carried_offset) on the very push that was supposed to recover the gap. The first post-restart push tripped PtsRegression every time. With both offsets reset, the first push computes expected_pts_out = starts_at.pts() + rescale(0) = starts_at.pts() exactly, matching the caller. Pre-restart in-flight chunks already hold their audio in their own Arc<[f32]>s (extracted at chunk-creation time per §5.4.1 step 5), so resetting indices is safe. NB-β — §6.4.1 saturation wait no longer uses crossbeam_channel::select! with empty arm bodies. select!'s recv arms perform the actual receive and bind the message to the arm body; an empty `=> {}` body silently drops the result and the next drive_one_step's Phase 1 try_recv would find the channel empty — silent data loss exactly on the saturation path. Replaced with crossbeam_channel::Select::ready_timeout, which signals readiness without consuming. drive_one_step's Phase 1 try_recv then drains the now-ready message normally. NB-γ — RunnerError gains a Backpressure { buffered, cap } variant that §6.4.1 was constructing without it being declared. The reviewer's claim that the entire RunnerError enum was missing was incorrect (it was at line 610 throughout) but the specific variant referenced by drive_one_step's Backpressure return path was indeed undeclared. W3 — §5.1 surface gains pub fn next_expected_starts_at(&self) -> Option<mediatime::Timestamp>. Strictly-contiguous callers should use this authoritative value instead of computing their own per-packet running sum, because per-packet mediatime::Timebase::rescale_pts truncates and a sequence of per-packet rescales does not always equal one rescale of the cumulative sample count (the regression check is anchored on the cumulative form). Without the accessor, callers on non-integer-ratio output timebases (1/30001, MPEG-TS, NTSC) would eventually drift by ±1 PTS and trip a spurious PtsRegression after enough packets. M-κ — full_params_from now sets set_temperature(attempt_temp), set_temperature_inc(0.0), set_max_decoding_failures(1) so the runner's outer ladder is the sole authority. Each state.full() call is one decoding attempt at exactly the runner-supplied temperature; whisper.cpp's internal ladder is disabled. run_with_temperature_ladder updated to pass attempt_temperature into full_params_from on each retry. The AsrParams log_prob_threshold doc-comment rewritten to reflect that the suppression is wired (not a future concern), with a compatibility note that the §13.1 verification spike confirms the precise setter names; if a future whisper-rs renames or removes them, the implementation falls back to a single attempt per chunk and downgrades the temperature fields to advisory. §8 defaults table — removed three stale rows (beam_size, temperature_schedule, condition_on_previous_text) and added nine accurate ones for the v3 AsrParams shape: strategy, initial_temperature, temperature_increment, max_attempts, no_speech_threshold, log_prob_threshold, compression_ratio_threshold, no_context, suppress_blank, suppress_non_speech_tokens, n_threads, initial_prompt. The no_context row explicitly documents the polarity inversion vs WhisperX's condition_on_previous_text. §12 stale AsrParamsOverride field list now reads { language_hint, strategy, initial_temperature, initial_prompt } matching §6.1. §13.4 cross-timebase wording fixed: now says "mediatime exposes 128-bit cross-multiply via Timestamp::cmp_semantic only — TimeRange has no Ord — so the indexer canonicalises ranges to one timebase up front" rather than the v3 draft's claim that mediatime supports the cross-timebase join "directly via 128-bit cross-multiply" (which contradicted the §1.6 rewrite). §4.3 Word.range doc-comment now describes partial-alignment semantics: when silence-aware alignment drops words, surviving words have a `range` covering only frames the Viterbi path attributed to them (not adjacent words' frames or masked-region frames), and the `text` is still the full original surface form even if alignment only saw a fraction of the audio. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… c_int Latent bug — §5.4.1 restart_at now drains cut_pending in step 1 before clearing the buffer in step 3. cut_pending lives on the dispatch state machine (not the cut state machine that Cut::flush() drains), so without an explicit drain its entries would survive into the post-restart frame carrying old-anchor SampleRange indices. The next trim() would then compute a stale low-water and the next promotion attempt would extract against indices that no longer exist in the cleared buffer — hard panic. The drain promotes each queued entry to in_flight (extract samples in old-frame indexing, cache TimeRange against the old anchor, build ChunkRecord). The drain-on-restart path is allowed to temporarily exceed max_in_flight; restart_at is a one-time event and dropping the chunks would lose transcription work the caller had already paid for. The bug existed in v3 too (cut_pending was already there) but v4's cleaner re-anchor logic surfaced it. NIT 1 — §5.6 rewritten the "fallback" wording to compile-time guidance. If a future whisper-rs is found (during §13.1 verification) to lack set_temperature or set_temperature_inc, the runner module must be built with an explicit alternative implementation that omits those calls — this is a compile-time situation requiring a code-path alternative, not a runtime fallback. In that alternative path the AsrParams temperature fields become advisory. set_max_decoding_failures wired as best-effort. Context7 verification confirms set_temperature and set_temperature_inc in whisper-rs 0.13.x but does not list set_max_decoding_failures. Since temperature_inc=0 alone fully disables whisper.cpp's internal ladder (the loop `for t=initial; t<=1.0+1e-6; t+=inc` iterates exactly once with inc=0), max_decoding_failures is a secondary safeguard. The doc-comment on log_prob_threshold and the inline comment in full_params_from now describe this explicitly; if §13.1 finds the setter absent, omitting that line is a no-op for behaviour. §5.6 AsrParams.n_threads type changed from i32 to std::os::raw::c_int to match whisper-rs's set_n_threads parameter exactly. Documented inline. §6.4.1 added a one-line comment about crossbeam_channel::Select::ready_timeout's documented spurious wake behaviour. The pattern is harmless under spurious return (next try_recv returns Empty, outer loop spins) but worth forestalling future confusion. §12 future-work entry on AsrParamsOverride now explains the intentional minimality: per-packet caller can shift the ladder's starting point (initial_temperature) but not reshape it (temperature_increment, max_attempts); reshape requires re-building the runner with asr_params(...). The constraint is documented as deliberate so future additions don't accidentally break the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

IB1 — §5.1 Transcriber surface gains per-method `Errors:` blocks listing the exact TranscriberError variants each method can return. Removes the implementer-guesses-and-tests-lock-it-in failure mode. push_samples documents PtsRegression / GapExceedsTolerance / Backpressure / InconsistentTimebase / AfterEof. push_vad_segment documents OutputTimebaseUnset / PtsRegression{kind:VadSegment} / AfterEof and pins the strict-monotonic invariant on segment ordering. signal_eof, restart_at, inject_*_result, inject_failure get explicit error contracts. IB2 — §5.5 invariant 4 carves an explicit exception for restart_at's cut_pending drain: the invariant prohibits exceeding max_in_flight in steady state, but restart_at is allowed to push above the cap by however many entries were queued in cut_pending at restart time. The exceedance is bounded by max_queued_chunks and decays as workers drain normally. Trim's promotion guard is documented as suspended for the duration of the drain. Required because the alternative (dropping queued chunks) loses paid-for transcription work. §3.3 — re-exports mediatime::{Timebase, Timestamp, TimeRange} so consumers don't need a separate mediatime dep just to name them. InvalidLang added to the public exports. §5.1 — unpoll_command's visibility narrowed to pub(crate) (was pub with prose "runner-only"). Same crate as the runner module, so pub(crate) is the right gate. Out-of-tree consumers driving the state machine themselves don't need this affordance. §6.1 — process_packet documents the vad_segments contract explicitly: strictly monotonically increasing start_sample, no overlap, no duplicates; violations surface as PtsRegression. Empty packet (samples.is_empty()) handling documented as a no-op when delta_pts_out is 0. §5.3 / VadSegment::new — panics on `end_sample <= start_sample` (strict inequality). Zero-duration VAD segments would produce zero-length MergedChunks downstream which break alignment; silero never produces them anyway. §6.2 — WhisperPoolConfig.dispatch_idle_poll exposes the previously- magic 10ms saturation-wait timeout. §6.4.1 saturation wait now references whisper_pool_config.dispatch_idle_poll instead of the hard-coded value. §6.4.3 — added a clarification on streak-vs-chunk_id correspondence: each timeout is one chunk_id; recycle does not retroactively retry timed-out chunks (they emit as Event::Error and decay out of in_flight); streak counting is per-worker, not per-chunk_id. §4.5 — InvalidLang struct added (returned by Lang::from_iso639_1). §10 — added §10.4 "v3-v5 regression tests" (8 named tests covering the specific defects caught during review rounds: PTS drift on non-integer-ratio TBs, saturation-wait result preservation, restart_at cut_pending drain, next_expected_starts_at correctness, layered-ladder suppression, unpoll_command round-trip, empty-packet handling, zero-duration VadSegment panic, output-PTS-space regression check). The CI matrix subsection was renumbered to §10.5. §13.1 — verification bullet 4 rewritten. The original "confirm threshold setter names" wording was already stale by v4; the revised bullet pins the actual setters whose presence the v6 implementation relies on (set_temperature, set_temperature_inc, and the best-effort set_max_decoding_failures), with the compile-time fallback path described in §5.6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lang is now a typed enum over whisper.cpp's supported language set, marked #[non_exhaustive] for forward compat without semver-major bumps, with a fallback Other(SmolStr) variant for unknown ISO codes that whisper auto-detect surfaces. §4.4 — replaced the SmolStr newtype declaration with the enum. 99 named variants plus Other; full variant list deferred to Appendix C to keep §4.4 readable. Documented the canonicalisation invariant: from_iso639_1 maps known codes to named variants and NEVER produces Other for an enum-known code, so structural PartialEq/Hash stay correct (Lang::En != Lang::Other("en") is fine because no API path can construct Lang::Other("en")). Discussed why Other(SmolStr) is preferred over Result<Lang, InvalidLang>: an indexing engine processing thousands of hours shouldn't fail a chunk on an unfamiliar language code; Other lets the transcript flow through and the indexer logs the unusual SmolStr. Discussed why #[non_exhaustive] in addition to Other: when whisper.cpp adds a new language, inserting the named variant later is not a breaking change; external matches already need a `_` arm so their compiled code keeps working. Variant just stops appearing under Lang::Other and starts appearing under its named form. §4.5 — InvalidLang type removed. from_iso639_1 is now a total function returning Lang directly, no Result. Documented inline why InvalidLang is gone for future readers. §3.3 — InvalidLang re-export dropped from the public surface. Appendix B — item 1 (Lang typed-enum vs SmolStr deferred decision) moved to "Resolved" with a pointer to §4.4 / Appendix C. Appendix C added — full Lang variant list (99 named variants plus Other), with naming notes for the few non-2-letter codes (Haw for Hawaiian, Yue for Cantonese, Jw for whisper.cpp's non-modern Javanese code) and a one-line description of how new whisper.cpp languages are added (insert alphabetically, add as_str / from_iso639_1 match arms, no other code paths). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…time SemVer caveat Three small additions following the round-6 final audit verdict of "PR-ready": 1. §5.1 VadSegment::new doc-comment now notes that panic in const fn is stable on Rust ≥ 1.57, well below the crate's ≥ 1.85 MSRV inherited from siblings. Pre-emptive: removes the implementer's "wait, is this actually allowed?" check. 2. §10.4 regression tests gain an explicit Lang canonicalisation invariant test. For every named variant V in Appendix C, assert Lang::from_iso639_1(V.as_str()) == V AND the result does not match Lang::Other(_). Plus the symmetric pure-Other round-trip ("zzz" → Lang::Other("zzz")). Pins the §4.4 contract that no API path produces Lang::Other(s) for an enum-known code s. 3. §3.3 mediatime re-export block now documents the SemVer coupling: a mediatime major bump is automatically a whispery major because the public API names mediatime types directly. The mediatime dependency stays pinned to one major in Cargo.toml; coordinated bumps when needed. After 6 rounds of multi-agent adversarial review, the design has converged. Architecture sound, contracts honored, dispatch loop deadlock-free, buffer arithmetic drift-free, alignment correct, dia integration verified, every public method has a documented error contract, every load-bearing API claim verified against real codebases or live docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-emptively close the doc-comment MINORs the round-6 audit flagged as appropriate for the first implementation PR. Fixing now in the spec costs ~15 lines and saves the implementer the same set of "spec is silent about X" questions later. §5.1 — Transcriber documented as Send + !Sync. Multi-threaded drivers must wrap in Mutex; whispery does not provide internal synchronisation. §4.5 — WorkFailure now derives Clone + Debug explicitly. Clone is needed because the dispatch state machine moves the failure into Event::Error while runner-side code may also want to log it; the contained String is the only non-trivial allocation. §6.2 — AsrWorkItem.asr_timeout provenance documented inline: sourced from the runner builder's worker_timeouts(asr, _), stamped per-job at dispatch, fed into the abort_flag watchdog. §6.3 — Aligner.blank_token_id documented: read from the wav2vec2 tokenizer's special-tokens map at Aligner::from_paths time (standard <pad> / [PAD] convention; future builder method for non-standard models). §5.1 — is_idle() doc-comment expanded: enumerates the queues that must all be empty, and explicitly notes that pre-restart in_flight chunks keep is_idle() false until they emit (restart_at does not synthetically clear them). §6.4.1 — Disconnected-channel behaviour of Select::ready_timeout documented: closed worker channels wake the wait, drive_one_step's next try_recv returns Disconnected, dispatch surfaces WhisperPoolShutdown. No silent stall on worker panic. §10.4 — extended the unpoll_command test (M12) with a park-and-resume case that exercises the full wake/select cycle: saturate work_tx, park via unpoll, fire result, assert ready_timeout wakes within dispatch_idle_poll, assert next drive_one_step lands the parked command without losing the result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

§11 working-memory bullet now notes that the total scales roughly linearly with worker_count via the per-worker decoder workspace term (~20 MiB / worker on tiny), with worked examples for 8 and 16 workers. Model weights and the alignment-logits buffer are called out as independent of worker_count so readers don't extrapolate the wrong scaling onto them. §5.6 AsrParams.n_threads doc-comment now adds the SemVer caveat: c_int is i32 on every supported platform today, so the alias is a no-op. The alias exists to track whisper-rs's signature exactly; an upstream type change (or a future c_int variation) would propagate as a breaking change here, requiring a coordinated whispery major. After 7 rounds of multi-agent adversarial audit, every documentation item flagged across rounds 1-7 has been addressed in the spec or explicitly deferred with rationale. The design has converged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implementation plan covering whispery's no-ML-deps foundation: public types (Transcript, Word, Lang, errors, VadSegment), the Sans-I/O core module (cut state machine, sample buffer, dispatch state machine, Transcriber), benches, and an end-to-end mocked- backend integration test. 25 tasks total, organised into 11 sections. Each task has 3-5 TDD-style steps with concrete code, exact commands, and explicit expected outputs. After Plan A merges, whispery is a usable state-machine crate that can be driven via examples/core_only.rs; Plan B will add the whisper-rs runner and Plan C the alignment pipeline. Spec ref: docs/superpowers/specs/2026-04-28-whispery-cut-batch-whisper-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan A scope: core deps only (mediatime, smol_str, thiserror, smallvec). Plan B will add the runner feature pulling whisper-rs and crossbeam-channel; Plan C will add alignment pulling ort, tokenizers, ndarray. Spec: docs/superpowers/specs/2026-04-28-whispery-cut-batch-whisper-design.md §3.2.

Code-quality review found that smol_str declared without default-features=false silently links its std feature even when a consumer enables whispery with --no-default-features, breaking the no_std promise lib.rs claims via #![cfg_attr(not(feature = "std"), no_std)]. Add default-features=false on smol_str and propagate smol_str/std through whispery's std feature. thiserror has the same structural pattern but its std feature is empty in 2.x so there is no current correctness impact; deferred. Spec: §3.2.

Wires up the module tree: time, types, core. Crate-level lints match the silero convention (deny missing_docs, forbid unsafe_code). Modules are empty placeholders that subsequent tasks fill in. Spec: §3.1, §3.3.

Forward fix flagged by Task 2 code review: when types code lands in Task 8 (errors via thiserror) and Task 10 (AsrParams uses SmallVec), the no_std build path will need these defaults off. Applying now while the manifest is fresh saves a fix-up commit later. thiserror v2's std feature is currently empty so this is a behaviour-no-op today; smallvec works on alloc-only out of the box. Spec: §3.2.

The analysis timebase is internal-only; whispery's public output uses whatever timebase the caller's first push_samples Timestamp carries. Spec: §4.1.

Implementation Plan §4. Spec: §4.2 (Transcript.chunk_id), §5.5 (monotonicity contract — id increments on Event::Error too).

Spec §4.1.3: 16 kHz sample-index input, panic on zero or negative duration. silero never produces zero-duration segments; panic catches programmer error at the boundary.

Variant table from whisper.cpp's g_lang. as_str() for the known→ISO direction. The reverse direction (from_iso639_1) and the canonicalisation invariant test land in the next task. Added #[allow(missing_docs)] on the enum to satisfy the crate's #![deny(missing_docs)] — the 99 variant names are ISO 639-1 codes and are self-documenting by name. Spec: §4.4, Appendix C.

Plan A Task 6 code review caught an internal inconsistency in the spec: the prose at §Appendix C said "99 named variants" but the code-block table immediately above lists 100 (10 rows × 10). Hand-counted whisper.cpp v1.7.6's g_lang: 100 supported languages. The table is correct; the prose was wrong. This commit fixes only the prose. The plan file's Task 7 test fixture (which asserts known.len() == 99) is corrected when Task 7's implementer is dispatched. Spec: §4.4, Appendix C.

Total-function constructor: every &str produces a Lang. Known codes canonicalise to named variants, unknowns go to Other. Test verifies Lang::from_iso639_1(V.as_str()) == V for every named variant AND that the result is never Lang::Other(_). Spec: §4.4, §10.4 (Lang canonicalisation invariant test).

Two channels: synchronous TranscriberError on push/inject paths, asynchronous WorkFailure via Event::Error. WorkFailure derives Clone + Debug so the dispatch loop can move it into Event::Error while runner code logs it. Spec: §4.5.

Crate-private constructors used by the dispatch state machine (Transcript::new) and alignment pipeline (Word::new). A test-only \`for_test\` module gives downstream tests concise helpers for synthesising Transcripts and Words. Spec: §4.2, §4.3.

Backend-agnostic: no whisper-rs names in any of these types. The runner translates AsrParams into FullParams in Plan B. Default values match WhisperX (BeamSearch{5,-1.0}, no_context=true, temperature ladder 0.0/+0.2 × 6 attempts). Spec: §3.4 (backend invariant), §5.6.

Two cosmetic fixes after Task 10's commit: 1. Cargo.toml gains empty `runner = []` and `alignment = ["runner"]` features. The cfg(feature = "alignment") branches in src/core/command.rs were emitting `unexpected_cfg` warnings on every build because the feature wasn't declared. Plan B/C will replace these stubs with real feature payloads. 2. The `pub(crate) type ChunkAudio = Arc<[f32]>` alias is consumed by the dispatch state machine in Task 16; allow(dead_code) until then. Spec: §3.2.

Spec §5.6.

… Cut) State struct only — push_segment and flush land in the next task with their tests. Crate-private; nothing here crosses the public surface. Spec: §5.3.

Per-index split formula: start_i = seg.start + (i*len)/n with the last part absorbing the remainder by setting end_{n-1} = seg.end_sample. Guarantees max part length ≤ chunk_size_samples; the audit's pathological case (len=29, chunk_size=10, n=3) now produces 9/10/10, never 9/9/11. Tests cover: empty flush, single short segment, multi-segment merge, multi-segment with mid-stream flush, over-long single segment with three-way split, and the strict-bound regression test for the audit case. Spec: §5.3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gth)) Codex round-23 [high] finding: `backtrack_beam` cloned each beam's full `Vec<PathPoint>` on every stay/change branch. With `beam_width = 2` and 4 branches per outer iteration over `T` frames, total path-copy cost was O(T²) — ~72 MB of memory churn for T=1500 (a 30 s chunk), ~3.2 GB for T=10000. The trellis budget caps `T * num_tokens` at 32 M cells, so a long-`T`/modest-`num_tokens` case can pass that gate while still drowning the alignment worker in path-clone allocation and surfacing as a watchdog timeout instead of a bounded result. Replaced the per-beam `Vec<PathPoint>` with a flat arena of `BeamNode { token_index, time_index, score, point_score, prev: Option<u32> }`. Active beams are now `Vec<u32>` indices into the arena. Each branch is O(1): push one new node + record its predecessor index. The path is reconstructed once at the end by walking the predecessor chain from the winning beam back to the seed (O(T) work, O(T) allocation, total). Memory bound: arena grows by at most `beam_width * 2` per iteration × T iterations = ~4T nodes worst case. At T=1500 / beam=2 that's ~96 KB; at T=10000 it's ~1 MB. Linear in T instead of quadratic. Path reconstruction uses the natural ascending-time order from the chain walk (winner has the smallest time, seed has the largest). The leading-blank fill (frames `[0, winner.t)` → token-0 + blank emission, WhisperX visualisation convention) is prepended in ascending time too, so no final reverse is needed. The previous build-then-reverse code was an artefact of the path being constructed in descending-time order during the cloning approach. Also dropped the now-unused `PathPoint` struct (replaced end-to-end by `BeamNode` internally and `PathPointPublic` at the API boundary). Verification: - Tests fixed an early bug where prepending the leading-blank fill to a vector that was about to be reversed put those frames in the wrong relative position. Round-trip path-order pinned by the existing `backtrack_beam_simple_two_token_path` test (T=3, tokens=[1,2]: path[0].t=0, path[2].t=2). - 282 lib tests pass. - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across 854 word pairs (bit-identical to the pre-refactor numbers — confirms the algorithmic semantics are preserved).

… reject Two findings from `/codex:adversarial-review --base main`: [medium] `build_speech_frames` did not clamp sub-segment bounds to `[0, n_samples]`, while `build_speech_mask` did. With the WhisperX effective ratio's last frame interval `[(T-1)*spf, T*spf)` extending just past `n_samples`, an overshoot sub-segment could "credit" the trailing frame with phantom- sample overlap — marking it as speech against audio that doesn't exist, while step 0's silence-mask had nothing to zero in that region. `compose_words` then accepts words at `coverage >= 0.5` and only clamps the emitted end timestamp, so a word spanning the last real frame plus the overshoot frame can survive over mostly silent / nonexistent audio. Fix: pass `n_samples: u64` into `build_speech_frames` and clamp `seg_start` / `seg_end` to `[0, n_samples_i64]` exactly like `build_speech_mask` does. Both helpers now agree on the same coordinate contract. Added `build_speech_frames_clamps_overshoot_seg_to_chunk_end` regression covering full-overshoot and partial-overshoot sub-segments — both must yield `false` on the trailing frame. [medium] `WhisperPoolOptions::with_worker_count(0)` / `set_worker_count(0)` were silently accepted, and `WhisperPool::new` then spawned no worker threads (`for _ in 0..0`) while still binding the work channel. Chunks would enter `in_flight` with no receiver capable of producing results, stalling the pump/drain loop until the configured timeout fires — a fatal stall instead of a clear construction error. Two-layer defence: - Setter (`set_worker_count`) and builder (`with_worker_count`) panic on `value == 0`. Same pattern as `Aligner::set_sample_rate(0)` and the transcriber-config setters that already panic on degenerate values; catches the explicit programmer-error case at the API boundary. - `WhisperPool::new` returns `RunnerError::WhisperContextLoad` with a contract-violation message when `worker_count == 0` in the deserialised config — covers the serde path that bypasses the setter. Two new `#[should_panic]` regressions exercise both API shapes; a follow-up integration test for the `WhisperPool::new` error path is gated behind needing a real `WhisperContext`, deferred. Verification: - 285 lib tests pass (was 282; +1 overshoot regression, +2 zero-worker panic regressions). - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across 854 word pairs (bit- identical to pre-fix; the overshoot clamp doesn't bite the parity fixtures because their `raw_asr_segments[]` are always inside chunk bounds — the fix is defensive against malformed external callers).

…sion) Codex round-22 [medium] finding: `log_softmax_with_finite_guard` computed `log_z = max + sum.ln() as f32`, then `lp = x - log_z`. For a row with a large common offset like `[1e20, 1e20]`, `sum.ln() = ln(2) ≈ 0.69` rounded away when added to `max = 1e20` in f32 — `1e20 + 0.69 ≈ 1e20` — and every `lp` collapsed to `0.0` instead of the correct `-ln(2) ≈ -0.693`. The finiteness checks passed but the outputs were no longer log-probabilities, hiding a backend / export numeric failure as plausible alignment input. Fix: keep the subtraction of `max` in shifted f64 throughout. Compute `log_z_shifted = sum.ln()` (a small f64 in `[0, ln(V)]` under any finite input), and per element `lp = ((x as f64 - max as f64) - log_z_shifted) as f32`. Both terms in the subtraction are bounded — `(x - max) <= 0` and `log_z_shifted >= 0` whenever any element equals `max` — so `lp_f64 ∈ (-∞, 0]` regardless of `max`'s magnitude. Casting to f32 preserves the precision that mattered. Kept the per-element finiteness check: pathological inputs like `[f32::MAX, -f32::MAX]` can still underflow `lp_f64 as f32` to `-inf`, and surfacing that as `ModelInferenceFailed` keeps the backend-numeric-failure path typed (a `-inf` slipping into `data` would let Viterbi return `NoAlignmentPath` recoverably and mask the bug as `words: []`). The existing `log_softmax_rejects_finite_extremes_that_overflow_lp` test still pins this behaviour. Replaced the previous `log_z`-finiteness check with a `log_z_shifted`-finiteness check (the same value semantically; only fires on the all-(-inf) case, already covered by `log_softmax_rejects_all_neg_infinity_row`). New regression test: - `log_softmax_large_common_offset_normalises_to_unit_exp_sum`: feeds `[1e20, 1e20]`, asserts the output exp-sum is 1.0 and each lp is ≈ -ln(2). Pre-fix this test would fail with exp-sum=2.0 (each lp=0.0). Verification: - 286 lib tests pass (was 285, +1 large-offset regression). - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 (bit-identical to pre-fix; ORT's wav2vec2 logits are bounded to typical ranges, so the f64-shift produces the same f32 outputs as the previous f32-fold for normal inputs — the fix is defensive against malformed model exports).

Codex round-25 [medium] finding: for inputs shorter than wav2vec2's 400-sample receptive field, the encode path pads the buffer to 400 before invoking ORT. The encoder then emits `T` frames whose stride corresponds to the PADDED length (`padded.len() / (T - 1)`), but downstream `samples_per_frame` uses the original `samples.len() / (T - 1)`. The two views disagree by exactly the padding ratio: `compose_words` emits word ranges shrunk by `samples.len() / padded.len()` and `build_speech_frames` classifies frames against intervals that don't match the encoder's own view. A short utterance at the edge can be silently dropped by the speech post-pass or get a word span projected onto the wrong part of the original audio rather than recognized as padding-only. Codex's recommendation offered two paths: - Carry both lengths explicitly (validate against padded length, mark padded frames non-speech, clamp emitted ranges to original). - Skip alignment for sub-400-sample chunks. Took the second path. At 25 ms or less, a chunk cannot realistically contain a meaningful CTC path through any non-trivial transcript anyway — the trellis budget requires `T >= num_tokens`, and `T` for a sub-400-sample chunk is 1-2 frames. The simplest safe response is an early-out matching the existing `EmptyText` / empty-tokens short-circuit: return `Ok(empty AlignmentResult)`. Alignment becomes a no-op on degenerate input rather than a data-loss path. The padding code below the guard remains as defense-in-depth (it's now provably unreachable, but the explicit `if normalized_samples.len() < 400` branch documents the contract and protects against a future refactor that accidentally moves the guard). Added regression test `sub_400_sample_chunk_short_circuits_to_empty_result`: feeds a 200-sample buffer with realistic transcript text and asserts the call returns `Ok(empty)` rather than `Err` or mis-projected word ranges. Skips when the wav2vec2 fixture isn't present (offline / `WHISPERY_OFFLINE=1`); same gating as the existing `empty_normalised_text_returns_empty_alignment_result` test. Verification: - 287 lib tests pass (was 286, +1 sub-400 regression). - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 (bit-identical to pre-fix; real fixtures don't trigger the guard, all segments are >> 400 samples).

…sample-rate setter Codex round-26 surfaced two findings, both legitimate. [medium] Backtracker reserve can exceed the trellis budget (`trellis_beam.rs:486`). The trellis budget caps `T * num_tokens` at 32 M cells, NOT `T` alone. A degenerate `num_tokens = 1` pass-through can therefore drive `T` up to 32 M frames while still passing the trellis guard. The previous backtracker code then issued a `Vec::with_capacity(1 + beam_width * 2 * T)` pre-reserve — ~96 KB at typical T ≈ 1500, but ~3 GB at T = 32 M, allocated up-front before the per-iteration abort check could fire. Dropped the pre-reserve entirely (`Vec::new()`); push-driven growth is amortised O(1) and bounded by the same per-iteration abort flag the loop already honours. For typical T ≈ 1500 the doubling churn is ~400 KB total — dwarfed by the trellis itself. Avoids OOM/panic on pathological input by letting the existing abort path bound work in-band rather than front- loading allocation. [medium] Public sample-rate override is ignored (`aligner.rs:192-204`). `set_sample_rate` / `with_sample_rate` mutated `self.sample_rate`, but the alignment algorithm never consumed that field — silence-mask, frame timebase, and stride checks all hard-coded `SAMPLE_RATE_HZ` (16 kHz). A caller using the override for a non-16 kHz model got plausible-but-wrong masks and word timestamps instead of a configuration error. Removed the `sample_rate: u32` struct field entirely along with both setters and builders. The `sample_rate()` getter now returns `SAMPLE_RATE_HZ` directly — informational only, documents what the aligner expects. Non-16 kHz wav2vec2 support is out of scope for v1; if it becomes in-scope, the right shape is to wire the override through silence-mask, frame-timebase, and stride checks together (with end-to-end tests) rather than expose a knob that mutates a dead field. This is a public API breaking change (callers of `set_sample_rate` / `with_sample_rate` will fail to compile), but pre-1.0 those methods never worked correctly anyway. Verification: - 287 lib tests pass. - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 (bit-identical to pre-fix; both fixes are behaviour-neutral on real chunks — the arena pre-reserve was a memory-budget concern, not a correctness one, and the sample-rate setters were never read in the algorithm).

…to union Codex round-27 [medium] finding: `build_speech_frames` summed each sub-segment's per-frame intersection independently, so two overlapping ranges (e.g. `[0, 100]` and `[50, 150]` inside frame 0's `[0, 320)` interval) contributed `100 + 100 = 200` to the frame's overlap counter even though their UNION only covers 150 samples. With `min_overlap_samples = 160`, the frame would clear the threshold via raw sum (200 ≥ 160) despite its union being below it (150 < 160), disagreeing with `build_speech_mask`'s union semantics (per-sample boolean OR) and letting `compose_words` retain words whose audio is mostly masked silence. The contract is "≥ 50 % of the frame's samples are inside the VAD speech UNION", which matches `build_speech_mask`'s boolean-OR aggregation. Fix: clamp + sort + merge sub-segments into a non-overlapping union before per-frame accumulation. Touching segments (`s == last.1`) merge too, so [0, 80] + [80, 160] becomes a contiguous [0, 160] (matches the `build_speech_mask` per-sample OR semantics for a single contiguous voiced span). Real wav2vec2 fixture VAD segments don't overlap (silero / VAD upstream guarantees disjoint intervals), so the parity numbers are bit-identical to pre-fix. The fix is defensive against external `align_chunk` callers that pass arbitrary VAD output — which the public API documents as accepting in chunk-local 1/16000 timebase but doesn't otherwise validate. Three new regression tests in `compose::tests`: - `build_speech_frames_uses_union_not_sum_for_overlapping_segments`: two overlapping segs whose UNION (150) < threshold (160) but SUM (200) > threshold — must classify silent. Plus a triple-overlap stress case (sum 240, union 120, classifies silent). - `build_speech_frames_treats_adjacent_segments_as_contiguous`: touching segs `[0, 80]` + `[80, 160]` form a [0, 160] union that exactly hits the threshold → speech. Verification: - 289 lib tests pass (was 287; +2 union tests, +1 adjacent test). - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 (bit-identical to pre-fix).

Codex round-28 [medium] finding: `tokenize_with_word_map` took `wildcard_chars_per_word: &[u32]` — a single TOTAL count per word — and pushed every wildcard at the END of the word's encoded chars. That made leading punctuation like `"hello` indistinguishable from trailing `hello"` once it reached the CTC graph: both produced `[h, e, l, l, o, *]`. The wildcard frames always landed at the trailing-frames zone, biasing the word's emit-end timing past where the leading-punct silence actually sat in the audio. WhisperX's reference behaviour keeps `*` placeholders in source order. Fix: split the per-word count into `(prefix, suffix)` tuples end-to-end: - `NormalizedText::wildcard_chars_per_word: Vec<u32>` → `NormalizedText::wildcard_boundary_per_word: Vec<(u32, u32)>`. `with_wildcards` now takes the tuple form; getter renamed to `wildcard_boundary_per_word()`. - `EnglishNormalizer` computes prefix vs suffix separately by splitting `strip_word_punct` into a left-trim then a right-trim and counting each side. Hyphenated words: the FIRST piece inherits the original word's prefix-stripped count, the LAST piece inherits the suffix-stripped count; middle pieces get `(0, 0)` (the hyphen is treated as a real word boundary, not a wildcard frame). - `tokenize_with_word_map` takes `wildcard_boundary_per_word: &[(u32, u32)]`. For each word it pushes prefix wildcards BEFORE the encoded chars and suffix wildcards (plus any internal-skipped chars discovered during this pass) AFTER. Internal-skipped (`.` for abbreviations) still lumps into the suffix bucket — placing it exactly in source order would require interleaving wildcards mid-word during per-char encoding, which the current flow doesn't support. The suffix-leaning approximation is strictly better than the previous all-at-end design for the common leading-vs-trailing case Codex flagged; the rare internal case takes the same hit it already had. Three new regression tests in `tokenize::tests`: - `leading_wildcards_land_before_encoded_chars` — pins the new prefix-injection contract for `"hello` (prefix=1, suffix=0): `*` must land at index 0. - `trailing_wildcards_land_after_encoded_chars` — renamed + hardened version of the previous test; pins suffix-only behaviour (`hello"`). - `paired_wildcards_bracket_encoded_chars` — pins the paired-punctuation case `(hello)`: prefix at start, suffix at end, only encoded chars in the middle. The length-mismatch test was renamed to `wildcard_boundary_per_word_length_mismatch_errors`. Verification: - 291 lib tests pass (was 289; +3 wildcard placement tests, −1 renamed/dropped from the previous total-count contract). - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 (bit-identical to pre-fix; the dia fixtures' source text doesn't trigger the leading-punctuation case where the wildcard order matters — the fix is defensive against transcripts that DO have leading boundary punctuation).

…tcher Codex round-29 findings: 1. min_speech_coverage validation (high). Out-of-range values were stored verbatim, so a config typo of 1.5 silently dropped every word (no word's coverage can exceed 1.0). Now coerce: clamp to [0.0, 1.0]; treat NaN as the documented default (0.5). 2. inject_wordlevel_model_type used naive substring search and brace counting (medium). A user tokenizer.json whose string values contained "model", `{`/`}`, or "type" substrings could be misdetected and patched at the wrong byte range. Replaced with a quote-aware scanner that tracks in_string / escape state across three helpers (find_top_level_object_value_open, find_matching_close_brace, has_top_level_key). 5-fixture parity bit-identical to round-28 (median 0.995-0.999, below_0.5=0 across all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex round-30 findings: 1. Per-packet ASR override binding (medium). The override snapshot was taken at chunk *emission* time. A chunk whose audio was pushed under packet A's override but whose VAD-driven close happened in packet B silently picked up B's override (or None) — language hints, prompts, and strategy overrides got attached to the wrong audio under normal streaming VAD latency. Fix: cut state machine snapshots `current_override` when accumulation starts (current_start: None → Some), carries it on `MergedChunk.override_at_start`, and dispatch.on_emit reads from there. First-override-wins semantics. New regression test (src/core/transcriber.rs): `override_binds_to_packet_that_started_chunk_not_packet_that_closed_it`. 2. poll_transcript / poll_error silently dropped fatal runner errors (high). `let _ = self.drive_one_step()` dropped `WhisperPoolShutdown` and `Backpressure` errors — a dead pool looked like an empty stream. Fix: change signatures to `Result<Option<_>, RunnerError>`; stash any drive_one_step error in a `pending_fatal` slot; surface it once buffered transcripts/errors have been drained (so already-arrived results aren't lost). 5-fixture parity bit-identical to round-29 (median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex round-31 finding: whisper-rs's set_language and set_initial_prompt build CStrings via CString::new(...).expect(...). A SmolStr or Lang::Other carrying an interior NUL byte therefore panicked the worker thread mid-inference. Single-worker pools shut down; multi-worker pools left the chunk stranded with no WorkFailure (drain hangs until drain_timeout). Public callers reach these strings through AsrParamsOverride::with_initial_prompt and Lang::Other without validation. Validate before crossing the FFI boundary in full_params_from and surface the failure as an in-band WorkFailure::AsrFailed { kind: BackendError } for the offending chunk. The pool stays alive; subsequent chunks transcribe normally. Diagnostic message includes byte length but not content (initial_prompt may be sensitive). Two new unit regressions cover Lang::Other("xx\0yy") and initial_prompt = "hint\0poison". 5-fixture parity bit-identical to round-30 (median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex round-32 findings: 1. ASR timeout silent fail-open (high). run_with_temperature_ladder only consulted abort_flag in the Err path. If the watchdog fired during inference but whisper.cpp finished before observing the next abort-callback poll, an apparently-successful AsrResult reached the caller — violating the per-job timeout AND skipping timeout-streak recycling (was_timeout was derived from the un-rewritten outcome). Fix: post-watchdog-join, re-check abort_flag and rewrite outcome to WorkerHangTimeout. Mirrors alignment_pool.rs:266's existing pattern. 2. Unbounded Lang::Other intern leak (medium). intern_lang_str leaks one &'static str per distinct language hint and never evicts. Lang::Other(SmolStr) is public, so a tenant or buggy caller producing unique high-cardinality hints would grow process memory without bound. Fix: validate before interning. Restrict to lowercase ASCII letters [a-z], len 1..=8 — covers every whisper.cpp-recognized code with comfortable headroom. Reject as in-band WorkFailure::AsrFailed (consistent with the round-31 NUL-rejection style); diagnostic message does NOT echo offending bytes (hint can be public input). 5-fixture parity bit-identical to round-31 (median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex round-33 findings: 1. Serde bypassed VadSegment invariant (high). derive(Deserialize) skipped VadSegment::new's end_sample > start_sample check. Reversed-range serialized input would wrap sample_count() to ~u64::MAX and trip the cut state machine's hard-split assert. Fix: replace derive with a custom Deserialize that re-validates at the serde boundary and surfaces a typed deserialize error. 2. max_attempts=0 silently dropped all ASR (medium). The retry ladder iterates `for _attempt in 0..max_attempts` — zero meant zero state.full() calls and AllTemperaturesFailed for every chunk. Fix: panic on 0 in set/with_max_attempts (matches the codebase's existing zero-rejection style for sizing config like worker_count, chunk_size, max_in_flight); add a serde deserialize_with that surfaces the violation as a typed error. 3. is_idle() ignored pending_fatal (medium). A poll that stashed a fatal could be shadowed by is_idle()=true on the next call, so callers polling "until idle" would stop without ever observing the structural failure. Fix: add !pending_fatal to the conjunction. Extracted is_idle_inner const fn so the contract is testable without a real ManagedTranscriber. 5-fixture parity bit-identical to round-32 (median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex round-34 findings: 1. Invalid n_threads crossed FFI unchecked (high). whisper.cpp's decoder loops do `vec<thread>(n_threads - 1)` when n_threads isn't 1; 0 underflows to a huge alloc, negatives abort. The public AsrParams::set/with_n_threads accepted any i32 and forwarded to set_n_threads. Fix: panic on `< 1` in setters, serde deserialize_with rejects, defensive check in full_params_from returns chunk-level WorkFailure. 2. Dead GPU-default cfg (medium). default_worker_count and default_timeout_streak_threshold checked feature names like `_whisper_cuda` that don't exist in Cargo.toml — the cfg branches always returned false, so GPU users silently got CPU defaults (oversubscribing one GPU with multiple workers). Fix: drop the dead cfg, add explicit WhisperPoolOptions::new_for_gpu(model_path) constructor that picks single-worker / longer-streak defaults appropriate for serialised GPU inference. CPU `new()` unchanged. 3. Lang serde shape mismatched docs (medium). derive(Serialize, Deserialize) produced Rust variant names `"En"` and `{"Other":"xx"}`, contradicting the documented lowercase ISO string format. Fix: custom Serialize/Deserialize that writes `as_str()` and reads via `Lang::from_iso639_1`, with the same shape validation as round-32 (lowercase ASCII letters, 1..=8 bytes). Canonicalisation invariant preserved across serde — `"en"` deserialises to Lang::En, not Lang::Other("en"). Legacy derive-shaped JSON now rejected. 5-fixture parity bit-identical to round-33 (median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Configs are usually human-edited and humans naturally use mixed case ("EN", "En", "en"). The ISO 639 standard treats codes as case-insensitive — make Lang's serde deserializer accept any case and normalize to canonical lowercase before Lang::from_iso639_1. Wire format and FFI behavior unchanged: serialization always emits lowercase, the alignment-stage validation in runner/whisper_pool.rs is unchanged (it never sees uppercase because both named variants and post-deserialize Lang::Other inner strings are always lowercase). Round-trip asymmetry by design: "EN" in → Lang::En → "en" out. Side benefit for migration from the round-34 derive switch: legacy configs using Rust-variant-shape strings like "En" (uppercase first letter) continue to work; only the externally-tagged {"Other":"xx"} form remains rejected. Two new regression tests; existing serde_rejects_uppercase removed (it's no longer the contract). 5-fixture parity bit-identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex round-35 findings: 1. process_packet partial-commit (high). The previous order (push_samples → push_vads loop) committed samples first, then pushed VADs one at a time. A failing VAD #N left samples + VADs 0..N-1 committed; caller could not safely retry the packet (same starts_at would trip PtsRegression). Fix: add precheck_vad_segments on Transcriber that validates eof, timebase, ordering, and high-water (against a floor projection of post-push high water) WITHOUT mutating state. Call it before push_samples in process_packet so process_packet is now either fully atomic or returns Err with no commit. The precheck is conservative (floor projection assumes no gap-fill), which can false-reject VADs targeting the rare zero-fill region — silero never emits such segments in practice. False-accept is impossible. 2. drain idle/timeout reporting (medium). drain() used self.core.is_idle() which ignored pending_transcripts, pending_errors, AND the round-30 pending_fatal slot. A fatal recorded behind buffered output could coexist with drain returning Ok. Fix: drain now (a) takes pending_fatal eagerly if set and returns Err, (b) checks self.is_idle() (the pub predicate including all four conjuncts) for the success case, and (c) reports DrainTimeout.in_flight via a new in_flight_chunk_count() that returns cut_pending.len() + in_flight.len() — the documented chunk count, not the sample count that buffered_samples() returned (which can read 0 after trim while real chunks remain). 7 new precheck regression tests cover valid/invalid orderings, ahead-of-projected-water, AfterEof, and OutputTimebaseUnset. 5-fixture parity bit-identical (median 0.995-0.999, mean 0.983-0.998, below_0.5=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex round-36 [critical]: the previous `recv_timeout(timeout).is_err()` matched both `Err(Timeout)` (real hang) AND `Err(Disconnected)` (clean cancel when the worker drops cancel_tx after fast inference). Round-32's post-watchdog `abort_flag` re-check then rewrote every successful `AsrResult` into `WorkFailure::WorkerHangTimeout` — every transcript would be lost in production ASR runs. The 5-fixture parity suite hides this because it injects pre-computed WhisperX ASR dumps and only exercises the alignment path. Fix: extract `watchdog_should_signal_timeout` helper that pattern-matches on `RecvTimeoutError`. Only `Timeout` flips the flag; `Ok(())` and `Disconnected` are clean exits. Mirrors the alignment pool's existing watchdog pattern at runner/alignment_pool.rs:209-220. 5 new regression tests cover: helper-level discrimination of each variant; end-to-end "drop tx does NOT flip flag"; end-to-end "real timeout DOES flip flag". Parity unchanged (alignment-only path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex round-37 findings (both high): 1. AsrParamsOverride.{language_hint,initial_prompt} use Option<Option<T>> for the absent / clear / set distinction, but the derived Deserialize collapsed JSON `null` and "field absent" into the same outer None — making `{"language_hint": null}` indistinguishable from omitting the field, defeating the documented "clear this override" contract. Locked-language streams could not unlock via JSON config. Fix: deserialize_with helpers `deserialize_double_option_lang` / `deserialize_double_option_smolstr` preserve the three-way distinction at the serde boundary. Lang's case-insensitive ISO deserializer continues to apply through the helper. Four new round-trip tests cover absent/null/value plus the three-state round-trip. 2. SampleBuffer::append committed the anchor, zero-filled, and advanced absolute_sample_offset for empty packets at a forward delta within gap tolerance. A "heartbeat" call at a slightly future PTS would create phantom audio and reject the next real packet at the originally-expected PTS as PtsRegression — caller had no recovery path. Fix: empty packets at delta > 0 are now a true no-op for stream state (validation still runs). delta == 0 empty packets and non-empty packets unchanged. Three new regression tests cover the heartbeat-then-real-audio sequence, empty-first-then- forward-heartbeat-then-real-audio, and exactly-on-time empty no-op. 5-fixture parity bit-identical (median 0.995-0.999, mean 0.983-0.998, below_0.5=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… leak Codex round-38, all four findings: 1. CString leak per ASR attempt (high). whisper-rs's set_language / set_initial_prompt use CString::into_raw() with no Drop, so each attempt leaked one C string. Build the FullParams template once per chunk and clone for each temperature attempt. Reduces leak from O(attempts × chunks) to O(chunks) — 6× at default max_attempts = 6. FullParams: Clone shallow-copies the inner struct including the leaked *const c_char pointers; aliasing is safe because the strings have static (leaked) lifetime. Inter-chunk dedup deferred (abort_callback captures per-job state, more involved). 2. drain_timeout bypassed while a command is parked (high). pump_until_idle_or_progress looped forever on parked + no-progress; the drain deadline wasn't checked inside the loop. A worker stuck in state.full made drain hang past its configured timeout. Thread an Option<Instant> deadline through the pump; check before wait_for_progress. process_packet / signal_eof keep deadline = None (existing behavior); drain passes Some(start + timeout). Now matches the documented timeout contract. 3. no_speech_threshold was a public no-op (medium). The knob existed in AsrParams but full_params_from never forwarded it and the accept gate only checked avg_logprob / compression ratio. Forward to set_no_speech_thold AND check post-decode: when mean no_speech_prob > threshold AND avg_logprob is too low to trust, short-circuit to empty AsrResult (no temperature retries — temperature can't conjure speech the model already evaluated as absent). Mirrors WhisperX / OpenAI Whisper's silence detection. 4. First empty packet still committed the anchor (medium). Extends round-37's empty-packet fix: an empty FIRST push has delta == 0 by definition (no prior anchor) so the round-37 bail (delta > 0) didn't fire, and the first-push commit path ran. Result: a heartbeat at PTS X locked the stream there, and real audio at PTS 0 tripped PtsRegression / InconsistentTimebase. Now empty packets are a true no-op for stream state regardless of first-push. Real first audio at any PTS becomes the actual first push. 5-fixture parity bit-identical (median 0.995-0.999, mean 0.983-0.998, below_0.5=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

uqio and others added 30 commits April 28, 2026 19:56

chore: lib.rs scaffolding + empty module stubs

b9fdd77

Wires up the module tree: time, types, core. Crate-level lints match the silero convention (deny missing_docs, forbid unsafe_code). Modules are empty placeholders that subsequent tasks fill in. Spec: §3.1, §3.3.

feat(time): SAMPLE_RATE_HZ and ANALYSIS_TIMEBASE constants

3642351

The analysis timebase is internal-only; whispery's public output uses whatever timebase the caller's first push_samples Timestamp carries. Spec: §4.1.

feat(types): ChunkId newtype with crate-private constructor

a2d3d1d

Implementation Plan §4. Spec: §4.2 (Transcript.chunk_id), §5.5 (monotonicity contract — id increments on Event::Error too).

feat(types): VadSegment with strict-inequality new() panic

7e1ba37

Spec §4.1.3: 16 kHz sample-index input, panic on zero or negative duration. silero never produces zero-duration segments; panic catches programmer error at the boundary.

feat(core): Event enum (Transcript / Error)

c883224

Spec §5.6.

feat(core/cut): type definitions (SampleRange, SubRange, MergedChunk,…

1e50c8f

… Cut) State struct only — push_segment and flush land in the next task with their tests. Crate-private; nothing here crosses the public surface. Spec: §5.3.

uqio and others added 30 commits May 4, 2026 16:27

fix

b465ac7

fix

ed100b9

fix

be69b70

fix

3d55aa0

fix

0e05705

update README

e7a9a59

update README

c51cadb

update README

3f2dc72

update README

02f054f

update README

37da302

update README

bc31be1

update README

3a7e57e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.1.0#1

0.1.0#1
uqio wants to merge 222 commits intomainfrom
0.1.0

uqio commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

uqio commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant