Conversation
Captures the v1 design: a Sans-I/O state-machine core (Transcriber) that owns cut and dispatch logic with no ML deps, paired with a default-on runner feature that wires whisper-rs and ort-based wav2vec2 forced alignment. Output is per-merged-chunk Transcript with mediatime ranges, language, full text, word-level timestamps, and provenance back to source VAD segments. References WhisperX's merge_chunks for the cut stage, adapts the batched-inference pattern to whisper.cpp's N-state concurrency model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P0 — emission ordering and back-pressure correctness:
- Dispatch state machine now tracks next_emit_chunk_id and emits
Event::Transcript / Event::Error in strict chunk-id order via
flush_in_order_events; out-of-order worker completions stage on
ChunkPhase::Ready until predecessors emit.
- cut_pending now holds chunk descriptors only (no extracted audio).
SampleBuffer becomes the sole audio back-pressure choke point;
buffer_cap_samples bounds total memory under sustained inference
saturation.
P1 — semantic and interface defects:
- SampleBuffer accepts forward gaps within gap_tolerance_samples
(zero-fill); rejects only PTS regressions and oversized gaps.
- LanguagePolicy::{Auto, Lock, AutoLockAfter(n)} added; default
AutoLockAfter(1) to match WhisperX language-locking behaviour.
- Lang::ANY sentinel removed; replaced with AlignerKey::{Lang, Any}
enum lifted into the type system.
- Backend invariant in §3.4: core renamed Whisper* to Asr*
(Command::RunAsr, AsrParams, AsrResult, AsrTokenHint,
inject_asr_result). Whisper-specific fields confined to runner.
- AlignmentPool concurrency clarified: v1 single worker; future
parallelism conditional on ort EP thread-safety analysis.
- WhisperState/WhisperContext sharing flagged as §13.1 spike.
P2 — algorithmic correctness:
- Cut state machine hard-splits any single VAD segment longer than
chunk_size; chunk-size bound is now true chunk_size, not
chunk_size + max(seg_duration).
- Aligner::align is silence-aware (zero-mask outside sub_segments)
and normalisation-aware (per-language TextNormalizer trait;
surface-form recovery from normalised tokens to original text).
- condition_on_previous_text rationale rewritten: it controls
intra-chunk decoder prompt continuation, not cross-chunk
continuity.
- Transcript.text contradiction resolved: text is verbatim Whisper
output (punctuated/cased); Word.text is per-word surface form.
P3 — documentation cleanup:
- alignment feature moved to opt-in (default = ["std", "runner"]).
- bundled-tiny dropped from v1 features; moved to future work as a
build-time fetch with checksum verification.
- All public types use private fields with getters per the
findit-studio no-public-fields convention; Transcript and Word
refactored accordingly.
- Concrete error variants for AsrFailureKind and
AlignmentFailureKind; thiserror, smallvec listed as permanent
deps.
- Concurrency diagram redrawn with safer ASCII.
- Memory math corrected to include decoder workspace, alignment
logits, and per-worker model duplication scenarios.
- ort version pinned (= "2.0.0-rc.12") matching siblings.
- 16k → 48k integer-ratio round-trip constraint documented.
- Test matrix expanded with edge cases (single-segment-overrun,
zero-gap segments, empty whisper output, worker hang, language
lock).
P4 — open architectural questions:
- §1.6 added: speaker-agnostic semantics; downstream diarization
joins on Word.range and vad_segments. Confirmation requested
on whether the diarization plan needs additional anchors.
- §1.7 added: crate vs ThreadService deployment, marked as a
decision needed before implementation.
P5 — additions to v1:
- Per-call AsrParamsOverride for language hints / per-packet
param tweaks.
- Worker hang timeout protection on ASR and alignment workers;
drain_timeout cap on drain().
- Device enum (Cpu/Cuda/Metal/Vulkan) replaces use_gpu: bool.
Added §13 Open Risks for items requiring spike or external input
before implementation begins. Updated Appendix B with resolutions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§1.6: pin the diarization join contract against dia v0.1.0. dia::DiarizedSpan.range is mediatime::TimeRange at 1/16000 timebase, identical to whispery's Word.range. Indexer joins by plain interval overlap; whispery does not depend on dia and exposes no speaker-aware fields. Inline the relevant DiarizedSpan struct shape and a sketch of the indexer-side join. §1.7: lock to crate-only deployment for v1, matching all sibling crates (silero, soundevents, textclap, dia). A wrapper service binary, if ever needed, is additive and does not change the public surface. §13.4: convert from "open questions, get confirmation" to a record of resolutions; flag the dependency on dia v0.1.0's current shape so a future dia API change forces a revisit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two big shifts in this revision: aligning §5.6/§6 with whisper-rs
0.13.x's actual API, and switching whispery to a two-timebase
output model (internal 1/16000 indexing, external caller-chosen
timebase for all emitted ranges).
Audit-level fixes (P0/P1):
B1: Drop public Device enum. whisper-rs selects backends via
Cargo features; only runtime knobs are use_gpu and gpu_device:
i32. WhisperPoolConfig now exposes use_gpu + gpu_device +
flash_attn directly.
B2: WhisperContext::new_with_params(path, WhisperContextParameters)
is the actual API; documented at the seam.
B3: condition_on_previous_text renamed to no_context, matching
whisper-rs polarity. Default true (no cross-chunk context). Doc
rewritten to explain intra-chunk vs cross-chunk semantics.
B4: SamplingStrategy moved to AsrParams (Greedy{best_of} /
BeamSearch{beam_size, patience}); FullParams is constructed
fresh per chunk, so beam parameters are runtime knobs after all.
B5: Drop suppress_tokens (whisper-rs has no token-list setter).
Only suppress_blank and suppress_non_speech_tokens remain.
B6: Drop temperature_schedule. The runner implements its own
ladder via repeated state.full() calls; AsrParams carries
initial_temperature, temperature_increment, max_attempts.
B7: Resolve §1.5 vs §5.6 contradiction by deleting AsrTokenHint
and Command::RunAlignment.token_hints. v1 does not enable DTW;
wav2vec2 forced alignment doesn't need per-token seeds.
B8: Drop AsrFailureKind::EmptyOutput. Empty whisper output is a
normal Transcript with empty text and elevated no_speech_prob,
not a failure.
B9: Pin invariants in §5.5: chunk_id increments on Event::Error
too; flush_in_order_events runs before trim on every inject path.
B10: Rewrite §6.4 to avoid the inline-send saturation deadlock.
Dispatch loop uses non-blocking try_send + always-drain pattern,
with crossbeam_channel::select! for backpressure parking.
B11: Pin Backpressure contract: inputs are accepted (state
advanced), caller drains before pushing again, never retries
the same call. would_accept() predicate added for callers
that need a non-mutating check.
M1: Hard-split rule pinned to ceil(len / chunk_size) equal-length
parts. MergedChunk.subs becomes Vec<SubRange> with SubOrigin
tagging Vad{vad_seq} vs HardSplit{vad_seq, part, total_parts}.
Provenance is reconstructable.
M2: GapExceedsTolerance recovery path: Transcriber::restart_at(pts)
flushes cut, drains in-flight, re-anchors the buffer.
M3: GPU backends default worker_count = 1 (whisper.cpp serialises
on a single GPU). CPU defaults to num_cpus / 2.
M4: §6.3.2 step 7 indexing fix. Per-word frame ranges go into
Vec<Option<...>> of length n (= original_words.len()), not a
packed list. Words whose audio fell in the silence-mask region
remain None and are skipped in step 8.
M5: §4.2/§4.3 Word.text doc-comments fixed: original surface form
with punctuation/casing, recovered via the normalisation map in
§6.3.2 step 9. Drop the contradictory "lowercased and
punctuation-stripped" wording.
M7: AlignerKey::Any is registry-miss fallback only. Aligner
failures emit AlignmentFailed; no silent fall-through.
M8: §6.3.3 Mutex<Aligner> justification rewritten. Real reason is
&mut self requirement of ort::Session::run with a worker on a
separate thread, not "forward-looking multi-worker."
Two-timebase output model (response to user guidance):
§4.1 rewritten. whispery indexes internally at 1/16000 (matches
silero, soundevents, dia internally) but emits all public
TimeRanges in the caller's chosen output timebase. The first
push_samples Timestamp's timebase becomes the output timebase
for the Transcriber's lifetime; subsequent pushes must use the
same one. SampleBuffer tracks both base_pts_out (output tb) and
base_sample_idx (16 kHz internal); samples_to_output_range
converts via mediatime::Timebase::rescale_pts.
VadSegment refactored from { range: TimeRange } to
{ start_sample, end_sample } (silero-native 16 kHz indices). The
caller passes silero output verbatim; whispery does the analysis
to output-timebase conversion internally.
§1.6 dia integration adjusted: dia stays at 1/16000; the indexer
joins across timebases via mediatime's 128-bit cross-multiply.
Lossless for integer-ratio output rates (1/48000, 1/8000), exact
to sub-sample for non-integer ratios.
P3 documentation cleanup:
- §2.1 added: honest scope of inherited conventions (vs the
earlier overclaim that Sans-I/O / mediatime / smol_str /
smallvec were sibling conventions). VadSegment is documented
as not isomorphic to silero's SpeechSegment.
- ort version pinned with `=2.0.0-rc.12` (Cargo's pre-release
semantics need the equality sign).
- Lang derives Hash, Eq, PartialEq, Clone (required for HashMap
key in AlignmentSet).
- §4.1 explicit on mediatime API: const nz() helper for
NonZeroU32; no Add/Sub operators on Timestamp; TimeRange::new
panics on negative span; rescale_to saturates.
- §4.5 documents the two distinct error channels: RunnerError
(synchronous structural) vs WorkFailure (async per-chunk).
- §6.4 worker hang protection: whisper-rs's
set_abort_callback_safe used for ASR cooperative timeout.
- §13.1 reduced from 1-day spike to 2-hour verification:
whisper-rs is officially thread-safe (confirmed via context7
query); WhisperState is owned in 0.13.x with no lifetime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rface holes
Verified the 3 NEW BLOCKERs the round-2 audit found and applied
the proposed fixes; verified mediatime source confirms M-α
(IntervalTree<TimeRange, ...> is fiction); applied all 9 net-new
MAJORs and the small-item sweep.
NB1 — §5.4 SampleBuffer no longer recomputes base_pts on trim.
Layout becomes (output_tb, base_pts_out_anchor: i64 immutable,
absolute_sample_offset: u64 monotonic, buffer_drop_offset: u64
monotonic, samples). samples_to_output_range always rescales
from the immutable anchor against the absolute sample index, so
truncation error is bounded at ±1 PTS regardless of trim history.
NTSC and MPEG-TS-style non-integer-ratio rates no longer
accumulate drift.
NB2 — §6.4.1 dispatch loop now correctly returns
RunnerError::Backpressure when block_on_full_queue=false and a
try_send returned Full. drive_one_step's signature changed from
Result<(), RunnerError> to Result<bool, RunnerError> so the
saturation wait loop has a real "made progress" signal (M-η).
The block_on_full_queue=true path uses crossbeam_channel::select!
with a 10 ms safety default to wait for any worker result before
retrying.
NB3 — §5.3 hard-split formula pinned to the per-index form:
start_i = seg.start + (i × len) / n
end_i = start_{i+1} for i<n-1; end_{n-1} = seg.end
This guarantees max part length = ceil(len/n) ≤ chunk_size_samples
and parts differ by at most 1 sample. The naive
floor(len/n)+remainder reading (which produces 9,9,11 for len=29
chunk_size=10) is explicitly disallowed.
M-α — Verified mediatime/src/lib.rs at line 436 (TimeRange has
Eq/Hash but no Ord/PartialOrd) and line 411 (only Timestamp has
Ord, via cmp_semantic at lib.rs:303-328). The IntervalTree path
in §1.6 was fiction; replaced with two real paths (canonicalise
to whispery's output tb, or canonicalise to dia's 1/16000) plus
a "roll your own newtype using cmp_semantic" note. The §1.6
indexer pseudo-code rewritten using BTreeMap<i64, _> keyed on
canonicalised end PTS rather than the nonexistent IntervalTree.
M-β — Added restart_at, would_accept, unpoll_command, and
output_timebase to Transcriber's §5.1 public surface. All three
were referenced by the rest of the doc as if they existed but
were missing from the impl block.
M-γ — push_vad_segment doc-comment now states it returns
TranscriberError::OutputTimebaseUnset before the first
push_samples. Added the OutputTimebaseUnset variant to
TranscriberError, plus InconsistentTimebase for cross-timebase
push attempts.
M-δ — §5.4 PTS regression check moved to output-PTS space. delta
is computed as starts_at.pts() - expected_pts_out (in output_tb);
only the zero-fill width converts back to 16k samples. This
eliminates the spurious-PtsRegression failure mode on
non-integer-ratio output timebases where contiguous caller pushes
appear to regress because of round-trip truncation.
M-ε — §5.4.1 restart_at explicitly documents that next_chunk_id
is NOT reset; chunk_id continuity is preserved so in-flight
chunks from before the gap still emit normally. The first
post-restart chunk has chunk_id one larger than the last
pre-restart chunk.
M-ζ — Trim's low-water computation (§5.5) now reads from
cut_pending only, not in_flight. In-flight chunks have their
audio in their own Arc<[f32]> and are decoupled from the live
buffer, so trim doesn't need to consider them. After restart_at,
cut_pending is empty (Cut was flushed), so trim's low-water is
the buffer's high-water — the entire empty buffer is eligible
for drop. This eliminates stale-anchor PTS values poisoning trim
across a restart.
M-η — drive_one_step now returns Result<bool, RunnerError>; the
caller's saturation wait loop has a real progress signal to drive
the spin condition.
M-θ — §6.3.2 word_idx_per_token changed to Vec<Option<usize>>.
wav2vec2 word-delimiter (|), <unk>, and special tokens get None
and are skipped during the Viterbi walk in step 7. Step 7 also
explicitly notes it skips blank-emitting frames (CTC blank token).
M-ι — §6.4.3 worker hang protection got recycle hysteresis. CPU
backends recycle WhisperState on every timeout (cheap); GPU
backends require timeout_streak_threshold consecutive timeouts
(default 3) before recycling, because GPU create_state allocates
KV-cache buffers and stalls the entire pool when worker_count=1.
WhisperPoolConfig.timeout_streak_threshold is now in §8.
M-κ — AsrParams docs explicitly note the layered-ladder concern.
Whisper.cpp has its own internal temperature ladder
(temperature_inc, max_decoding_failures); the runner must set
temperature_inc=0 and max_decoding_failures=1 inside FullParams
so the runner's outer ladder is the sole authority. The
spec acknowledges that if whisper-rs 0.13.x doesn't expose those
setters, the runner pins initial_temperature=0 and the AsrParams
temperature fields are advisory; resolved during the §13.1
verification.
Smaller items:
- §8 defaults table: max_queued_chunks default = worker_count + 4;
timeout_streak_threshold default 1 (CPU) / 3 (GPU).
- §5.3 step 3 pins current_end ≥ current_start invariant
explicitly (set current_end = sub.start when current_start is
initialised), eliminating the underflow guard reliance.
- §6.4.1 drain() now reuses the saturation select-loop with a
10 ms default and exits on idle.
- §5.6 full_params_from now shows the set_abort_callback_safe
wire-in for worker hang protection.
- §6.1 builder section now shows the canonical
WhisperContext::new_with_params construction so reviewers can
see where the actual whisper-rs API lives.
- §4.3 Word.text doc-comment notes that silence-masked words are
absent from Transcript.words (so words[].text joined is not
guaranteed to equal Transcript.text modulo whitespace).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng variants
NB-α — §5.4.1 restart_at now resets BOTH absolute_sample_offset
and buffer_drop_offset to 0. The v3 draft carried
absolute_sample_offset forward as the new buffer_drop_offset,
which made expected_pts_out = anchor + rescale(carried_offset)
on the very push that was supposed to recover the gap. The
first post-restart push tripped PtsRegression every time. With
both offsets reset, the first push computes
expected_pts_out = starts_at.pts() + rescale(0) = starts_at.pts()
exactly, matching the caller. Pre-restart in-flight chunks
already hold their audio in their own Arc<[f32]>s
(extracted at chunk-creation time per §5.4.1 step 5), so
resetting indices is safe.
NB-β — §6.4.1 saturation wait no longer uses
crossbeam_channel::select! with empty arm bodies. select!'s recv
arms perform the actual receive and bind the message to the arm
body; an empty `=> {}` body silently drops the result and the
next drive_one_step's Phase 1 try_recv would find the channel
empty — silent data loss exactly on the saturation path.
Replaced with crossbeam_channel::Select::ready_timeout, which
signals readiness without consuming. drive_one_step's Phase 1
try_recv then drains the now-ready message normally.
NB-γ — RunnerError gains a Backpressure { buffered, cap }
variant that §6.4.1 was constructing without it being declared.
The reviewer's claim that the entire RunnerError enum was
missing was incorrect (it was at line 610 throughout) but the
specific variant referenced by drive_one_step's Backpressure
return path was indeed undeclared.
W3 — §5.1 surface gains pub fn next_expected_starts_at(&self)
-> Option<mediatime::Timestamp>. Strictly-contiguous callers
should use this authoritative value instead of computing their
own per-packet running sum, because per-packet
mediatime::Timebase::rescale_pts truncates and a sequence of
per-packet rescales does not always equal one rescale of the
cumulative sample count (the regression check is anchored on
the cumulative form). Without the accessor, callers on
non-integer-ratio output timebases (1/30001, MPEG-TS, NTSC)
would eventually drift by ±1 PTS and trip a spurious
PtsRegression after enough packets.
M-κ — full_params_from now sets set_temperature(attempt_temp),
set_temperature_inc(0.0), set_max_decoding_failures(1) so the
runner's outer ladder is the sole authority. Each state.full()
call is one decoding attempt at exactly the runner-supplied
temperature; whisper.cpp's internal ladder is disabled.
run_with_temperature_ladder updated to pass attempt_temperature
into full_params_from on each retry. The AsrParams
log_prob_threshold doc-comment rewritten to reflect that the
suppression is wired (not a future concern), with a
compatibility note that the §13.1 verification spike confirms
the precise setter names; if a future whisper-rs renames or
removes them, the implementation falls back to a single attempt
per chunk and downgrades the temperature fields to advisory.
§8 defaults table — removed three stale rows (beam_size,
temperature_schedule, condition_on_previous_text) and added
nine accurate ones for the v3 AsrParams shape: strategy,
initial_temperature, temperature_increment, max_attempts,
no_speech_threshold, log_prob_threshold,
compression_ratio_threshold, no_context, suppress_blank,
suppress_non_speech_tokens, n_threads, initial_prompt. The
no_context row explicitly documents the polarity inversion vs
WhisperX's condition_on_previous_text.
§12 stale AsrParamsOverride field list now reads
{ language_hint, strategy, initial_temperature, initial_prompt }
matching §6.1.
§13.4 cross-timebase wording fixed: now says "mediatime exposes
128-bit cross-multiply via Timestamp::cmp_semantic only —
TimeRange has no Ord — so the indexer canonicalises ranges to
one timebase up front" rather than the v3 draft's claim that
mediatime supports the cross-timebase join "directly via 128-bit
cross-multiply" (which contradicted the §1.6 rewrite).
§4.3 Word.range doc-comment now describes partial-alignment
semantics: when silence-aware alignment drops words, surviving
words have a `range` covering only frames the Viterbi path
attributed to them (not adjacent words' frames or masked-region
frames), and the `text` is still the full original surface form
even if alignment only saw a fraction of the audio.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… c_int Latent bug — §5.4.1 restart_at now drains cut_pending in step 1 before clearing the buffer in step 3. cut_pending lives on the dispatch state machine (not the cut state machine that Cut::flush() drains), so without an explicit drain its entries would survive into the post-restart frame carrying old-anchor SampleRange indices. The next trim() would then compute a stale low-water and the next promotion attempt would extract against indices that no longer exist in the cleared buffer — hard panic. The drain promotes each queued entry to in_flight (extract samples in old-frame indexing, cache TimeRange against the old anchor, build ChunkRecord). The drain-on-restart path is allowed to temporarily exceed max_in_flight; restart_at is a one-time event and dropping the chunks would lose transcription work the caller had already paid for. The bug existed in v3 too (cut_pending was already there) but v4's cleaner re-anchor logic surfaced it. NIT 1 — §5.6 rewritten the "fallback" wording to compile-time guidance. If a future whisper-rs is found (during §13.1 verification) to lack set_temperature or set_temperature_inc, the runner module must be built with an explicit alternative implementation that omits those calls — this is a compile-time situation requiring a code-path alternative, not a runtime fallback. In that alternative path the AsrParams temperature fields become advisory. set_max_decoding_failures wired as best-effort. Context7 verification confirms set_temperature and set_temperature_inc in whisper-rs 0.13.x but does not list set_max_decoding_failures. Since temperature_inc=0 alone fully disables whisper.cpp's internal ladder (the loop `for t=initial; t<=1.0+1e-6; t+=inc` iterates exactly once with inc=0), max_decoding_failures is a secondary safeguard. The doc-comment on log_prob_threshold and the inline comment in full_params_from now describe this explicitly; if §13.1 finds the setter absent, omitting that line is a no-op for behaviour. §5.6 AsrParams.n_threads type changed from i32 to std::os::raw::c_int to match whisper-rs's set_n_threads parameter exactly. Documented inline. §6.4.1 added a one-line comment about crossbeam_channel::Select::ready_timeout's documented spurious wake behaviour. The pattern is harmless under spurious return (next try_recv returns Empty, outer loop spins) but worth forestalling future confusion. §12 future-work entry on AsrParamsOverride now explains the intentional minimality: per-packet caller can shift the ladder's starting point (initial_temperature) but not reshape it (temperature_increment, max_attempts); reshape requires re-building the runner with asr_params(...). The constraint is documented as deliberate so future additions don't accidentally break the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IB1 — §5.1 Transcriber surface gains per-method `Errors:` blocks
listing the exact TranscriberError variants each method can return.
Removes the implementer-guesses-and-tests-lock-it-in failure mode.
push_samples documents PtsRegression / GapExceedsTolerance /
Backpressure / InconsistentTimebase / AfterEof. push_vad_segment
documents OutputTimebaseUnset / PtsRegression{kind:VadSegment} /
AfterEof and pins the strict-monotonic invariant on segment
ordering. signal_eof, restart_at, inject_*_result, inject_failure
get explicit error contracts.
IB2 — §5.5 invariant 4 carves an explicit exception for
restart_at's cut_pending drain: the invariant prohibits exceeding
max_in_flight in steady state, but restart_at is allowed to push
above the cap by however many entries were queued in cut_pending
at restart time. The exceedance is bounded by max_queued_chunks
and decays as workers drain normally. Trim's promotion guard is
documented as suspended for the duration of the drain. Required
because the alternative (dropping queued chunks) loses paid-for
transcription work.
§3.3 — re-exports mediatime::{Timebase, Timestamp, TimeRange} so
consumers don't need a separate mediatime dep just to name them.
InvalidLang added to the public exports.
§5.1 — unpoll_command's visibility narrowed to pub(crate) (was
pub with prose "runner-only"). Same crate as the runner module,
so pub(crate) is the right gate. Out-of-tree consumers driving
the state machine themselves don't need this affordance.
§6.1 — process_packet documents the vad_segments contract
explicitly: strictly monotonically increasing start_sample, no
overlap, no duplicates; violations surface as PtsRegression.
Empty packet (samples.is_empty()) handling documented as a no-op
when delta_pts_out is 0.
§5.3 / VadSegment::new — panics on `end_sample <= start_sample`
(strict inequality). Zero-duration VAD segments would produce
zero-length MergedChunks downstream which break alignment;
silero never produces them anyway.
§6.2 — WhisperPoolConfig.dispatch_idle_poll exposes the previously-
magic 10ms saturation-wait timeout. §6.4.1 saturation wait now
references whisper_pool_config.dispatch_idle_poll instead of the
hard-coded value.
§6.4.3 — added a clarification on streak-vs-chunk_id correspondence:
each timeout is one chunk_id; recycle does not retroactively retry
timed-out chunks (they emit as Event::Error and decay out of
in_flight); streak counting is per-worker, not per-chunk_id.
§4.5 — InvalidLang struct added (returned by Lang::from_iso639_1).
§10 — added §10.4 "v3-v5 regression tests" (8 named tests covering
the specific defects caught during review rounds: PTS drift on
non-integer-ratio TBs, saturation-wait result preservation,
restart_at cut_pending drain, next_expected_starts_at correctness,
layered-ladder suppression, unpoll_command round-trip, empty-packet
handling, zero-duration VadSegment panic, output-PTS-space
regression check). The CI matrix subsection was renumbered to §10.5.
§13.1 — verification bullet 4 rewritten. The original "confirm
threshold setter names" wording was already stale by v4; the
revised bullet pins the actual setters whose presence the v6
implementation relies on (set_temperature, set_temperature_inc,
and the best-effort set_max_decoding_failures), with the
compile-time fallback path described in §5.6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lang is now a typed enum over whisper.cpp's supported language
set, marked #[non_exhaustive] for forward compat without
semver-major bumps, with a fallback Other(SmolStr) variant for
unknown ISO codes that whisper auto-detect surfaces.
§4.4 — replaced the SmolStr newtype declaration with the enum.
99 named variants plus Other; full variant list deferred to
Appendix C to keep §4.4 readable. Documented the canonicalisation
invariant: from_iso639_1 maps known codes to named variants and
NEVER produces Other for an enum-known code, so structural
PartialEq/Hash stay correct (Lang::En != Lang::Other("en") is
fine because no API path can construct Lang::Other("en")).
Discussed why Other(SmolStr) is preferred over
Result<Lang, InvalidLang>: an indexing engine processing
thousands of hours shouldn't fail a chunk on an unfamiliar
language code; Other lets the transcript flow through and the
indexer logs the unusual SmolStr.
Discussed why #[non_exhaustive] in addition to Other: when
whisper.cpp adds a new language, inserting the named variant
later is not a breaking change; external matches already need
a `_` arm so their compiled code keeps working. Variant just
stops appearing under Lang::Other and starts appearing under
its named form.
§4.5 — InvalidLang type removed. from_iso639_1 is now a total
function returning Lang directly, no Result. Documented inline
why InvalidLang is gone for future readers.
§3.3 — InvalidLang re-export dropped from the public surface.
Appendix B — item 1 (Lang typed-enum vs SmolStr deferred decision)
moved to "Resolved" with a pointer to §4.4 / Appendix C.
Appendix C added — full Lang variant list (99 named variants
plus Other), with naming notes for the few non-2-letter codes
(Haw for Hawaiian, Yue for Cantonese, Jw for whisper.cpp's
non-modern Javanese code) and a one-line description of how
new whisper.cpp languages are added (insert alphabetically,
add as_str / from_iso639_1 match arms, no other code paths).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…time SemVer caveat
Three small additions following the round-6 final audit verdict
of "PR-ready":
1. §5.1 VadSegment::new doc-comment now notes that panic in
const fn is stable on Rust ≥ 1.57, well below the crate's
≥ 1.85 MSRV inherited from siblings. Pre-emptive: removes
the implementer's "wait, is this actually allowed?" check.
2. §10.4 regression tests gain an explicit Lang canonicalisation
invariant test. For every named variant V in Appendix C,
assert Lang::from_iso639_1(V.as_str()) == V AND the result
does not match Lang::Other(_). Plus the symmetric
pure-Other round-trip ("zzz" → Lang::Other("zzz")). Pins
the §4.4 contract that no API path produces Lang::Other(s)
for an enum-known code s.
3. §3.3 mediatime re-export block now documents the SemVer
coupling: a mediatime major bump is automatically a whispery
major because the public API names mediatime types directly.
The mediatime dependency stays pinned to one major in
Cargo.toml; coordinated bumps when needed.
After 6 rounds of multi-agent adversarial review, the design has
converged. Architecture sound, contracts honored, dispatch loop
deadlock-free, buffer arithmetic drift-free, alignment correct,
dia integration verified, every public method has a documented
error contract, every load-bearing API claim verified against
real codebases or live docs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-emptively close the doc-comment MINORs the round-6 audit flagged as appropriate for the first implementation PR. Fixing now in the spec costs ~15 lines and saves the implementer the same set of "spec is silent about X" questions later. §5.1 — Transcriber documented as Send + !Sync. Multi-threaded drivers must wrap in Mutex; whispery does not provide internal synchronisation. §4.5 — WorkFailure now derives Clone + Debug explicitly. Clone is needed because the dispatch state machine moves the failure into Event::Error while runner-side code may also want to log it; the contained String is the only non-trivial allocation. §6.2 — AsrWorkItem.asr_timeout provenance documented inline: sourced from the runner builder's worker_timeouts(asr, _), stamped per-job at dispatch, fed into the abort_flag watchdog. §6.3 — Aligner.blank_token_id documented: read from the wav2vec2 tokenizer's special-tokens map at Aligner::from_paths time (standard <pad> / [PAD] convention; future builder method for non-standard models). §5.1 — is_idle() doc-comment expanded: enumerates the queues that must all be empty, and explicitly notes that pre-restart in_flight chunks keep is_idle() false until they emit (restart_at does not synthetically clear them). §6.4.1 — Disconnected-channel behaviour of Select::ready_timeout documented: closed worker channels wake the wait, drive_one_step's next try_recv returns Disconnected, dispatch surfaces WhisperPoolShutdown. No silent stall on worker panic. §10.4 — extended the unpoll_command test (M12) with a park-and-resume case that exercises the full wake/select cycle: saturate work_tx, park via unpoll, fire result, assert ready_timeout wakes within dispatch_idle_poll, assert next drive_one_step lands the parked command without losing the result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§11 working-memory bullet now notes that the total scales roughly linearly with worker_count via the per-worker decoder workspace term (~20 MiB / worker on tiny), with worked examples for 8 and 16 workers. Model weights and the alignment-logits buffer are called out as independent of worker_count so readers don't extrapolate the wrong scaling onto them. §5.6 AsrParams.n_threads doc-comment now adds the SemVer caveat: c_int is i32 on every supported platform today, so the alias is a no-op. The alias exists to track whisper-rs's signature exactly; an upstream type change (or a future c_int variation) would propagate as a breaking change here, requiring a coordinated whispery major. After 7 rounds of multi-agent adversarial audit, every documentation item flagged across rounds 1-7 has been addressed in the spec or explicitly deferred with rationale. The design has converged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implementation plan covering whispery's no-ML-deps foundation: public types (Transcript, Word, Lang, errors, VadSegment), the Sans-I/O core module (cut state machine, sample buffer, dispatch state machine, Transcriber), benches, and an end-to-end mocked- backend integration test. 25 tasks total, organised into 11 sections. Each task has 3-5 TDD-style steps with concrete code, exact commands, and explicit expected outputs. After Plan A merges, whispery is a usable state-machine crate that can be driven via examples/core_only.rs; Plan B will add the whisper-rs runner and Plan C the alignment pipeline. Spec ref: docs/superpowers/specs/2026-04-28-whispery-cut-batch-whisper-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan A scope: core deps only (mediatime, smol_str, thiserror, smallvec). Plan B will add the runner feature pulling whisper-rs and crossbeam-channel; Plan C will add alignment pulling ort, tokenizers, ndarray. Spec: docs/superpowers/specs/2026-04-28-whispery-cut-batch-whisper-design.md §3.2.
Code-quality review found that smol_str declared without default-features=false silently links its std feature even when a consumer enables whispery with --no-default-features, breaking the no_std promise lib.rs claims via #![cfg_attr(not(feature = "std"), no_std)]. Add default-features=false on smol_str and propagate smol_str/std through whispery's std feature. thiserror has the same structural pattern but its std feature is empty in 2.x so there is no current correctness impact; deferred. Spec: §3.2.
Wires up the module tree: time, types, core. Crate-level lints match the silero convention (deny missing_docs, forbid unsafe_code). Modules are empty placeholders that subsequent tasks fill in. Spec: §3.1, §3.3.
Forward fix flagged by Task 2 code review: when types code lands in Task 8 (errors via thiserror) and Task 10 (AsrParams uses SmallVec), the no_std build path will need these defaults off. Applying now while the manifest is fresh saves a fix-up commit later. thiserror v2's std feature is currently empty so this is a behaviour-no-op today; smallvec works on alloc-only out of the box. Spec: §3.2.
The analysis timebase is internal-only; whispery's public output uses whatever timebase the caller's first push_samples Timestamp carries. Spec: §4.1.
Implementation Plan §4. Spec: §4.2 (Transcript.chunk_id), §5.5 (monotonicity contract — id increments on Event::Error too).
Spec §4.1.3: 16 kHz sample-index input, panic on zero or negative duration. silero never produces zero-duration segments; panic catches programmer error at the boundary.
Variant table from whisper.cpp's g_lang. as_str() for the known→ISO direction. The reverse direction (from_iso639_1) and the canonicalisation invariant test land in the next task. Added #[allow(missing_docs)] on the enum to satisfy the crate's #![deny(missing_docs)] — the 99 variant names are ISO 639-1 codes and are self-documenting by name. Spec: §4.4, Appendix C.
Plan A Task 6 code review caught an internal inconsistency in the spec: the prose at §Appendix C said "99 named variants" but the code-block table immediately above lists 100 (10 rows × 10). Hand-counted whisper.cpp v1.7.6's g_lang: 100 supported languages. The table is correct; the prose was wrong. This commit fixes only the prose. The plan file's Task 7 test fixture (which asserts known.len() == 99) is corrected when Task 7's implementer is dispatched. Spec: §4.4, Appendix C.
Total-function constructor: every &str produces a Lang. Known codes canonicalise to named variants, unknowns go to Other. Test verifies Lang::from_iso639_1(V.as_str()) == V for every named variant AND that the result is never Lang::Other(_). Spec: §4.4, §10.4 (Lang canonicalisation invariant test).
Two channels: synchronous TranscriberError on push/inject paths, asynchronous WorkFailure via Event::Error. WorkFailure derives Clone + Debug so the dispatch loop can move it into Event::Error while runner code logs it. Spec: §4.5.
Crate-private constructors used by the dispatch state machine (Transcript::new) and alignment pipeline (Word::new). A test-only \`for_test\` module gives downstream tests concise helpers for synthesising Transcripts and Words. Spec: §4.2, §4.3.
Backend-agnostic: no whisper-rs names in any of these types. The
runner translates AsrParams into FullParams in Plan B. Default
values match WhisperX (BeamSearch{5,-1.0}, no_context=true,
temperature ladder 0.0/+0.2 × 6 attempts).
Spec: §3.4 (backend invariant), §5.6.
Two cosmetic fixes after Task 10's commit: 1. Cargo.toml gains empty `runner = []` and `alignment = ["runner"]` features. The cfg(feature = "alignment") branches in src/core/command.rs were emitting `unexpected_cfg` warnings on every build because the feature wasn't declared. Plan B/C will replace these stubs with real feature payloads. 2. The `pub(crate) type ChunkAudio = Arc<[f32]>` alias is consumed by the dispatch state machine in Task 16; allow(dead_code) until then. Spec: §3.2.
… Cut) State struct only — push_segment and flush land in the next task with their tests. Crate-private; nothing here crosses the public surface. Spec: §5.3.
Per-index split formula: start_i = seg.start + (i*len)/n with the
last part absorbing the remainder by setting end_{n-1} =
seg.end_sample. Guarantees max part length ≤ chunk_size_samples;
the audit's pathological case (len=29, chunk_size=10, n=3) now
produces 9/10/10, never 9/9/11.
Tests cover: empty flush, single short segment, multi-segment
merge, multi-segment with mid-stream flush, over-long single
segment with three-way split, and the strict-bound regression
test for the audit case.
Spec: §5.3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gth))
Codex round-23 [high] finding: `backtrack_beam` cloned each
beam's full `Vec<PathPoint>` on every stay/change branch. With
`beam_width = 2` and 4 branches per outer iteration over `T`
frames, total path-copy cost was O(T²) — ~72 MB of memory
churn for T=1500 (a 30 s chunk), ~3.2 GB for T=10000. The
trellis budget caps `T * num_tokens` at 32 M cells, so a
long-`T`/modest-`num_tokens` case can pass that gate while
still drowning the alignment worker in path-clone allocation
and surfacing as a watchdog timeout instead of a bounded
result.
Replaced the per-beam `Vec<PathPoint>` with a flat arena of
`BeamNode { token_index, time_index, score, point_score, prev:
Option<u32> }`. Active beams are now `Vec<u32>` indices into
the arena. Each branch is O(1): push one new node + record its
predecessor index. The path is reconstructed once at the end
by walking the predecessor chain from the winning beam back to
the seed (O(T) work, O(T) allocation, total).
Memory bound: arena grows by at most `beam_width * 2` per
iteration × T iterations = ~4T nodes worst case. At T=1500 /
beam=2 that's ~96 KB; at T=10000 it's ~1 MB. Linear in T
instead of quadratic.
Path reconstruction uses the natural ascending-time order from
the chain walk (winner has the smallest time, seed has the
largest). The leading-blank fill (frames `[0, winner.t)` →
token-0 + blank emission, WhisperX visualisation convention)
is prepended in ascending time too, so no final reverse is
needed. The previous build-then-reverse code was an artefact
of the path being constructed in descending-time order during
the cloning approach.
Also dropped the now-unused `PathPoint` struct (replaced
end-to-end by `BeamNode` internally and `PathPointPublic` at
the API boundary).
Verification:
- Tests fixed an early bug where prepending the leading-blank
fill to a vector that was about to be reversed put those
frames in the wrong relative position. Round-trip path-order
pinned by the existing `backtrack_beam_simple_two_token_path`
test (T=3, tokens=[1,2]: path[0].t=0, path[2].t=2).
- 282 lib tests pass.
- 7 whisperx_unit_parity tests pass.
- 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998,
below_0.5=0 across 854 word pairs (bit-identical to the
pre-refactor numbers — confirms the algorithmic semantics
are preserved).
… reject Two findings from `/codex:adversarial-review --base main`: [medium] `build_speech_frames` did not clamp sub-segment bounds to `[0, n_samples]`, while `build_speech_mask` did. With the WhisperX effective ratio's last frame interval `[(T-1)*spf, T*spf)` extending just past `n_samples`, an overshoot sub-segment could "credit" the trailing frame with phantom- sample overlap — marking it as speech against audio that doesn't exist, while step 0's silence-mask had nothing to zero in that region. `compose_words` then accepts words at `coverage >= 0.5` and only clamps the emitted end timestamp, so a word spanning the last real frame plus the overshoot frame can survive over mostly silent / nonexistent audio. Fix: pass `n_samples: u64` into `build_speech_frames` and clamp `seg_start` / `seg_end` to `[0, n_samples_i64]` exactly like `build_speech_mask` does. Both helpers now agree on the same coordinate contract. Added `build_speech_frames_clamps_overshoot_seg_to_chunk_end` regression covering full-overshoot and partial-overshoot sub-segments — both must yield `false` on the trailing frame. [medium] `WhisperPoolOptions::with_worker_count(0)` / `set_worker_count(0)` were silently accepted, and `WhisperPool::new` then spawned no worker threads (`for _ in 0..0`) while still binding the work channel. Chunks would enter `in_flight` with no receiver capable of producing results, stalling the pump/drain loop until the configured timeout fires — a fatal stall instead of a clear construction error. Two-layer defence: - Setter (`set_worker_count`) and builder (`with_worker_count`) panic on `value == 0`. Same pattern as `Aligner::set_sample_rate(0)` and the transcriber-config setters that already panic on degenerate values; catches the explicit programmer-error case at the API boundary. - `WhisperPool::new` returns `RunnerError::WhisperContextLoad` with a contract-violation message when `worker_count == 0` in the deserialised config — covers the serde path that bypasses the setter. Two new `#[should_panic]` regressions exercise both API shapes; a follow-up integration test for the `WhisperPool::new` error path is gated behind needing a real `WhisperContext`, deferred. Verification: - 285 lib tests pass (was 282; +1 overshoot regression, +2 zero-worker panic regressions). - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across 854 word pairs (bit- identical to pre-fix; the overshoot clamp doesn't bite the parity fixtures because their `raw_asr_segments[]` are always inside chunk bounds — the fix is defensive against malformed external callers).
…sion) Codex round-22 [medium] finding: `log_softmax_with_finite_guard` computed `log_z = max + sum.ln() as f32`, then `lp = x - log_z`. For a row with a large common offset like `[1e20, 1e20]`, `sum.ln() = ln(2) ≈ 0.69` rounded away when added to `max = 1e20` in f32 — `1e20 + 0.69 ≈ 1e20` — and every `lp` collapsed to `0.0` instead of the correct `-ln(2) ≈ -0.693`. The finiteness checks passed but the outputs were no longer log-probabilities, hiding a backend / export numeric failure as plausible alignment input. Fix: keep the subtraction of `max` in shifted f64 throughout. Compute `log_z_shifted = sum.ln()` (a small f64 in `[0, ln(V)]` under any finite input), and per element `lp = ((x as f64 - max as f64) - log_z_shifted) as f32`. Both terms in the subtraction are bounded — `(x - max) <= 0` and `log_z_shifted >= 0` whenever any element equals `max` — so `lp_f64 ∈ (-∞, 0]` regardless of `max`'s magnitude. Casting to f32 preserves the precision that mattered. Kept the per-element finiteness check: pathological inputs like `[f32::MAX, -f32::MAX]` can still underflow `lp_f64 as f32` to `-inf`, and surfacing that as `ModelInferenceFailed` keeps the backend-numeric-failure path typed (a `-inf` slipping into `data` would let Viterbi return `NoAlignmentPath` recoverably and mask the bug as `words: []`). The existing `log_softmax_rejects_finite_extremes_that_overflow_lp` test still pins this behaviour. Replaced the previous `log_z`-finiteness check with a `log_z_shifted`-finiteness check (the same value semantically; only fires on the all-(-inf) case, already covered by `log_softmax_rejects_all_neg_infinity_row`). New regression test: - `log_softmax_large_common_offset_normalises_to_unit_exp_sum`: feeds `[1e20, 1e20]`, asserts the output exp-sum is 1.0 and each lp is ≈ -ln(2). Pre-fix this test would fail with exp-sum=2.0 (each lp=0.0). Verification: - 286 lib tests pass (was 285, +1 large-offset regression). - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 (bit-identical to pre-fix; ORT's wav2vec2 logits are bounded to typical ranges, so the f64-shift produces the same f32 outputs as the previous f32-fold for normal inputs — the fix is defensive against malformed model exports).
Codex round-25 [medium] finding: for inputs shorter than wav2vec2's 400-sample receptive field, the encode path pads the buffer to 400 before invoking ORT. The encoder then emits `T` frames whose stride corresponds to the PADDED length (`padded.len() / (T - 1)`), but downstream `samples_per_frame` uses the original `samples.len() / (T - 1)`. The two views disagree by exactly the padding ratio: `compose_words` emits word ranges shrunk by `samples.len() / padded.len()` and `build_speech_frames` classifies frames against intervals that don't match the encoder's own view. A short utterance at the edge can be silently dropped by the speech post-pass or get a word span projected onto the wrong part of the original audio rather than recognized as padding-only. Codex's recommendation offered two paths: - Carry both lengths explicitly (validate against padded length, mark padded frames non-speech, clamp emitted ranges to original). - Skip alignment for sub-400-sample chunks. Took the second path. At 25 ms or less, a chunk cannot realistically contain a meaningful CTC path through any non-trivial transcript anyway — the trellis budget requires `T >= num_tokens`, and `T` for a sub-400-sample chunk is 1-2 frames. The simplest safe response is an early-out matching the existing `EmptyText` / empty-tokens short-circuit: return `Ok(empty AlignmentResult)`. Alignment becomes a no-op on degenerate input rather than a data-loss path. The padding code below the guard remains as defense-in-depth (it's now provably unreachable, but the explicit `if normalized_samples.len() < 400` branch documents the contract and protects against a future refactor that accidentally moves the guard). Added regression test `sub_400_sample_chunk_short_circuits_to_empty_result`: feeds a 200-sample buffer with realistic transcript text and asserts the call returns `Ok(empty)` rather than `Err` or mis-projected word ranges. Skips when the wav2vec2 fixture isn't present (offline / `WHISPERY_OFFLINE=1`); same gating as the existing `empty_normalised_text_returns_empty_alignment_result` test. Verification: - 287 lib tests pass (was 286, +1 sub-400 regression). - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 (bit-identical to pre-fix; real fixtures don't trigger the guard, all segments are >> 400 samples).
…sample-rate setter Codex round-26 surfaced two findings, both legitimate. [medium] Backtracker reserve can exceed the trellis budget (`trellis_beam.rs:486`). The trellis budget caps `T * num_tokens` at 32 M cells, NOT `T` alone. A degenerate `num_tokens = 1` pass-through can therefore drive `T` up to 32 M frames while still passing the trellis guard. The previous backtracker code then issued a `Vec::with_capacity(1 + beam_width * 2 * T)` pre-reserve — ~96 KB at typical T ≈ 1500, but ~3 GB at T = 32 M, allocated up-front before the per-iteration abort check could fire. Dropped the pre-reserve entirely (`Vec::new()`); push-driven growth is amortised O(1) and bounded by the same per-iteration abort flag the loop already honours. For typical T ≈ 1500 the doubling churn is ~400 KB total — dwarfed by the trellis itself. Avoids OOM/panic on pathological input by letting the existing abort path bound work in-band rather than front- loading allocation. [medium] Public sample-rate override is ignored (`aligner.rs:192-204`). `set_sample_rate` / `with_sample_rate` mutated `self.sample_rate`, but the alignment algorithm never consumed that field — silence-mask, frame timebase, and stride checks all hard-coded `SAMPLE_RATE_HZ` (16 kHz). A caller using the override for a non-16 kHz model got plausible-but-wrong masks and word timestamps instead of a configuration error. Removed the `sample_rate: u32` struct field entirely along with both setters and builders. The `sample_rate()` getter now returns `SAMPLE_RATE_HZ` directly — informational only, documents what the aligner expects. Non-16 kHz wav2vec2 support is out of scope for v1; if it becomes in-scope, the right shape is to wire the override through silence-mask, frame-timebase, and stride checks together (with end-to-end tests) rather than expose a knob that mutates a dead field. This is a public API breaking change (callers of `set_sample_rate` / `with_sample_rate` will fail to compile), but pre-1.0 those methods never worked correctly anyway. Verification: - 287 lib tests pass. - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 (bit-identical to pre-fix; both fixes are behaviour-neutral on real chunks — the arena pre-reserve was a memory-budget concern, not a correctness one, and the sample-rate setters were never read in the algorithm).
…to union Codex round-27 [medium] finding: `build_speech_frames` summed each sub-segment's per-frame intersection independently, so two overlapping ranges (e.g. `[0, 100]` and `[50, 150]` inside frame 0's `[0, 320)` interval) contributed `100 + 100 = 200` to the frame's overlap counter even though their UNION only covers 150 samples. With `min_overlap_samples = 160`, the frame would clear the threshold via raw sum (200 ≥ 160) despite its union being below it (150 < 160), disagreeing with `build_speech_mask`'s union semantics (per-sample boolean OR) and letting `compose_words` retain words whose audio is mostly masked silence. The contract is "≥ 50 % of the frame's samples are inside the VAD speech UNION", which matches `build_speech_mask`'s boolean-OR aggregation. Fix: clamp + sort + merge sub-segments into a non-overlapping union before per-frame accumulation. Touching segments (`s == last.1`) merge too, so [0, 80] + [80, 160] becomes a contiguous [0, 160] (matches the `build_speech_mask` per-sample OR semantics for a single contiguous voiced span). Real wav2vec2 fixture VAD segments don't overlap (silero / VAD upstream guarantees disjoint intervals), so the parity numbers are bit-identical to pre-fix. The fix is defensive against external `align_chunk` callers that pass arbitrary VAD output — which the public API documents as accepting in chunk-local 1/16000 timebase but doesn't otherwise validate. Three new regression tests in `compose::tests`: - `build_speech_frames_uses_union_not_sum_for_overlapping_segments`: two overlapping segs whose UNION (150) < threshold (160) but SUM (200) > threshold — must classify silent. Plus a triple-overlap stress case (sum 240, union 120, classifies silent). - `build_speech_frames_treats_adjacent_segments_as_contiguous`: touching segs `[0, 80]` + `[80, 160]` form a [0, 160] union that exactly hits the threshold → speech. Verification: - 289 lib tests pass (was 287; +2 union tests, +1 adjacent test). - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 (bit-identical to pre-fix).
Codex round-28 [medium] finding: `tokenize_with_word_map` took `wildcard_chars_per_word: &[u32]` — a single TOTAL count per word — and pushed every wildcard at the END of the word's encoded chars. That made leading punctuation like `"hello` indistinguishable from trailing `hello"` once it reached the CTC graph: both produced `[h, e, l, l, o, *]`. The wildcard frames always landed at the trailing-frames zone, biasing the word's emit-end timing past where the leading-punct silence actually sat in the audio. WhisperX's reference behaviour keeps `*` placeholders in source order. Fix: split the per-word count into `(prefix, suffix)` tuples end-to-end: - `NormalizedText::wildcard_chars_per_word: Vec<u32>` → `NormalizedText::wildcard_boundary_per_word: Vec<(u32, u32)>`. `with_wildcards` now takes the tuple form; getter renamed to `wildcard_boundary_per_word()`. - `EnglishNormalizer` computes prefix vs suffix separately by splitting `strip_word_punct` into a left-trim then a right-trim and counting each side. Hyphenated words: the FIRST piece inherits the original word's prefix-stripped count, the LAST piece inherits the suffix-stripped count; middle pieces get `(0, 0)` (the hyphen is treated as a real word boundary, not a wildcard frame). - `tokenize_with_word_map` takes `wildcard_boundary_per_word: &[(u32, u32)]`. For each word it pushes prefix wildcards BEFORE the encoded chars and suffix wildcards (plus any internal-skipped chars discovered during this pass) AFTER. Internal-skipped (`.` for abbreviations) still lumps into the suffix bucket — placing it exactly in source order would require interleaving wildcards mid-word during per-char encoding, which the current flow doesn't support. The suffix-leaning approximation is strictly better than the previous all-at-end design for the common leading-vs-trailing case Codex flagged; the rare internal case takes the same hit it already had. Three new regression tests in `tokenize::tests`: - `leading_wildcards_land_before_encoded_chars` — pins the new prefix-injection contract for `"hello` (prefix=1, suffix=0): `*` must land at index 0. - `trailing_wildcards_land_after_encoded_chars` — renamed + hardened version of the previous test; pins suffix-only behaviour (`hello"`). - `paired_wildcards_bracket_encoded_chars` — pins the paired-punctuation case `(hello)`: prefix at start, suffix at end, only encoded chars in the middle. The length-mismatch test was renamed to `wildcard_boundary_per_word_length_mismatch_errors`. Verification: - 291 lib tests pass (was 289; +3 wildcard placement tests, −1 renamed/dropped from the previous total-count contract). - 7 whisperx_unit_parity tests pass. - 5 dia parity fixtures: median 0.995-0.999, mean 0.983-0.998, below_0.5=0 (bit-identical to pre-fix; the dia fixtures' source text doesn't trigger the leading-punctuation case where the wildcard order matters — the fix is defensive against transcripts that DO have leading boundary punctuation).
…tcher
Codex round-29 findings:
1. min_speech_coverage validation (high). Out-of-range values were
stored verbatim, so a config typo of 1.5 silently dropped every
word (no word's coverage can exceed 1.0). Now coerce: clamp to
[0.0, 1.0]; treat NaN as the documented default (0.5).
2. inject_wordlevel_model_type used naive substring search and
brace counting (medium). A user tokenizer.json whose string
values contained "model", `{`/`}`, or "type" substrings could
be misdetected and patched at the wrong byte range. Replaced
with a quote-aware scanner that tracks in_string / escape
state across three helpers (find_top_level_object_value_open,
find_matching_close_brace, has_top_level_key).
5-fixture parity bit-identical to round-28
(median 0.995-0.999, below_0.5=0 across all).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-30 findings: 1. Per-packet ASR override binding (medium). The override snapshot was taken at chunk *emission* time. A chunk whose audio was pushed under packet A's override but whose VAD-driven close happened in packet B silently picked up B's override (or None) — language hints, prompts, and strategy overrides got attached to the wrong audio under normal streaming VAD latency. Fix: cut state machine snapshots `current_override` when accumulation starts (current_start: None → Some), carries it on `MergedChunk.override_at_start`, and dispatch.on_emit reads from there. First-override-wins semantics. New regression test (src/core/transcriber.rs): `override_binds_to_packet_that_started_chunk_not_packet_that_closed_it`. 2. poll_transcript / poll_error silently dropped fatal runner errors (high). `let _ = self.drive_one_step()` dropped `WhisperPoolShutdown` and `Backpressure` errors — a dead pool looked like an empty stream. Fix: change signatures to `Result<Option<_>, RunnerError>`; stash any drive_one_step error in a `pending_fatal` slot; surface it once buffered transcripts/errors have been drained (so already-arrived results aren't lost). 5-fixture parity bit-identical to round-29 (median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-31 finding: whisper-rs's set_language and
set_initial_prompt build CStrings via CString::new(...).expect(...).
A SmolStr or Lang::Other carrying an interior NUL byte therefore
panicked the worker thread mid-inference. Single-worker pools
shut down; multi-worker pools left the chunk stranded with no
WorkFailure (drain hangs until drain_timeout). Public callers
reach these strings through AsrParamsOverride::with_initial_prompt
and Lang::Other without validation.
Validate before crossing the FFI boundary in full_params_from
and surface the failure as an in-band
WorkFailure::AsrFailed { kind: BackendError } for the offending
chunk. The pool stays alive; subsequent chunks transcribe
normally. Diagnostic message includes byte length but not
content (initial_prompt may be sensitive).
Two new unit regressions cover Lang::Other("xx\0yy") and
initial_prompt = "hint\0poison".
5-fixture parity bit-identical to round-30
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-32 findings: 1. ASR timeout silent fail-open (high). run_with_temperature_ladder only consulted abort_flag in the Err path. If the watchdog fired during inference but whisper.cpp finished before observing the next abort-callback poll, an apparently-successful AsrResult reached the caller — violating the per-job timeout AND skipping timeout-streak recycling (was_timeout was derived from the un-rewritten outcome). Fix: post-watchdog-join, re-check abort_flag and rewrite outcome to WorkerHangTimeout. Mirrors alignment_pool.rs:266's existing pattern. 2. Unbounded Lang::Other intern leak (medium). intern_lang_str leaks one &'static str per distinct language hint and never evicts. Lang::Other(SmolStr) is public, so a tenant or buggy caller producing unique high-cardinality hints would grow process memory without bound. Fix: validate before interning. Restrict to lowercase ASCII letters [a-z], len 1..=8 — covers every whisper.cpp-recognized code with comfortable headroom. Reject as in-band WorkFailure::AsrFailed (consistent with the round-31 NUL-rejection style); diagnostic message does NOT echo offending bytes (hint can be public input). 5-fixture parity bit-identical to round-31 (median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-33 findings: 1. Serde bypassed VadSegment invariant (high). derive(Deserialize) skipped VadSegment::new's end_sample > start_sample check. Reversed-range serialized input would wrap sample_count() to ~u64::MAX and trip the cut state machine's hard-split assert. Fix: replace derive with a custom Deserialize that re-validates at the serde boundary and surfaces a typed deserialize error. 2. max_attempts=0 silently dropped all ASR (medium). The retry ladder iterates `for _attempt in 0..max_attempts` — zero meant zero state.full() calls and AllTemperaturesFailed for every chunk. Fix: panic on 0 in set/with_max_attempts (matches the codebase's existing zero-rejection style for sizing config like worker_count, chunk_size, max_in_flight); add a serde deserialize_with that surfaces the violation as a typed error. 3. is_idle() ignored pending_fatal (medium). A poll that stashed a fatal could be shadowed by is_idle()=true on the next call, so callers polling "until idle" would stop without ever observing the structural failure. Fix: add !pending_fatal to the conjunction. Extracted is_idle_inner const fn so the contract is testable without a real ManagedTranscriber. 5-fixture parity bit-identical to round-32 (median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-34 findings:
1. Invalid n_threads crossed FFI unchecked (high). whisper.cpp's
decoder loops do `vec<thread>(n_threads - 1)` when n_threads
isn't 1; 0 underflows to a huge alloc, negatives abort. The
public AsrParams::set/with_n_threads accepted any i32 and
forwarded to set_n_threads. Fix: panic on `< 1` in setters,
serde deserialize_with rejects, defensive check in
full_params_from returns chunk-level WorkFailure.
2. Dead GPU-default cfg (medium). default_worker_count and
default_timeout_streak_threshold checked feature names like
`_whisper_cuda` that don't exist in Cargo.toml — the cfg
branches always returned false, so GPU users silently got
CPU defaults (oversubscribing one GPU with multiple workers).
Fix: drop the dead cfg, add explicit
WhisperPoolOptions::new_for_gpu(model_path) constructor that
picks single-worker / longer-streak defaults appropriate for
serialised GPU inference. CPU `new()` unchanged.
3. Lang serde shape mismatched docs (medium). derive(Serialize,
Deserialize) produced Rust variant names `"En"` and
`{"Other":"xx"}`, contradicting the documented lowercase ISO
string format. Fix: custom Serialize/Deserialize that writes
`as_str()` and reads via `Lang::from_iso639_1`, with the same
shape validation as round-32 (lowercase ASCII letters,
1..=8 bytes). Canonicalisation invariant preserved across
serde — `"en"` deserialises to Lang::En, not
Lang::Other("en"). Legacy derive-shaped JSON now rejected.
5-fixture parity bit-identical to round-33
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0 across all).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Configs are usually human-edited and humans naturally use mixed
case ("EN", "En", "en"). The ISO 639 standard treats codes as
case-insensitive — make Lang's serde deserializer accept any
case and normalize to canonical lowercase before
Lang::from_iso639_1.
Wire format and FFI behavior unchanged: serialization always
emits lowercase, the alignment-stage validation in
runner/whisper_pool.rs is unchanged (it never sees uppercase
because both named variants and post-deserialize Lang::Other
inner strings are always lowercase).
Round-trip asymmetry by design: "EN" in → Lang::En → "en" out.
Side benefit for migration from the round-34 derive switch:
legacy configs using Rust-variant-shape strings like "En"
(uppercase first letter) continue to work; only the
externally-tagged {"Other":"xx"} form remains rejected.
Two new regression tests; existing serde_rejects_uppercase
removed (it's no longer the contract).
5-fixture parity bit-identical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-35 findings: 1. process_packet partial-commit (high). The previous order (push_samples → push_vads loop) committed samples first, then pushed VADs one at a time. A failing VAD #N left samples + VADs 0..N-1 committed; caller could not safely retry the packet (same starts_at would trip PtsRegression). Fix: add precheck_vad_segments on Transcriber that validates eof, timebase, ordering, and high-water (against a floor projection of post-push high water) WITHOUT mutating state. Call it before push_samples in process_packet so process_packet is now either fully atomic or returns Err with no commit. The precheck is conservative (floor projection assumes no gap-fill), which can false-reject VADs targeting the rare zero-fill region — silero never emits such segments in practice. False-accept is impossible. 2. drain idle/timeout reporting (medium). drain() used self.core.is_idle() which ignored pending_transcripts, pending_errors, AND the round-30 pending_fatal slot. A fatal recorded behind buffered output could coexist with drain returning Ok. Fix: drain now (a) takes pending_fatal eagerly if set and returns Err, (b) checks self.is_idle() (the pub predicate including all four conjuncts) for the success case, and (c) reports DrainTimeout.in_flight via a new in_flight_chunk_count() that returns cut_pending.len() + in_flight.len() — the documented chunk count, not the sample count that buffered_samples() returned (which can read 0 after trim while real chunks remain). 7 new precheck regression tests cover valid/invalid orderings, ahead-of-projected-water, AfterEof, and OutputTimebaseUnset. 5-fixture parity bit-identical (median 0.995-0.999, mean 0.983-0.998, below_0.5=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-36 [critical]: the previous `recv_timeout(timeout).is_err()` matched both `Err(Timeout)` (real hang) AND `Err(Disconnected)` (clean cancel when the worker drops cancel_tx after fast inference). Round-32's post-watchdog `abort_flag` re-check then rewrote every successful `AsrResult` into `WorkFailure::WorkerHangTimeout` — every transcript would be lost in production ASR runs. The 5-fixture parity suite hides this because it injects pre-computed WhisperX ASR dumps and only exercises the alignment path. Fix: extract `watchdog_should_signal_timeout` helper that pattern-matches on `RecvTimeoutError`. Only `Timeout` flips the flag; `Ok(())` and `Disconnected` are clean exits. Mirrors the alignment pool's existing watchdog pattern at runner/alignment_pool.rs:209-220. 5 new regression tests cover: helper-level discrimination of each variant; end-to-end "drop tx does NOT flip flag"; end-to-end "real timeout DOES flip flag". Parity unchanged (alignment-only path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-37 findings (both high):
1. AsrParamsOverride.{language_hint,initial_prompt} use
Option<Option<T>> for the absent / clear / set distinction,
but the derived Deserialize collapsed JSON `null` and "field
absent" into the same outer None — making
`{"language_hint": null}` indistinguishable from omitting
the field, defeating the documented "clear this override"
contract. Locked-language streams could not unlock via JSON
config. Fix: deserialize_with helpers
`deserialize_double_option_lang` /
`deserialize_double_option_smolstr` preserve the three-way
distinction at the serde boundary. Lang's case-insensitive
ISO deserializer continues to apply through the helper.
Four new round-trip tests cover absent/null/value plus the
three-state round-trip.
2. SampleBuffer::append committed the anchor, zero-filled, and
advanced absolute_sample_offset for empty packets at a
forward delta within gap tolerance. A "heartbeat" call at a
slightly future PTS would create phantom audio and reject
the next real packet at the originally-expected PTS as
PtsRegression — caller had no recovery path. Fix: empty
packets at delta > 0 are now a true no-op for stream state
(validation still runs). delta == 0 empty packets and
non-empty packets unchanged. Three new regression tests cover
the heartbeat-then-real-audio sequence, empty-first-then-
forward-heartbeat-then-real-audio, and exactly-on-time empty
no-op.
5-fixture parity bit-identical
(median 0.995-0.999, mean 0.983-0.998, below_0.5=0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… leak Codex round-38, all four findings: 1. CString leak per ASR attempt (high). whisper-rs's set_language / set_initial_prompt use CString::into_raw() with no Drop, so each attempt leaked one C string. Build the FullParams template once per chunk and clone for each temperature attempt. Reduces leak from O(attempts × chunks) to O(chunks) — 6× at default max_attempts = 6. FullParams: Clone shallow-copies the inner struct including the leaked *const c_char pointers; aliasing is safe because the strings have static (leaked) lifetime. Inter-chunk dedup deferred (abort_callback captures per-job state, more involved). 2. drain_timeout bypassed while a command is parked (high). pump_until_idle_or_progress looped forever on parked + no-progress; the drain deadline wasn't checked inside the loop. A worker stuck in state.full made drain hang past its configured timeout. Thread an Option<Instant> deadline through the pump; check before wait_for_progress. process_packet / signal_eof keep deadline = None (existing behavior); drain passes Some(start + timeout). Now matches the documented timeout contract. 3. no_speech_threshold was a public no-op (medium). The knob existed in AsrParams but full_params_from never forwarded it and the accept gate only checked avg_logprob / compression ratio. Forward to set_no_speech_thold AND check post-decode: when mean no_speech_prob > threshold AND avg_logprob is too low to trust, short-circuit to empty AsrResult (no temperature retries — temperature can't conjure speech the model already evaluated as absent). Mirrors WhisperX / OpenAI Whisper's silence detection. 4. First empty packet still committed the anchor (medium). Extends round-37's empty-packet fix: an empty FIRST push has delta == 0 by definition (no prior anchor) so the round-37 bail (delta > 0) didn't fire, and the first-push commit path ran. Result: a heartbeat at PTS X locked the stream there, and real audio at PTS 0 tripped PtsRegression / InconsistentTimebase. Now empty packets are a true no-op for stream state regardless of first-push. Real first audio at any PTS becomes the actual first push. 5-fixture parity bit-identical (median 0.995-0.999, mean 0.983-0.998, below_0.5=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.