Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@ codecov:
require_ci_to_pass: false

ignore:
- **benches/*
- **examples/*
- **tests/*
- "**benches/*"
- "**examples/*"
- "**tests/*"

coverage:
status:
Expand Down
342 changes: 308 additions & 34 deletions .github/workflows/ci.yml

Large diffs are not rendered by default.

34 changes: 34 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,37 @@

/target
Cargo.lock
/models/*.onnx
/models/*.onnx.data
/models/*.pt
# Segmentation model is committed to git so it ships with the crate
# (`SegmentModel::bundled` under feature `bundled-segmentation`).
!/models/segmentation-3.0.onnx
# WeSpeaker ResNet34-LM is too large for crates.io (27 MB), but is
# committed to git so it can be served as a GitHub release asset.
# It is excluded from the crate tarball via `[package] exclude` in
# Cargo.toml so `cargo publish` doesn't accidentally ship it.
# The `.onnx.data` sidecar is no longer used — we ship the
# single-file packed form (works on ORT's CoreML EP loader, which
# fails to relocate external initializers).
!/models/wespeaker_resnet34_lm.onnx
**.claude/
docs/

# Spike-specific (kaldi-native-fbank parity)
spikes/kaldi_fbank/python/.venv/
spikes/kaldi_fbank/python/__pycache__/
spikes/kaldi_fbank/python/*.egg-info/
spikes/kaldi_fbank/python/uv.lock
spikes/kaldi_fbank/rust.csv
spikes/kaldi_fbank/python.csv

# Phase-0 parity capture: large local artifacts.
tests/parity/fixtures/*/clip_16k.wav
# verify_capture.py writes a backup before re-running.
tests/parity/fixtures/.*.backup/

# uv venv + setuptools editable-install scaffolding for tests/parity/python/.
tests/parity/python/.venv/
tests/parity/python/uv.lock
tests/parity/python/*.egg-info/
213 changes: 212 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,218 @@
# UNRELEASED

# 0.1.2 (January 6th, 2022)
The pyannote-community-1 offline + streaming-offline pipelines now
ship in full: VBx clustering, PLDA, AHC, centroid + Hungarian
assignment, reconstruction, RTTM emission. The crate exposes both
the offline pipeline (one-shot batch) and the streaming-offline
variant (push voice ranges, finalize once at the end). End-to-end
DER vs pyannote 4.0.4 on the in-repo fixture suite is ≤ 0.4% on the
worst clip and bit-exact on the rest.

PUBLIC SURFACE

- **`diarization::offline`** — `OfflineInput` / `diarize_offline`:
caller-supplied segmentation + embedding tensors → diarization +
RTTM spans. No ORT inference inside; pair with `OwnedDiarizationPipeline`
(under `feature = "ort"`) for the full audio entrypoint.
- **`diarization::streaming::StreamingOfflineDiarizer`** — push
voice ranges incrementally via `push_voice_range(&mut seg, &mut emb,
...)`, call `finalize(&plda)` once to produce RTTM spans. Same
numerics as `diarize_offline` modulo plumbing.
- **`diarization::segment`** — `SegmentModel::bundled()` /
`from_file` / `from_memory` (default + `_with_options` variants);
segmentation-3.0 ONNX is embedded via `include_bytes!` under the
default `bundled-segmentation` feature.
- **`diarization::embed`** — `EmbedModel::from_file` /
`from_memory` (and `from_torchscript_file` under `feature = "tch"`).
WeSpeaker ResNet34-LM is BYO; fetch it from
`FinDIT-Studio/dia-models` on HuggingFace. The single-file packed
ONNX is the canonical form.
- **`diarization::plda`** — `PldaTransform::new()` (no args; weights
embedded via `include_bytes!`); CC-BY-4.0 with attribution
preserved in `NOTICE` and `models/plda/SOURCE.md`.
- **`diarization::cluster`** — `ahc`, `vbx`, `centroid`, `hungarian`
submodules expose the algorithmic primitives directly for callers
who want to wire their own pipeline.
- **`diarization::pipeline::assign_embeddings`** — the AHC + VBx +
centroid + Hungarian core, callable on already-projected
post-PLDA features.
- **`diarization::reconstruct`** — discrete grid + RTTM span emission
+ `try_discrete_to_spans` (fallible variant for direct callers).
- **`diarization::aggregate::count_pyannote`** — overlap-add per-frame
speaker-count tensor, bit-exact with pyannote.
- **`diarization::ep`** — opt-in ORT execution providers (CoreML,
CUDA, TensorRT, DirectML, ROCm, OpenVINO, WebGPU, …) gated by
per-EP cargo features and the `gpu` meta-feature. `auto_providers()`
helper picks compiled-in EPs at runtime.
- **`diarization::spill`** — `SpillOptions` + `SpillBytes` /
`SpillBytesMut` for file-backed mmap fallback above the
configurable threshold; protects multi-hour inputs from
OOM-aborting the pipeline.

ASYMMETRIC EP DEFAULT

- `SegmentModel::bundled()` / `::from_file()` auto-register
`dia::ep::auto_providers()` so any compiled-in per-EP feature
accelerates segmentation with no caller code change.
- `EmbedModel::from_file()` does **NOT** auto-register EPs.
Empirically, ORT's CoreML EP miscompiles the WeSpeaker
ResNet34-LM graph and emits NaN/Inf on most realistic inputs
across every CoreML compute unit / model format / static-shape
knob; auto-on would crash the embed pipeline. Callers on a vetted
EP host opt in via `EmbedModelOptions::default().with_providers(...)`
and `EmbedModel::from_file_with_options(path, opts)`. See
`crate::ep` and `crate::embed::EmbedModel::from_file` docs.

CORRECTNESS GUARANTEES

- **Bit-exact pyannote 4.0.4 parity** on the in-repo fixture suite
(01_dialogue, 02_pyannote_sample, 03_dual_speaker,
04_three_speaker, 05_four_speaker — DER 0.0000–0.0037; 06_long_recording
is `#[ignore]`d at the strict bit-level due to GEMM-roundoff drift
past T=1004 but the per-frame coverage at DER 0.0019 is the
release-blocking metric).
- **SpillBytes / SpillBytesMut** are `Send + Sync`; the runtime EP
registration is per-session.
- **Cross-platform** spill: `posix_fallocate` on Linux,
`F_PREALLOCATE` on macOS, `SetFileValidData`/`SetEndOfFile` on
Windows; reservations happen before any mapped writes so we
never `SIGBUS` on `ENOSPC` mid-run.

TESTING

- 495 lib unit tests pass on default features; full DER suite
(in-repo + speakrs clips) at the bit-exact baseline.
- Parity tests under `src/*/parity_tests.rs` skip cleanly via
`parity_fixtures_or_skip!` when `tests/parity/fixtures/` is
absent (the published crate tarball excludes the fixtures to
stay under the 10 MiB crates.io limit).
- `tests/parity/run.sh` is a manual harness for end-to-end DER
validation against pyannote-on-disk; provide your own clip path
if running outside a workspace checkout.

BUILD

- Rust edition 2024, MSRV 1.95.
- `nalgebra 0.34`, `kodama 0.3` (AHC linkage), `kaldi-native-fbank 0.1`,
`pathfinding 4.15` (Hungarian), `mediatime`, `thiserror`,
`memmapix 0.9` + `bytemuck 1` + `tempfile 3` + `fs4 1` for the
spill backend, `rustix` (Linux/Android only, for `O_TMPFILE`).
- Optional features: `serde`, `tch`, `silero-vad`, plus 16 per-EP
features (`coreml`, `cuda`, `tensorrt`, `directml`, `rocm`,
`migraphx`, `openvino`, `webgpu`, `xnnpack`, `onednn`, `cann`,
`acl`, `qnn`, `nnapi`, `tvm`, `azure`) and a `gpu` meta-feature.

KNOWN LIMITATIONS / DEFERRED

- WeSpeaker embed model (~26 MiB) exceeds the crates.io 10 MiB
hard limit; not bundled. Fetch from
`FinDIT-Studio/dia-models` on HuggingFace at the pinned revision
documented in `scripts/download-embed-model.sh`, or set
`DIA_EMBED_MODEL_PATH` if you keep the model elsewhere.
- ORT CoreML EP cannot run the WeSpeaker graph correctly; the
asymmetric default (seg-auto, embed-CPU) ships as the workaround.
- FP16 / INT8 ONNX variants and TensorRT / OpenVINO IR / CoreML
`.mlpackage` formats are not provided; the canonical FP32
single-file ONNX runs on every ORT EP that doesn't have the
WeSpeaker miscompile.
- 06_long_recording (T=1004) hits a GEMM-roundoff partition drift
at the strict bit-exact level; tolerant per-frame coverage is in
`reconstruct::parity_tests::reconstruct_within_tolerance_06_long_recording`.

# 0.1.0 (2026-04-26)

Initial release. Ships the `diarization::segment` module — Sans-I/O speaker
segmentation backed by `pyannote/segmentation-3.0` ONNX.

FEATURES

- **Sans-I/O state machine** (`diarization::segment::Segmenter`) with no `ort`
dependency. Caller pumps audio in via `push_samples`, drains `Action`s
via `poll`, runs ONNX inference externally, and pushes scores back via
`push_inference`. The state machine is exercisable in unit tests with
synthetic scores — no model file required.
- **Layer 2 streaming driver** (`Segmenter::process_samples` and
`finish_stream`) gated on the default `ort` feature. Mirrors silero's
`Session::process_stream` callback idiom.
- **`SegmentModel`** wraps `ort::Session` for `pyannote/segmentation-3.0`
with `from_file`, `from_memory`, and `*_with_options` constructors.
- **`SegmentModelOptions`** builder for `GraphOptimizationLevel`,
`ExecutionProviderDispatch`, intra/inter thread counts. Both `ort`
types are re-exported from `diarization::segment`.
- **`mediatime`-based time types** (`TimeRange`, `Timestamp`, `Duration`)
for every sample range and duration crossing the public API.
- **Sliding-window scheduling** with configurable step (default 2.5 s)
and tail-anchored window for end-of-stream coverage.
- **Powerset decoding** (7-class → 3-speaker additive marginals + voice
probability), **per-frame voice-timeline stitching** (overlap-add mean,
~1.7 MB/hour storage), **streaming hysteresis** with onset/offset
thresholds, **window-local `SpeakerActivity`** records, and
**`voice_merge_gap`** post-processing.

CORRECTNESS GUARANTEES

- **Generation-counter `WindowId`** (process-wide `AtomicU64`): stale
inference results from before a `clear()` and cross-`Segmenter` ID
collisions both reject as `Error::UnknownWindow`.
- **Pending-aware finalization boundary**: out-of-order `push_inference`
cannot prematurely finalize frames whose other contributing windows
haven't yet reported.
- **Tail-window activity clamping** to `total_samples`.
- **Frame-to-sample conversion** uses integer-rounded division
(`frame_to_sample`) bit-for-bit equivalent to Python's
`int(round(...))`. **Sample-to-frame conversion** uses floor
(`frame_index_of`) for boundary safety.

OBSERVABILITY

- `Segmenter::pending_inferences()` and `Segmenter::buffered_samples()`
introspection for backpressure detection.
- Compile-time `Send + Sync` assertion on `Segmenter`; compile-time `Send`
assertion on `SegmentModel` (which is `!Sync` because `ort::Session` is).

EXAMPLES, TESTS, BENCHES

- `examples/stream_layer1.rs`: Sans-I/O usage with synthetic inferencer
(no model file, no `ort` feature).
- `examples/stream_from_wav.rs`: full Layer-2 pipeline streaming a 16 kHz
mono WAV file in 100 ms chunks.
- `tests/integration_segment.rs`: gated `#[ignore]` smoke test against a
real downloaded model.
- `benches/segment.rs`: Layer-1 throughput on one minute of audio.
- 54 unit tests covering options, powerset, hysteresis, RLE, sliding-window
planning, per-frame stitching, segmenter end-to-end, out-of-order
`push_inference`, cross-`Segmenter` ID collision, stale-id rejection,
empty-stream handling, tail-window activity clamping.

BUILD

- Edition 2024, Rust 1.95.
- Default features `["std", "ort"]`. `--no-default-features --features
std` builds without `ort` and exposes only Layer 1.
- Lints aligned with sibling crates (silero, soundevents, scenesdetect,
mediatime).

KNOWN LIMITATIONS

- **No load-time ONNX shape verification.** The `ort` 2.0.0-rc.12 metadata
API doesn't expose dimensions in a way matching the spec's assumption;
shape mismatches surface on first inference as
`Error::InferenceShapeMismatch`. The `Error::IncompatibleModel` variant
is reserved for the eventual load-time check. Matches silero's pragmatic
stance.
- **Sample-rate is the caller's responsibility.** `push_samples` accepts
`&[f32]` without validating that the input is 16 kHz mono. Feeding the
wrong rate produces silently corrupted output.
- **No bundled model.** Run `scripts/download-model.sh` to fetch
`pyannote/segmentation-3.0` from Hugging Face.

DEFERRED FOR v0.2

- `diarization::embed` module (speaker embedding via WeSpeaker ResNet34).
- `infer_batch` for cross-stream batching, `IoBinding`-based
reusable-output-buffer fast path, `Arc<[f32]>` in `Action::NeedsInference`.
- `serde` derives on output types.
- `step_samples` typed as `Duration`.
- Soft-cap `try_push_samples` for backpressure enforcement.
- Bundled model behind a Cargo feature.
- F1 numerical-parity tests vs `pyannote.audio`.
Loading
Loading