Skip to content

v0.1.0: Sans-I/O speaker diarization with pyannote-equivalent accuracy#2

Merged
uqio merged 1 commit intomainfrom
0.1.0
May 6, 2026
Merged

v0.1.0: Sans-I/O speaker diarization with pyannote-equivalent accuracy#2
uqio merged 1 commit intomainfrom
0.1.0

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 2, 2026

Summary

Initial release of diarization — a Rust port of pyannote.audio's speaker-diarization pipeline, restructured around a Sans-I/O design (push PCM → get spans). Targets pyannote-equivalent accuracy on the captured community-1 fixtures.

DER on the six captured fixtures via the streaming-offline path:

Fixture DER
01_dialogue 0.37 %
02_pyannote_sample 0 %
03_dual_speaker 0 %
04_three_speaker 0 %
05_four_speaker 0 %
06_long_recording (16 min) 0.19 %

Pipeline

audio → segmentation → embedding → PLDA → AHC → VBx → centroid → cosine → Hungarian → reconstruct → RTTM

Two public entrypoints, both running the full pyannote cluster_vbx flow:

  • offline::OwnedDiarizationPipeline — owned-audio batch path. Caller passes the entire 16 kHz mono PCM at once.
  • streaming::StreamingOfflineDiarizer — voice-range-driven streaming path. Caller drives a VAD externally and pushes one voice range at a time; heavy stages run eagerly per range, global clustering deferred to finalize. Same DER as the offline path.

Bundled model artifacts

  • models/segmentation-3.0.onnx (~6 MB, MIT) — embedded by default via SegmentModel::bundled(). Off-switch: default-features = false for callers shipping a fine-tuned variant.
  • models/plda/*.bin (~530 KB total, CC-BY-4.0) — PLDA whitening weights from pyannote/speaker-diarization-community-1, loaded by PldaTransform::new().
  • WeSpeaker ResNet34-LM embedding ONNX (~27 MB) — not bundled, exceeds crates.io's 10 MB cap. Fetch via scripts/download-embed-model.sh.

.crate tarball: ~6.4 MiB. License SPDX: (MIT OR Apache-2.0) AND MIT AND CC-BY-4.0.

Public API shape

  • SegmentModel / EmbedModelfrom_file / from_memory / bundled constructors with options-builder.
  • PldaTransform::new() — loads embedded PLDA weights.
  • OwnedDiarizationPipeline::run(&mut seg, &mut emb, &plda, &samples) — owned audio.
  • StreamingOfflineDiarizer::push_voice_range(...) + .finalize(&plda) — VAD-driven streaming.
  • Algorithm-level entrypoints (diarize_offline, assign_embeddings, reconstruct, count_pyannote) take builder-style input structs (OfflineInput::new(...).with_threshold(...).with_fa(...)).

All public structs use accessor patterns (no public fields). Hyperparameters default to community-1 values; override via with_* builders.

SIMD policy

  • NEON ≡ scalar bit-exact on aarch64 at the primitive level (verified by ops::differential_tests).
  • AHC pdist and Hungarian-feeding cosine use ops::scalar directly on every architecture — they feed discrete decisions where ulp drift could flip a partition.
  • VBx EM, centroid sums, embed aggregation use SIMD via ops::dot / ops::axpy — continuous/iterative paths where ulp drift smooths instead of flipping decisions.
  • A guard band on SP_ALIVE_THRESHOLD rejects pathological VBx outputs that could land within SIMD ulp drift of the alive-cluster cut.
  • nalgebra/matrixmultiply GEMMs in VBx are uncontrolled; cross-arch determinism end-to-end is not claimed for T>200 inputs but is empirically validated under SDE-emulated AVX2 + AVX-512 in CI.

Testing

  • 355 in-tree lib tests, including bit-exact pyannote parity for PLDA, AHC, VBx, centroid, count tensor, reconstruct, RTTM on all six captured fixtures (06_long_recording strict pipeline parity is #[ignore]d due to GEMM-roundoff drift; covered by the tolerant Hungarian-permuted per-frame match in reconstruct::parity_tests).
  • CI matrix: ASan, Miri (SB+TB), AVX2 SDE, AVX-512 SDE.
  • Streaming parity harness at tests/parity/run.sh measures DER against pyannote captures; results table above.

Test plan

  • cargo test --lib (355 tests; 10 #[ignore]d for documented reasons)
  • cargo clippy --lib --tests --features 'ort bundled-segmentation' clean
  • cargo build --examples --features ort clean
  • RUSTFLAGS=-Dwarnings cargo check --no-default-features --lib (used by SDE CI lanes)
  • bash tests/parity/run.sh tests/parity/fixtures/<fixture>/clip_16k.wav for each of the six fixtures; DER table above.

🤖 Generated with Claude Code

@al8n al8n requested a review from Copilot May 2, 2026 00:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@uqio uqio changed the title 0.1.0 v0.1.0: Sans-I/O speaker diarization with pyannote-equivalent accuracy May 2, 2026
uqio added a commit that referenced this pull request May 4, 2026
# This is the 1st commit message:

update

# This is the commit message #2:

update
@al8n al8n requested a review from Copilot May 6, 2026 09:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

…ccuracy

v0.1.0 ships:
- diarization::segment — speaker segmentation (pyannote/segmentation-3.0).
  Bundled by default (~6 MB, MIT) via SegmentModel::bundled().
- diarization::embed — speaker fingerprint (WeSpeaker ResNet34 ONNX +
  kaldi fbank). Caller-fetched (27 MB, exceeds crates.io 10 MB cap).
- diarization::plda — pyannote/speaker-diarization-community-1 PLDA
  whitening. Bundled by default (CC-BY-4.0) via PldaTransform::new().
- diarization::cluster + pipeline — pyannote cluster_vbx primitives
  (PLDA → AHC → VBx → centroid → cosine → Hungarian → reconstruct).
- diarization::offline::OwnedDiarizationPipeline — owned-audio batch
  entrypoint.
- diarization::streaming::StreamingOfflineDiarizer — voice-range-driven
  streaming entrypoint with the same per-fixture DER as offline.
@uqio uqio merged commit 07a684f into main May 6, 2026
62 checks passed
@uqio uqio deleted the 0.1.0 branch May 6, 2026 09:41
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants