Add SpeakerKit with Pyannote speaker diarization support by a2they · Pull Request #440 · argmaxinc/WhisperKit

a2they · 2026-03-12T23:48:29Z

On-device speaker diarization using the Pyannote v4 (community-1) pipeline.

swift run whisperkit-cli diarize --audio-path audio.wav [--verbose] \
  [--num-speakers N] [--cluster-distance-threshold F] \
  [--model-path /path] [--model-repo org/repo] [--download-model-path /path] \
  [--rttm-path output.rttm]

Output is standard RTTM. Use --model-path for existing models; otherwise models are resolved/downloaded as needed (e.g. via --download-model-path).

swift run whisperkit-cli transcribe --audio-path audio.wav --diarization [--verbose] \
  [--diarization-model-path /path] [--diarization-model-repo org/repo] \
  [--diarization-num-speakers N]

Runs transcription and diarization in one pass. Word timestamps are enabled when --diarization is set so speaker labels align at the subsegment level. RTTM lines are printed to stdout.

WhisperAX shows full SpeakerKit integration:

Pipeline version picker (Pyannote 4 or None)
Speakers tab with per-segment speaker labels

SpeakerKit
├── SpeakerKit.swift             — public API (diarize, generateRTTM)
├── SpeakerKitModelManager.swift — download + load Pyannote models
├── PyannoteConfig.swift         — config + DiarizationOptionsProtocol
├── DiarizationResult.swift      — result type + transcription alignment
├── SpeakerSegment.swift
├── SpeakerInfo.swift
├── RTTMLine.swift
└── Pyannote/
    ├── PyannoteDiarizer.swift
    ├── SpeakerSegmenterModel.swift
    ├── SpeakerEmbedderModel.swift
    ├── SpeakerPreEmbedderModel.swift
    ├── SpeakerClustering.swift      — clustering protocol + config
    ├── VBxClustering.swift          — variational Bayes HMM
    ├── ClusteringAlgorithms.swift   — fast linkage
    └── MathOps.swift                — matrix ops (Accelerate)
ArgmaxCore
├── ModelDownloader.swift        — downloads + ModelInfo struct
├── ModelState.swift
└── (other shared utilities: Logging, ModelUtilities, etc.)

Copilot

Pull request overview

Adds a new SpeakerKit module to the Swift package to support on-device speaker diarization (Pyannote v4 pipeline), and wires it into both the CLI and the WhisperAX example app, alongside new shared model-download utilities and test coverage.

Changes:

Introduces SpeakerKit (models, Pyannote diarizer, clustering, RTTM generation, transcription alignment types).
Extends whisperkit-cli with a new diarize subcommand and adds diarization support to transcribe.
Adds ArgmaxCore.ModelDownloader and updates docs/build/test setup for the new product and workflows.

Reviewed changes

Copilot reviewed 36 out of 38 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
Tests/WhisperKitTests/UnitTestsPlan.xctestplan	Updates test plan targets to include SpeakerKit tests and keeps WhisperKit skips.
Tests/SpeakerKitTests/SpeakerEmbedderContextTests.swift	Unit tests for embedder context stride/index math and bounds behavior.
Tests/SpeakerKitTests/RTTMLineTests.swift	Tests RTTM formatting and RTTM generation from diarization + transcription.
Tests/SpeakerKitTests/PyannoteIntegrationTests.swift	Integration tests exercising end-to-end diarization on bundled audio.
Tests/SpeakerKitTests/MathOpsTests.swift	Unit + perf tests for Accelerate-backed math utilities.
Tests/SpeakerKitTests/ExclusiveReconciliationTests.swift	Validates exclusive reconciliation reduces overlaps.
Tests/SpeakerKitTests/DiarizerPostProcessingTests.swift	Validates post-processing produces sane segment/frame distributions.
Tests/SpeakerKitTests/DiarizationResultTests.swift	Tests diarization result segment creation and speaker alignment strategies.
Tests/SpeakerKitTests/ClusterAlgorithmsTests.swift	Tests clustering and VBx edge cases/guards.
Sources/WhisperKitCLI/WhisperKitCLI.swift	Registers the new `diarize` CLI subcommand.
Sources/WhisperKitCLI/TranscribeCLI.swift	Adds diarization flags and runs diarization after transcription.
Sources/WhisperKitCLI/DiarizeCLI.swift	New standalone diarization command that outputs RTTM.
Sources/TTSKit/TTSKit.swift	Switches file listing / downloads to the new `ModelDownloader` API.
Sources/SpeakerKit/SpeakerSegment.swift	Introduces `SpeakerSegment` model type for diarization + aligned transcription segments.
Sources/SpeakerKit/SpeakerKitModelManager.swift	Downloads and loads Pyannote CoreML bundles (segmenter/embedder/PLDA).
Sources/SpeakerKit/SpeakerKit.swift	Public SpeakerKit API (`diarize`, `generateRTTM`) + shared error types.
Sources/SpeakerKit/SpeakerInfo.swift	Adds speaker info enums/strategies and word timing wrappers.
Sources/SpeakerKit/RTTMLine.swift	Adds RTTM line type + grouping from word-level speaker assignments.
Sources/SpeakerKit/PyannoteConfig.swift	Adds Pyannote config, diarization options, and diarization timings structures.
Sources/SpeakerKit/Pyannote/VBxClustering.swift	Implements VBx clustering actor + fallback paths for non-trainable embeddings.
Sources/SpeakerKit/Pyannote/SpeakerSegmenterModel.swift	CoreML segmenter wrapper and chunked concurrent inference pipeline.
Sources/SpeakerKit/Pyannote/SpeakerPreEmbedderModel.swift	CoreML preprocessor wrapper for embedder input preparation.
Sources/SpeakerKit/Pyannote/SpeakerEmbedderModel.swift	CoreML embedder + PLDA embeddings + embedding extraction logic.
Sources/SpeakerKit/Pyannote/SpeakerClustering.swift	Defines clustering config/result and clustering protocol.
Sources/SpeakerKit/Pyannote/PyannoteDiarizer.swift	End-to-end diarization pipeline actor (segment → embed → cluster → postprocess).
Sources/SpeakerKit/Pyannote/MathOps.swift	Adds Accelerate-backed math helpers (mmul/transpose/softmax/cosine/argmax).
Sources/SpeakerKit/DiarizationResult.swift	Adds diarization result type + transcription alignment strategies.
Sources/ArgmaxCore/ModelUtilities.swift	Adds recursive CoreML model bundle discovery helper.
Sources/ArgmaxCore/ModelDownloader.swift	Adds shared HuggingFace model download/list/resolve utilities.
README.md	Documents SpeakerKit usage, options, RTTM output, and CLI examples.
Package.swift	Adds SpeakerKit product/targets and ArgmaxCore Hub dependency.
Makefile	Adds SpeakerKit model repo setup + download target.
Examples/WhisperAX/WhisperAX/Views/ContentView.swift	Adds UI + settings for diarization, plus sequential/concurrent diarization flows.
Examples/WhisperAX/WhisperAX.xcodeproj/project.pbxproj	Links SpeakerKit into the WhisperAX example app target.
.swiftpm/xcode/xcshareddata/xcschemes/whisperkit-Package.xcscheme	Adds SpeakerKitTests to the shared Xcode scheme.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sources/SpeakerKit/SpeakerSegment.swift

+        let start = speakerWords.first?.wordTiming.start ?? startTime
+        let end = speakerWords.last?.wordTiming.end ?? endTime
+        let text = text.trimmingCharacters(in: .whitespaces)
+        return String(format: "\(speaker) [%.2f-%.2fs]: \(text)", start, end)


Sources/SpeakerKit/SpeakerKitModelManager.swift

+        let segmenterURL = ModelUtilities.detectModelURL(inFolder: segmenterVersionDir, named: "SpeakerSegmenter")
+        let embedderPreprocessorURL = ModelUtilities.detectModelURL(inFolder: embedderVersionDir, named: "SpeakerEmbedderPreprocessor")
+        let embedderURL = ModelUtilities.detectModelURL(inFolder: embedderVersionDir, named: "SpeakerEmbedder")
+        let pldaURL = ModelUtilities.detectModelURL(inFolder: pldaVersionDir, named: "PldaProjector")


Sources/SpeakerKit/SpeakerKitModelManager.swift

+
+import Foundation
+import CoreML
+import Hub


Sources/WhisperKitCLI/TranscribeCLI.swift

+        if diarization {
+            for (audioPath, result) in zip(resolvedAudioPaths, transcribeResult) {
+                do {
+                    let partialResult = try result.get()
+                    try await runDiarization(audioPath: audioPath, transcriptionResults: partialResult)
+                } catch {
+                    print("Error during diarization for \(audioPath): \(error)")
+                }
+            }


On-device speaker diarization using the [Pyannote v4 (community-1)](https://huggingface.co/argmaxinc/speakerkit-coreml) pipeline. ``` swift run whisperkit-cli diarize --audio-path audio.wav [--verbose] \ [--num-speakers N] [--cluster-distance-threshold F] \ [--model-path /path] [--model-repo org/repo] [--download-model-path /path] \ [--rttm-path output.rttm] ``` Output is standard RTTM. Use --model-path for existing models; otherwise models are resolved/downloaded as needed (e.g. via --download-model-path). ``` swift run whisperkit-cli transcribe --audio-path audio.wav --diarization [--verbose] \ [--diarization-model-path /path] [--diarization-model-repo org/repo] \ [--diarization-num-speakers N] ``` Runs transcription and diarization in one pass. Word timestamps are enabled when --diarization is set so speaker labels align at the subsegment level. RTTM lines are printed to stdout. WhisperAX shows full SpeakerKit integration: - Pipeline version picker (Pyannote 4 or None) - Speakers tab with per-segment speaker labels ``` SpeakerKit ├── SpeakerKit.swift — public API (diarize, generateRTTM) ├── SpeakerKitModelManager.swift — download + load Pyannote models ├── PyannoteConfig.swift — config + DiarizationOptionsProtocol ├── DiarizationResult.swift — result type + transcription alignment ├── SpeakerSegment.swift ├── SpeakerInfo.swift ├── RTTMLine.swift └── Pyannote/ ├── PyannoteDiarizer.swift ├── SpeakerSegmenterModel.swift ├── SpeakerEmbedderModel.swift ├── SpeakerPreEmbedderModel.swift ├── SpeakerClustering.swift — clustering protocol + config ├── VBxClustering.swift — variational Bayes HMM ├── ClusteringAlgorithms.swift — fast linkage └── MathOps.swift — matrix ops (Accelerate) ArgmaxCore ├── ModelDownloader.swift — downloads + ModelInfo struct ├── ModelState.swift └── (other shared utilities: Logging, ModelUtilities, etc.) ``` --------- Co-authored-by: ZachNagengast <znagengast@gmail.com>

a2they requested review from EduardoPach, ZachNagengast, atiorh and Copilot March 12, 2026 23:48

Copilot started reviewing on behalf of a2they March 12, 2026 23:48 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

ZachNagengast approved these changes Mar 13, 2026

View reviewed changes

EduardoPach approved these changes Mar 13, 2026

View reviewed changes

a2they force-pushed the oss/speakerkit branch from 346c780 to 5c53997 Compare March 13, 2026 14:59

a2they merged commit 26577ce into main Mar 13, 2026
11 of 13 checks passed

a2they deleted the oss/speakerkit branch March 13, 2026 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SpeakerKit with Pyannote speaker diarization support#440

Add SpeakerKit with Pyannote speaker diarization support#440
a2they merged 1 commit intomainfrom
oss/speakerkit

a2they commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

a2they commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants