Skip to content

Add SpeakerKit with Pyannote speaker diarization support#440

Merged
a2they merged 1 commit intomainfrom
oss/speakerkit
Mar 13, 2026
Merged

Add SpeakerKit with Pyannote speaker diarization support#440
a2they merged 1 commit intomainfrom
oss/speakerkit

Conversation

@a2they
Copy link
Contributor

@a2they a2they commented Mar 12, 2026

On-device speaker diarization using the Pyannote v4 (community-1) pipeline.

swift run whisperkit-cli diarize --audio-path audio.wav [--verbose] \
  [--num-speakers N] [--cluster-distance-threshold F] \
  [--model-path /path] [--model-repo org/repo] [--download-model-path /path] \
  [--rttm-path output.rttm]

Output is standard RTTM. Use --model-path for existing models; otherwise models are resolved/downloaded as needed (e.g. via --download-model-path).

swift run whisperkit-cli transcribe --audio-path audio.wav --diarization [--verbose] \
  [--diarization-model-path /path] [--diarization-model-repo org/repo] \
  [--diarization-num-speakers N]

Runs transcription and diarization in one pass. Word timestamps are enabled when --diarization is set so speaker labels align at the subsegment level. RTTM lines are printed to stdout.

WhisperAX shows full SpeakerKit integration:

  • Pipeline version picker (Pyannote 4 or None)
  • Speakers tab with per-segment speaker labels
SpeakerKit
├── SpeakerKit.swift             — public API (diarize, generateRTTM)
├── SpeakerKitModelManager.swift — download + load Pyannote models
├── PyannoteConfig.swift         — config + DiarizationOptionsProtocol
├── DiarizationResult.swift      — result type + transcription alignment
├── SpeakerSegment.swift
├── SpeakerInfo.swift
├── RTTMLine.swift
└── Pyannote/
    ├── PyannoteDiarizer.swift
    ├── SpeakerSegmenterModel.swift
    ├── SpeakerEmbedderModel.swift
    ├── SpeakerPreEmbedderModel.swift
    ├── SpeakerClustering.swift      — clustering protocol + config
    ├── VBxClustering.swift          — variational Bayes HMM
    ├── ClusteringAlgorithms.swift   — fast linkage
    └── MathOps.swift                — matrix ops (Accelerate)
ArgmaxCore
├── ModelDownloader.swift        — downloads + ModelInfo struct
├── ModelState.swift
└── (other shared utilities: Logging, ModelUtilities, etc.)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new SpeakerKit module to the Swift package to support on-device speaker diarization (Pyannote v4 pipeline), and wires it into both the CLI and the WhisperAX example app, alongside new shared model-download utilities and test coverage.

Changes:

  • Introduces SpeakerKit (models, Pyannote diarizer, clustering, RTTM generation, transcription alignment types).
  • Extends whisperkit-cli with a new diarize subcommand and adds diarization support to transcribe.
  • Adds ArgmaxCore.ModelDownloader and updates docs/build/test setup for the new product and workflows.

Reviewed changes

Copilot reviewed 36 out of 38 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
Tests/WhisperKitTests/UnitTestsPlan.xctestplan Updates test plan targets to include SpeakerKit tests and keeps WhisperKit skips.
Tests/SpeakerKitTests/SpeakerEmbedderContextTests.swift Unit tests for embedder context stride/index math and bounds behavior.
Tests/SpeakerKitTests/RTTMLineTests.swift Tests RTTM formatting and RTTM generation from diarization + transcription.
Tests/SpeakerKitTests/PyannoteIntegrationTests.swift Integration tests exercising end-to-end diarization on bundled audio.
Tests/SpeakerKitTests/MathOpsTests.swift Unit + perf tests for Accelerate-backed math utilities.
Tests/SpeakerKitTests/ExclusiveReconciliationTests.swift Validates exclusive reconciliation reduces overlaps.
Tests/SpeakerKitTests/DiarizerPostProcessingTests.swift Validates post-processing produces sane segment/frame distributions.
Tests/SpeakerKitTests/DiarizationResultTests.swift Tests diarization result segment creation and speaker alignment strategies.
Tests/SpeakerKitTests/ClusterAlgorithmsTests.swift Tests clustering and VBx edge cases/guards.
Sources/WhisperKitCLI/WhisperKitCLI.swift Registers the new diarize CLI subcommand.
Sources/WhisperKitCLI/TranscribeCLI.swift Adds diarization flags and runs diarization after transcription.
Sources/WhisperKitCLI/DiarizeCLI.swift New standalone diarization command that outputs RTTM.
Sources/TTSKit/TTSKit.swift Switches file listing / downloads to the new ModelDownloader API.
Sources/SpeakerKit/SpeakerSegment.swift Introduces SpeakerSegment model type for diarization + aligned transcription segments.
Sources/SpeakerKit/SpeakerKitModelManager.swift Downloads and loads Pyannote CoreML bundles (segmenter/embedder/PLDA).
Sources/SpeakerKit/SpeakerKit.swift Public SpeakerKit API (diarize, generateRTTM) + shared error types.
Sources/SpeakerKit/SpeakerInfo.swift Adds speaker info enums/strategies and word timing wrappers.
Sources/SpeakerKit/RTTMLine.swift Adds RTTM line type + grouping from word-level speaker assignments.
Sources/SpeakerKit/PyannoteConfig.swift Adds Pyannote config, diarization options, and diarization timings structures.
Sources/SpeakerKit/Pyannote/VBxClustering.swift Implements VBx clustering actor + fallback paths for non-trainable embeddings.
Sources/SpeakerKit/Pyannote/SpeakerSegmenterModel.swift CoreML segmenter wrapper and chunked concurrent inference pipeline.
Sources/SpeakerKit/Pyannote/SpeakerPreEmbedderModel.swift CoreML preprocessor wrapper for embedder input preparation.
Sources/SpeakerKit/Pyannote/SpeakerEmbedderModel.swift CoreML embedder + PLDA embeddings + embedding extraction logic.
Sources/SpeakerKit/Pyannote/SpeakerClustering.swift Defines clustering config/result and clustering protocol.
Sources/SpeakerKit/Pyannote/PyannoteDiarizer.swift End-to-end diarization pipeline actor (segment → embed → cluster → postprocess).
Sources/SpeakerKit/Pyannote/MathOps.swift Adds Accelerate-backed math helpers (mmul/transpose/softmax/cosine/argmax).
Sources/SpeakerKit/DiarizationResult.swift Adds diarization result type + transcription alignment strategies.
Sources/ArgmaxCore/ModelUtilities.swift Adds recursive CoreML model bundle discovery helper.
Sources/ArgmaxCore/ModelDownloader.swift Adds shared HuggingFace model download/list/resolve utilities.
README.md Documents SpeakerKit usage, options, RTTM output, and CLI examples.
Package.swift Adds SpeakerKit product/targets and ArgmaxCore Hub dependency.
Makefile Adds SpeakerKit model repo setup + download target.
Examples/WhisperAX/WhisperAX/Views/ContentView.swift Adds UI + settings for diarization, plus sequential/concurrent diarization flows.
Examples/WhisperAX/WhisperAX.xcodeproj/project.pbxproj Links SpeakerKit into the WhisperAX example app target.
.swiftpm/xcode/xcshareddata/xcschemes/whisperkit-Package.xcscheme Adds SpeakerKitTests to the shared Xcode scheme.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

let start = speakerWords.first?.wordTiming.start ?? startTime
let end = speakerWords.last?.wordTiming.end ?? endTime
let text = text.trimmingCharacters(in: .whitespaces)
return String(format: "\(speaker) [%.2f-%.2fs]: \(text)", start, end)
Comment on lines +190 to +193
let segmenterURL = ModelUtilities.detectModelURL(inFolder: segmenterVersionDir, named: "SpeakerSegmenter")
let embedderPreprocessorURL = ModelUtilities.detectModelURL(inFolder: embedderVersionDir, named: "SpeakerEmbedderPreprocessor")
let embedderURL = ModelUtilities.detectModelURL(inFolder: embedderVersionDir, named: "SpeakerEmbedder")
let pldaURL = ModelUtilities.detectModelURL(inFolder: pldaVersionDir, named: "PldaProjector")

import Foundation
import CoreML
import Hub
Comment on lines +261 to +269
if diarization {
for (audioPath, result) in zip(resolvedAudioPaths, transcribeResult) {
do {
let partialResult = try result.get()
try await runDiarization(audioPath: audioPath, transcriptionResults: partialResult)
} catch {
print("Error during diarization for \(audioPath): \(error)")
}
}
On-device speaker diarization using the [Pyannote v4 (community-1)](https://huggingface.co/argmaxinc/speakerkit-coreml) pipeline.

```
swift run whisperkit-cli diarize --audio-path audio.wav [--verbose] \
  [--num-speakers N] [--cluster-distance-threshold F] \
  [--model-path /path] [--model-repo org/repo] [--download-model-path /path] \
  [--rttm-path output.rttm]
```
Output is standard RTTM. Use --model-path for existing models; otherwise models are resolved/downloaded as needed (e.g. via --download-model-path).

```
swift run whisperkit-cli transcribe --audio-path audio.wav --diarization [--verbose] \
  [--diarization-model-path /path] [--diarization-model-repo org/repo] \
  [--diarization-num-speakers N]
```
Runs transcription and diarization in one pass. Word timestamps are enabled when --diarization is set so speaker labels align at the subsegment level. RTTM lines are printed to stdout.

WhisperAX shows full SpeakerKit integration:
- Pipeline version picker (Pyannote 4 or None)
- Speakers tab with per-segment speaker labels

```
SpeakerKit
├── SpeakerKit.swift             — public API (diarize, generateRTTM)
├── SpeakerKitModelManager.swift — download + load Pyannote models
├── PyannoteConfig.swift         — config + DiarizationOptionsProtocol
├── DiarizationResult.swift      — result type + transcription alignment
├── SpeakerSegment.swift
├── SpeakerInfo.swift
├── RTTMLine.swift
└── Pyannote/
    ├── PyannoteDiarizer.swift
    ├── SpeakerSegmenterModel.swift
    ├── SpeakerEmbedderModel.swift
    ├── SpeakerPreEmbedderModel.swift
    ├── SpeakerClustering.swift      — clustering protocol + config
    ├── VBxClustering.swift          — variational Bayes HMM
    ├── ClusteringAlgorithms.swift   — fast linkage
    └── MathOps.swift                — matrix ops (Accelerate)
ArgmaxCore
├── ModelDownloader.swift        — downloads + ModelInfo struct
├── ModelState.swift
└── (other shared utilities: Logging, ModelUtilities, etc.)
```

---------

Co-authored-by: ZachNagengast <znagengast@gmail.com>
@a2they a2they merged commit 26577ce into main Mar 13, 2026
11 of 13 checks passed
@a2they a2they deleted the oss/speakerkit branch March 13, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants