Add SpeakerKit with Pyannote speaker diarization support#440
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new SpeakerKit module to the Swift package to support on-device speaker diarization (Pyannote v4 pipeline), and wires it into both the CLI and the WhisperAX example app, alongside new shared model-download utilities and test coverage.
Changes:
- Introduces
SpeakerKit(models, Pyannote diarizer, clustering, RTTM generation, transcription alignment types). - Extends
whisperkit-cliwith a newdiarizesubcommand and adds diarization support totranscribe. - Adds
ArgmaxCore.ModelDownloaderand updates docs/build/test setup for the new product and workflows.
Reviewed changes
Copilot reviewed 36 out of 38 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| Tests/WhisperKitTests/UnitTestsPlan.xctestplan | Updates test plan targets to include SpeakerKit tests and keeps WhisperKit skips. |
| Tests/SpeakerKitTests/SpeakerEmbedderContextTests.swift | Unit tests for embedder context stride/index math and bounds behavior. |
| Tests/SpeakerKitTests/RTTMLineTests.swift | Tests RTTM formatting and RTTM generation from diarization + transcription. |
| Tests/SpeakerKitTests/PyannoteIntegrationTests.swift | Integration tests exercising end-to-end diarization on bundled audio. |
| Tests/SpeakerKitTests/MathOpsTests.swift | Unit + perf tests for Accelerate-backed math utilities. |
| Tests/SpeakerKitTests/ExclusiveReconciliationTests.swift | Validates exclusive reconciliation reduces overlaps. |
| Tests/SpeakerKitTests/DiarizerPostProcessingTests.swift | Validates post-processing produces sane segment/frame distributions. |
| Tests/SpeakerKitTests/DiarizationResultTests.swift | Tests diarization result segment creation and speaker alignment strategies. |
| Tests/SpeakerKitTests/ClusterAlgorithmsTests.swift | Tests clustering and VBx edge cases/guards. |
| Sources/WhisperKitCLI/WhisperKitCLI.swift | Registers the new diarize CLI subcommand. |
| Sources/WhisperKitCLI/TranscribeCLI.swift | Adds diarization flags and runs diarization after transcription. |
| Sources/WhisperKitCLI/DiarizeCLI.swift | New standalone diarization command that outputs RTTM. |
| Sources/TTSKit/TTSKit.swift | Switches file listing / downloads to the new ModelDownloader API. |
| Sources/SpeakerKit/SpeakerSegment.swift | Introduces SpeakerSegment model type for diarization + aligned transcription segments. |
| Sources/SpeakerKit/SpeakerKitModelManager.swift | Downloads and loads Pyannote CoreML bundles (segmenter/embedder/PLDA). |
| Sources/SpeakerKit/SpeakerKit.swift | Public SpeakerKit API (diarize, generateRTTM) + shared error types. |
| Sources/SpeakerKit/SpeakerInfo.swift | Adds speaker info enums/strategies and word timing wrappers. |
| Sources/SpeakerKit/RTTMLine.swift | Adds RTTM line type + grouping from word-level speaker assignments. |
| Sources/SpeakerKit/PyannoteConfig.swift | Adds Pyannote config, diarization options, and diarization timings structures. |
| Sources/SpeakerKit/Pyannote/VBxClustering.swift | Implements VBx clustering actor + fallback paths for non-trainable embeddings. |
| Sources/SpeakerKit/Pyannote/SpeakerSegmenterModel.swift | CoreML segmenter wrapper and chunked concurrent inference pipeline. |
| Sources/SpeakerKit/Pyannote/SpeakerPreEmbedderModel.swift | CoreML preprocessor wrapper for embedder input preparation. |
| Sources/SpeakerKit/Pyannote/SpeakerEmbedderModel.swift | CoreML embedder + PLDA embeddings + embedding extraction logic. |
| Sources/SpeakerKit/Pyannote/SpeakerClustering.swift | Defines clustering config/result and clustering protocol. |
| Sources/SpeakerKit/Pyannote/PyannoteDiarizer.swift | End-to-end diarization pipeline actor (segment → embed → cluster → postprocess). |
| Sources/SpeakerKit/Pyannote/MathOps.swift | Adds Accelerate-backed math helpers (mmul/transpose/softmax/cosine/argmax). |
| Sources/SpeakerKit/DiarizationResult.swift | Adds diarization result type + transcription alignment strategies. |
| Sources/ArgmaxCore/ModelUtilities.swift | Adds recursive CoreML model bundle discovery helper. |
| Sources/ArgmaxCore/ModelDownloader.swift | Adds shared HuggingFace model download/list/resolve utilities. |
| README.md | Documents SpeakerKit usage, options, RTTM output, and CLI examples. |
| Package.swift | Adds SpeakerKit product/targets and ArgmaxCore Hub dependency. |
| Makefile | Adds SpeakerKit model repo setup + download target. |
| Examples/WhisperAX/WhisperAX/Views/ContentView.swift | Adds UI + settings for diarization, plus sequential/concurrent diarization flows. |
| Examples/WhisperAX/WhisperAX.xcodeproj/project.pbxproj | Links SpeakerKit into the WhisperAX example app target. |
| .swiftpm/xcode/xcshareddata/xcschemes/whisperkit-Package.xcscheme | Adds SpeakerKitTests to the shared Xcode scheme. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let start = speakerWords.first?.wordTiming.start ?? startTime | ||
| let end = speakerWords.last?.wordTiming.end ?? endTime | ||
| let text = text.trimmingCharacters(in: .whitespaces) | ||
| return String(format: "\(speaker) [%.2f-%.2fs]: \(text)", start, end) |
Comment on lines
+190
to
+193
| let segmenterURL = ModelUtilities.detectModelURL(inFolder: segmenterVersionDir, named: "SpeakerSegmenter") | ||
| let embedderPreprocessorURL = ModelUtilities.detectModelURL(inFolder: embedderVersionDir, named: "SpeakerEmbedderPreprocessor") | ||
| let embedderURL = ModelUtilities.detectModelURL(inFolder: embedderVersionDir, named: "SpeakerEmbedder") | ||
| let pldaURL = ModelUtilities.detectModelURL(inFolder: pldaVersionDir, named: "PldaProjector") |
|
|
||
| import Foundation | ||
| import CoreML | ||
| import Hub |
Comment on lines
+261
to
+269
| if diarization { | ||
| for (audioPath, result) in zip(resolvedAudioPaths, transcribeResult) { | ||
| do { | ||
| let partialResult = try result.get() | ||
| try await runDiarization(audioPath: audioPath, transcriptionResults: partialResult) | ||
| } catch { | ||
| print("Error during diarization for \(audioPath): \(error)") | ||
| } | ||
| } |
ZachNagengast
approved these changes
Mar 13, 2026
EduardoPach
approved these changes
Mar 13, 2026
On-device speaker diarization using the [Pyannote v4 (community-1)](https://huggingface.co/argmaxinc/speakerkit-coreml) pipeline. ``` swift run whisperkit-cli diarize --audio-path audio.wav [--verbose] \ [--num-speakers N] [--cluster-distance-threshold F] \ [--model-path /path] [--model-repo org/repo] [--download-model-path /path] \ [--rttm-path output.rttm] ``` Output is standard RTTM. Use --model-path for existing models; otherwise models are resolved/downloaded as needed (e.g. via --download-model-path). ``` swift run whisperkit-cli transcribe --audio-path audio.wav --diarization [--verbose] \ [--diarization-model-path /path] [--diarization-model-repo org/repo] \ [--diarization-num-speakers N] ``` Runs transcription and diarization in one pass. Word timestamps are enabled when --diarization is set so speaker labels align at the subsegment level. RTTM lines are printed to stdout. WhisperAX shows full SpeakerKit integration: - Pipeline version picker (Pyannote 4 or None) - Speakers tab with per-segment speaker labels ``` SpeakerKit ├── SpeakerKit.swift — public API (diarize, generateRTTM) ├── SpeakerKitModelManager.swift — download + load Pyannote models ├── PyannoteConfig.swift — config + DiarizationOptionsProtocol ├── DiarizationResult.swift — result type + transcription alignment ├── SpeakerSegment.swift ├── SpeakerInfo.swift ├── RTTMLine.swift └── Pyannote/ ├── PyannoteDiarizer.swift ├── SpeakerSegmenterModel.swift ├── SpeakerEmbedderModel.swift ├── SpeakerPreEmbedderModel.swift ├── SpeakerClustering.swift — clustering protocol + config ├── VBxClustering.swift — variational Bayes HMM ├── ClusteringAlgorithms.swift — fast linkage └── MathOps.swift — matrix ops (Accelerate) ArgmaxCore ├── ModelDownloader.swift — downloads + ModelInfo struct ├── ModelState.swift └── (other shared utilities: Logging, ModelUtilities, etc.) ``` --------- Co-authored-by: ZachNagengast <znagengast@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On-device speaker diarization using the Pyannote v4 (community-1) pipeline.
Output is standard RTTM. Use --model-path for existing models; otherwise models are resolved/downloaded as needed (e.g. via --download-model-path).
Runs transcription and diarization in one pass. Word timestamps are enabled when --diarization is set so speaker labels align at the subsegment level. RTTM lines are printed to stdout.
WhisperAX shows full SpeakerKit integration: