feat (stt): Webserver: Diarization, export, Voxtral 3B Mini, larger audio files and enhanced model/language select by Gotanius · Pull Request #643 · Blaizzy/mlx-audio

Gotanius · 2026-04-10T16:09:54Z

Warning

This is one of my first pull requests ever. If I have done anything incorrectly, please let me know and I will fix it.

Context

I wanted to use the webserver to transcribe audio files with Voxtral 3B Mini, but it was not working as expected. I also wanted speaker diarization. While working on that, I made some UX improvements to the frontend.

Summary

This PR applies only to offline transcription, not realtime transcription.

Added STT features to the webserver

Option to enable diarization.
Language selection based on the selected model, and model selection based on the selected language.
Export transcription as txt, srt, vtt, and/or json.
Support for larger audio files

Fixed for the webserver

Voxtral 3B Mini STT support.

Diarization support

The frontend now allows the user to enable diarization and select a diarization model. For now, the available diarization model is Sortformer 4spk v2.1 fp16.

Depending on the selected STT model, the backend in server.py will either:

diarize first and then transcribe, for STT models that do not generate segments, or
transcribe first and then diarize, for STT models that do generate segments.

The detailed transcript page now shows speaker-labeled output.

Language and model selection

The language selector now supports more languages and includes search functionality. The model selector now uses a similar searchable dropdown.

The language and model selectors are linked, so unsupported combinations are dimmed and cannot be selected.

Transcript export

The detailed transcription page previously had a placeholder export function. This is now implemented.

Selecting an export format from the dropdown will download the transcript in the chosen format:

txt
srt
vtt
json

Support for larger audio files

Added a DB_Store to store the audiofiles for transcription. Bypassing a previous 5MB limit for files.

Changes in the codebase

Frontend components:

Added ui/components/modelLanguageSelect for matching languages against models.
Added ui/components/SearchableLanguageSelect for a searchable language dropdown.
Added ui/components/SearchableSTTModelSelect for a searchable model dropdown.
Added audio-db for audio files
Updated ui/app/speech-to-text/page.tsx to include the searchable dropdowns, new diarization flow, audio-db and export functionality.
Modified server.py to support diarization integration and Voxtral 3B Mini.

… files page. Language select now contains more language and the dropdown has a search functionality. Model select now has a similar dropdown as the new language select. Model and languages select now dim options that don't match. You can't select a language not supported by the selected model and vice versa.

Changed the display of language and date in the output page. It had a fallback to "english" and "yesterday". Now it shows the correct language and date

Moved from LocalStorage file to IndexDB for larger audio file support. Added Diarization support, for now only Sortformer 4spk v2.1 fp16 is supported. Diarizatoin results are also visible on the detailed transcription pages. For STT models that generate segments, diarization will run after transcription. For STT models that do not generate segments, diarization will run before the STT model. Added export functionality for exporting and downloading transcriptions as txt, srt, vtt and json.

Gotanius · 2026-04-17T12:03:40Z

Updated the PR message to reflect the new changes.

lucasnewman · 2026-04-24T16:15:29Z

+                synced_stream = False
                for s in streams:
-                    mx.synchronize(s)
+                    try:


@Gotanius What is this guarding against?

Gotanius closed this Apr 11, 2026

Gotanius reopened this Apr 11, 2026

Gotanius changed the title ~~Webserver: Expanded model and language select on STT Transcribe page~~ feat (stt): Webserver: Expanded model and language select on STT Transcribe page Apr 12, 2026

Blaizzy and others added 5 commits April 14, 2026 11:51

Merge branch 'main' into main

2291902

Slight bugfix, now voxtral 3B allows for selecting of dutch.

594f6cc

Made language codes lowercase to appease modes (votral).

98ca8e3

Changed the display of language and date in the output page. It had a fallback to "english" and "yesterday". Now it shows the correct language and date

Small syntax fix in server.py

ce6d28b

Gotanius changed the title ~~feat (stt): Webserver: Expanded model and language select on STT Transcribe page~~ feat (stt): Webserver: Diarization, export and Voxtral 3B Mini, larger audio files Apr 17, 2026

Gotanius changed the title ~~feat (stt): Webserver: Diarization, export and Voxtral 3B Mini, larger audio files~~ feat (stt): Webserver: Diarization, export, Voxtral 3B Mini, larger audio files and enhanced model/language select Apr 17, 2026

lucasnewman reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat (stt): Webserver: Diarization, export, Voxtral 3B Mini, larger audio files and enhanced model/language select#643

feat (stt): Webserver: Diarization, export, Voxtral 3B Mini, larger audio files and enhanced model/language select#643
Gotanius wants to merge 6 commits intoBlaizzy:mainfrom
Gotanius:main

Gotanius commented Apr 10, 2026 •

edited

Loading

Uh oh!

Gotanius commented Apr 17, 2026

Uh oh!

lucasnewman Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Gotanius commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Warning

Context

Summary

Added STT features to the webserver

Fixed for the webserver

Diarization support

Language and model selection

Transcript export

Support for larger audio files

Changes in the codebase

Uh oh!

Gotanius commented Apr 17, 2026

Uh oh!

lucasnewman Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gotanius commented Apr 10, 2026 •

edited

Loading