Skip to content

feat (stt): Webserver: Diarization, export, Voxtral 3B Mini, larger audio files and enhanced model/language select#643

Open
Gotanius wants to merge 6 commits intoBlaizzy:mainfrom
Gotanius:main
Open

feat (stt): Webserver: Diarization, export, Voxtral 3B Mini, larger audio files and enhanced model/language select#643
Gotanius wants to merge 6 commits intoBlaizzy:mainfrom
Gotanius:main

Conversation

@Gotanius
Copy link
Copy Markdown

@Gotanius Gotanius commented Apr 10, 2026

Warning

This is one of my first pull requests ever. If I have done anything incorrectly, please let me know and I will fix it.

Context

I wanted to use the webserver to transcribe audio files with Voxtral 3B Mini, but it was not working as expected. I also wanted speaker diarization. While working on that, I made some UX improvements to the frontend.

Summary

This PR applies only to offline transcription, not realtime transcription.

Added STT features to the webserver

  • Option to enable diarization.
  • Language selection based on the selected model, and model selection based on the selected language.
  • Export transcription as txt, srt, vtt, and/or json.
  • Support for larger audio files

Fixed for the webserver

  • Voxtral 3B Mini STT support.

Diarization support

The frontend now allows the user to enable diarization and select a diarization model. For now, the available diarization model is Sortformer 4spk v2.1 fp16.

Depending on the selected STT model, the backend in server.py will either:

  • diarize first and then transcribe, for STT models that do not generate segments, or
  • transcribe first and then diarize, for STT models that do generate segments.

The detailed transcript page now shows speaker-labeled output.

Language and model selection

The language selector now supports more languages and includes search functionality. The model selector now uses a similar searchable dropdown.

The language and model selectors are linked, so unsupported combinations are dimmed and cannot be selected.

Transcript export

The detailed transcription page previously had a placeholder export function. This is now implemented.

Selecting an export format from the dropdown will download the transcript in the chosen format:

  • txt
  • srt
  • vtt
  • json

Support for larger audio files

Added a DB_Store to store the audiofiles for transcription. Bypassing a previous 5MB limit for files.

Changes in the codebase

Frontend components:

  • Added ui/components/modelLanguageSelect for matching languages against models.
  • Added ui/components/SearchableLanguageSelect for a searchable language dropdown.
  • Added ui/components/SearchableSTTModelSelect for a searchable model dropdown.
  • Added audio-db for audio files
  • Updated ui/app/speech-to-text/page.tsx to include the searchable dropdowns, new diarization flow, audio-db and export functionality.
  • Modified server.py to support diarization integration and Voxtral 3B Mini.

… files page.

Language select now contains more language and the dropdown has a search functionality.
Model select now has a similar dropdown as the new language select.
Model and languages select now dim options that don't match. You can't select a language not supported by the selected model and vice versa.
@Gotanius Gotanius closed this Apr 11, 2026
@Gotanius Gotanius reopened this Apr 11, 2026
@Gotanius Gotanius changed the title Webserver: Expanded model and language select on STT Transcribe page feat (stt): Webserver: Expanded model and language select on STT Transcribe page Apr 12, 2026
Blaizzy and others added 5 commits April 14, 2026 11:51
Changed the display of language and date in the output page. It had a fallback to "english" and "yesterday". Now it shows the correct language and date
Moved from LocalStorage file to IndexDB for larger audio file support.
Added Diarization support, for now only Sortformer 4spk v2.1 fp16 is supported. Diarizatoin results are also visible on the detailed transcription pages.
For STT models that generate segments, diarization will run after transcription. For STT models that do not generate segments, diarization will run before the STT model.
Added export functionality for exporting and downloading transcriptions as txt, srt, vtt and json.
@Gotanius Gotanius changed the title feat (stt): Webserver: Expanded model and language select on STT Transcribe page feat (stt): Webserver: Diarization, export and Voxtral 3B Mini, larger audio files Apr 17, 2026
@Gotanius Gotanius changed the title feat (stt): Webserver: Diarization, export and Voxtral 3B Mini, larger audio files feat (stt): Webserver: Diarization, export, Voxtral 3B Mini, larger audio files and enhanced model/language select Apr 17, 2026
@Gotanius
Copy link
Copy Markdown
Author

Updated the PR message to reflect the new changes.

Comment thread mlx_audio/stt/generate.py
synced_stream = False
for s in streams:
mx.synchronize(s)
try:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gotanius What is this guarding against?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants