feat (stt): Webserver: Diarization, export, Voxtral 3B Mini, larger audio files and enhanced model/language select#643
Open
Gotanius wants to merge 6 commits intoBlaizzy:mainfrom
Open
feat (stt): Webserver: Diarization, export, Voxtral 3B Mini, larger audio files and enhanced model/language select#643Gotanius wants to merge 6 commits intoBlaizzy:mainfrom
Gotanius wants to merge 6 commits intoBlaizzy:mainfrom
Conversation
… files page. Language select now contains more language and the dropdown has a search functionality. Model select now has a similar dropdown as the new language select. Model and languages select now dim options that don't match. You can't select a language not supported by the selected model and vice versa.
Changed the display of language and date in the output page. It had a fallback to "english" and "yesterday". Now it shows the correct language and date
Moved from LocalStorage file to IndexDB for larger audio file support. Added Diarization support, for now only Sortformer 4spk v2.1 fp16 is supported. Diarizatoin results are also visible on the detailed transcription pages. For STT models that generate segments, diarization will run after transcription. For STT models that do not generate segments, diarization will run before the STT model. Added export functionality for exporting and downloading transcriptions as txt, srt, vtt and json.
Author
|
Updated the PR message to reflect the new changes. |
lucasnewman
reviewed
Apr 24, 2026
| synced_stream = False | ||
| for s in streams: | ||
| mx.synchronize(s) | ||
| try: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Warning
This is one of my first pull requests ever. If I have done anything incorrectly, please let me know and I will fix it.
Context
I wanted to use the webserver to transcribe audio files with Voxtral 3B Mini, but it was not working as expected. I also wanted speaker diarization. While working on that, I made some UX improvements to the frontend.
Summary
This PR applies only to offline transcription, not realtime transcription.
Added STT features to the webserver
txt,srt,vtt, and/orjson.Fixed for the webserver
Diarization support
The frontend now allows the user to enable diarization and select a diarization model. For now, the available diarization model is
Sortformer 4spk v2.1 fp16.Depending on the selected STT model, the backend in
server.pywill either:The detailed transcript page now shows speaker-labeled output.
Language and model selection
The language selector now supports more languages and includes search functionality. The model selector now uses a similar searchable dropdown.
The language and model selectors are linked, so unsupported combinations are dimmed and cannot be selected.
Transcript export
The detailed transcription page previously had a placeholder export function. This is now implemented.
Selecting an export format from the dropdown will download the transcript in the chosen format:
txtsrtvttjsonSupport for larger audio files
Added a DB_Store to store the audiofiles for transcription. Bypassing a previous 5MB limit for files.
Changes in the codebase
Frontend components:
ui/components/modelLanguageSelectfor matching languages against models.ui/components/SearchableLanguageSelectfor a searchable language dropdown.ui/components/SearchableSTTModelSelectfor a searchable model dropdown.ui/app/speech-to-text/page.tsxto include the searchable dropdowns, new diarization flow, audio-db and export functionality.server.pyto support diarization integration and Voxtral 3B Mini.