Asr tweak by Luke-Bilhorn · Pull Request #1029 · genesis-ai-dev/codex-editor

Luke-Bilhorn · 2026-06-05T04:47:24Z

PR Title

979-when-transcribing-audio-there-is-an-unknown-tag

Summary

Closes #979

Originally a fix for the "unknown" language tag on transcribed cells, but the work uncovered deeper issues. This PR wires OmniASR's ?lang= end-to-end instead of attempting to transcribe in english at all times, gives the user a Language + Script picker for transcription, and adds a Re-transcribe button.

Themms-zeroshot-asr Modal app is left running, though we do have a new endpoint with a more descriptive name and LID running if we'd like it. The only problems would be dealing with users who haven't updated yet, and users who use auto-detect paying an extra second of latency for LID telling them what language is being transcribed. This isn't even strictly accurate, since the MMS LID could differ from Omni. Thus, I didn't think switching was worth it, and I did not end up using the new endpoint since the only upgrade was the name.

Changes

Language plumbing: new sharedUtils/asrLanguageUtils.ts + supporting data files resolve project metadata to OmniASR {iso639_3}_{Script} codes.
UI: dead phonetic/IPA branches removed. Transcribe button always visible, greys out while running, flips to Re-transcribe afterwards. A gear is glued to it as a split-button with a popover for Language (Project / Auto-detect) and Script (Default / Latin / Other ISO 15924 tag).
Language badge hidden for now, since we are not doing anything that makes it convey information the user does not already know.
codex-asr Modal app (docs/asr/codex_asr_modal.py): renamed for model-agnosticism, pinned omnilingual-asr==0.2.0 + torch==2.8.0 cu128, added a fairseq2 gang-context shim, added MMS-LID 2048 that runs only when ?lang= is omitted and pairs the detected base with a default script.
Settings: removed asrPhonetic; added asrLanguageMode and asrScriptPref; defaults updated for asrEndpoint / asrProvider / asrModel. Frontier auth-proxy URL still wins when authenticated.
Types: asrConfig gains lang / languageMode / scriptPref / projectLanguageName; updateCellAfterTranscription.language is now string | null; new setAsrLanguageMode / setAsrScriptPref messages.
Docs: rewrote asr-proxy-endpoint.md and AUTH_SERVER_ASR_IMPLEMENTATION.md for the HTTP POST contract; added an Action Items section for the auth-proxy team; old summary marked historical.

Follow-ups if we want to... my guess is we don't

Auth-proxy upstream migration mms-zeroshot-asr → codex-asr
Re-enable language badge once the proxy migration lands.
Decommission mms-zeroshot-asr after a release cycle.

Testing Checklist

Editor — Transcribe / Re-transcribe

Transcribe greys out while running, then label changes to Re-transcribe.
Re-transcribe re-runs against the current audio and replaces the saved result.
Insert Transcription still works.

Editor — Gear popover

Project mode sends ?lang=<resolved code>
Auto-detect sends no ?lang=.
Script → Latin on a non-Latin language (e.g. Urdu) sends urd_Latn.
Script → Other reveals the input immediately; Apply enables only on a valid 4-letter tag (ISO 15924); Enter also commits.
Language badge is not rendered on transcribed cells (intentional).

`codex-asr` smoke

POST /transcribe (no lang) returns lang + lid_s.
POST /transcribe?lang=eng_Latn echoes eng_Latn, no lid_s.
?lang=foo_Bar → 400 with supported-langs hint.

Regression

Previously-transcribed cells still display their text on load.

Screenshots

Adds four new files in sharedUtils/, all importable from both the extension host and the webview bundles: - omniAsrSupportedLangs.ts: 1672 supported {iso639_3}_{Script} codes, snapshotted from the live GET /languages endpoint. - omniAsrDefaultScripts.ts: per-base "best guess" script for the 19 multi-script bases (urd→Arab, cmn→Hans, uig→Arab, yue→Hant, ...). All others have exactly one supported script so no entry needed. - omniAsrFriendlyNames.ts: 1650 base→Ref_Name map, for rendering the transcription badge. - asrLanguageUtils.ts: pure helpers — resolveOmniAsrCode(meta, scriptPref) and labelForTranscriptionLanguage(serverLang, sentCode, projectLanguageName). Nothing is wired up yet; that comes in subsequent commits. Each file has a header explaining how to regenerate after a model/endpoint change.

OmniASR doesn't support IPA output (only its now-deprecated MMS predecessor's ESPeak companion did), so the phonetic flag was never doing anything. Removes: - `codex-editor-extension.asrPhonetic` workspace setting - `phonetic` field from the asrConfig payload and the inline types in TextCellEditor + CodexCellEditor's batch transcription path - `phonetic` read/write in the Copilot + MainMenu settings panels (settings panel UI itself is unchanged — the field just stops being wired) Also nudges the stale default endpoints / provider / model strings toward OmniASR-correct values (the endpoint default is unused in production — the live endpoint comes from getAsrEndpoint() — but the old default leaked the deprecated WebSocket URL).

Extension-host side. The webview now receives `lang` (resolved OmniASR code), `languageMode` ("project" | "auto"), and `projectLanguageName` in the `asrConfig` message; updateCellAfterTranscription stops defaulting `language` to "unknown" — the field now carries the actual OmniASR code the server echoed (or `null` when auto-detect gave us nothing to report). New workspace settings persisting user choices from the gear menu: - `codex-editor-extension.asrLanguageMode` ("project" | "auto") - `codex-editor-extension.asrScriptPref` ("auto" | "latin" | 4-letter ISO 15924 tag) New message commands the webview calls when the user toggles these: - `setAsrLanguageMode` - `setAsrScriptPref` Both rebroadcast `asrConfig` so the live webview state stays in sync without a reload.

WhisperTranscriptionClient - transcribe() now takes { lang?, timeoutMs? } and forwards lang as ?lang= when provided - parses back result.lang (or result.language, for back-compat with the Frontier proxy's earlier field name) and returns it alongside the text CodexCellEditor / TextCellEditor - both transcription paths (per-cell button + batch run) now send the resolved OmniASR code in project mode, omit it in auto-detect mode, and persist whatever the server echoes (or what they sent) via updateCellAfterTranscription - the badge label is now computed via labelForTranscriptionLanguage: server echo → sent code → project name → "Auto Detect" (only when that's the user's chosen mode); falls through to nothing when in project mode and we have no signal — never lies about the language - deletes the dead toIso3() lookup table; the resolver handles macrolang/ISO-1→3 mapping now AudioWaveformWithTranscription - Transcribe button is always visible (no longer hidden once a transcription exists); flips label to "Re-transcribe" and stays disabled while transcribing — mirrors the Re-record button - new gear-icon popover next to it surfaces two advanced settings: Language: Project (default) / Auto-detect Script: Best guess (default) / Latin / Custom (ISO 15924 tag) Hidden on source editors where transcription policy isn't user- driven. Selections post back to the host (setAsrLanguageMode / setAsrScriptPref) which persists them to workspace settings and rebroadcasts asrConfig so the live state stays in sync. Types - asrConfig content gains lang, languageMode, scriptPref, projectLanguageName - updateCellAfterTranscription.content.language is now `string | null` (was always the hardcoded "unknown") - new EditorPostMessages: setAsrLanguageMode, setAsrScriptPref

The two specs (asr-proxy-endpoint.md, AUTH_SERVER_ASR_IMPLEMENTATION.md) were stuck describing the WebSocket / MMS / phonetic era and were actively misleading. Replaces them with the current contract: - Multipart HTTP POST, not WebSocket. - OmniASR upstream, not MMS — and explicitly future-proofed via the model-agnostic Modal app name (codex-asr), with a note on the historical mms-zeroshot-asr URL. - Language is sent as ?lang={iso639_3}_{Script} (e.g. swh_Latn) and omitted for auto-detect. The proxy passes it through verbatim. - Reference FastAPI implementation updated to match. The third doc (asr-auth-proxy-implementation-summary.md) is left intact as a changelog of the original WebSocket-era work, with a header pointing readers to the current spec. Also tightens the webview CSP — drops the dead `wss://ryderwishart--...` allow entry and adds `https://*.modal.run` so the new HTTPS endpoint works under the policy.

The Modal app source was living only on the deployed instance — this commit makes it discoverable and reviewable in-repo at `docs/asr/codex_asr_modal.py`, with a README describing the deploy workflow and migration plan. Substantive changes vs. the currently-deployed source: - `modal.App("mms-zeroshot-asr")` → `modal.App("codex-asr")` so the Modal URL stops hard-coding the (long-since-replaced) model family. Deploying this file creates `genesis-ai-dev--codex-asr- serve.modal.run`. The old `mms-zeroshot-asr` deployment stays warm during the rollout — both serve identical responses. - Module docstring spells out the naming rationale, migration plan, and the auto-detect LID gap. - Comments in transcribe_audio() clarify that the absent `lang` field on auto-detect responses is intentional (no built-in LID), not a bug. Functional contract is unchanged — same `/`, `/health`, `/languages`, `/transcribe` endpoints, same response shape. Deployment is a follow-up step (requires `modal token new` and a deploy). The Frontier auth proxy + the client's default endpoint must be updated to the new URL once deployed — see the handoff note (separate commit).

The unpinned image was pulling torch wheels built against CUDA 13, which fails to load on Modal's debian_slim (libcudart.so.13 missing). omnilingual-asr 0.2.0 also exposes the omniASR_LLM_1B_v2 model card the legacy mms-zeroshot-asr deployment was using. Add a small _ensure_gang_context() shim around the inference path: fairseq2 0.6 stores the current-gangs stack on a threading.local() that is only initialised on the importing thread, so FastAPI worker threads otherwise blow up with AttributeError on every request.

- package.json: point default asrEndpoint at the new codex-asr Modal app, default asrProvider to "omniasr" (was "mms"), and asrModel to omniASR_LLM_1B_v2. Drop the "ASR WebSocket Endpoint" framing — the endpoint is HTTPS multipart. - docs/AUTH_SERVER_ASR_IMPLEMENTATION.md: add an action-items section for the Frontier auth-proxy team (new upstream URL, forward ?lang= verbatim, drop legacy fields) and link the in-repo Modal source.

When the client omits ?lang= we now run facebook/mms-lid-2048 first to detect the ISO 639-3 base, pair it with the default script for that base, and feed the resolved {iso639_3}_{Script} code into OmniASR. The same code is echoed back in the response so the client can render a real "detected language" badge. LID adds ~70-130 ms when warm, ~12 s on the first call after a cold-start. If LID fails for any reason (silence, unrecognised language, base not in OmniASR's set) we fall through to unconditioned transcription and omit `lang` in the response. The script-default table is a Python mirror of sharedUtils/omniAsrDefaultScripts.ts — keep both in sync.

Now that the Modal endpoint runs MMS-LID in auto-detect mode, the language badge on transcribed text is informative again (it reflects the server's resolved code, which is the LID result in auto-detect mode and the user-supplied code in project mode). Relabel the Script dropdown to use plainer language: - "Best guess (default)" → "Default" - "Custom (ISO 15924 tag)" → "Other (ISO 15924 tag)"

Two issues stacking made auto-detect render the project language ("Arabic") even when the user spoke clear English: 1. Badge labeller bug: in auto-detect mode we still passed `projectLanguageName` to `labelForTranscriptionLanguage`, so when the server didn't echo a `lang` (e.g. the legacy endpoint without LID) the labeller fell through to its project-name last-ditch fallback. Now we pass null for both `sentLang` and `projectName` in auto mode, so the only label source is the server's echo, and the explicit "Auto Detect" branch handles the missing-echo case. 2. The Frontier auth proxy still points its ASR upstream at the legacy `mms-zeroshot-asr` Modal app, which doesn't run LID. When the proxy hands us that URL we now detect the legacy host and fall back to the configured `asrEndpoint` (defaulted to `codex-asr`, which does run LID). The bypass becomes a no-op once the proxy migrates its upstream — see docs/AUTH_SERVER_ASR_IMPLEMENTATION.md.

Revert the auth-proxy bypass added in 2760a8b — this PR should behave like main and route ASR through the Frontier auth-proxy, which still forwards to the legacy `mms-zeroshot-asr` Modal app. Changing that routing is a separate decision that needs sign-off. Since the legacy upstream doesn't run LID and never echoes `lang`, the transcription language badge would either silently say "Auto Detect" or fall back to the project language — neither of which is honest. Hide the badge with a TODO comment pointing at the auth-proxy migration; the `transcriptionLanguageLabel` prop and all the plumbing through it stay wired so re-enabling is a one-line change once the proxy upstream moves to `codex-asr`. The `codex-asr` Modal app (with MMS-LID baked in) stays deployed and ready — see docs/asr/codex_asr_modal.py and docs/AUTH_SERVER_ASR_IMPLEMENTATION.md.

Luke-Bilhorn added 12 commits June 4, 2026 15:19

TimRl self-requested a review June 5, 2026 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asr tweak#1029

Asr tweak#1029
Luke-Bilhorn wants to merge 12 commits into
mainfrom
ASR-tweak

Luke-Bilhorn commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luke-Bilhorn commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Title

Summary

Changes

Follow-ups if we want to... my guess is we don't

Testing Checklist

Editor — Transcribe / Re-transcribe

Editor — Gear popover

codex-asr smoke

Regression

Screenshots

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Luke-Bilhorn commented Jun 5, 2026 •

edited

Loading

`codex-asr` smoke