Asr tweak#1029
Open
Luke-Bilhorn wants to merge 12 commits into
Open
Conversation
Adds four new files in sharedUtils/, all importable from both the
extension host and the webview bundles:
- omniAsrSupportedLangs.ts: 1672 supported {iso639_3}_{Script} codes,
snapshotted from the live GET /languages endpoint.
- omniAsrDefaultScripts.ts: per-base "best guess" script for the 19
multi-script bases (urd→Arab, cmn→Hans, uig→Arab, yue→Hant, ...).
All others have exactly one supported script so no entry needed.
- omniAsrFriendlyNames.ts: 1650 base→Ref_Name map, for rendering the
transcription badge.
- asrLanguageUtils.ts: pure helpers — resolveOmniAsrCode(meta,
scriptPref) and labelForTranscriptionLanguage(serverLang, sentCode,
projectLanguageName).
Nothing is wired up yet; that comes in subsequent commits. Each file
has a header explaining how to regenerate after a model/endpoint
change.
OmniASR doesn't support IPA output (only its now-deprecated MMS
predecessor's ESPeak companion did), so the phonetic flag was never
doing anything. Removes:
- `codex-editor-extension.asrPhonetic` workspace setting
- `phonetic` field from the asrConfig payload and the inline types
in TextCellEditor + CodexCellEditor's batch transcription path
- `phonetic` read/write in the Copilot + MainMenu settings panels
(settings panel UI itself is unchanged — the field just stops
being wired)
Also nudges the stale default endpoints / provider / model strings
toward OmniASR-correct values (the endpoint default is unused in
production — the live endpoint comes from getAsrEndpoint() — but the
old default leaked the deprecated WebSocket URL).
Extension-host side. The webview now receives `lang` (resolved OmniASR
code), `languageMode` ("project" | "auto"), and `projectLanguageName`
in the `asrConfig` message; updateCellAfterTranscription stops
defaulting `language` to "unknown" — the field now carries the actual
OmniASR code the server echoed (or `null` when auto-detect gave us
nothing to report).
New workspace settings persisting user choices from the gear menu:
- `codex-editor-extension.asrLanguageMode` ("project" | "auto")
- `codex-editor-extension.asrScriptPref` ("auto" | "latin" | 4-letter
ISO 15924 tag)
New message commands the webview calls when the user toggles these:
- `setAsrLanguageMode`
- `setAsrScriptPref`
Both rebroadcast `asrConfig` so the live webview state stays in sync
without a reload.
WhisperTranscriptionClient
- transcribe() now takes { lang?, timeoutMs? } and forwards lang as
?lang= when provided
- parses back result.lang (or result.language, for back-compat with
the Frontier proxy's earlier field name) and returns it alongside
the text
CodexCellEditor / TextCellEditor
- both transcription paths (per-cell button + batch run) now send
the resolved OmniASR code in project mode, omit it in auto-detect
mode, and persist whatever the server echoes (or what they sent)
via updateCellAfterTranscription
- the badge label is now computed via labelForTranscriptionLanguage:
server echo → sent code → project name → "Auto Detect" (only when
that's the user's chosen mode); falls through to nothing when in
project mode and we have no signal — never lies about the language
- deletes the dead toIso3() lookup table; the resolver handles
macrolang/ISO-1→3 mapping now
AudioWaveformWithTranscription
- Transcribe button is always visible (no longer hidden once a
transcription exists); flips label to "Re-transcribe" and stays
disabled while transcribing — mirrors the Re-record button
- new gear-icon popover next to it surfaces two advanced settings:
Language: Project (default) / Auto-detect
Script: Best guess (default) / Latin / Custom (ISO 15924 tag)
Hidden on source editors where transcription policy isn't user-
driven. Selections post back to the host (setAsrLanguageMode /
setAsrScriptPref) which persists them to workspace settings and
rebroadcasts asrConfig so the live state stays in sync.
Types
- asrConfig content gains lang, languageMode, scriptPref,
projectLanguageName
- updateCellAfterTranscription.content.language is now `string | null`
(was always the hardcoded "unknown")
- new EditorPostMessages: setAsrLanguageMode, setAsrScriptPref
The two specs (asr-proxy-endpoint.md, AUTH_SERVER_ASR_IMPLEMENTATION.md)
were stuck describing the WebSocket / MMS / phonetic era and were
actively misleading. Replaces them with the current contract:
- Multipart HTTP POST, not WebSocket.
- OmniASR upstream, not MMS — and explicitly future-proofed via the
model-agnostic Modal app name (codex-asr), with a note on the
historical mms-zeroshot-asr URL.
- Language is sent as ?lang={iso639_3}_{Script} (e.g. swh_Latn) and
omitted for auto-detect. The proxy passes it through verbatim.
- Reference FastAPI implementation updated to match.
The third doc (asr-auth-proxy-implementation-summary.md) is left intact
as a changelog of the original WebSocket-era work, with a header
pointing readers to the current spec.
Also tightens the webview CSP — drops the dead
`wss://ryderwishart--...` allow entry and adds `https://*.modal.run` so
the new HTTPS endpoint works under the policy.
The Modal app source was living only on the deployed instance — this
commit makes it discoverable and reviewable in-repo at
`docs/asr/codex_asr_modal.py`, with a README describing the deploy
workflow and migration plan.
Substantive changes vs. the currently-deployed source:
- `modal.App("mms-zeroshot-asr")` → `modal.App("codex-asr")` so the
Modal URL stops hard-coding the (long-since-replaced) model
family. Deploying this file creates `genesis-ai-dev--codex-asr-
serve.modal.run`. The old `mms-zeroshot-asr` deployment stays
warm during the rollout — both serve identical responses.
- Module docstring spells out the naming rationale, migration
plan, and the auto-detect LID gap.
- Comments in transcribe_audio() clarify that the absent `lang`
field on auto-detect responses is intentional (no built-in LID),
not a bug.
Functional contract is unchanged — same `/`, `/health`, `/languages`,
`/transcribe` endpoints, same response shape.
Deployment is a follow-up step (requires `modal token new` and a
deploy). The Frontier auth proxy + the client's default endpoint must
be updated to the new URL once deployed — see the handoff note
(separate commit).
The unpinned image was pulling torch wheels built against CUDA 13, which fails to load on Modal's debian_slim (libcudart.so.13 missing). omnilingual-asr 0.2.0 also exposes the omniASR_LLM_1B_v2 model card the legacy mms-zeroshot-asr deployment was using. Add a small _ensure_gang_context() shim around the inference path: fairseq2 0.6 stores the current-gangs stack on a threading.local() that is only initialised on the importing thread, so FastAPI worker threads otherwise blow up with AttributeError on every request.
- package.json: point default asrEndpoint at the new codex-asr Modal app, default asrProvider to "omniasr" (was "mms"), and asrModel to omniASR_LLM_1B_v2. Drop the "ASR WebSocket Endpoint" framing — the endpoint is HTTPS multipart. - docs/AUTH_SERVER_ASR_IMPLEMENTATION.md: add an action-items section for the Frontier auth-proxy team (new upstream URL, forward ?lang= verbatim, drop legacy fields) and link the in-repo Modal source.
When the client omits ?lang= we now run facebook/mms-lid-2048 first
to detect the ISO 639-3 base, pair it with the default script for
that base, and feed the resolved {iso639_3}_{Script} code into
OmniASR. The same code is echoed back in the response so the client
can render a real "detected language" badge.
LID adds ~70-130 ms when warm, ~12 s on the first call after a
cold-start. If LID fails for any reason (silence, unrecognised
language, base not in OmniASR's set) we fall through to
unconditioned transcription and omit `lang` in the response.
The script-default table is a Python mirror of
sharedUtils/omniAsrDefaultScripts.ts — keep both in sync.
Now that the Modal endpoint runs MMS-LID in auto-detect mode, the language badge on transcribed text is informative again (it reflects the server's resolved code, which is the LID result in auto-detect mode and the user-supplied code in project mode). Relabel the Script dropdown to use plainer language: - "Best guess (default)" → "Default" - "Custom (ISO 15924 tag)" → "Other (ISO 15924 tag)"
Two issues stacking made auto-detect render the project language
("Arabic") even when the user spoke clear English:
1. Badge labeller bug: in auto-detect mode we still passed
`projectLanguageName` to `labelForTranscriptionLanguage`, so when
the server didn't echo a `lang` (e.g. the legacy endpoint without
LID) the labeller fell through to its project-name last-ditch
fallback. Now we pass null for both `sentLang` and `projectName`
in auto mode, so the only label source is the server's echo, and
the explicit "Auto Detect" branch handles the missing-echo case.
2. The Frontier auth proxy still points its ASR upstream at the
legacy `mms-zeroshot-asr` Modal app, which doesn't run LID. When
the proxy hands us that URL we now detect the legacy host and
fall back to the configured `asrEndpoint` (defaulted to
`codex-asr`, which does run LID). The bypass becomes a no-op once
the proxy migrates its upstream — see
docs/AUTH_SERVER_ASR_IMPLEMENTATION.md.
Revert the auth-proxy bypass added in 2760a8b — this PR should behave like main and route ASR through the Frontier auth-proxy, which still forwards to the legacy `mms-zeroshot-asr` Modal app. Changing that routing is a separate decision that needs sign-off. Since the legacy upstream doesn't run LID and never echoes `lang`, the transcription language badge would either silently say "Auto Detect" or fall back to the project language — neither of which is honest. Hide the badge with a TODO comment pointing at the auth-proxy migration; the `transcriptionLanguageLabel` prop and all the plumbing through it stay wired so re-enabling is a one-line change once the proxy upstream moves to `codex-asr`. The `codex-asr` Modal app (with MMS-LID baked in) stays deployed and ready — see docs/asr/codex_asr_modal.py and docs/AUTH_SERVER_ASR_IMPLEMENTATION.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Title
979-when-transcribing-audio-there-is-an-unknown-tag
Summary
Closes #979
Originally a fix for the "unknown" language tag on transcribed cells, but the work uncovered deeper issues. This PR wires OmniASR's
?lang=end-to-end instead of attempting to transcribe in english at all times, gives the user a Language + Script picker for transcription, and adds a Re-transcribe button.The
mms-zeroshot-asrModal app is left running, though we do have a new endpoint with a more descriptive name and LID running if we'd like it. The only problems would be dealing with users who haven't updated yet, and users who use auto-detect paying an extra second of latency for LID telling them what language is being transcribed. This isn't even strictly accurate, since the MMS LID could differ from Omni. Thus, I didn't think switching was worth it, and I did not end up using the new endpoint since the only upgrade was the name.Changes
sharedUtils/asrLanguageUtils.ts+ supporting data files resolve project metadata to OmniASR{iso639_3}_{Script}codes.codex-asrModal app (docs/asr/codex_asr_modal.py): renamed for model-agnosticism, pinnedomnilingual-asr==0.2.0+torch==2.8.0cu128, added a fairseq2 gang-context shim, added MMS-LID 2048 that runs only when?lang=is omitted and pairs the detected base with a default script.asrPhonetic; addedasrLanguageModeandasrScriptPref; defaults updated forasrEndpoint/asrProvider/asrModel. Frontier auth-proxy URL still wins when authenticated.asrConfiggainslang/languageMode/scriptPref/projectLanguageName;updateCellAfterTranscription.languageis nowstring | null; newsetAsrLanguageMode/setAsrScriptPrefmessages.asr-proxy-endpoint.mdandAUTH_SERVER_ASR_IMPLEMENTATION.mdfor the HTTP POST contract; added an Action Items section for the auth-proxy team; old summary marked historical.Follow-ups if we want to... my guess is we don't
mms-zeroshot-asr→codex-asrmms-zeroshot-asrafter a release cycle.Testing Checklist
Editor — Transcribe / Re-transcribe
Editor — Gear popover
?lang=<resolved code>?lang=.urd_Latn.codex-asrsmokePOST /transcribe(nolang) returnslang+lid_s.POST /transcribe?lang=eng_Latnechoeseng_Latn, nolid_s.?lang=foo_Bar→ 400 with supported-langs hint.Regression
Screenshots