Skip to content

Asr tweak#1029

Open
Luke-Bilhorn wants to merge 12 commits into
mainfrom
ASR-tweak
Open

Asr tweak#1029
Luke-Bilhorn wants to merge 12 commits into
mainfrom
ASR-tweak

Conversation

@Luke-Bilhorn
Copy link
Copy Markdown
Contributor

@Luke-Bilhorn Luke-Bilhorn commented Jun 5, 2026

PR Title

979-when-transcribing-audio-there-is-an-unknown-tag

Summary

Closes #979

Originally a fix for the "unknown" language tag on transcribed cells, but the work uncovered deeper issues. This PR wires OmniASR's ?lang= end-to-end instead of attempting to transcribe in english at all times, gives the user a Language + Script picker for transcription, and adds a Re-transcribe button.

Themms-zeroshot-asr Modal app is left running, though we do have a new endpoint with a more descriptive name and LID running if we'd like it. The only problems would be dealing with users who haven't updated yet, and users who use auto-detect paying an extra second of latency for LID telling them what language is being transcribed. This isn't even strictly accurate, since the MMS LID could differ from Omni. Thus, I didn't think switching was worth it, and I did not end up using the new endpoint since the only upgrade was the name.

Changes

  • Language plumbing: new sharedUtils/asrLanguageUtils.ts + supporting data files resolve project metadata to OmniASR {iso639_3}_{Script} codes.
  • UI: dead phonetic/IPA branches removed. Transcribe button always visible, greys out while running, flips to Re-transcribe afterwards. A gear is glued to it as a split-button with a popover for Language (Project / Auto-detect) and Script (Default / Latin / Other ISO 15924 tag).
  • Language badge hidden for now, since we are not doing anything that makes it convey information the user does not already know.
  • codex-asr Modal app (docs/asr/codex_asr_modal.py): renamed for model-agnosticism, pinned omnilingual-asr==0.2.0 + torch==2.8.0 cu128, added a fairseq2 gang-context shim, added MMS-LID 2048 that runs only when ?lang= is omitted and pairs the detected base with a default script.
  • Settings: removed asrPhonetic; added asrLanguageMode and asrScriptPref; defaults updated for asrEndpoint / asrProvider / asrModel. Frontier auth-proxy URL still wins when authenticated.
  • Types: asrConfig gains lang / languageMode / scriptPref / projectLanguageName; updateCellAfterTranscription.language is now string | null; new setAsrLanguageMode / setAsrScriptPref messages.
  • Docs: rewrote asr-proxy-endpoint.md and AUTH_SERVER_ASR_IMPLEMENTATION.md for the HTTP POST contract; added an Action Items section for the auth-proxy team; old summary marked historical.

Follow-ups if we want to... my guess is we don't

  • Auth-proxy upstream migration mms-zeroshot-asrcodex-asr
  • Re-enable language badge once the proxy migration lands.
  • Decommission mms-zeroshot-asr after a release cycle.

Testing Checklist

Editor — Transcribe / Re-transcribe

  • Transcribe greys out while running, then label changes to Re-transcribe.
  • Re-transcribe re-runs against the current audio and replaces the saved result.
  • Insert Transcription still works.

Editor — Gear popover

  • Project mode sends ?lang=<resolved code>
  • Auto-detect sends no ?lang=.
  • Script → Latin on a non-Latin language (e.g. Urdu) sends urd_Latn.
  • Script → Other reveals the input immediately; Apply enables only on a valid 4-letter tag (ISO 15924); Enter also commits.
  • Language badge is not rendered on transcribed cells (intentional).

codex-asr smoke

  • POST /transcribe (no lang) returns lang + lid_s.
  • POST /transcribe?lang=eng_Latn echoes eng_Latn, no lid_s.
  • ?lang=foo_Bar → 400 with supported-langs hint.

Regression

  • Previously-transcribed cells still display their text on load.

Screenshots

image

Adds four new files in sharedUtils/, all importable from both the
extension host and the webview bundles:

  - omniAsrSupportedLangs.ts: 1672 supported {iso639_3}_{Script} codes,
    snapshotted from the live GET /languages endpoint.
  - omniAsrDefaultScripts.ts: per-base "best guess" script for the 19
    multi-script bases (urd→Arab, cmn→Hans, uig→Arab, yue→Hant, ...).
    All others have exactly one supported script so no entry needed.
  - omniAsrFriendlyNames.ts: 1650 base→Ref_Name map, for rendering the
    transcription badge.
  - asrLanguageUtils.ts: pure helpers — resolveOmniAsrCode(meta,
    scriptPref) and labelForTranscriptionLanguage(serverLang, sentCode,
    projectLanguageName).

Nothing is wired up yet; that comes in subsequent commits. Each file
has a header explaining how to regenerate after a model/endpoint
change.
OmniASR doesn't support IPA output (only its now-deprecated MMS
predecessor's ESPeak companion did), so the phonetic flag was never
doing anything. Removes:

  - `codex-editor-extension.asrPhonetic` workspace setting
  - `phonetic` field from the asrConfig payload and the inline types
    in TextCellEditor + CodexCellEditor's batch transcription path
  - `phonetic` read/write in the Copilot + MainMenu settings panels
    (settings panel UI itself is unchanged — the field just stops
    being wired)

Also nudges the stale default endpoints / provider / model strings
toward OmniASR-correct values (the endpoint default is unused in
production — the live endpoint comes from getAsrEndpoint() — but the
old default leaked the deprecated WebSocket URL).
Extension-host side. The webview now receives `lang` (resolved OmniASR
code), `languageMode` ("project" | "auto"), and `projectLanguageName`
in the `asrConfig` message; updateCellAfterTranscription stops
defaulting `language` to "unknown" — the field now carries the actual
OmniASR code the server echoed (or `null` when auto-detect gave us
nothing to report).

New workspace settings persisting user choices from the gear menu:
  - `codex-editor-extension.asrLanguageMode` ("project" | "auto")
  - `codex-editor-extension.asrScriptPref` ("auto" | "latin" | 4-letter
    ISO 15924 tag)

New message commands the webview calls when the user toggles these:
  - `setAsrLanguageMode`
  - `setAsrScriptPref`

Both rebroadcast `asrConfig` so the live webview state stays in sync
without a reload.
WhisperTranscriptionClient
  - transcribe() now takes { lang?, timeoutMs? } and forwards lang as
    ?lang= when provided
  - parses back result.lang (or result.language, for back-compat with
    the Frontier proxy's earlier field name) and returns it alongside
    the text

CodexCellEditor / TextCellEditor
  - both transcription paths (per-cell button + batch run) now send
    the resolved OmniASR code in project mode, omit it in auto-detect
    mode, and persist whatever the server echoes (or what they sent)
    via updateCellAfterTranscription
  - the badge label is now computed via labelForTranscriptionLanguage:
    server echo → sent code → project name → "Auto Detect" (only when
    that's the user's chosen mode); falls through to nothing when in
    project mode and we have no signal — never lies about the language
  - deletes the dead toIso3() lookup table; the resolver handles
    macrolang/ISO-1→3 mapping now

AudioWaveformWithTranscription
  - Transcribe button is always visible (no longer hidden once a
    transcription exists); flips label to "Re-transcribe" and stays
    disabled while transcribing — mirrors the Re-record button
  - new gear-icon popover next to it surfaces two advanced settings:
      Language: Project (default) / Auto-detect
      Script:   Best guess (default) / Latin / Custom (ISO 15924 tag)
    Hidden on source editors where transcription policy isn't user-
    driven. Selections post back to the host (setAsrLanguageMode /
    setAsrScriptPref) which persists them to workspace settings and
    rebroadcasts asrConfig so the live state stays in sync.

Types
  - asrConfig content gains lang, languageMode, scriptPref,
    projectLanguageName
  - updateCellAfterTranscription.content.language is now `string | null`
    (was always the hardcoded "unknown")
  - new EditorPostMessages: setAsrLanguageMode, setAsrScriptPref
The two specs (asr-proxy-endpoint.md, AUTH_SERVER_ASR_IMPLEMENTATION.md)
were stuck describing the WebSocket / MMS / phonetic era and were
actively misleading. Replaces them with the current contract:

  - Multipart HTTP POST, not WebSocket.
  - OmniASR upstream, not MMS — and explicitly future-proofed via the
    model-agnostic Modal app name (codex-asr), with a note on the
    historical mms-zeroshot-asr URL.
  - Language is sent as ?lang={iso639_3}_{Script} (e.g. swh_Latn) and
    omitted for auto-detect. The proxy passes it through verbatim.
  - Reference FastAPI implementation updated to match.

The third doc (asr-auth-proxy-implementation-summary.md) is left intact
as a changelog of the original WebSocket-era work, with a header
pointing readers to the current spec.

Also tightens the webview CSP — drops the dead
`wss://ryderwishart--...` allow entry and adds `https://*.modal.run` so
the new HTTPS endpoint works under the policy.
The Modal app source was living only on the deployed instance — this
commit makes it discoverable and reviewable in-repo at
`docs/asr/codex_asr_modal.py`, with a README describing the deploy
workflow and migration plan.

Substantive changes vs. the currently-deployed source:

  - `modal.App("mms-zeroshot-asr")` → `modal.App("codex-asr")` so the
    Modal URL stops hard-coding the (long-since-replaced) model
    family. Deploying this file creates `genesis-ai-dev--codex-asr-
    serve.modal.run`. The old `mms-zeroshot-asr` deployment stays
    warm during the rollout — both serve identical responses.
  - Module docstring spells out the naming rationale, migration
    plan, and the auto-detect LID gap.
  - Comments in transcribe_audio() clarify that the absent `lang`
    field on auto-detect responses is intentional (no built-in LID),
    not a bug.

Functional contract is unchanged — same `/`, `/health`, `/languages`,
`/transcribe` endpoints, same response shape.

Deployment is a follow-up step (requires `modal token new` and a
deploy). The Frontier auth proxy + the client's default endpoint must
be updated to the new URL once deployed — see the handoff note
(separate commit).
The unpinned image was pulling torch wheels built against CUDA 13,
which fails to load on Modal's debian_slim (libcudart.so.13 missing).
omnilingual-asr 0.2.0 also exposes the omniASR_LLM_1B_v2 model card
the legacy mms-zeroshot-asr deployment was using.

Add a small _ensure_gang_context() shim around the inference path:
fairseq2 0.6 stores the current-gangs stack on a threading.local()
that is only initialised on the importing thread, so FastAPI worker
threads otherwise blow up with AttributeError on every request.
- package.json: point default asrEndpoint at the new codex-asr Modal
  app, default asrProvider to "omniasr" (was "mms"), and asrModel to
  omniASR_LLM_1B_v2. Drop the "ASR WebSocket Endpoint" framing — the
  endpoint is HTTPS multipart.
- docs/AUTH_SERVER_ASR_IMPLEMENTATION.md: add an action-items section
  for the Frontier auth-proxy team (new upstream URL, forward ?lang=
  verbatim, drop legacy fields) and link the in-repo Modal source.
When the client omits ?lang= we now run facebook/mms-lid-2048 first
to detect the ISO 639-3 base, pair it with the default script for
that base, and feed the resolved {iso639_3}_{Script} code into
OmniASR. The same code is echoed back in the response so the client
can render a real "detected language" badge.

LID adds ~70-130 ms when warm, ~12 s on the first call after a
cold-start. If LID fails for any reason (silence, unrecognised
language, base not in OmniASR's set) we fall through to
unconditioned transcription and omit `lang` in the response.

The script-default table is a Python mirror of
sharedUtils/omniAsrDefaultScripts.ts — keep both in sync.
Now that the Modal endpoint runs MMS-LID in auto-detect mode, the
language badge on transcribed text is informative again (it reflects
the server's resolved code, which is the LID result in auto-detect
mode and the user-supplied code in project mode).

Relabel the Script dropdown to use plainer language:
- "Best guess (default)"  → "Default"
- "Custom (ISO 15924 tag)" → "Other (ISO 15924 tag)"
Two issues stacking made auto-detect render the project language
("Arabic") even when the user spoke clear English:

1. Badge labeller bug: in auto-detect mode we still passed
   `projectLanguageName` to `labelForTranscriptionLanguage`, so when
   the server didn't echo a `lang` (e.g. the legacy endpoint without
   LID) the labeller fell through to its project-name last-ditch
   fallback. Now we pass null for both `sentLang` and `projectName`
   in auto mode, so the only label source is the server's echo, and
   the explicit "Auto Detect" branch handles the missing-echo case.

2. The Frontier auth proxy still points its ASR upstream at the
   legacy `mms-zeroshot-asr` Modal app, which doesn't run LID. When
   the proxy hands us that URL we now detect the legacy host and
   fall back to the configured `asrEndpoint` (defaulted to
   `codex-asr`, which does run LID). The bypass becomes a no-op once
   the proxy migrates its upstream — see
   docs/AUTH_SERVER_ASR_IMPLEMENTATION.md.
Revert the auth-proxy bypass added in 2760a8b — this PR should
behave like main and route ASR through the Frontier auth-proxy,
which still forwards to the legacy `mms-zeroshot-asr` Modal app.
Changing that routing is a separate decision that needs sign-off.

Since the legacy upstream doesn't run LID and never echoes `lang`,
the transcription language badge would either silently say
"Auto Detect" or fall back to the project language — neither of
which is honest. Hide the badge with a TODO comment pointing at
the auth-proxy migration; the `transcriptionLanguageLabel` prop
and all the plumbing through it stay wired so re-enabling is a
one-line change once the proxy upstream moves to `codex-asr`.

The `codex-asr` Modal app (with MMS-LID baked in) stays deployed
and ready — see docs/asr/codex_asr_modal.py and
docs/AUTH_SERVER_ASR_IMPLEMENTATION.md.
@TimRl TimRl self-requested a review June 5, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When transcribing audio there is an 'Unknown' tag

1 participant