Add `assembly dub` command for video/audio dubbing by alexkroman · Pull Request #135 · AssemblyAI/cli

alexkroman · 2026-06-12T22:10:29Z

Implements the assembly dub command, a complete end-to-end dubbing pipeline that transcribes media with speaker diarization, translates utterances via LLM Gateway, synthesizes translations with streaming TTS (one voice per speaker), and muxes the dubbed audio back into the original media using ffmpeg.

Key Changes

New module aai_cli/dub_exec.py: Core dubbing pipeline orchestration
- Transcription with speaker labels (or reuse of existing transcript via --transcript-id)
- Per-utterance translation to target language with LLM Gateway
- Streaming TTS synthesis with voice rotation per speaker
- Timeline assembly: places synthesized segments at original utterance timestamps with silence padding
- ffmpeg mux: swaps dubbed audio track while preserving video stream
- Language code resolution (ISO-639-1 to language names) with fallthrough for unlisted languages
- Comprehensive validation: sandbox-only check, media file existence, ffmpeg availability, output path safety
New command module aai_cli/commands/dub.py: CLI interface
- Required flags: MEDIA (local file), --lang (target language)
- Optional flags: --transcript-id (reuse transcript), --voice (voice assignment), --model/--max-tokens (LLM config), --out (output path)
- JSON output mode support
- Integrated help text with usage examples
Comprehensive test suite:
- tests/test_dub_exec.py: Pure helpers (language resolution, output naming, timeline assembly, utterance extraction) and validation order
- tests/test_dub_pipeline.py: End-to-end faked pipeline runs with mocked transcription, translation, TTS, and ffmpeg
- tests/test_dub_command.py: Argv parsing and flag mapping
- tests/_dub_helpers.py: Shared test fixtures and fake boundary recorders
Integration updates:
- Registered dub command in main app and help group ordering
- Updated import-linter architecture contracts (.importlinter)
- Updated help snapshot tests
- Updated README with dub example

Implementation Details

Sandbox-only: Streaming TTS is only available in the sandbox environment; the command enforces this with a clear error message and suggestion
Voice assignment: Reuses the existing dialogue module logic for voice rotation and speaker-to-voice mapping (bare --voice applies to all speakers; SPEAKER=VOICE pins individuals)
Timeline precision: Uses sample-rate-aware byte math to place segments at millisecond-accurate start times, with overlap handling (segments append immediately if predecessor overran)
Validation order: Cheap local checks (language, file existence, ffmpeg availability) run before any credential resolution or network calls
Error handling: Distinct error types for each failure mode (missing file, unsupported environment, API errors, ffmpeg failures) with actionable suggestions

https://claude.ai/code/session_01Mcran5xqMHcrt4RUxSHrkX

One command that runs the whole platform end to end: the media is transcribed with diarized utterance timestamps, each utterance is translated to the target language by an LLM Gateway model, each translation is synthesized with streaming TTS (rotation voice per speaker, --voice/SPEAKER=VOICE overrides like speak), the segments are laid on a silence timeline at their original start times, and ffmpeg swaps the new track over the original media with the video stream copied untouched (-map 0:v? -c:v copy, so audio-only input works too). Usage: assembly --sandbox dub talk.mp4 --lang de - --lang takes an ISO code (mapped to a language name) or a name as-is - --transcript-id reuses an existing diarized transcript - default output <name>.dub.<lang><ext>; --out overrides (and refuses to overwrite the input) - sandbox-only, like speak: streaming TTS has no production host yet Follows the options/run split (commands/dub.py parses argv into a frozen DubOptions; dub_exec.run_dub does the work), with the LLM, TTS, and ffmpeg boundaries seamed for hermetic tests. https://claude.ai/code/session_01Mcran5xqMHcrt4RUxSHrkX

Union resolutions: dictate + dub both land in the importlinter contracts, the run help-group, and the README features list; the run-group help snapshots were regenerated (not hand-merged) on top of main's copy. https://claude.ai/code/session_01Mcran5xqMHcrt4RUxSHrkX

CI renders CliRunner output with color, so style codes interleave inside flag names ("--lang") and the human summary line, breaking substring asserts that pass locally without color. Strip SGR sequences first via a shared plain() helper, the same convention test_help_rendering and the clip suite use. https://claude.ai/code/session_01Mcran5xqMHcrt4RUxSHrkX

All confirmed findings from the dub (#135) code review: - Self-overwrite guard now also catches the same file under another spelling (samefile when --out exists): on case-insensitive filesystems (macOS APFS) `--out TALK.MP4` against talk.mp4 passed the path comparison and ffmpeg corrupted the input. - Fresh transcriptions auto-detect the source language (dub input is typically not English, which is the API default); a new --source-lang flag pins it instead. - --out viability is validated before the billed pipeline: existing directory, missing parent directory, and missing file extension (ffmpeg picks the container from it) now fail upfront, and a language that slugs to nothing (e.g. 中文) asks for an explicit --out instead of colliding every such dub onto "<stem>.dub..<ext>". - --voice is parsed before any billed work, and SPEAKER=VOICE pins for speakers absent from the diarized transcript warn instead of being dropped silently (mirrors assembly speak). - A --transcript-id that is queued/processing/errored is rejected with the real reason (shared resolve_diarized_transcript, so clip gets the same fix) instead of a misleading "no utterances" error. - Translations truncated at max_tokens (finish_reason length/max_tokens) raise instead of dubbing speech that stops mid-sentence. - The success line escapes user-controlled --lang/--voice text (an embedded "[/]" crashed with MarkupError after the dub succeeded). - URLs are rejected with the URL echoed intact (Path() collapsed "s3://…" to "s3:/…") and a download hint. - ffmpeg output paths starting with "-" are passed as "./-…" so they can't be parsed as ffmpeg options (clip's cut destinations too). https://claude.ai/code/session_018TuAQTvp9PVy5EdhsDWo2h

alexkroman enabled auto-merge June 12, 2026 22:10

claude and others added 3 commits June 12, 2026 22:19

Merge branch 'main' into claude/gifted-euler-jj1a1r

eca87fc

alexkroman added this pull request to the merge queue Jun 12, 2026

Merged via the queue into main with commit dcb96e2 Jun 12, 2026
15 checks passed

alexkroman deleted the claude/gifted-euler-jj1a1r branch June 12, 2026 22:38

alexkroman mentioned this pull request Jun 12, 2026

Deduplicate clip/dub media scaffolding into shared mediafile module #137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `assembly dub` command for video/audio dubbing#135

Add `assembly dub` command for video/audio dubbing#135
alexkroman merged 4 commits into
mainfrom
claude/gifted-euler-jj1a1r

alexkroman commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexkroman commented Jun 12, 2026

Key Changes

Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants