Skip to content

Add assembly dub command for video/audio dubbing#135

Merged
alexkroman merged 4 commits into
mainfrom
claude/gifted-euler-jj1a1r
Jun 12, 2026
Merged

Add assembly dub command for video/audio dubbing#135
alexkroman merged 4 commits into
mainfrom
claude/gifted-euler-jj1a1r

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Implements the assembly dub command, a complete end-to-end dubbing pipeline that transcribes media with speaker diarization, translates utterances via LLM Gateway, synthesizes translations with streaming TTS (one voice per speaker), and muxes the dubbed audio back into the original media using ffmpeg.

Key Changes

  • New module aai_cli/dub_exec.py: Core dubbing pipeline orchestration

    • Transcription with speaker labels (or reuse of existing transcript via --transcript-id)
    • Per-utterance translation to target language with LLM Gateway
    • Streaming TTS synthesis with voice rotation per speaker
    • Timeline assembly: places synthesized segments at original utterance timestamps with silence padding
    • ffmpeg mux: swaps dubbed audio track while preserving video stream
    • Language code resolution (ISO-639-1 to language names) with fallthrough for unlisted languages
    • Comprehensive validation: sandbox-only check, media file existence, ffmpeg availability, output path safety
  • New command module aai_cli/commands/dub.py: CLI interface

    • Required flags: MEDIA (local file), --lang (target language)
    • Optional flags: --transcript-id (reuse transcript), --voice (voice assignment), --model/--max-tokens (LLM config), --out (output path)
    • JSON output mode support
    • Integrated help text with usage examples
  • Comprehensive test suite:

    • tests/test_dub_exec.py: Pure helpers (language resolution, output naming, timeline assembly, utterance extraction) and validation order
    • tests/test_dub_pipeline.py: End-to-end faked pipeline runs with mocked transcription, translation, TTS, and ffmpeg
    • tests/test_dub_command.py: Argv parsing and flag mapping
    • tests/_dub_helpers.py: Shared test fixtures and fake boundary recorders
  • Integration updates:

    • Registered dub command in main app and help group ordering
    • Updated import-linter architecture contracts (.importlinter)
    • Updated help snapshot tests
    • Updated README with dub example

Implementation Details

  • Sandbox-only: Streaming TTS is only available in the sandbox environment; the command enforces this with a clear error message and suggestion
  • Voice assignment: Reuses the existing dialogue module logic for voice rotation and speaker-to-voice mapping (bare --voice applies to all speakers; SPEAKER=VOICE pins individuals)
  • Timeline precision: Uses sample-rate-aware byte math to place segments at millisecond-accurate start times, with overlap handling (segments append immediately if predecessor overran)
  • Validation order: Cheap local checks (language, file existence, ffmpeg availability) run before any credential resolution or network calls
  • Error handling: Distinct error types for each failure mode (missing file, unsupported environment, API errors, ffmpeg failures) with actionable suggestions

https://claude.ai/code/session_01Mcran5xqMHcrt4RUxSHrkX

One command that runs the whole platform end to end: the media is
transcribed with diarized utterance timestamps, each utterance is
translated to the target language by an LLM Gateway model, each
translation is synthesized with streaming TTS (rotation voice per
speaker, --voice/SPEAKER=VOICE overrides like speak), the segments are
laid on a silence timeline at their original start times, and ffmpeg
swaps the new track over the original media with the video stream
copied untouched (-map 0:v? -c:v copy, so audio-only input works too).

Usage: assembly --sandbox dub talk.mp4 --lang de

- --lang takes an ISO code (mapped to a language name) or a name as-is
- --transcript-id reuses an existing diarized transcript
- default output <name>.dub.<lang><ext>; --out overrides (and refuses
  to overwrite the input)
- sandbox-only, like speak: streaming TTS has no production host yet

Follows the options/run split (commands/dub.py parses argv into a
frozen DubOptions; dub_exec.run_dub does the work), with the LLM,
TTS, and ffmpeg boundaries seamed for hermetic tests.

https://claude.ai/code/session_01Mcran5xqMHcrt4RUxSHrkX
@alexkroman alexkroman enabled auto-merge June 12, 2026 22:10
claude and others added 3 commits June 12, 2026 22:19
Union resolutions: dictate + dub both land in the importlinter
contracts, the run help-group, and the README features list; the
run-group help snapshots were regenerated (not hand-merged) on top of
main's copy.

https://claude.ai/code/session_01Mcran5xqMHcrt4RUxSHrkX
CI renders CliRunner output with color, so style codes interleave inside
flag names ("--lang") and the human summary line, breaking substring
asserts that pass locally without color. Strip SGR sequences first via a
shared plain() helper, the same convention test_help_rendering and the
clip suite use.

https://claude.ai/code/session_01Mcran5xqMHcrt4RUxSHrkX
@alexkroman alexkroman added this pull request to the merge queue Jun 12, 2026
Merged via the queue into main with commit dcb96e2 Jun 12, 2026
15 checks passed
@alexkroman alexkroman deleted the claude/gifted-euler-jj1a1r branch June 12, 2026 22:38
alexkroman pushed a commit that referenced this pull request Jun 12, 2026
All confirmed findings from the dub (#135) code review:

- Self-overwrite guard now also catches the same file under another
  spelling (samefile when --out exists): on case-insensitive
  filesystems (macOS APFS) `--out TALK.MP4` against talk.mp4 passed the
  path comparison and ffmpeg corrupted the input.
- Fresh transcriptions auto-detect the source language (dub input is
  typically not English, which is the API default); a new --source-lang
  flag pins it instead.
- --out viability is validated before the billed pipeline: existing
  directory, missing parent directory, and missing file extension
  (ffmpeg picks the container from it) now fail upfront, and a language
  that slugs to nothing (e.g. 中文) asks for an explicit --out instead
  of colliding every such dub onto "<stem>.dub..<ext>".
- --voice is parsed before any billed work, and SPEAKER=VOICE pins for
  speakers absent from the diarized transcript warn instead of being
  dropped silently (mirrors assembly speak).
- A --transcript-id that is queued/processing/errored is rejected with
  the real reason (shared resolve_diarized_transcript, so clip gets the
  same fix) instead of a misleading "no utterances" error.
- Translations truncated at max_tokens (finish_reason length/max_tokens)
  raise instead of dubbing speech that stops mid-sentence.
- The success line escapes user-controlled --lang/--voice text (an
  embedded "[/]" crashed with MarkupError after the dub succeeded).
- URLs are rejected with the URL echoed intact (Path() collapsed
  "s3://…" to "s3:/…") and a download hint.
- ffmpeg output paths starting with "-" are passed as "./-…" so they
  can't be parsed as ffmpeg options (clip's cut destinations too).

https://claude.ai/code/session_018TuAQTvp9PVy5EdhsDWo2h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants