Skip to content

feat: visual description mode for videos without speech#11

Merged
Mapleeeeeeeeeee merged 5 commits into
mainfrom
feat/visual-description-mode
Apr 28, 2026
Merged

feat: visual description mode for videos without speech#11
Mapleeeeeeeeeee merged 5 commits into
mainfrom
feat/visual-description-mode

Conversation

@Mapleeeeeeeeeee
Copy link
Copy Markdown
Owner

Summary

  • Add visual description mode that uses Gemini 3.1 Flash Lite Preview to analyze video frames and generate timestamped description subtitles for speechless videos
  • Frontend toggle between "Speech Subtitles" and "Visual Description" modes with full pipeline support
  • Mode switch between visual→subtitle and subtitle→visual at any phase (with on-the-fly audio extraction fallback)

Changes

  • Backend: visual_describer.py core module, pipeline branching in pipeline.py, ProcessingMode enum, upload validation reordering
  • Frontend: mode toggle in UrlInput, ProcessingMode type, DownloadLinks filtering (hide ASS/Audio in visual mode), ProgressTracker source badge, useJob state tracking, startSubtitle mode forwarding
  • Config: GEMINI_API_KEY + VISUAL_DESCRIPTION_MODEL env vars
  • Docs: design doc + architecture doc

Test plan

  • 166 unit tests pass, no regression
  • ruff, mypy, TypeScript, prettier all clean
  • curl E2E: create job (both modes), upload with mode, invalid mode rejection
  • curl E2E: visual→subtitle mode switch (C-B2 fix verified — audio extracted on-the-fly)
  • curl E2E: subtitle→visual mode switch
  • curl E2E: SRT download with bilingual content verified
  • 4 rounds of code review (correctness, convention, simplicity, cleanliness, efficiency)
  • QA verification of all bug fixes

🤖 Generated with Claude Code

Mapleeeeeeeeeee and others added 5 commits April 24, 2026 14:12
New pipeline mode that uses Gemini to analyze video frames and generate
translated subtitles from visual content (on-screen text, UI elements,
scene descriptions). Users toggle between speech subtitles and visual
description via a new UI switch.

Backend: core/visual_describer.py (Gemini File API), pipeline branch on
processing_mode, configurable model via VISUAL_DESCRIPTION_MODEL env var.
Frontend: Toggle in UrlInput, i18n keys, processing_mode in request types.
Tests: 7 unit tests + 3 integration tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ciples

Restructured DESCRIBE_PROMPT using XML sections with why-what-how flow:
pacing (3-8s segments), on_screen_text (quote actual text for translation),
ui_actions (narrate purpose not labels), skip (omit logo cards).
Also fixed file processing wait loop and mypy type issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ProcessingMode StrEnum to eliminate magic strings
- Add polling timeout (600s) and FAILED state handling for Gemini file upload
- Clean up uploaded files from Gemini after use (try/finally)
- Fix error handling in _run_visual_description_subtitle (no re-raise)
- Fix check order: _genai import before API key validation
- Add processing_mode to upload route
- Extract _require_api_key helper (Rule of Three)
- Use source_lang in prompt instead of ignoring it
- Pass work_dir as parameter to _serialize_translated_only
- Tighten test assertions (remove assertion roulette)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code review fixes:
- Remove duplicate [dependency-groups] section from pyproject.toml
- Validate processing_mode in upload route (prevent 500 on invalid input)
- Use ProcessingMode.SUBTITLE as Form default instead of magic string
- Change _genai-missing check to raise ValueError for correct error routing
- Guard client in finally block to prevent potential UnboundLocalError
- Refactor _require_api_key to accept value directly (type-safe, no getattr)
- Skip audio extraction for visual description mode (saves 30-60s)
- Align pre-commit mypy (v1.10→v1.19.1) and add google-genai to its
  additional_dependencies so both environments resolve the same types

Test review fixes:
- Add MM:SS and HH:MM:SS timestamp parsing tests
- Add start>=end boundary value test
- Add FAILED state and timeout path tests for _wait_for_active
- Verify describe→translate causal chain in IT

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round-3: early-fail on missing file name, audio extraction guard for
mode switch, validation before file upload, error message fixes,
frontend ProcessingMode type alias, upload FormData forwards mode.

Round-4: ProgressTracker visual description label, DownloadLinks hides
ASS/Audio in visual mode, startSubtitle forwards processingMode,
consolidate get_settings calls, update docs to Gemini 3.1 Flash Lite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Mapleeeeeeeeeee Mapleeeeeeeeeee merged commit dede41b into main Apr 28, 2026
2 of 3 checks passed
@Mapleeeeeeeeeee Mapleeeeeeeeeee deleted the feat/visual-description-mode branch April 28, 2026 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant