Skip to content

feat: support any yt-dlp site + readable download filenames#9

Merged
Mapleeeeeeeeeee merged 5 commits into
mainfrom
feat/readable-filename-multiplatform
Apr 20, 2026
Merged

feat: support any yt-dlp site + readable download filenames#9
Mapleeeeeeeeeee merged 5 commits into
mainfrom
feat/readable-filename-multiplatform

Conversation

@Mapleeeeeeeeeee
Copy link
Copy Markdown
Owner

Summary

  • Rename youtube_urlsource_url across all layers (schema, jobs, pipeline, routes, frontend)
  • Replace YouTube-only URL validation with yt-dlp extractor-based _is_supported_url (supports X/Twitter, TikTok, Bilibili, etc.)
  • Add _sanitize_filename + _build_download_filename for human-readable download filenames (e.g., Video Title (en_to_zh-TW).srt)
  • Prefer yt-dlp info_dict title over ffprobe MP4 tags (fixes missing titles on non-YouTube platforms)
  • Add @lru_cache to _is_supported_url for per-request performance (1,862 extractors)
  • Delete 103-line run_pipeline dead code, clean up bare except Exception, extract constants
  • Add API contract IT + live-download E2E tests; fix test quality issues (remove Mockery/Implementation-Detail anti-patterns)

Test plan

  • Unit tests: 126 passed
  • Ruff / mypy / vulture / prettier all pass
  • Manual test: X/Twitter video download + full pipeline (transcribe + translate) verified
  • CI: integration + E2E tests

🤖 Generated with Claude Code

Mapleeeeeeeeeee and others added 5 commits April 18, 2026 09:50
- Rename youtube_url -> source_url across backend schemas, Job, routes,
  pipeline and all tests, plus frontend types, component and i18n copy.
  Reflects that the tool now supports any platform yt-dlp can handle
  (X, Twitter, TikTok, etc.), not just YouTube.
- Replace regex-based _is_youtube_url() with _is_supported_url(), which
  delegates to yt-dlp extractors (excluding the generic one) so URL
  validation always matches yt-dlp's actual capabilities.
- Download filenames now use the video title plus a language-intent
  suffix: "<title> (original).srt" for source-only outputs and
  "<title> (en_to_zh-TW).srt" for translated ones, with illegal chars
  stripped and length capped at 120 chars.
- Frontend copy and URL validation relaxed to accept any https URL.
Correctness / Cleanliness:
- Rename download_youtube_video -> download_video (leaked abstraction);
  update all call sites in pipeline, core/__init__ and tests.
- Swap Chinese error string for English ("URL not supported by
  downloader (yt-dlp): ...") to match the rest of the file and avoid
  leaking a UI concern into core.
- Strip trailing " ." after 120-char truncation in _sanitize_filename
  to avoid Windows-invalid names ending in space.
- Update module docstring in downloader.py to drop "YouTube".

Efficiency (Blocker):
- Cache yt-dlp extractor classes at module load; _is_supported_url
  now iterates the cached list and calls suitable() on classes
  (~11ms/call -> ~0ms). Removes the deferred import + PLC0415 noqa.

Simplicity / Convention:
- Merge _EXTENSIONS and _MEDIA_TYPES into a single _FILE_META dict
  keyed by FileType (NamedTuple of ext + media_type).
- Extract magic strings: _DEFAULT_FILENAME, _SUFFIX_ORIGINAL,
  _LANG_SEPARATOR.

Frontend:
- Tighten URL pattern from /^https?:\/\/.+/ to /^https?:\/\/\S+$/
  so whitespace-only suffixes are rejected client-side.

API defensiveness (QA F2):
- JobCreateRequest now uses extra="forbid"; clients that send
  youtube_url (old field) alongside source_url get 422 instead of a
  silent success.

Tests:
- Strengthen two assertion-roulette cases to assert exact strings /
  legal content preservation, so a mutation that returns the fallback
  "video" would now fail the tests.
download_video was populating VideoMetadata.title from ffprobe's MP4
container tags, which for a freshly-downloaded video is empty — ffprobe
falls back to the filename stem `video`. The yt-dlp info_dict carries
the upstream platform title (YouTube / X / TikTok…), matching the
pre-existing treatment of `description`.

Previously unobservable because the download filename was hardcoded to
`bilingualsub.{ext}`. After the rename to use `video_title` in the
Content-Disposition filename, every downloaded file showed up as
`video (original).{ext}`. E2E run with "Me at the zoo" (YouTube's
first video) now produces `Me at the zoo (original).mp4` as expected.
tests/integration/test_api_contract.py (19 tests, @integration):
- JobCreateRequest schema: source_url accepted; youtube_url (old field),
  both fields simultaneously, non-URL strings, and empty bodies all get
  422 via HttpUrl + extra="forbid".
- HTTP-layer acceptance of any valid HttpUrl (YouTube, youtu.be, X,
  Twitter, TikTok, Vimeo) — the test is intentionally scoped to schema
  layer only; extractor-level coverage is in the unit tests and in the
  async failure test below.
- Async unsupported-URL path: example.com creates a job then fails via
  the pipeline with error.code=invalid_input and a yt-dlp-specific
  detail. 10-second polling window to survive CI thread contention.
- Content-Disposition filename format: parametrized over title,
  language pair, and file type, covering translated outputs, same-lang
  original suffix, empty-title fallback, and illegal-char sanitisation.
- Both fixtures use `with TestClient(app) as c: yield c` so Starlette
  actually runs the lifespan and populates app.state.job_manager.

tests/e2e/test_filename_e2e.py (2 tests, @e2e, gated by ENABLE_LIVE_DOWNLOAD=1):
- Real yt-dlp download of "Me at the zoo" through FastAPI, then asserts
  Content-Disposition filename == "Me at the zoo (original).{mp4|mp3}".
- Specifically exists to catch regression of the bug where video_title
  was read from ffprobe MP4 container tags (empty) instead of yt-dlp
  info_dict.title. If the live download fails we pytest.fail — silent
  skip would hide the regression the test is built to detect.

tests/unit/core/test_downloader.py:
- Rename test_ffprobe_missing_title_uses_filename -> test_info_dict_
  title_overrides_ffprobe. The old expectation (fall back to filename
  stem) tested behaviour that was itself the bug; the new assertion
  matches the documented contract that info_dict.title always wins.
Production code:
- Delete run_pipeline dead code (103 lines)
- Replace list[Any] with list[type] for extractor classes
- Add @lru_cache to _is_supported_url (1862 extractors)
- Replace bare except Exception with specific exceptions
- Add _FILENAME_BAD_CHARS comment, extract _MAX_UPLOAD_BYTES constant
- Remove WHAT comments, remove dead url_placeholder i18n key

Test code:
- Delete 3 Implementation-Detail/Mockery tests in test_pipeline
- Migrate all tests from run_pipeline to run_download/run_subtitle
- Remove mock internal assertions in test_downloader
- Use exact value assertion in test_routes
- Rename misleading IT to clarify PlayRes fixed design
- Add coverage: extract_audio failure + sanitize edge cases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Mapleeeeeeeeeee Mapleeeeeeeeeee merged commit 767a2b8 into main Apr 20, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant