Add podcast RSS/Atom feed expansion to transcribe command#195
Conversation
`assembly transcribe <feed-url>` now expands a podcast RSS/Atom feed into its episode enclosure URLs and runs them through the existing batch path — one `.aai.json` sidecar per episode, resumable, concurrent, and compatible with `--llm`/`--llm-reduce`. The enclosures are direct media URLs the API fetches itself, so no per-episode yt-dlp download is needed (unlike a podcast *page*). Detection is deliberately narrow to avoid surprise fetches: only an http(s) URL whose path is feed-shaped (extensionless or `.xml`/`.rss`/`.atom`) and that no dedicated yt-dlp extractor already claims is probed, the response body is bounded to 10 MB, binary media content types are skipped, and only content that actually parses as a feed with at least one enclosure is treated as one. `--show-code` skips the probe entirely so it never touches the network. Docs (README, transcribe help/docstring, aai-cli skill reference) updated to list RSS feeds alongside files, URLs, and YouTube/podcast pages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq
Swap the regex-based RSS/Atom enclosure extraction in feed.py for `feedparser`, the de-facto standard feed parser — it handles the namespace, encoding, and malformed-markup edge cases a regex never will. The bounded, content-type-guarded httpx fetch stays the only network path: feedparser is handed the already-fetched bytes (never a URL), so it never fetches on its own. feedparser's result is untyped, so it's validated through a small pydantic model (the project pattern for untyped third-party returns — cf. core/wer.py), keeping feed.py strict-clean under mypy and pyright. Adds feedparser as a runtime dependency. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq
|
|
||
|
|
||
| class _Entry(BaseModel): | ||
| enclosures: list[_Enclosure] = [] |
There was a problem hiding this comment.
_Entry.enclosures uses a mutable list default (=[]). Use a default factory (e.g., Field(default_factory=list)) so each instance gets its own list instead of sharing one across requests.
Details
✨ AI Reasoning
A model field is defined with a mutable list literal as its default. Mutable defaults in module-level class definitions can be shared across instances, causing data from one request or parse to appear in another. This change introduces shared mutable state where each parsed feed/entry should have its own fresh list.
🔧 How do I fix it?
Avoid storing request-specific data in module-level variables. Use request-scoped variables or explicitly mark shared caches as intentional.
Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
Replace the `= []` field defaults on the pydantic feed models with `Field(default_factory=list[...])`. pydantic v2 already deep-copies mutable defaults per instance, so `= []` was not actually shared — but the explicit typed factory makes per-instance isolation obvious to readers and static analysis, while keeping the field's element type known under pyright strict (a bare `default_factory=list` infers `list[Unknown]`). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq
Adds support for transcribing entire podcast feeds by expanding feed URLs into batch mode, with one resumable sidecar per episode enclosure.
Summary
When a user passes a podcast RSS or Atom feed URL to
assembly transcribe, the CLI now detects it, fetches the feed, extracts all episode enclosure URLs, and transcribes them as a batch. Each episode is transcribed independently with its own.aai.jsonsidecar, and re-runs skip already-transcribed episodes. Direct media URLs and non-feed pages continue to work as before (single-source path).Key Changes
New module
aai_cli/app/transcribe/feed.py: Implements feed detection, fetching, and parsingfeed_episode_urls(): Main entry point that gates on URL shape, checks for yt-dlp pages, fetches the feed, and parses enclosures_looks_like_feed_url(): Detects feed-shaped URLs (extensionless or.xml/.rss/atomextensions)_episode_urls(): Parses RSS and Atom feeds, validates the root element, extracts enclosure URLs, dedupes while preserving order, and HTML-unescapes URLs_fetch(): Bounded HTTP fetch (10 MB cap) that skips binary media content types and handles network errors gracefullyUpdated
aai_cli/app/transcribe/sources.py: Integrates feed detection into the batch-vs-single-source routingdetect_feedsparameter toexpand_sources()(defaults toTrue)feed.feed_episode_urls()for batch expansion_local_sources()helper--show-codepassesdetect_feeds=Falseto skip network probesUpdated
aai_cli/commands/transcribe.py: Documentation and help textUpdated help snapshots and reference docs: Reflect feed support in command descriptions
Implementation Details
.mp3, etc.) and ordinary web pages are never fetched, avoiding double-fetches and unnecessary network calls<rss>or<feed>root element before enclosures are trusted, so stray HTML pages containing the word "enclosure" are never mistaken for feedsNone, leaving the URL on the single-source path untouchedTesting
Comprehensive test suite in
tests/test_transcribe_feed.py(311 lines) covers:&→&).xml,.rss,.atom)https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq