Skip to content

Add podcast RSS/Atom feed expansion to transcribe command#195

Merged
alexkroman merged 4 commits into
mainfrom
claude/laughing-bell-5a2nj1
Jun 16, 2026
Merged

Add podcast RSS/Atom feed expansion to transcribe command#195
alexkroman merged 4 commits into
mainfrom
claude/laughing-bell-5a2nj1

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Adds support for transcribing entire podcast feeds by expanding feed URLs into batch mode, with one resumable sidecar per episode enclosure.

Summary

When a user passes a podcast RSS or Atom feed URL to assembly transcribe, the CLI now detects it, fetches the feed, extracts all episode enclosure URLs, and transcribes them as a batch. Each episode is transcribed independently with its own .aai.json sidecar, and re-runs skip already-transcribed episodes. Direct media URLs and non-feed pages continue to work as before (single-source path).

Key Changes

  • New module aai_cli/app/transcribe/feed.py: Implements feed detection, fetching, and parsing

    • feed_episode_urls(): Main entry point that gates on URL shape, checks for yt-dlp pages, fetches the feed, and parses enclosures
    • _looks_like_feed_url(): Detects feed-shaped URLs (extensionless or .xml/.rss/atom extensions)
    • _episode_urls(): Parses RSS and Atom feeds, validates the root element, extracts enclosure URLs, dedupes while preserving order, and HTML-unescapes URLs
    • _fetch(): Bounded HTTP fetch (10 MB cap) that skips binary media content types and handles network errors gracefully
  • Updated aai_cli/app/transcribe/sources.py: Integrates feed detection into the batch-vs-single-source routing

    • Added detect_feeds parameter to expand_sources() (defaults to True)
    • Routes feed URLs to feed.feed_episode_urls() for batch expansion
    • Extracted local-path logic into _local_sources() helper
    • --show-code passes detect_feeds=False to skip network probes
  • Updated aai_cli/commands/transcribe.py: Documentation and help text

    • Updated argument help to mention "podcast RSS feed"
    • Added example: "Transcribe a whole podcast feed"
    • Updated docstring to explain feed expansion in batch mode
  • Updated help snapshots and reference docs: Reflect feed support in command descriptions

Implementation Details

  • Detection is deliberately narrow: Only http(s) URLs with feed-shaped paths are probed; direct media URLs (.mp3, etc.) and ordinary web pages are never fetched, avoiding double-fetches and unnecessary network calls
  • Feed validation: A response must have an <rss> or <feed> root element before enclosures are trusted, so stray HTML pages containing the word "enclosure" are never mistaken for feeds
  • Deduplication: Episode URLs are deduped while preserving feed order (newest first), so duplicate enclosures in a feed don't create duplicate batch jobs
  • Bounded fetch: Capped at 10 MB to prevent hostile or huge feeds from exhausting memory; 10 MB already holds thousands of episodes
  • Graceful fallthrough: Network errors, non-feed responses, and feeds without enclosures all return None, leaving the URL on the single-source path untouched

Testing

Comprehensive test suite in tests/test_transcribe_feed.py (311 lines) covers:

  • RSS 2.0 and Atom parsing with various attribute orders
  • HTML entity unescaping (&amp;&)
  • Deduplication while preserving order
  • Feed-shaped URL detection (extensionless, .xml, .rss, .atom)
  • HTTP error handling, binary media skipping, byte-cap truncation
  • Network error resilience
  • End-to-end CLI runs with faked network (socket-blocked test suite)

https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq

claude added 3 commits June 16, 2026 21:27
`assembly transcribe <feed-url>` now expands a podcast RSS/Atom feed into its
episode enclosure URLs and runs them through the existing batch path — one
`.aai.json` sidecar per episode, resumable, concurrent, and compatible with
`--llm`/`--llm-reduce`. The enclosures are direct media URLs the API fetches
itself, so no per-episode yt-dlp download is needed (unlike a podcast *page*).

Detection is deliberately narrow to avoid surprise fetches: only an http(s) URL
whose path is feed-shaped (extensionless or `.xml`/`.rss`/`.atom`) and that no
dedicated yt-dlp extractor already claims is probed, the response body is bounded
to 10 MB, binary media content types are skipped, and only content that actually
parses as a feed with at least one enclosure is treated as one. `--show-code`
skips the probe entirely so it never touches the network.

Docs (README, transcribe help/docstring, aai-cli skill reference) updated to
list RSS feeds alongside files, URLs, and YouTube/podcast pages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq
Swap the regex-based RSS/Atom enclosure extraction in feed.py for `feedparser`,
the de-facto standard feed parser — it handles the namespace, encoding, and
malformed-markup edge cases a regex never will. The bounded, content-type-guarded
httpx fetch stays the only network path: feedparser is handed the already-fetched
bytes (never a URL), so it never fetches on its own.

feedparser's result is untyped, so it's validated through a small pydantic model
(the project pattern for untyped third-party returns — cf. core/wer.py), keeping
feed.py strict-clean under mypy and pyright.

Adds feedparser as a runtime dependency.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq
Comment thread aai_cli/app/transcribe/feed.py Outdated


class _Entry(BaseModel):
enclosures: list[_Enclosure] = []

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_Entry.enclosures uses a mutable list default (=[]). Use a default factory (e.g., Field(default_factory=list)) so each instance gets its own list instead of sharing one across requests.

Details

✨ AI Reasoning
​A model field is defined with a mutable list literal as its default. Mutable defaults in module-level class definitions can be shared across instances, causing data from one request or parse to appear in another. This change introduces shared mutable state where each parsed feed/entry should have its own fresh list.

🔧 How do I fix it?
Avoid storing request-specific data in module-level variables. Use request-scoped variables or explicitly mark shared caches as intentional.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

Replace the `= []` field defaults on the pydantic feed models with
`Field(default_factory=list[...])`. pydantic v2 already deep-copies mutable
defaults per instance, so `= []` was not actually shared — but the explicit
typed factory makes per-instance isolation obvious to readers and static
analysis, while keeping the field's element type known under pyright strict
(a bare `default_factory=list` infers `list[Unknown]`).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq
@alexkroman alexkroman enabled auto-merge June 16, 2026 22:22
@alexkroman alexkroman added this pull request to the merge queue Jun 16, 2026
Merged via the queue into main with commit bacfac5 Jun 16, 2026
19 checks passed
@alexkroman alexkroman deleted the claude/laughing-bell-5a2nj1 branch June 16, 2026 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants