Add podcast RSS/Atom feed expansion to transcribe command by alexkroman · Pull Request #195 · AssemblyAI/cli

alexkroman · 2026-06-16T21:45:45Z

Adds support for transcribing entire podcast feeds by expanding feed URLs into batch mode, with one resumable sidecar per episode enclosure.

Summary

When a user passes a podcast RSS or Atom feed URL to assembly transcribe, the CLI now detects it, fetches the feed, extracts all episode enclosure URLs, and transcribes them as a batch. Each episode is transcribed independently with its own .aai.json sidecar, and re-runs skip already-transcribed episodes. Direct media URLs and non-feed pages continue to work as before (single-source path).

Key Changes

New module aai_cli/app/transcribe/feed.py: Implements feed detection, fetching, and parsing
- feed_episode_urls(): Main entry point that gates on URL shape, checks for yt-dlp pages, fetches the feed, and parses enclosures
- _looks_like_feed_url(): Detects feed-shaped URLs (extensionless or .xml/.rss/atom extensions)
- _episode_urls(): Parses RSS and Atom feeds, validates the root element, extracts enclosure URLs, dedupes while preserving order, and HTML-unescapes URLs
- _fetch(): Bounded HTTP fetch (10 MB cap) that skips binary media content types and handles network errors gracefully
Updated aai_cli/app/transcribe/sources.py: Integrates feed detection into the batch-vs-single-source routing
- Added detect_feeds parameter to expand_sources() (defaults to True)
- Routes feed URLs to feed.feed_episode_urls() for batch expansion
- Extracted local-path logic into _local_sources() helper
- --show-code passes detect_feeds=False to skip network probes
Updated aai_cli/commands/transcribe.py: Documentation and help text
- Updated argument help to mention "podcast RSS feed"
- Added example: "Transcribe a whole podcast feed"
- Updated docstring to explain feed expansion in batch mode
Updated help snapshots and reference docs: Reflect feed support in command descriptions

Implementation Details

Detection is deliberately narrow: Only http(s) URLs with feed-shaped paths are probed; direct media URLs (.mp3, etc.) and ordinary web pages are never fetched, avoiding double-fetches and unnecessary network calls
Feed validation: A response must have an <rss> or <feed> root element before enclosures are trusted, so stray HTML pages containing the word "enclosure" are never mistaken for feeds
Deduplication: Episode URLs are deduped while preserving feed order (newest first), so duplicate enclosures in a feed don't create duplicate batch jobs
Bounded fetch: Capped at 10 MB to prevent hostile or huge feeds from exhausting memory; 10 MB already holds thousands of episodes
Graceful fallthrough: Network errors, non-feed responses, and feeds without enclosures all return None, leaving the URL on the single-source path untouched

Testing

Comprehensive test suite in tests/test_transcribe_feed.py (311 lines) covers:

RSS 2.0 and Atom parsing with various attribute orders
HTML entity unescaping (& → &)
Deduplication while preserving order
Feed-shaped URL detection (extensionless, .xml, .rss, .atom)
HTTP error handling, binary media skipping, byte-cap truncation
Network error resilience
End-to-end CLI runs with faked network (socket-blocked test suite)

https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq

`assembly transcribe <feed-url>` now expands a podcast RSS/Atom feed into its episode enclosure URLs and runs them through the existing batch path — one `.aai.json` sidecar per episode, resumable, concurrent, and compatible with `--llm`/`--llm-reduce`. The enclosures are direct media URLs the API fetches itself, so no per-episode yt-dlp download is needed (unlike a podcast *page*). Detection is deliberately narrow to avoid surprise fetches: only an http(s) URL whose path is feed-shaped (extensionless or `.xml`/`.rss`/`.atom`) and that no dedicated yt-dlp extractor already claims is probed, the response body is bounded to 10 MB, binary media content types are skipped, and only content that actually parses as a feed with at least one enclosure is treated as one. `--show-code` skips the probe entirely so it never touches the network. Docs (README, transcribe help/docstring, aai-cli skill reference) updated to list RSS feeds alongside files, URLs, and YouTube/podcast pages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq

…5a2nj1

Swap the regex-based RSS/Atom enclosure extraction in feed.py for `feedparser`, the de-facto standard feed parser — it handles the namespace, encoding, and malformed-markup edge cases a regex never will. The bounded, content-type-guarded httpx fetch stays the only network path: feedparser is handed the already-fetched bytes (never a URL), so it never fetches on its own. feedparser's result is untyped, so it's validated through a small pydantic model (the project pattern for untyped third-party returns — cf. core/wer.py), keeping feed.py strict-clean under mypy and pyright. Adds feedparser as a runtime dependency. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq

aikido-pr-checks · 2026-06-16T22:13:07Z

+
+
+class _Entry(BaseModel):
+    enclosures: list[_Enclosure] = []


_Entry.enclosures uses a mutable list default (=[]). Use a default factory (e.g., Field(default_factory=list)) so each instance gets its own list instead of sharing one across requests.

Details

✨ AI Reasoning
A model field is defined with a mutable list literal as its default. Mutable defaults in module-level class definitions can be shared across instances, causing data from one request or parse to appear in another. This change introduces shared mutable state where each parsed feed/entry should have its own fresh list.

🔧 How do I fix it?
Avoid storing request-specific data in module-level variables. Use request-scoped variables or explicitly mark shared caches as intentional.

_{Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.}
_{Reply @AikidoSec ignore: [REASON] to ignore this issue.}
_{More info}

Replace the `= []` field defaults on the pydantic feed models with `Field(default_factory=list[...])`. pydantic v2 already deep-copies mutable defaults per instance, so `= []` was not actually shared — but the explicit typed factory makes per-instance isolation obvious to readers and static analysis, while keeping the field's element type known under pyright strict (a bare `default_factory=list` infers `list[Unknown]`). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq

claude added 3 commits June 16, 2026 21:27

Merge remote-tracking branch 'origin/main' into claude/laughing-bell-…

a7af25b

…5a2nj1

aikido-pr-checks Bot reviewed Jun 16, 2026

View reviewed changes

alexkroman enabled auto-merge June 16, 2026 22:22

alexkroman added this pull request to the merge queue Jun 16, 2026

Merged via the queue into main with commit bacfac5 Jun 16, 2026
19 checks passed

alexkroman deleted the claude/laughing-bell-5a2nj1 branch June 16, 2026 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add podcast RSS/Atom feed expansion to transcribe command#195

Add podcast RSS/Atom feed expansion to transcribe command#195
alexkroman merged 4 commits into
mainfrom
claude/laughing-bell-5a2nj1

alexkroman commented Jun 16, 2026

Uh oh!

aikido-pr-checks Bot Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexkroman commented Jun 16, 2026

Summary

Key Changes

Implementation Details

Testing

Uh oh!

aikido-pr-checks Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants