Add podcast RSS/Atom feed expansion to transcribe command (#195)

alexkroman · claude · web-flow · commit bacfac52cd95 · 2026-06-16T22:22:38.000Z
Adds support for transcribing entire podcast feeds by expanding feed URLs into batch mode, with one resumable sidecar per episode enclosure. ## Summary When a user passes a podcast RSS or Atom feed URL to `assembly transcribe`, the CLI now detects it, fetches the feed, extracts all episode enclosure URLs, and transcribes them as a batch. Each episode is transcribed independently with its own `.aai.json` sidecar, and re-runs skip already-transcribed episodes. Direct media URLs and non-feed pages continue to work as before (single-source path). ## Key Changes - **New module `aai_cli/app/transcribe/feed.py`**: Implements feed detection, fetching, and parsing - `feed_episode_urls()`: Main entry point that gates on URL shape, checks for yt-dlp pages, fetches the feed, and parses enclosures - `_looks_like_feed_url()`: Detects feed-shaped URLs (extensionless or `.xml`/`.rss`/`atom` extensions) - `_episode_urls()`: Parses RSS and Atom feeds, validates the root element, extracts enclosure URLs, dedupes while preserving order, and HTML-unescapes URLs - `_fetch()`: Bounded HTTP fetch (10 MB cap) that skips binary media content types and handles network errors gracefully - **Updated `aai_cli/app/transcribe/sources.py`**: Integrates feed detection into the batch-vs-single-source routing - Added `detect_feeds` parameter to `expand_sources()` (defaults to `True`) - Routes feed URLs to `feed.feed_episode_urls()` for batch expansion - Extracted local-path logic into `_local_sources()` helper - `--show-code` passes `detect_feeds=False` to skip network probes - **Updated `aai_cli/commands/transcribe.py`**: Documentation and help text - Updated argument help to mention "podcast RSS feed" - Added example: "Transcribe a whole podcast feed" - Updated docstring to explain feed expansion in batch mode - **Updated help snapshots and reference docs**: Reflect feed support in command descriptions ## Implementation Details - **Detection is deliberately narrow**: Only http(s) URLs with feed-shaped paths are probed; direct media URLs (`.mp3`, etc.) and ordinary web pages are never fetched, avoiding double-fetches and unnecessary network calls - **Feed validation**: A response must have an `<rss>` or `<feed>` root element before enclosures are trusted, so stray HTML pages containing the word "enclosure" are never mistaken for feeds - **Deduplication**: Episode URLs are deduped while preserving feed order (newest first), so duplicate enclosures in a feed don't create duplicate batch jobs - **Bounded fetch**: Capped at 10 MB to prevent hostile or huge feeds from exhausting memory; 10 MB already holds thousands of episodes - **Graceful fallthrough**: Network errors, non-feed responses, and feeds without enclosures all return `None`, leaving the URL on the single-source path untouched ## Testing Comprehensive test suite in `tests/test_transcribe_feed.py` (311 lines) covers: - RSS 2.0 and Atom parsing with various attribute orders - HTML entity unescaping (`&amp;` → `&`) - Deduplication while preserving order - Feed-shaped URL detection (extensionless, `.xml`, `.rss`, `.atom`) - HTTP error handling, binary media skipping, byte-cap truncation - Network error resilience - End-to-end CLI runs with faked network (socket-blocked test suite) https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq --------- Co-authored-by: Claude <noreply@anthropic.com>
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 [![License](https://img.shields.io/badge/license-MIT-D6402E)](https://github.com/AssemblyAI/cli/blob/main/LICENSE)
 [![Docs](https://img.shields.io/badge/docs-assemblyai-D6402E)](https://www.assemblyai.com/docs)
 
-The AssemblyAI CLI (`assembly`) brings speech AI directly into your terminal: transcribe files, URLs, and YouTube/podcast pages, stream live audio, talk to a two-way voice agent, prompt the LLM Gateway, benchmark speech models, and scaffold ready-to-deploy starter apps.
+The AssemblyAI CLI (`assembly`) brings speech AI directly into your terminal: transcribe files, URLs, YouTube/podcast pages, and whole podcast RSS feeds, stream live audio, talk to a two-way voice agent, prompt the LLM Gateway, benchmark speech models, and scaffold ready-to-deploy starter apps.
 
 <p align="center">
   <img src="assets/welcome.png" alt="The assembly CLI welcome screen, listing command groups for transcription, streaming, voice agents, app scaffolding, and account management" width="820">
@@ -44,7 +44,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
 
 | Command | What it does |
 | :--- | :--- |
-| `assembly transcribe` | Transcribe files, URLs, YouTube/podcast pages, directories, globs, or bucket storage (`s3://`, `gs://`, `az://`) — with speaker labels, PII redaction, summarization, SRT/VTT captions, and resumable batch runs |
+| `assembly transcribe` | Transcribe files, URLs, YouTube/podcast pages, podcast RSS feeds, directories, globs, or bucket storage (`s3://`, `gs://`, `az://`) — with speaker labels, PII redaction, summarization, SRT/VTT captions, and resumable batch runs |
 | `assembly stream` | Real-time transcription from your microphone, a file, or a URL — on macOS it can capture system audio too |
 | `assembly dictate` | Push-to-talk dictation: press Enter to record, Enter again for instant text (Sync STT API, up to 120 s per utterance) |
 | `assembly agent` | Full-duplex spoken conversation with a voice agent, right in your terminal |
@@ -285,11 +285,13 @@ assembly transcribe video.mp4 -o srt   # captions
 assembly transcribe call.mp3 --speaker-labels --summarization --json
 ```
 
-Transcribe in batches — a directory, a glob, or a piped list, resumable on re-run:
+Transcribe in batches — a directory, a glob, a piped list, or a whole podcast
+RSS feed (every episode becomes one source), resumable on re-run:
 
 ```sh
 assembly transcribe ./recordings
 assembly transcribe "s3://bucket/calls/*.mp3"   # needs: pip install s3fs
+assembly transcribe "https://feeds.simplecast.com/54nAGcIl"   # every episode in the feed
 find . -name "*.wav" | assembly transcribe --from-stdin
 ```
 
diff --git a/aai_cli/app/transcribe/feed.py b/aai_cli/app/transcribe/feed.py
@@ -0,0 +1,123 @@
+"""Podcast RSS/Atom feed expansion for ``assembly transcribe``.
+
+A feed URL names a whole show, so transcribing it means transcribing every
+episode. ``feed_episode_urls`` fetches the URL and, when ``feedparser`` recognizes
+it as an RSS or Atom feed, returns its episode enclosure URLs (in feed order —
+newest first) for the batch path to transcribe, one resumable sidecar per episode.
+The enclosures are direct media URLs the API fetches itself, so — unlike a YouTube
+or podcast *page*, which yt-dlp downloads first — no local download step is needed.
+
+Detection is deliberately narrow so a direct media URL or ordinary web page still
+falls through to the single-source path untouched (and is never fetched twice):
+only an http(s) URL whose path is feed-shaped — no extension, or one of
+``.xml``/``.rss``/``.atom`` — and that no dedicated yt-dlp extractor already claims
+is sniffed, the response body is bounded, and only content ``feedparser`` parses as
+a real feed with at least one enclosure is treated as a feed. We hand ``feedparser``
+the already-fetched bytes (never the URL) so our bounded, safe fetch below stays the
+only network path.
+"""
+
+from __future__ import annotations
+
+from pathlib import PurePosixPath
+from urllib.parse import urlsplit
+
+from pydantic import BaseModel, Field
+
+from aai_cli.core import youtube
+
+# A feed lives at an extensionless URL (e.g. feeds.simplecast.com/<id>) or a feed
+# document (.xml/.rss/.atom). Every other path — .mp3, .txt, .pdf — is never a feed,
+# so it is left for the single-source path and never fetched here.
+_FEED_URL_SUFFIXES = frozenset({"", ".xml", ".rss", ".atom"})
+
+# Bound the download so a hostile or huge URL can't exhaust memory; 10 MB of feed
+# already holds thousands of episodes, far past any realistic batch.
+_MAX_FEED_BYTES = 10 * 1024 * 1024  # pragma: no mutate -- tuning knob, not behavior
+_FETCH_TIMEOUT_SECONDS = 15.0  # pragma: no mutate -- tuning knob, not behavior
+
+
+class _Enclosure(BaseModel):
+    """One ``<enclosure>`` / Atom enclosure link; ``href`` is the media URL."""
+
+    href: str = ""
+
+
+class _Entry(BaseModel):
+    # default_factory (not a shared `= []`) so each entry gets its own list, and the
+    # typed factory keeps the field's element type known under pyright strict.
+    enclosures: list[_Enclosure] = Field(default_factory=list[_Enclosure])
+
+
+class _ParsedFeed(BaseModel):
+    """The slice of feedparser's untyped result we use, validated into a real type
+    (the project pattern for untyped third-party returns — cf. core/wer.py)."""
+
+    # feedparser sets ``version`` to a non-empty id ("rss20", "atom10", …) for a
+    # recognized feed and to "" for anything it doesn't recognize as one.
+    version: str = ""
+    entries: list[_Entry] = Field(default_factory=list[_Entry])
+
+
+def feed_episode_urls(url: str) -> list[str] | None:
+    """The episode media URLs if `url` is a podcast feed, else ``None``.
+
+    Returns ``None`` (stay single-source) for a direct-media URL, a yt-dlp page,
+    an unreachable URL, or any content that isn't a feed carrying enclosures.
+    """
+    if not _looks_like_feed_url(url) or youtube.is_downloadable_url(url):
+        return None
+    body = _fetch(url)
+    if body is None:
+        return None
+    return _episode_urls(body)
+
+
+def _looks_like_feed_url(url: str) -> bool:
+    """True when the URL path is feed-shaped: extensionless or a feed document."""
+    suffix = PurePosixPath(urlsplit(url).path).suffix.lower()
+    return suffix in _FEED_URL_SUFFIXES
+
+
+def _episode_urls(body: str) -> list[str] | None:
+    """The enclosure URLs in a feed body, deduped in document order; ``None`` when
+    feedparser doesn't recognize it as a feed or it carries no enclosures."""
+    import feedparser
+
+    # feedparser ships only partial inline types (its parse signature is Unknown),
+    # so the result is validated through _ParsedFeed below; mirror remotefs.py's
+    # fsspec shim in ignoring the unavoidable unknown-member report on the call.
+    raw = feedparser.parse(body)  # pyright: ignore[reportUnknownMemberType]
+    parsed = _ParsedFeed.model_validate(raw)
+    if not parsed.version:
+        return None
+    urls = [enc.href for entry in parsed.entries for enc in entry.enclosures if enc.href]
+    deduped = list(dict.fromkeys(urls))
+    return deduped or None
+
+
+def _fetch(url: str) -> str | None:
+    """Up to ``_MAX_FEED_BYTES`` of `url` decoded as text, or ``None`` on any failure
+    or when the response is obviously binary media (audio/video/image)."""
+    import httpx2 as httpx
+
+    chunks: list[bytes] = []
+    try:
+        with (
+            httpx.Client(timeout=_FETCH_TIMEOUT_SECONDS, follow_redirects=True) as client,
+            client.stream("GET", url) as response,
+        ):
+            if not response.is_success:
+                return None
+            content_type = response.headers.get("content-type", "").lower()
+            if content_type.startswith(("audio/", "video/", "image/")):
+                return None
+            total = 0
+            for chunk in response.iter_bytes():
+                chunks.append(chunk)
+                total += len(chunk)
+                if total >= _MAX_FEED_BYTES:
+                    break
+    except (httpx.HTTPError, OSError):
+        return None
+    return b"".join(chunks).decode("utf-8", "replace")
diff --git a/aai_cli/app/transcribe/run.py b/aai_cli/app/transcribe/run.py
@@ -356,7 +356,12 @@ def run_transcribe(opts: TranscribeOptions, state: AppState, *, json_mode: bool)
     transcribe_validate.validate_speakers_expected(merged)
 
     sources = transcribe_sources.expand_sources(
-        opts.source, from_stdin=opts.from_stdin, sample=opts.sample
+        opts.source,
+        from_stdin=opts.from_stdin,
+        sample=opts.sample,
+        # --show-code must never touch the network; skip the feed probe and treat a
+        # URL as a single source for code generation.
+        detect_feeds=not opts.show_code,
     )
     if sources is not None:
         transcribe_sources.reject_single_source_flags(
diff --git a/aai_cli/app/transcribe/sources.py b/aai_cli/app/transcribe/sources.py
@@ -49,24 +49,41 @@
 _GLOB_CHARS = frozenset("*?[")
 
 
-def expand_sources(source: str | None, *, from_stdin: bool, sample: bool) -> list[str] | None:
+def expand_sources(
+    source: str | None, *, from_stdin: bool, sample: bool, detect_feeds: bool = True
+) -> list[str] | None:
     """The batch source list, or ``None`` when this is a single-source invocation.
 
     Batch mode triggers on ``--from-stdin``, a directory (scanned recursively for
-    audio files), a glob pattern that names no existing file, or a bucket URL
-    that is a glob or trailing-slash folder. A plain file, URL, ``-`` (audio
-    piped on stdin), or ``--sample`` stays on the single-source path.
+    audio files), a glob pattern that names no existing file, a bucket URL that is
+    a glob or trailing-slash folder, or an http(s) URL that turns out to be a
+    podcast RSS/Atom feed (each episode becomes one batch source). A plain file,
+    direct media URL, ``-`` (audio piped on stdin), or ``--sample`` stays on the
+    single-source path. ``detect_feeds=False`` skips the feed probe (and its
+    network fetch) for paths that must not touch the network, e.g. ``--show-code``.
     """
     if from_stdin:
         return _stdin_sources(source, sample=sample)
     # `not source` (rather than `is None`) also catches the empty string — e.g. an
     # unset shell variable in `assembly transcribe "$FILE"`. `Path("")` is `Path(".")`,
     # so it would otherwise fall into the directory branch and batch-transcribe the
     # whole working directory; instead it stays single-source and fails validation.
-    if not source or sample or source == "-" or source.startswith(URL_PREFIXES):
+    if not source or sample or source == "-":
         return None
+    if source.startswith(URL_PREFIXES):
+        # A podcast feed URL expands into its episode enclosure URLs (batch mode);
+        # a direct media URL or ordinary page returns None and stays single-source.
+        from aai_cli.app.transcribe import feed
+
+        return feed.feed_episode_urls(source) if detect_feeds else None
     if remotefs.is_remote_url(source):
         return _remote_sources(source)
+    return _local_sources(source)
+
+
+def _local_sources(source: str) -> list[str] | None:
+    """Batch sources for a local path: a directory's audio files or a glob's matches,
+    else ``None`` (a single file, which the single-source path handles)."""
     path = Path(source)
     if path.is_dir():
         return _directory_sources(path)
diff --git a/aai_cli/commands/transcribe.py b/aai_cli/commands/transcribe.py
@@ -31,6 +31,10 @@
             ("Try it with the hosted sample", "assembly transcribe --sample"),
             ("Transcribe a YouTube video", "assembly transcribe https://youtu.be/dtp6b76pMak"),
             ("Transcribe a podcast page", 'assembly transcribe "https://podcasts.apple.com/…"'),
+            (
+                "Transcribe a whole podcast feed",
+                'assembly transcribe "https://feeds.simplecast.com/…"',
+            ),
             ("Label who said what", "assembly transcribe call.mp3 --speaker-labels"),
             ("Redact PII for compliance", "assembly transcribe call.mp3 --redact-pii"),
             ("Summarize a recording", "assembly transcribe call.mp3 --summarization"),
@@ -43,8 +47,8 @@ def transcribe(
     ctx: typer.Context,
     source: str | None = typer.Argument(
         None,
-        help="Audio file, URL, YouTube/podcast URL, bucket URL (s3://, gs://, …), or a "
-        "directory/glob (batch mode)",
+        help="Audio file, URL, YouTube/podcast URL, podcast RSS feed, bucket URL "
+        "(s3://, gs://, …), or a directory/glob (batch mode)",
     ),
     sample: bool = typer.Option(False, "--sample", help="Use the hosted wildfires.mp3 sample"),
     # batch mode
@@ -362,10 +366,11 @@ def transcribe(
     URLs (any page yt-dlp can extract) are downloaded first, then transcribed.
 
     Batch mode: pass a directory or glob (or pipe a list with --from-stdin) to
-    transcribe many sources concurrently. Each source gets a .aai.json sidecar
-    with the full result (including any --llm responses), and a re-run skips
-    sources already transcribed — with changed --llm prompts it replays just
-    the LLM step, never a second transcription.
+    transcribe many sources concurrently. A podcast RSS/Atom feed URL also expands
+    to batch mode — every episode enclosure becomes one source. Each source gets a
+    .aai.json sidecar with the full result (including any --llm responses), and a
+    re-run skips sources already transcribed — with changed --llm prompts it
+    replays just the LLM step, never a second transcription.
 
     Bucket URLs (s3://, gs://, az://, sftp://, …) work for single files and for
     batches (a glob, or a folder ending in /); install the matching fsspec
diff --git a/aai_cli/skills/aai-cli/references/transcription.md b/aai_cli/skills/aai-cli/references/transcription.md
@@ -5,12 +5,14 @@ Five commands. All accept `--json` (auto-enabled when piped); `transcribe`,
 `transcribe`, `stream`, and `agent` accept `--show-code` to print equivalent
 Python SDK code without calling the API.
 
-## `assembly transcribe [SOURCE]` — file / URL / YouTube / podcast page
+## `assembly transcribe [SOURCE]` — file / URL / YouTube / podcast page / RSS feed
 
 `SOURCE` is a local file path, public URL, or a media-page URL yt-dlp can extract
 (YouTube, Apple Podcasts, Spreaker, SoundCloud, …) — those are downloaded first.
-Use `--sample` for the hosted `wildfires.mp3`. Analysis results (summary,
-chapters, sentiment, …) render automatically in human mode.
+A podcast RSS/Atom feed URL expands into a resumable batch run over every episode
+enclosure (one `.aai.json` sidecar apiece). Use `--sample` for the hosted
+`wildfires.mp3`. Analysis results (summary, chapters, sentiment, …) render
+automatically in human mode.
 
 High-value flags (run `assembly transcribe --help` for the full set):
 
@@ -37,6 +39,7 @@ assembly transcribe --sample
 assembly transcribe call.mp3 --speaker-labels --speakers-expected 2 --redact-pii
 assembly transcribe call.mp3 -o text
 assembly transcribe call.mp3 --show-code
+assembly transcribe "https://feeds.simplecast.com/54nAGcIl"   # every episode in the feed
 ```
 
 ## `assembly stream [SOURCE]` — live real-time transcription
diff --git a/pyproject.toml b/pyproject.toml
@@ -56,6 +56,11 @@ dependencies = [
     # imported lazily). fsspec core only — each protocol's backend (s3fs, gcsfs, adlfs,
     # …) stays a user-installed extra surfaced via a clean install hint.
     "fsspec>=2026.4.0",
+    # Podcast RSS/Atom feed parsing for `assembly transcribe <feed-url>` (feed.py,
+    # imported lazily). The de-facto standard feed parser; pure-Python, no compiled
+    # deps. We hand it already-fetched bytes (never a URL) so our bounded, safe
+    # httpx fetch stays the only network path.
+    "feedparser>=6.0.11",
 ]
 
 [project.urls]
diff --git a/tests/__snapshots__/test_snapshots_help_run.ambr b/tests/__snapshots__/test_snapshots_help_run.ambr
diff --git a/tests/test_transcribe_feed.py b/tests/test_transcribe_feed.py
diff --git a/uv.lock b/uv.lock