Skip to content

Commit bacfac5

Browse files
alexkromanclaude
andauthored
Add podcast RSS/Atom feed expansion to transcribe command (#195)
Adds support for transcribing entire podcast feeds by expanding feed URLs into batch mode, with one resumable sidecar per episode enclosure. ## Summary When a user passes a podcast RSS or Atom feed URL to `assembly transcribe`, the CLI now detects it, fetches the feed, extracts all episode enclosure URLs, and transcribes them as a batch. Each episode is transcribed independently with its own `.aai.json` sidecar, and re-runs skip already-transcribed episodes. Direct media URLs and non-feed pages continue to work as before (single-source path). ## Key Changes - **New module `aai_cli/app/transcribe/feed.py`**: Implements feed detection, fetching, and parsing - `feed_episode_urls()`: Main entry point that gates on URL shape, checks for yt-dlp pages, fetches the feed, and parses enclosures - `_looks_like_feed_url()`: Detects feed-shaped URLs (extensionless or `.xml`/`.rss`/`atom` extensions) - `_episode_urls()`: Parses RSS and Atom feeds, validates the root element, extracts enclosure URLs, dedupes while preserving order, and HTML-unescapes URLs - `_fetch()`: Bounded HTTP fetch (10 MB cap) that skips binary media content types and handles network errors gracefully - **Updated `aai_cli/app/transcribe/sources.py`**: Integrates feed detection into the batch-vs-single-source routing - Added `detect_feeds` parameter to `expand_sources()` (defaults to `True`) - Routes feed URLs to `feed.feed_episode_urls()` for batch expansion - Extracted local-path logic into `_local_sources()` helper - `--show-code` passes `detect_feeds=False` to skip network probes - **Updated `aai_cli/commands/transcribe.py`**: Documentation and help text - Updated argument help to mention "podcast RSS feed" - Added example: "Transcribe a whole podcast feed" - Updated docstring to explain feed expansion in batch mode - **Updated help snapshots and reference docs**: Reflect feed support in command descriptions ## Implementation Details - **Detection is deliberately narrow**: Only http(s) URLs with feed-shaped paths are probed; direct media URLs (`.mp3`, etc.) and ordinary web pages are never fetched, avoiding double-fetches and unnecessary network calls - **Feed validation**: A response must have an `<rss>` or `<feed>` root element before enclosures are trusted, so stray HTML pages containing the word "enclosure" are never mistaken for feeds - **Deduplication**: Episode URLs are deduped while preserving feed order (newest first), so duplicate enclosures in a feed don't create duplicate batch jobs - **Bounded fetch**: Capped at 10 MB to prevent hostile or huge feeds from exhausting memory; 10 MB already holds thousands of episodes - **Graceful fallthrough**: Network errors, non-feed responses, and feeds without enclosures all return `None`, leaving the URL on the single-source path untouched ## Testing Comprehensive test suite in `tests/test_transcribe_feed.py` (311 lines) covers: - RSS 2.0 and Atom parsing with various attribute orders - HTML entity unescaping (`&amp;` → `&`) - Deduplication while preserving order - Feed-shaped URL detection (extensionless, `.xml`, `.rss`, `.atom`) - HTTP error handling, binary media skipping, byte-cap truncation - Network error resilience - End-to-end CLI runs with faked network (socket-blocked test suite) https://claude.ai/code/session_01VwZxsDGG57kDQU4J39u3oq --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent 1b90c98 commit bacfac5

10 files changed

Lines changed: 537 additions & 30 deletions

File tree

README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
[![License](https://img.shields.io/badge/license-MIT-D6402E)](https://github.com/AssemblyAI/cli/blob/main/LICENSE)
55
[![Docs](https://img.shields.io/badge/docs-assemblyai-D6402E)](https://www.assemblyai.com/docs)
66

7-
The AssemblyAI CLI (`assembly`) brings speech AI directly into your terminal: transcribe files, URLs, and YouTube/podcast pages, stream live audio, talk to a two-way voice agent, prompt the LLM Gateway, benchmark speech models, and scaffold ready-to-deploy starter apps.
7+
The AssemblyAI CLI (`assembly`) brings speech AI directly into your terminal: transcribe files, URLs, YouTube/podcast pages, and whole podcast RSS feeds, stream live audio, talk to a two-way voice agent, prompt the LLM Gateway, benchmark speech models, and scaffold ready-to-deploy starter apps.
88

99
<p align="center">
1010
<img src="assets/welcome.png" alt="The assembly CLI welcome screen, listing command groups for transcription, streaming, voice agents, app scaffolding, and account management" width="820">
@@ -44,7 +44,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
4444

4545
| Command | What it does |
4646
| :--- | :--- |
47-
| `assembly transcribe` | Transcribe files, URLs, YouTube/podcast pages, directories, globs, or bucket storage (`s3://`, `gs://`, `az://`) — with speaker labels, PII redaction, summarization, SRT/VTT captions, and resumable batch runs |
47+
| `assembly transcribe` | Transcribe files, URLs, YouTube/podcast pages, podcast RSS feeds, directories, globs, or bucket storage (`s3://`, `gs://`, `az://`) — with speaker labels, PII redaction, summarization, SRT/VTT captions, and resumable batch runs |
4848
| `assembly stream` | Real-time transcription from your microphone, a file, or a URL — on macOS it can capture system audio too |
4949
| `assembly dictate` | Push-to-talk dictation: press Enter to record, Enter again for instant text (Sync STT API, up to 120 s per utterance) |
5050
| `assembly agent` | Full-duplex spoken conversation with a voice agent, right in your terminal |
@@ -285,11 +285,13 @@ assembly transcribe video.mp4 -o srt # captions
285285
assembly transcribe call.mp3 --speaker-labels --summarization --json
286286
```
287287

288-
Transcribe in batches — a directory, a glob, or a piped list, resumable on re-run:
288+
Transcribe in batches — a directory, a glob, a piped list, or a whole podcast
289+
RSS feed (every episode becomes one source), resumable on re-run:
289290

290291
```sh
291292
assembly transcribe ./recordings
292293
assembly transcribe "s3://bucket/calls/*.mp3" # needs: pip install s3fs
294+
assembly transcribe "https://feeds.simplecast.com/54nAGcIl" # every episode in the feed
293295
find . -name "*.wav" | assembly transcribe --from-stdin
294296
```
295297

aai_cli/app/transcribe/feed.py

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
"""Podcast RSS/Atom feed expansion for ``assembly transcribe``.
2+
3+
A feed URL names a whole show, so transcribing it means transcribing every
4+
episode. ``feed_episode_urls`` fetches the URL and, when ``feedparser`` recognizes
5+
it as an RSS or Atom feed, returns its episode enclosure URLs (in feed order —
6+
newest first) for the batch path to transcribe, one resumable sidecar per episode.
7+
The enclosures are direct media URLs the API fetches itself, so — unlike a YouTube
8+
or podcast *page*, which yt-dlp downloads first — no local download step is needed.
9+
10+
Detection is deliberately narrow so a direct media URL or ordinary web page still
11+
falls through to the single-source path untouched (and is never fetched twice):
12+
only an http(s) URL whose path is feed-shaped — no extension, or one of
13+
``.xml``/``.rss``/``.atom`` — and that no dedicated yt-dlp extractor already claims
14+
is sniffed, the response body is bounded, and only content ``feedparser`` parses as
15+
a real feed with at least one enclosure is treated as a feed. We hand ``feedparser``
16+
the already-fetched bytes (never the URL) so our bounded, safe fetch below stays the
17+
only network path.
18+
"""
19+
20+
from __future__ import annotations
21+
22+
from pathlib import PurePosixPath
23+
from urllib.parse import urlsplit
24+
25+
from pydantic import BaseModel, Field
26+
27+
from aai_cli.core import youtube
28+
29+
# A feed lives at an extensionless URL (e.g. feeds.simplecast.com/<id>) or a feed
30+
# document (.xml/.rss/.atom). Every other path — .mp3, .txt, .pdf — is never a feed,
31+
# so it is left for the single-source path and never fetched here.
32+
_FEED_URL_SUFFIXES = frozenset({"", ".xml", ".rss", ".atom"})
33+
34+
# Bound the download so a hostile or huge URL can't exhaust memory; 10 MB of feed
35+
# already holds thousands of episodes, far past any realistic batch.
36+
_MAX_FEED_BYTES = 10 * 1024 * 1024 # pragma: no mutate -- tuning knob, not behavior
37+
_FETCH_TIMEOUT_SECONDS = 15.0 # pragma: no mutate -- tuning knob, not behavior
38+
39+
40+
class _Enclosure(BaseModel):
41+
"""One ``<enclosure>`` / Atom enclosure link; ``href`` is the media URL."""
42+
43+
href: str = ""
44+
45+
46+
class _Entry(BaseModel):
47+
# default_factory (not a shared `= []`) so each entry gets its own list, and the
48+
# typed factory keeps the field's element type known under pyright strict.
49+
enclosures: list[_Enclosure] = Field(default_factory=list[_Enclosure])
50+
51+
52+
class _ParsedFeed(BaseModel):
53+
"""The slice of feedparser's untyped result we use, validated into a real type
54+
(the project pattern for untyped third-party returns — cf. core/wer.py)."""
55+
56+
# feedparser sets ``version`` to a non-empty id ("rss20", "atom10", …) for a
57+
# recognized feed and to "" for anything it doesn't recognize as one.
58+
version: str = ""
59+
entries: list[_Entry] = Field(default_factory=list[_Entry])
60+
61+
62+
def feed_episode_urls(url: str) -> list[str] | None:
63+
"""The episode media URLs if `url` is a podcast feed, else ``None``.
64+
65+
Returns ``None`` (stay single-source) for a direct-media URL, a yt-dlp page,
66+
an unreachable URL, or any content that isn't a feed carrying enclosures.
67+
"""
68+
if not _looks_like_feed_url(url) or youtube.is_downloadable_url(url):
69+
return None
70+
body = _fetch(url)
71+
if body is None:
72+
return None
73+
return _episode_urls(body)
74+
75+
76+
def _looks_like_feed_url(url: str) -> bool:
77+
"""True when the URL path is feed-shaped: extensionless or a feed document."""
78+
suffix = PurePosixPath(urlsplit(url).path).suffix.lower()
79+
return suffix in _FEED_URL_SUFFIXES
80+
81+
82+
def _episode_urls(body: str) -> list[str] | None:
83+
"""The enclosure URLs in a feed body, deduped in document order; ``None`` when
84+
feedparser doesn't recognize it as a feed or it carries no enclosures."""
85+
import feedparser
86+
87+
# feedparser ships only partial inline types (its parse signature is Unknown),
88+
# so the result is validated through _ParsedFeed below; mirror remotefs.py's
89+
# fsspec shim in ignoring the unavoidable unknown-member report on the call.
90+
raw = feedparser.parse(body) # pyright: ignore[reportUnknownMemberType]
91+
parsed = _ParsedFeed.model_validate(raw)
92+
if not parsed.version:
93+
return None
94+
urls = [enc.href for entry in parsed.entries for enc in entry.enclosures if enc.href]
95+
deduped = list(dict.fromkeys(urls))
96+
return deduped or None
97+
98+
99+
def _fetch(url: str) -> str | None:
100+
"""Up to ``_MAX_FEED_BYTES`` of `url` decoded as text, or ``None`` on any failure
101+
or when the response is obviously binary media (audio/video/image)."""
102+
import httpx2 as httpx
103+
104+
chunks: list[bytes] = []
105+
try:
106+
with (
107+
httpx.Client(timeout=_FETCH_TIMEOUT_SECONDS, follow_redirects=True) as client,
108+
client.stream("GET", url) as response,
109+
):
110+
if not response.is_success:
111+
return None
112+
content_type = response.headers.get("content-type", "").lower()
113+
if content_type.startswith(("audio/", "video/", "image/")):
114+
return None
115+
total = 0
116+
for chunk in response.iter_bytes():
117+
chunks.append(chunk)
118+
total += len(chunk)
119+
if total >= _MAX_FEED_BYTES:
120+
break
121+
except (httpx.HTTPError, OSError):
122+
return None
123+
return b"".join(chunks).decode("utf-8", "replace")

aai_cli/app/transcribe/run.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -356,7 +356,12 @@ def run_transcribe(opts: TranscribeOptions, state: AppState, *, json_mode: bool)
356356
transcribe_validate.validate_speakers_expected(merged)
357357

358358
sources = transcribe_sources.expand_sources(
359-
opts.source, from_stdin=opts.from_stdin, sample=opts.sample
359+
opts.source,
360+
from_stdin=opts.from_stdin,
361+
sample=opts.sample,
362+
# --show-code must never touch the network; skip the feed probe and treat a
363+
# URL as a single source for code generation.
364+
detect_feeds=not opts.show_code,
360365
)
361366
if sources is not None:
362367
transcribe_sources.reject_single_source_flags(

aai_cli/app/transcribe/sources.py

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -49,24 +49,41 @@
4949
_GLOB_CHARS = frozenset("*?[")
5050

5151

52-
def expand_sources(source: str | None, *, from_stdin: bool, sample: bool) -> list[str] | None:
52+
def expand_sources(
53+
source: str | None, *, from_stdin: bool, sample: bool, detect_feeds: bool = True
54+
) -> list[str] | None:
5355
"""The batch source list, or ``None`` when this is a single-source invocation.
5456
5557
Batch mode triggers on ``--from-stdin``, a directory (scanned recursively for
56-
audio files), a glob pattern that names no existing file, or a bucket URL
57-
that is a glob or trailing-slash folder. A plain file, URL, ``-`` (audio
58-
piped on stdin), or ``--sample`` stays on the single-source path.
58+
audio files), a glob pattern that names no existing file, a bucket URL that is
59+
a glob or trailing-slash folder, or an http(s) URL that turns out to be a
60+
podcast RSS/Atom feed (each episode becomes one batch source). A plain file,
61+
direct media URL, ``-`` (audio piped on stdin), or ``--sample`` stays on the
62+
single-source path. ``detect_feeds=False`` skips the feed probe (and its
63+
network fetch) for paths that must not touch the network, e.g. ``--show-code``.
5964
"""
6065
if from_stdin:
6166
return _stdin_sources(source, sample=sample)
6267
# `not source` (rather than `is None`) also catches the empty string — e.g. an
6368
# unset shell variable in `assembly transcribe "$FILE"`. `Path("")` is `Path(".")`,
6469
# so it would otherwise fall into the directory branch and batch-transcribe the
6570
# whole working directory; instead it stays single-source and fails validation.
66-
if not source or sample or source == "-" or source.startswith(URL_PREFIXES):
71+
if not source or sample or source == "-":
6772
return None
73+
if source.startswith(URL_PREFIXES):
74+
# A podcast feed URL expands into its episode enclosure URLs (batch mode);
75+
# a direct media URL or ordinary page returns None and stays single-source.
76+
from aai_cli.app.transcribe import feed
77+
78+
return feed.feed_episode_urls(source) if detect_feeds else None
6879
if remotefs.is_remote_url(source):
6980
return _remote_sources(source)
81+
return _local_sources(source)
82+
83+
84+
def _local_sources(source: str) -> list[str] | None:
85+
"""Batch sources for a local path: a directory's audio files or a glob's matches,
86+
else ``None`` (a single file, which the single-source path handles)."""
7087
path = Path(source)
7188
if path.is_dir():
7289
return _directory_sources(path)

aai_cli/commands/transcribe.py

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@
3131
("Try it with the hosted sample", "assembly transcribe --sample"),
3232
("Transcribe a YouTube video", "assembly transcribe https://youtu.be/dtp6b76pMak"),
3333
("Transcribe a podcast page", 'assembly transcribe "https://podcasts.apple.com/…"'),
34+
(
35+
"Transcribe a whole podcast feed",
36+
'assembly transcribe "https://feeds.simplecast.com/…"',
37+
),
3438
("Label who said what", "assembly transcribe call.mp3 --speaker-labels"),
3539
("Redact PII for compliance", "assembly transcribe call.mp3 --redact-pii"),
3640
("Summarize a recording", "assembly transcribe call.mp3 --summarization"),
@@ -43,8 +47,8 @@ def transcribe(
4347
ctx: typer.Context,
4448
source: str | None = typer.Argument(
4549
None,
46-
help="Audio file, URL, YouTube/podcast URL, bucket URL (s3://, gs://, …), or a "
47-
"directory/glob (batch mode)",
50+
help="Audio file, URL, YouTube/podcast URL, podcast RSS feed, bucket URL "
51+
"(s3://, gs://, …), or a directory/glob (batch mode)",
4852
),
4953
sample: bool = typer.Option(False, "--sample", help="Use the hosted wildfires.mp3 sample"),
5054
# batch mode
@@ -362,10 +366,11 @@ def transcribe(
362366
URLs (any page yt-dlp can extract) are downloaded first, then transcribed.
363367
364368
Batch mode: pass a directory or glob (or pipe a list with --from-stdin) to
365-
transcribe many sources concurrently. Each source gets a .aai.json sidecar
366-
with the full result (including any --llm responses), and a re-run skips
367-
sources already transcribed — with changed --llm prompts it replays just
368-
the LLM step, never a second transcription.
369+
transcribe many sources concurrently. A podcast RSS/Atom feed URL also expands
370+
to batch mode — every episode enclosure becomes one source. Each source gets a
371+
.aai.json sidecar with the full result (including any --llm responses), and a
372+
re-run skips sources already transcribed — with changed --llm prompts it
373+
replays just the LLM step, never a second transcription.
369374
370375
Bucket URLs (s3://, gs://, az://, sftp://, …) work for single files and for
371376
batches (a glob, or a folder ending in /); install the matching fsspec

aai_cli/skills/aai-cli/references/transcription.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,14 @@ Five commands. All accept `--json` (auto-enabled when piped); `transcribe`,
55
`transcribe`, `stream`, and `agent` accept `--show-code` to print equivalent
66
Python SDK code without calling the API.
77

8-
## `assembly transcribe [SOURCE]` — file / URL / YouTube / podcast page
8+
## `assembly transcribe [SOURCE]` — file / URL / YouTube / podcast page / RSS feed
99

1010
`SOURCE` is a local file path, public URL, or a media-page URL yt-dlp can extract
1111
(YouTube, Apple Podcasts, Spreaker, SoundCloud, …) — those are downloaded first.
12-
Use `--sample` for the hosted `wildfires.mp3`. Analysis results (summary,
13-
chapters, sentiment, …) render automatically in human mode.
12+
A podcast RSS/Atom feed URL expands into a resumable batch run over every episode
13+
enclosure (one `.aai.json` sidecar apiece). Use `--sample` for the hosted
14+
`wildfires.mp3`. Analysis results (summary, chapters, sentiment, …) render
15+
automatically in human mode.
1416

1517
High-value flags (run `assembly transcribe --help` for the full set):
1618

@@ -37,6 +39,7 @@ assembly transcribe --sample
3739
assembly transcribe call.mp3 --speaker-labels --speakers-expected 2 --redact-pii
3840
assembly transcribe call.mp3 -o text
3941
assembly transcribe call.mp3 --show-code
42+
assembly transcribe "https://feeds.simplecast.com/54nAGcIl" # every episode in the feed
4043
```
4144

4245
## `assembly stream [SOURCE]` — live real-time transcription

pyproject.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,11 @@ dependencies = [
5656
# imported lazily). fsspec core only — each protocol's backend (s3fs, gcsfs, adlfs,
5757
# …) stays a user-installed extra surfaced via a clean install hint.
5858
"fsspec>=2026.4.0",
59+
# Podcast RSS/Atom feed parsing for `assembly transcribe <feed-url>` (feed.py,
60+
# imported lazily). The de-facto standard feed parser; pure-Python, no compiled
61+
# deps. We hand it already-fetched bytes (never a URL) so our bounded, safe
62+
# httpx fetch stays the only network path.
63+
"feedparser>=6.0.11",
5964
]
6065

6166
[project.urls]

0 commit comments

Comments
 (0)