Add PDF support to `assembly speak --url` by alexkroman · Pull Request #220 · AssemblyAI/cli

alexkroman · 2026-06-17T15:19:03Z

Extends the --url option to fetch and extract text from PDF documents in addition to HTML web pages. PDFs are detected by Content-Type header or the %PDF- magic bytes, then processed with pypdf to extract the text layer and metadata.

Changes

Renamed _fetch_html() to _fetch() and changed it to return the full httpx.Response object instead of just the text, allowing callers to inspect headers and access raw bytes for PDF detection.
Added PDF detection and extraction:
- New _is_pdf() function detects PDFs by Content-Type or magic bytes (handles mislabeled servers)
- New _extract_pdf() function uses pypdf to pull text and title from PDF documents
- Gracefully handles corrupt/invalid PDFs with a UsageError suggesting the file may be encrypted or malformed
Updated fetch_article() dispatch logic to route HTML and PDF resources to their respective extraction paths, with context-specific error hints (e.g., "scanned or image-only" for PDFs without text layers vs. "paywalled or JavaScript-rendered" for HTML).
Added pypdf dependency (>=5.1.0) to pyproject.toml with a note that it's pure-Python and adds no compilation step to Homebrew bottling.
Updated documentation and help text throughout to reflect PDF support (module docstring, function docstrings, CLI help text, and snapshot goldens).
Comprehensive test coverage for PDF extraction, including:
- PDF detection by Content-Type and magic bytes
- Text and title extraction from valid PDFs
- Handling of scanned/image-only PDFs (no text layer)
- Handling of corrupt PDFs
- A helper _make_pdf() function that builds minimal but valid PDFs for testing

The implementation defers both trafilatura and pypdf imports to call time, keeping them off the CLI's startup path.

https://claude.ai/code/session_015Z1o33Ezt9aznmePd4Jc9R

assembly speak --url now handles PDF URLs in addition to HTML. The fetch reads the full response and dispatches on content type: PDFs (detected by Content-Type or the %PDF- magic bytes, so a mislabeled octet-stream still routes correctly) go through pypdf's text-layer extraction; HTML keeps the trafilatura boilerplate-stripping path. A scanned/image-only PDF (no text layer) and an unparseable PDF both surface a clear UsageError. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015Z1o33Ezt9aznmePd4Jc9R

alexkroman force-pushed the claude/confident-carson-eka270 branch from 3093ad7 to 5f642fe Compare June 17, 2026 17:12