Skip to content

Add PDF support to assembly speak --url#220

Merged
alexkroman merged 3 commits into
mainfrom
claude/confident-carson-eka270
Jun 17, 2026
Merged

Add PDF support to assembly speak --url#220
alexkroman merged 3 commits into
mainfrom
claude/confident-carson-eka270

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Extends the --url option to fetch and extract text from PDF documents in addition to HTML web pages. PDFs are detected by Content-Type header or the %PDF- magic bytes, then processed with pypdf to extract the text layer and metadata.

Changes

  • Renamed _fetch_html() to _fetch() and changed it to return the full httpx.Response object instead of just the text, allowing callers to inspect headers and access raw bytes for PDF detection.

  • Added PDF detection and extraction:

    • New _is_pdf() function detects PDFs by Content-Type or magic bytes (handles mislabeled servers)
    • New _extract_pdf() function uses pypdf to pull text and title from PDF documents
    • Gracefully handles corrupt/invalid PDFs with a UsageError suggesting the file may be encrypted or malformed
  • Updated fetch_article() dispatch logic to route HTML and PDF resources to their respective extraction paths, with context-specific error hints (e.g., "scanned or image-only" for PDFs without text layers vs. "paywalled or JavaScript-rendered" for HTML).

  • Added pypdf dependency (>=5.1.0) to pyproject.toml with a note that it's pure-Python and adds no compilation step to Homebrew bottling.

  • Updated documentation and help text throughout to reflect PDF support (module docstring, function docstrings, CLI help text, and snapshot goldens).

  • Comprehensive test coverage for PDF extraction, including:

    • PDF detection by Content-Type and magic bytes
    • Text and title extraction from valid PDFs
    • Handling of scanned/image-only PDFs (no text layer)
    • Handling of corrupt PDFs
    • A helper _make_pdf() function that builds minimal but valid PDFs for testing

The implementation defers both trafilatura and pypdf imports to call time, keeping them off the CLI's startup path.

https://claude.ai/code/session_015Z1o33Ezt9aznmePd4Jc9R

@alexkroman alexkroman force-pushed the claude/confident-carson-eka270 branch from 3093ad7 to 5f642fe Compare June 17, 2026 17:12
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2026
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2026
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
@alexkroman alexkroman removed this pull request from the merge queue due to a manual request Jun 17, 2026
assembly speak --url now handles PDF URLs in addition to HTML. The fetch
reads the full response and dispatches on content type: PDFs (detected by
Content-Type or the %PDF- magic bytes, so a mislabeled octet-stream still
routes correctly) go through pypdf's text-layer extraction; HTML keeps the
trafilatura boilerplate-stripping path. A scanned/image-only PDF (no text
layer) and an unparseable PDF both surface a clear UsageError.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015Z1o33Ezt9aznmePd4Jc9R
@alexkroman alexkroman force-pushed the claude/confident-carson-eka270 branch from 5f642fe to 3803309 Compare June 17, 2026 19:46
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2026
@alexkroman alexkroman enabled auto-merge June 17, 2026 20:56
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2026
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
Merged via the queue into main with commit 3b52ef5 Jun 17, 2026
19 checks passed
@alexkroman alexkroman deleted the claude/confident-carson-eka270 branch June 17, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants