Add PDF support to assembly speak --url#220
Merged
Merged
Conversation
3093ad7 to
5f642fe
Compare
assembly speak --url now handles PDF URLs in addition to HTML. The fetch reads the full response and dispatches on content type: PDFs (detected by Content-Type or the %PDF- magic bytes, so a mislabeled octet-stream still routes correctly) go through pypdf's text-layer extraction; HTML keeps the trafilatura boilerplate-stripping path. A scanned/image-only PDF (no text layer) and an unparseable PDF both surface a clear UsageError. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015Z1o33Ezt9aznmePd4Jc9R
5f642fe to
3803309
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Extends the
--urloption to fetch and extract text from PDF documents in addition to HTML web pages. PDFs are detected by Content-Type header or the%PDF-magic bytes, then processed with pypdf to extract the text layer and metadata.Changes
Renamed
_fetch_html()to_fetch()and changed it to return the fullhttpx.Responseobject instead of just the text, allowing callers to inspect headers and access raw bytes for PDF detection.Added PDF detection and extraction:
_is_pdf()function detects PDFs by Content-Type or magic bytes (handles mislabeled servers)_extract_pdf()function uses pypdf to pull text and title from PDF documentsUsageErrorsuggesting the file may be encrypted or malformedUpdated
fetch_article()dispatch logic to route HTML and PDF resources to their respective extraction paths, with context-specific error hints (e.g., "scanned or image-only" for PDFs without text layers vs. "paywalled or JavaScript-rendered" for HTML).Added pypdf dependency (
>=5.1.0) topyproject.tomlwith a note that it's pure-Python and adds no compilation step to Homebrew bottling.Updated documentation and help text throughout to reflect PDF support (module docstring, function docstrings, CLI help text, and snapshot goldens).
Comprehensive test coverage for PDF extraction, including:
_make_pdf()function that builds minimal but valid PDFs for testingThe implementation defers both trafilatura and pypdf imports to call time, keeping them off the CLI's startup path.
https://claude.ai/code/session_015Z1o33Ezt9aznmePd4Jc9R