Add --url flag to speak command for reading web pages aloud#201
Conversation
Bundle trafilatura as a content adapter for `speak`: --url fetches a web page (httpx, the project's pinned client) and trafilatura strips the boilerplate — nav, sidebars, footers, comment threads — down to the readable article body, so text-to-speech narrates the piece rather than the page chrome. - core/webpage.py: fetch_article(url) -> Article (text/title/url), with a lazy trafilatura import to keep it off CLI startup; non-http URLs and pages with no extractable text raise UsageError, fetch failures APIError. - speak: new --url option, mutually exclusive with the text argument and stdin; resolves to the extracted text before synthesis. trafilatura ships prebuilt wheels (lxml included), so it adds no source-compile step to Homebrew bottling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KHf2ttdfNjEwMHvZSMi2HU
| response.raise_for_status() | ||
| return response.text | ||
| except httpx.HTTPError as exc: | ||
| raise APIError(f"Couldn't fetch {url}: {exc}") from exc |
There was a problem hiding this comment.
APIError includes the raw user-provided URL and exception in its message (f"Couldn't fetch {url}: {exc}"). Avoid embedding unsanitized URLs in error text; sanitize or redact before including in messages.
Details
✨ AI Reasoning
The exception handler constructs an APIError embedding the requested URL and the HTTP exception (f"Couldn't fetch {url}: {exc}"). If these errors are logged or displayed, the raw URL (and possibly sensitive query strings) will be exposed and may allow log injection via crafted input.
🔧 How do I fix it?
Keep sensitive data such as emails, passwords, and tokens out of logs. When logging values tied to a user, prefer a safe identifier like a user ID over the raw input, and strip line breaks from any user-provided text you do log.
Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
…des-vv0vr6 # Conflicts: # aai_cli/commands/speak/_exec.py # pyproject.toml # tests/test_speak.py
Adds support for fetching and narrating web pages via a new
--urlflag on theassembly speakcommand. The main article text is extracted from the page with boilerplate (nav, footers, comments, sidebars) stripped before being passed to text-to-speech.Changes
New
webpagemodule (aai_cli/core/webpage.py): Fetches HTML via httpx2 and extracts readable article text using trafilatura. Handles URL validation, network errors, and extraction failures with appropriate error types.fetch_article(url)— main entry point; validates http(s) URLs and raisesUsageErrorfor non-web URLs or pages with no readable text_fetch_html(url)— fetches with browser-like User-Agent and redirect following; maps network/HTTP errors toAPIError_extract(html)— strips boilerplate and extracts title using trafilatura (imported lazily to keep CLI startup fast)Articledataclass — frozen to prevent accidental mutation of fetched contentUpdated
speakcommand (aai_cli/commands/speak/__init__.py):--urloption with help text and exampleUpdated
speakexecution (aai_cli/commands/speak/_exec.py):urlfield toSpeakOptions_resolve_input()function enforces mutual exclusivity between--urland the text argument/stdin using themutually_exclusive()helperwebpage.fetch_article()when--urlis providedComprehensive test suite (
tests/test_webpage.py):Articledataclasshttpx.MockTransportfor hermetic testing without real network callsIntegration tests (
tests/test_speak.py):--urlfetches and narrates extracted article text--urland text argumentDependencies: Added
trafilatura>=2.1.0topyproject.tomlImplementation notes
_extract()to avoid slowing CLI startupAPIError; extraction failures and invalid URLs mapped toUsageErrorwith helpful suggestionsmutually_exclusive()helper from the errors modulehttps://claude.ai/code/session_01KHf2ttdfNjEwMHvZSMi2HU