Skip to content

Add --url flag to speak command for reading web pages aloud#201

Merged
alexkroman merged 2 commits into
mainfrom
claude/dreamy-archimedes-vv0vr6
Jun 16, 2026
Merged

Add --url flag to speak command for reading web pages aloud#201
alexkroman merged 2 commits into
mainfrom
claude/dreamy-archimedes-vv0vr6

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Adds support for fetching and narrating web pages via a new --url flag on the assembly speak command. The main article text is extracted from the page with boilerplate (nav, footers, comments, sidebars) stripped before being passed to text-to-speech.

Changes

  • New webpage module (aai_cli/core/webpage.py): Fetches HTML via httpx2 and extracts readable article text using trafilatura. Handles URL validation, network errors, and extraction failures with appropriate error types.

    • fetch_article(url) — main entry point; validates http(s) URLs and raises UsageError for non-web URLs or pages with no readable text
    • _fetch_html(url) — fetches with browser-like User-Agent and redirect following; maps network/HTTP errors to APIError
    • _extract(html) — strips boilerplate and extracts title using trafilatura (imported lazily to keep CLI startup fast)
    • Article dataclass — frozen to prevent accidental mutation of fetched content
  • Updated speak command (aai_cli/commands/speak/__init__.py):

    • Added --url option with help text and example
    • Updated docstring to document the new web page input mode
    • Added to examples section
  • Updated speak execution (aai_cli/commands/speak/_exec.py):

    • Added url field to SpeakOptions
    • New _resolve_input() function enforces mutual exclusivity between --url and the text argument/stdin using the mutually_exclusive() helper
    • Calls webpage.fetch_article() when --url is provided
  • Comprehensive test suite (tests/test_webpage.py):

    • Tests immutability of Article dataclass
    • Tests HTML fetching with browser UA, redirect following, and error handling (404, connection errors)
    • Tests boilerplate extraction (nav, footers, comments stripped; title extracted)
    • Tests URL validation and readable text validation
    • Uses httpx.MockTransport for hermetic testing without real network calls
  • Integration tests (tests/test_speak.py):

    • Tests --url fetches and narrates extracted article text
    • Tests mutual exclusivity of --url and text argument
  • Dependencies: Added trafilatura>=2.1.0 to pyproject.toml

Implementation notes

  • trafilatura is imported lazily in _extract() to avoid slowing CLI startup
  • Network timeouts capped at 30 seconds to prevent TTS runs from hanging
  • Browser-like User-Agent sent to avoid stub/block pages from sites that reject unknown clients
  • All network/HTTP errors mapped to APIError; extraction failures and invalid URLs mapped to UsageError with helpful suggestions
  • Mutual exclusivity validation reuses the existing mutually_exclusive() helper from the errors module

https://claude.ai/code/session_01KHf2ttdfNjEwMHvZSMi2HU

Bundle trafilatura as a content adapter for `speak`: --url fetches a web
page (httpx, the project's pinned client) and trafilatura strips the
boilerplate — nav, sidebars, footers, comment threads — down to the
readable article body, so text-to-speech narrates the piece rather than
the page chrome.

- core/webpage.py: fetch_article(url) -> Article (text/title/url), with a
  lazy trafilatura import to keep it off CLI startup; non-http URLs and
  pages with no extractable text raise UsageError, fetch failures APIError.
- speak: new --url option, mutually exclusive with the text argument and
  stdin; resolves to the extracted text before synthesis.

trafilatura ships prebuilt wheels (lxml included), so it adds no
source-compile step to Homebrew bottling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KHf2ttdfNjEwMHvZSMi2HU
@alexkroman alexkroman enabled auto-merge June 16, 2026 23:05
Comment thread aai_cli/core/webpage.py
response.raise_for_status()
return response.text
except httpx.HTTPError as exc:
raise APIError(f"Couldn't fetch {url}: {exc}") from exc

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APIError includes the raw user-provided URL and exception in its message (f"Couldn't fetch {url}: {exc}"). Avoid embedding unsanitized URLs in error text; sanitize or redact before including in messages.

Details

✨ AI Reasoning
​The exception handler constructs an APIError embedding the requested URL and the HTTP exception (f"Couldn't fetch {url}: {exc}"). If these errors are logged or displayed, the raw URL (and possibly sensitive query strings) will be exposed and may allow log injection via crafted input.

🔧 How do I fix it?
Keep sensitive data such as emails, passwords, and tokens out of logs. When logging values tied to a user, prefer a safe identifier like a user ID over the raw input, and strip line breaks from any user-provided text you do log.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

…des-vv0vr6

# Conflicts:
#	aai_cli/commands/speak/_exec.py
#	pyproject.toml
#	tests/test_speak.py
@alexkroman alexkroman added this pull request to the merge queue Jun 16, 2026
Merged via the queue into main with commit e53dcbf Jun 16, 2026
19 checks passed
@alexkroman alexkroman deleted the claude/dreamy-archimedes-vv0vr6 branch June 16, 2026 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants