-
Notifications
You must be signed in to change notification settings - Fork 0
Add --url flag to speak command for reading web pages aloud #201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| """Fetch a web page and extract its main article text. | ||
|
|
||
| Backs ``assembly speak --url``: httpx2 (the project's pinned client) fetches the | ||
| HTML and trafilatura strips the boilerplate — nav, sidebars, cookie banners, | ||
| footers, comment threads — down to the readable article body, so text-to-speech | ||
| narrates the piece rather than the page chrome. trafilatura (and its lxml | ||
| backend) is the heavy import, so it is deferred to call time to stay off the | ||
| CLI's startup path. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from dataclasses import dataclass | ||
|
|
||
| import httpx2 as httpx | ||
|
|
||
| from aai_cli.core.errors import APIError, UsageError | ||
|
|
||
| # A page fetch shouldn't hang a TTS run; cap it. | ||
| _TIMEOUT = 30.0 # pragma: no mutate -- request timeout; nothing observable to assert | ||
| # Browser-like UA: some sites serve a stub or block page to unknown clients. | ||
| _USER_AGENT = "Mozilla/5.0 (compatible; assembly-cli; +https://www.assemblyai.com)" | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class Article: | ||
| """The readable content extracted from a web page.""" | ||
|
|
||
| text: str | ||
| title: str | None | ||
| url: str | ||
|
|
||
|
|
||
| def fetch_article(url: str) -> Article: | ||
| """Fetch ``url`` and return its main article text with boilerplate removed. | ||
|
|
||
| Raises a :class:`UsageError` when ``url`` isn't an http(s) address or the | ||
| page yields no readable text, and an :class:`APIError` when the fetch itself | ||
| fails (DNS, timeout, non-2xx). | ||
| """ | ||
| if not url.startswith(("http://", "https://")): | ||
| raise UsageError( | ||
| f"Not a web page URL: {url}", | ||
| suggestion="Pass an http(s) URL, e.g. assembly speak --url https://example.com/post.", | ||
| ) | ||
| text, title = _extract(_fetch_html(url)) | ||
| if not text: | ||
| raise UsageError( | ||
| f"Couldn't find readable text at {url}.", | ||
| suggestion="The page may be paywalled, JavaScript-rendered, or not an article.", | ||
| ) | ||
| return Article(text=text, title=title, url=url) | ||
|
|
||
|
|
||
| def _fetch_html(url: str) -> str: | ||
| """GET the raw HTML for ``url``, mapping any network/HTTP failure to APIError.""" | ||
| try: | ||
| with httpx.Client( | ||
| timeout=_TIMEOUT, | ||
| follow_redirects=True, | ||
| headers={"User-Agent": _USER_AGENT}, | ||
| ) as client: | ||
| response = client.get(url) | ||
| response.raise_for_status() | ||
| return response.text | ||
| except httpx.HTTPError as exc: | ||
| raise APIError(f"Couldn't fetch {url}: {exc}") from exc | ||
|
|
||
|
|
||
| def _extract(html: str) -> tuple[str | None, str | None]: | ||
| """Pull the main text and title out of ``html`` (trafilatura, imported lazily).""" | ||
| import trafilatura | ||
|
|
||
| text = trafilatura.extract( | ||
| html, | ||
| output_format="txt", | ||
| # Don't narrate the comment thread. trafilatura's comment classifier keys | ||
| # off real-world markup (Disqus, microformats), which synthetic test | ||
| # fixtures can't reproduce, so this flag stays unasserted by the suite. | ||
| include_comments=False, # pragma: no mutate | ||
| ) | ||
| title = getattr(trafilatura.extract_metadata(html), "title", None) | ||
| return text, title | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| from __future__ import annotations | ||
|
|
||
| import dataclasses | ||
|
|
||
| import httpx2 as httpx | ||
| import pytest | ||
|
|
||
| from aai_cli.core import webpage | ||
| from aai_cli.core.errors import APIError, UsageError | ||
|
|
||
| # An article wrapped in the usual page chrome: nav, a comment thread, and a | ||
| # <title>. The extractor should keep the body and drop the rest. | ||
| ARTICLE_HTML = """<!DOCTYPE html><html><head><title>The Real Headline</title></head> | ||
| <body> | ||
| <nav>Home | About | SubscribeNavBoilerplate</nav> | ||
| <article> | ||
| <h1>The Real Headline</h1> | ||
| <p>This is the first real paragraph of the article body that we care about.</p> | ||
| <p>Here is a second substantive paragraph with more content to extract.</p> | ||
| </article> | ||
| <section class="comments"><p>UserCommentText that appears in the discussion thread below.</p></section> | ||
| <footer>FooterBoilerplate copyright 2026.</footer> | ||
| </body></html>""" | ||
|
|
||
|
|
||
| def _client_returning(monkeypatch, handler): | ||
| """Patch webpage.httpx.Client to route requests through a MockTransport handler | ||
| (the test_eval_data_hf.py pattern) so no real socket is opened.""" | ||
| real_client = httpx.Client | ||
|
|
||
| def fake_client(*args, **kwargs): | ||
| kwargs["transport"] = httpx.MockTransport(handler) | ||
| return real_client(*args, **kwargs) | ||
|
|
||
| monkeypatch.setattr(webpage.httpx, "Client", fake_client) | ||
|
|
||
|
|
||
| def test_article_is_immutable(): | ||
| # frozen=True: a fetched Article can't be mutated out from under a caller. | ||
| article = webpage.Article(text="body", title="T", url="https://example.com/p") | ||
| # A dynamic field name (not a literal) keeps pyright from resolving the | ||
| # assignment to the read-only attribute — see test_command_options_seam.py. | ||
| field_name = dataclasses.fields(article)[0].name | ||
| with pytest.raises(dataclasses.FrozenInstanceError): | ||
| setattr(article, field_name, "tampered") | ||
|
|
||
|
|
||
| def test_fetch_html_returns_body_and_sends_browser_user_agent(monkeypatch): | ||
| seen: dict[str, str] = {} | ||
|
|
||
| def handler(request: httpx.Request) -> httpx.Response: | ||
| seen["ua"] = request.headers["user-agent"] | ||
| return httpx.Response(200, text="<html>ok</html>") | ||
|
|
||
| _client_returning(monkeypatch, handler) | ||
| assert webpage._fetch_html("https://example.com/post") == "<html>ok</html>" | ||
| # The browser-like UA is sent so sites don't serve a stub/block page. | ||
| assert "assembly-cli" in seen["ua"] | ||
|
|
||
|
|
||
| def test_fetch_html_follows_redirects(monkeypatch): | ||
| # A 301 must be followed to the final 200; without follow_redirects the | ||
| # client would return the empty 301 body instead of the article. | ||
| def handler(request: httpx.Request) -> httpx.Response: | ||
| if request.url.path == "/start": | ||
| return httpx.Response(301, headers={"Location": "https://example.com/final"}) | ||
| return httpx.Response(200, text="final body") | ||
|
|
||
| _client_returning(monkeypatch, handler) | ||
| assert webpage._fetch_html("https://example.com/start") == "final body" | ||
|
|
||
|
|
||
| def test_fetch_html_non_2xx_becomes_api_error(monkeypatch): | ||
| _client_returning(monkeypatch, lambda request: httpx.Response(404, text="nope")) | ||
| with pytest.raises(APIError) as exc: | ||
| webpage._fetch_html("https://example.com/missing") | ||
| assert "https://example.com/missing" in exc.value.message | ||
|
|
||
|
|
||
| def test_fetch_html_connect_error_becomes_api_error(monkeypatch): | ||
| def handler(request: httpx.Request) -> httpx.Response: | ||
| raise httpx.ConnectError("boom") | ||
|
|
||
| _client_returning(monkeypatch, handler) | ||
| with pytest.raises(APIError): | ||
| webpage._fetch_html("https://example.com/post") | ||
|
|
||
|
|
||
| def test_extract_strips_boilerplate_and_comments_and_reads_title(): | ||
| text, title = webpage._extract(ARTICLE_HTML) | ||
| assert text is not None | ||
| # The article body survives... | ||
| assert "first real paragraph of the article body" in text | ||
| # ...while the nav and footer chrome are dropped. | ||
| assert "NavBoilerplate" not in text | ||
| assert "FooterBoilerplate" not in text | ||
| # The <title> drives the extracted title. | ||
| assert title == "The Real Headline" | ||
|
|
||
|
|
||
| def test_fetch_article_rejects_non_http_url(): | ||
| with pytest.raises(UsageError) as exc: | ||
| webpage.fetch_article("ftp://example.com/file") | ||
| assert "Not a web page URL" in exc.value.message | ||
| assert "http" in (exc.value.suggestion or "") | ||
|
|
||
|
|
||
| def test_fetch_article_returns_extracted_text_and_title(monkeypatch): | ||
| monkeypatch.setattr(webpage, "_fetch_html", lambda url: ARTICLE_HTML) | ||
| article = webpage.fetch_article("https://example.com/post") | ||
| assert "first real paragraph of the article body" in article.text | ||
| assert article.title == "The Real Headline" | ||
| assert article.url == "https://example.com/post" | ||
|
|
||
|
|
||
| def test_fetch_article_without_readable_text_is_a_usage_error(monkeypatch): | ||
| # A page trafilatura can't extract an article from yields no text -> usage error. | ||
| monkeypatch.setattr(webpage, "_fetch_html", lambda url: "<html><body></body></html>") | ||
| with pytest.raises(UsageError) as exc: | ||
| webpage.fetch_article("https://example.com/empty") | ||
| assert "Couldn't find readable text" in exc.value.message |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
APIError includes the raw user-provided URL and exception in its message (f"Couldn't fetch {url}: {exc}"). Avoid embedding unsanitized URLs in error text; sanitize or redact before including in messages.
Details
✨ AI Reasoning
The exception handler constructs an APIError embedding the requested URL and the HTTP exception (f"Couldn't fetch {url}: {exc}"). If these errors are logged or displayed, the raw URL (and possibly sensitive query strings) will be exposed and may allow log injection via crafted input.
🔧 How do I fix it?
Keep sensitive data such as emails, passwords, and tokens out of logs. When logging values tied to a user, prefer a safe identifier like a user ID over the raw input, and strip line breaks from any user-provided text you do log.
Reply
@AikidoSec feedback: [FEEDBACK]to get better review comments in the future.Reply
@AikidoSec ignore: [REASON]to ignore this issue.More info