fix: guard trafilatura import to prevent cascading tool load failure on Python 3.13#1597
fix: guard trafilatura import to prevent cascading tool load failure on Python 3.13#1597he-yufeng wants to merge 2 commits intoMoonshotAI:mainfrom
Conversation
|
you can run |
…on Python 3.13 On Python 3.13, charset-normalizer ships mypyc-compiled binaries that are incompatible, causing trafilatura to fail at import time. Since web/__init__.py unconditionally imports FetchURL (which imports trafilatura at module level), the entire web package fails to load — taking SearchWeb down with it even though SearchWeb has no trafilatura dependency. Wrap the trafilatura import in try/except and fall back to returning raw page content when it's unavailable. This keeps both tools loadable regardless of the trafilatura situation. Fixes MoonshotAI#1572
6f04ab1 to
d78db0c
Compare
|
Rebased and verified — |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d78db0cc04
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if not _has_trafilatura: | ||
| # trafilatura unavailable (e.g. charset-normalizer binary | ||
| # incompatible with current Python), return raw HTML trimmed | ||
| builder.write(resp_text[:50000]) | ||
| return builder.ok("trafilatura is not available; returning raw page content.") |
There was a problem hiding this comment.
Restrict raw-content fallback to HTML responses
When trafilatura is unavailable, this branch returns ok for every non-empty response body, regardless of Content-Type. In fetch_with_http_get, that means binary/media endpoints (for example application/pdf or images) now get treated as successful page fetches and can return garbled decoded bytes, whereas the previous behavior would fail extraction and surface an error. This is a regression for agents that rely on is_error to decide whether to retry with another tool/path; consider gating this fallback to HTML/text-like types and preserving an error for non-text content.
Useful? React with 👍 / 👎.
Summary
On Python 3.13,
charset-normalizerships mypyc-compiled.sobinaries that are incompatible with the interpreter, causingtrafilaturato fail at import time. Sinceweb/__init__.pyunconditionally doesfrom .fetch import FetchURL(which has a bareimport trafilaturaat module level), the entirewebpackage fails to load — taking SearchWeb down with it even though SearchWeb has zero trafilatura dependency.Changes:
trafilaturaimport intry/except, set a_has_trafilaturaflagfetch_with_http_getfalls back to returning raw page content (trimmed to 50k chars) instead of crashing_fetch_with_service) is completely unaffectedSearchWebnow loads normally regardless of the trafilatura situationRoot Cause
Test Plan
SearchWebloads on Python 3.13 without thecharset-normalizerworkaroundFetchURLloads and returns raw content for HTML pages when trafilatura is unavailableFetchURLbehavior unchanged when trafilatura is availablepytest tests/tools/test_fetch_url.py)Fixes #1572