Skip to content

fix: guard trafilatura import to prevent cascading tool load failure on Python 3.13#1597

Open
he-yufeng wants to merge 2 commits intoMoonshotAI:mainfrom
he-yufeng:fix/trafilatura-import-guard
Open

fix: guard trafilatura import to prevent cascading tool load failure on Python 3.13#1597
he-yufeng wants to merge 2 commits intoMoonshotAI:mainfrom
he-yufeng:fix/trafilatura-import-guard

Conversation

@he-yufeng
Copy link
Copy Markdown

@he-yufeng he-yufeng commented Mar 27, 2026

Summary

On Python 3.13, charset-normalizer ships mypyc-compiled .so binaries that are incompatible with the interpreter, causing trafilatura to fail at import time. Since web/__init__.py unconditionally does from .fetch import FetchURL (which has a bare import trafilatura at module level), the entire web package fails to load — taking SearchWeb down with it even though SearchWeb has zero trafilatura dependency.

Changes:

  • Wrap the trafilatura import in try/except, set a _has_trafilatura flag
  • When trafilatura is unavailable, fetch_with_http_get falls back to returning raw page content (trimmed to 50k chars) instead of crashing
  • Service-based fetch path (_fetch_with_service) is completely unaffected
  • SearchWeb now loads normally regardless of the trafilatura situation

Root Cause

charset-normalizer (mypyc .so) incompatible with Python 3.13
  → trafilatura import fails
    → fetch.py fails to load
      → web/__init__.py fails to load
        → both FetchURL AND SearchWeb become "Invalid tools"

Test Plan

  • Verify SearchWeb loads on Python 3.13 without the charset-normalizer workaround
  • Verify FetchURL loads and returns raw content for HTML pages when trafilatura is unavailable
  • Verify existing FetchURL behavior unchanged when trafilatura is available
  • Existing tests pass (pytest tests/tools/test_fetch_url.py)

Fixes #1572


Open with Devin

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

@RealKai42
Copy link
Copy Markdown
Collaborator

you can run make check before you create the PR

…on Python 3.13

On Python 3.13, charset-normalizer ships mypyc-compiled binaries that
are incompatible, causing trafilatura to fail at import time. Since
web/__init__.py unconditionally imports FetchURL (which imports
trafilatura at module level), the entire web package fails to load —
taking SearchWeb down with it even though SearchWeb has no trafilatura
dependency.

Wrap the trafilatura import in try/except and fall back to returning
raw page content when it's unavailable. This keeps both tools loadable
regardless of the trafilatura situation.

Fixes MoonshotAI#1572
@he-yufeng he-yufeng force-pushed the fix/trafilatura-import-guard branch from 6f04ab1 to d78db0c Compare April 1, 2026 15:13
@he-yufeng
Copy link
Copy Markdown
Author

Rebased and verified — ruff check and ruff format --check both pass. Sorry about that, will run make check before future PRs.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d78db0cc04

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +101 to +105
if not _has_trafilatura:
# trafilatura unavailable (e.g. charset-normalizer binary
# incompatible with current Python), return raw HTML trimmed
builder.write(resp_text[:50000])
return builder.ok("trafilatura is not available; returning raw page content.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict raw-content fallback to HTML responses

When trafilatura is unavailable, this branch returns ok for every non-empty response body, regardless of Content-Type. In fetch_with_http_get, that means binary/media endpoints (for example application/pdf or images) now get treated as successful page fetches and can return garbled decoded bytes, whereas the previous behavior would fail extraction and surface an error. This is a regression for agents that rely on is_error to decide whether to retry with another tool/path; consider gating this fallback to HTML/text-like types and preserving an error for non-text content.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid tools: ['kimi_cli.tools.web:SearchWeb', 'kimi_cli.tools.web:FetchURL']

2 participants