Skip to content

feat: add curl_cffi fallback to bypass WAF/TLS-fingerprint blocks#1471

Open
marcusosterberg wants to merge 2 commits into
mainfrom
feat/curl-cffi-waf-fallback
Open

feat: add curl_cffi fallback to bypass WAF/TLS-fingerprint blocks#1471
marcusosterberg wants to merge 2 commits into
mainfrom
feat/curl-cffi-waf-fallback

Conversation

@marcusosterberg
Copy link
Copy Markdown
Contributor

Resolves #1470

Depends on #1469 — please merge that one first.

What this changes

  • New file: helpers/http_helper.py exposing http_get_with_fallback()
  • tests/utils.py: two-line edit in get_http_content (one import + one call-site swap)
  • requirements.txt: adds curl-cffi>=0.13.0

How it works

  1. requests.get is tried first — fast, identical to existing behaviour
  2. On ConnectionError, retry once via curl_cffi with impersonate="chrome131"
  3. If curl_cffi is unavailable or also fails, the original requests.exceptions.ConnectionError is re-raised so the calling except block in get_http_content behaves exactly as before

Existing exception handling for SSL errors, redirects, timeouts, etc. is untouched because the helper only handles ConnectionError.

Testing

Verified locally against https://bolagsverket.se:

  • Before: four Connection error! messages, empty result file
  • After: robots.txt and security.txt fetched successfully via fallback; standard-files test produces a real rating

Set WEBPERF_HTTP_HELPER_DEBUG=1 to see fallback activity:

[http_helper] primary requests.get failed for https://bolagsverket.se/robots.txt: ConnectionError
[http_helper] falling back to curl_cffi with impersonate=chrome131
[http_helper] curl_cffi succeeded for https://bolagsverket.se/robots.txt: status=200

Backward compatibility

  • curl-cffi is wrapped in try/except ImportError — if it is not installed, the helper degrades to plain requests.get
  • The public API of get_http_content is unchanged
  • curl_cffi's response object is drop-in compatible with requests.Response for the attributes actually used by webperf_core (.text, .content, .status_code, .headers)
  • CI does not need any changes; curl-cffi ships pre-built wheels for all platforms the project supports

Out of scope

  • The requests.get call in get_url_headers() (rad 650 i tests/utils.py) — that one is for HEAD requests and was not observed to fail on bolagsverket. Can be addressed in a follow-up if needed.
  • Configurable impersonation profile — chrome131 default is sufficient for currently observed blocks. A future PR could expose it via settings if multiple profiles are needed.

Standards note

The files this helper is most often used to fetch — /robots.txt (RFC 9309) and /.well-known/security.txt (RFC 9116) — are by definition intended to be machine-readable by automated tools. Sites that WAF-block them are arguably in violation of the relevant RFCs. This PR is a pragmatic workaround on the consumer side.

…path

addIssue() was defined with three positional arguments but called with
four in the failure branch of run_test(). The extra argument was a text
string carrying site-unavailable context that was never actually stored
on the sub-issue.

This commit adds `text` as an optional fourth parameter on addIssue and
stores it on the sub-issue when provided. The call site is updated to
pass arguments in the correct order: (result_dict, rule_id, url, text).
All ~25 existing 3-argument call sites continue to work unchanged.

Triggered by any site where the initial HTTP request raises
ConnectionError, e.g. WAF-protected sites that drop python-requests at
the TLS handshake (bolagsverket.se and other Swedish government sites
confirmed).
Some enterprise WAF appliances (Akamai, Imperva, F5 ASM) drop the TCP
connection from python-requests at the TLS ClientHello stage. The user
gets requests.exceptions.ConnectionError instead of an HTTP response,
even though a real browser to the same URL succeeds. This blocks
webperf_core from fetching standard files (robots.txt, security.txt,
etc.) on WAF-protected sites including several Swedish government
agencies.

This commit:
- Adds helpers/http_helper.py exposing http_get_with_fallback(), which
  tries plain requests.get() first and falls back to curl_cffi (with
  impersonate="chrome131") on ConnectionError. If curl_cffi is not
  installed or also fails, the original error is re-raised so existing
  exception handling continues unchanged.
- Updates tests/utils.py:get_http_content() to route through the helper
  (one import + one call-site swap; surrounding code unchanged).
- Adds curl-cffi>=0.13.0 to requirements.txt.

Debug logging available via WEBPERF_HTTP_HELPER_DEBUG=1 (prints fallback
activity to stderr).

Verified against https://bolagsverket.se — previously failed with four
ConnectionError messages and an empty result; now fetches robots.txt
and security.txt successfully via fallback.
@7h3Rabbit
Copy link
Copy Markdown
Collaborator

@marcusosterberg I agree with the problem description, but a highly recommend a different solution.
We should use sitespeed for everything instead if the normal python way is causing problems here as well.
staring to use sitespeed for everything else was to solve problems like this, where a normal browser is better.

Using sitespeed instead of PR suggested way has the benefit of supporting more browsers (read: firefox)

Copy link
Copy Markdown
Collaborator

@7h3Rabbit 7h3Rabbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution is not recommended. please use sitespeed for this instead (read: use a real browser instead of again trying to lie about the fact that we are not a real browser.

Using sitespeed has the benefit of using same logic across webperf-core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add curl_cffi fallback to bypass WAF/TLS-fingerprint blocks

2 participants