feat: add curl_cffi fallback to bypass WAF/TLS-fingerprint blocks#1471
Open
marcusosterberg wants to merge 2 commits into
Open
feat: add curl_cffi fallback to bypass WAF/TLS-fingerprint blocks#1471marcusosterberg wants to merge 2 commits into
marcusosterberg wants to merge 2 commits into
Conversation
…path addIssue() was defined with three positional arguments but called with four in the failure branch of run_test(). The extra argument was a text string carrying site-unavailable context that was never actually stored on the sub-issue. This commit adds `text` as an optional fourth parameter on addIssue and stores it on the sub-issue when provided. The call site is updated to pass arguments in the correct order: (result_dict, rule_id, url, text). All ~25 existing 3-argument call sites continue to work unchanged. Triggered by any site where the initial HTTP request raises ConnectionError, e.g. WAF-protected sites that drop python-requests at the TLS handshake (bolagsverket.se and other Swedish government sites confirmed).
Some enterprise WAF appliances (Akamai, Imperva, F5 ASM) drop the TCP connection from python-requests at the TLS ClientHello stage. The user gets requests.exceptions.ConnectionError instead of an HTTP response, even though a real browser to the same URL succeeds. This blocks webperf_core from fetching standard files (robots.txt, security.txt, etc.) on WAF-protected sites including several Swedish government agencies. This commit: - Adds helpers/http_helper.py exposing http_get_with_fallback(), which tries plain requests.get() first and falls back to curl_cffi (with impersonate="chrome131") on ConnectionError. If curl_cffi is not installed or also fails, the original error is re-raised so existing exception handling continues unchanged. - Updates tests/utils.py:get_http_content() to route through the helper (one import + one call-site swap; surrounding code unchanged). - Adds curl-cffi>=0.13.0 to requirements.txt. Debug logging available via WEBPERF_HTTP_HELPER_DEBUG=1 (prints fallback activity to stderr). Verified against https://bolagsverket.se — previously failed with four ConnectionError messages and an empty result; now fetches robots.txt and security.txt successfully via fallback.
Collaborator
|
@marcusosterberg I agree with the problem description, but a highly recommend a different solution. Using sitespeed instead of PR suggested way has the benefit of supporting more browsers (read: firefox) |
7h3Rabbit
requested changes
May 18, 2026
Collaborator
7h3Rabbit
left a comment
There was a problem hiding this comment.
This solution is not recommended. please use sitespeed for this instead (read: use a real browser instead of again trying to lie about the fact that we are not a real browser.
Using sitespeed has the benefit of using same logic across webperf-core.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #1470
Depends on #1469 — please merge that one first.
What this changes
helpers/http_helper.pyexposinghttp_get_with_fallback()tests/utils.py: two-line edit inget_http_content(one import + one call-site swap)requirements.txt: addscurl-cffi>=0.13.0How it works
requests.getis tried first — fast, identical to existing behaviourConnectionError, retry once viacurl_cffiwithimpersonate="chrome131"curl_cffiis unavailable or also fails, the originalrequests.exceptions.ConnectionErroris re-raised so the callingexceptblock inget_http_contentbehaves exactly as beforeExisting exception handling for SSL errors, redirects, timeouts, etc. is untouched because the helper only handles
ConnectionError.Testing
Verified locally against
https://bolagsverket.se:Connection error!messages, empty result filerobots.txtandsecurity.txtfetched successfully via fallback; standard-files test produces a real ratingSet
WEBPERF_HTTP_HELPER_DEBUG=1to see fallback activity:Backward compatibility
curl-cffiis wrapped intry/except ImportError— if it is not installed, the helper degrades to plainrequests.getget_http_contentis unchangedcurl_cffi's response object is drop-in compatible withrequests.Responsefor the attributes actually used by webperf_core (.text,.content,.status_code,.headers)curl-cffiships pre-built wheels for all platforms the project supportsOut of scope
requests.getcall inget_url_headers()(rad 650 itests/utils.py) — that one is for HEAD requests and was not observed to fail on bolagsverket. Can be addressed in a follow-up if needed.chrome131default is sufficient for currently observed blocks. A future PR could expose it via settings if multiple profiles are needed.Standards note
The files this helper is most often used to fetch —
/robots.txt(RFC 9309) and/.well-known/security.txt(RFC 9116) — are by definition intended to be machine-readable by automated tools. Sites that WAF-block them are arguably in violation of the relevant RFCs. This PR is a pragmatic workaround on the consumer side.