Skip to content

_curl_debug: decode pycurl bytes as latin-1 before native_str#3659

Open
HrachShah wants to merge 1 commit into
tornadoweb:masterfrom
HrachShah:fix/curl-debug-non-utf8-bytes
Open

_curl_debug: decode pycurl bytes as latin-1 before native_str#3659
HrachShah wants to merge 1 commit into
tornadoweb:masterfrom
HrachShah:fix/curl-debug-non-utf8-bytes

Conversation

@HrachShah

Copy link
Copy Markdown

Fixes #3183

What

CurlAsyncHTTPClient._curl_debug decodes pycurl's DEBUGFUNCTION
debug_msg argument via native_str (which is to_unicode --
UTF-8). pycurl passes raw bytes, not str, and those bytes are not
guaranteed to be valid UTF-8. A non-UTF-8 sequence (e.g. a proxy that
echoes back binary bytes from the upstream response) raised
UnicodeDecodeError and killed the request.

Latin-1 round-trips every byte 0x00-0xFF to the matching U+00xx code
point without raising, so a non-UTF-8 debug message is preserved as
its byte sequence (rendered as the latin-1 characters) instead of
crashing. Valid UTF-8 still decodes losslessly through native_str.
errors="replace" was avoided -- it would silently turn a
non-UTF-8 byte into a replacement character and make proxy-debug
investigations harder.

Implementation

  • New module-level helper _curl_debug_to_unicode keeps the
    byte-vs-str handling in one place and documents the latin-1
    rationale; the two callers in _curl_debug (the
    debug_type == 0 info branch and the debug_type in (1, 2)
    header/data branch) are the only call sites.
  • The debug_type == 4 branch is left alone -- it %r-formats
    debug_msg and passes it to curl_log.debug, which can accept
    bytes directly (the logging module formats bytes via repr
    before interpolation).

Test

CurlDebugTest.test_curl_debug_handles_non_utf8_bytes in
tornado/test/curl_httpclient_test.py exercises debug_type 0,
1, 2 with b"hello \xff world" (and friends). Fails on pre-fix
code with UnicodeDecodeError; passes with the fix. Verified:

python3 -m pytest tornado/test/curl_httpclient_test.py -p no:aiohttp -k 'not strip_headers_on_redirect'
`

48 passed, 1 deselected


(The deselected `test_strip_headers_on_redirect` is an unrelated
pre-existing failure that reproduces on master without this change.)

pycurl's DEBUGFUNCTION passes raw bytes, not str. The pre-fix code
called native_str(debug_msg), where native_str is to_unicode -- which
decodes as UTF-8. A non-UTF-8 byte sequence (issue tornadoweb#3183: a proxy
echoing back binary bytes from the upstream response) raised
UnicodeDecodeError and killed the request.

Latin-1 round-trips every byte 0x00-0xFF to the matching U+00xx code
point without raising, so a non-UTF-8 debug message is preserved as
its byte sequence (rendered as the latin-1 characters) instead of
crashing. Valid UTF-8 still decodes losslessly through native_str.

Added CurlDebugTest.test_curl_debug_handles_non_utf8_bytes in the
existing curl_httpclient_test module. The test fails on the pre-fix
code with UnicodeDecodeError and passes with the fix.

Refs: tornadoweb#3183
@bdarnell

bdarnell commented Jul 4, 2026

Copy link
Copy Markdown
Member

The type annotations don't add up here. Should _curl_debug be typed as str|bytes? native_str is a no-op on type str, so the second call in _curl_debug is unnecessary. The comment here is more verbose than I'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UnicodeDecodeError in curl_httpclient's _curl_debug()

2 participants