_curl_debug: decode pycurl bytes as latin-1 before native_str by HrachShah · Pull Request #3659 · tornadoweb/tornado

HrachShah · 2026-07-01T07:39:29Z

What

CurlAsyncHTTPClient._curl_debug decodes pycurl's DEBUGFUNCTION
debug_msg argument via native_str (which is to_unicode --
UTF-8). pycurl passes raw bytes, not str, and those bytes are not
guaranteed to be valid UTF-8. A non-UTF-8 sequence (e.g. a proxy that
echoes back binary bytes from the upstream response) raised
UnicodeDecodeError and killed the request.

Latin-1 round-trips every byte 0x00-0xFF to the matching U+00xx code
point without raising, so a non-UTF-8 debug message is preserved as
its byte sequence (rendered as the latin-1 characters) instead of
crashing. Valid UTF-8 still decodes losslessly through native_str.
errors="replace" was avoided -- it would silently turn a
non-UTF-8 byte into a replacement character and make proxy-debug
investigations harder.

Implementation

New module-level helper _curl_debug_to_unicode keeps the
byte-vs-str handling in one place and documents the latin-1
rationale; the two callers in _curl_debug (the
debug_type == 0 info branch and the debug_type in (1, 2)
header/data branch) are the only call sites.
The debug_type == 4 branch is left alone -- it %r-formats
debug_msg and passes it to curl_log.debug, which can accept
bytes directly (the logging module formats bytes via repr
before interpolation).

Test

CurlDebugTest.test_curl_debug_handles_non_utf8_bytes in
tornado/test/curl_httpclient_test.py exercises debug_type 0,
1, 2 with b"hello \xff world" (and friends). Fails on pre-fix
code with UnicodeDecodeError; passes with the fix. Verified:

python3 -m pytest tornado/test/curl_httpclient_test.py -p no:aiohttp -k 'not strip_headers_on_redirect'
`

48 passed, 1 deselected


(The deselected `test_strip_headers_on_redirect` is an unrelated
pre-existing failure that reproduces on master without this change.)

pycurl's DEBUGFUNCTION passes raw bytes, not str. The pre-fix code called native_str(debug_msg), where native_str is to_unicode -- which decodes as UTF-8. A non-UTF-8 byte sequence (issue tornadoweb#3183: a proxy echoing back binary bytes from the upstream response) raised UnicodeDecodeError and killed the request. Latin-1 round-trips every byte 0x00-0xFF to the matching U+00xx code point without raising, so a non-UTF-8 debug message is preserved as its byte sequence (rendered as the latin-1 characters) instead of crashing. Valid UTF-8 still decodes losslessly through native_str. Added CurlDebugTest.test_curl_debug_handles_non_utf8_bytes in the existing curl_httpclient_test module. The test fails on the pre-fix code with UnicodeDecodeError and passes with the fix. Refs: tornadoweb#3183

bdarnell · 2026-07-04T20:24:51Z

The type annotations don't add up here. Should _curl_debug be typed as str|bytes? native_str is a no-op on type str, so the second call in _curl_debug is unnecessary. The comment here is more verbose than I'd like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

_curl_debug: decode pycurl bytes as latin-1 before native_str#3659

_curl_debug: decode pycurl bytes as latin-1 before native_str#3659
HrachShah wants to merge 1 commit into
tornadoweb:masterfrom
HrachShah:fix/curl-debug-non-utf8-bytes

HrachShah commented Jul 1, 2026

Uh oh!

bdarnell commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

HrachShah commented Jul 1, 2026

What

Implementation

Test

Uh oh!

bdarnell commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants