Bug report
Bug description:
For codecs.encode,
with utf-* encoding, and a custom errors which returns str,
if you pass some characters that are not valid UTF characters (e.g. surrogates),
UnicodeEncodeError is just raised and there's not the expected (and documented) case
where the returned str is appended.
import codecs
ERRORS_NAME = "returning non-ascii"
# something being not encod-able via `utf-*`
BAD_UTF = "\uD800" # the first high surrogate character
def register_repl_error(repl: str):
def error_handle(exc: UnicodeEncodeError) -> tuple[str, int]:
return (repl, exc.end)
codecs.register_error(ERRORS_NAME, error_handle)
def encode_surrogate(encoding: str, repl: str):
register_repl_error(repl)
max_enc_len = 9
pre = f"codecs.encode({BAD_UTF!r}, {encoding=:{max_enc_len}}) "
try:
res = codecs.encode(BAD_UTF, encoding, ERRORS_NAME)
except UnicodeEncodeError as err:
reason = err.reason
print(pre + f"raises with {reason=}")
else:
print(pre + f"returns {res}")
NON_ASCII = "龍" # loong in Chinese
## utf-*
for i in ('8', '16', '32', '16-le', '16-be'):
encode_surrogate("utf-" + i, NON_ASCII)
print('-'*3)
# The following is some non-utf* encoding, which works fine
## cjk
### zh
for enc in ("gbk", "big5"):
encode_surrogate(enc, NON_ASCII)
### jp
for enc in ("Shift_JIS", "EUC-JP"):
encode_surrogate(enc, NON_ASCII)
Output:
codecs.encode('\ud800', encoding=utf-8 ) raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16 ) raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-32 ) raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16-le) raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16-be) raises with reason='surrogates not allowed'
---
codecs.encode('\ud800', encoding=gbk ) returns b'\xfd\x88'
codecs.encode('\ud800', encoding=big5 ) returns b'\xc0s'
codecs.encode('\ud800', encoding=Shift_JIS) returns b'\x97\xb4'
codecs.encode('\ud800', encoding=EUC-JP ) returns b'\xce\xb6'
CPython versions tested on:
3.9, 3.11, 3.12, 3.13, 3.14
Operating systems tested on:
Linux, Windows
Linked PRs
Bug report
Bug description:
For
codecs.encode,with
utf-*encoding, and a customerrorswhich returnsstr,if you pass some characters that are not valid UTF characters (e.g. surrogates),
UnicodeEncodeErroris just raised and there's not the expected (and documented) casewhere the returned
stris appended.Output:
codecs.encode('\ud800', encoding=utf-8 ) raises with reason='surrogates not allowed' codecs.encode('\ud800', encoding=utf-16 ) raises with reason='surrogates not allowed' codecs.encode('\ud800', encoding=utf-32 ) raises with reason='surrogates not allowed' codecs.encode('\ud800', encoding=utf-16-le) raises with reason='surrogates not allowed' codecs.encode('\ud800', encoding=utf-16-be) raises with reason='surrogates not allowed' --- codecs.encode('\ud800', encoding=gbk ) returns b'\xfd\x88' codecs.encode('\ud800', encoding=big5 ) returns b'\xc0s' codecs.encode('\ud800', encoding=Shift_JIS) returns b'\x97\xb4' codecs.encode('\ud800', encoding=EUC-JP ) returns b'\xce\xb6'CPython versions tested on:
3.9, 3.11, 3.12, 3.13, 3.14
Operating systems tested on:
Linux, Windows
Linked PRs
register_error's error_handler #139554