Skip to content

Comments

handle whatwg encoding standard overrides#6091

Open
anonrig wants to merge 3 commits intomainfrom
yagiz/fix-textdecoder
Open

handle whatwg encoding standard overrides#6091
anonrig wants to merge 3 commits intomainfrom
yagiz/fix-textdecoder

Conversation

@anonrig
Copy link
Member

@anonrig anonrig commented Feb 17, 2026

Fixes #6038
Fixes #5381
Fixes #5380

The solution became a lot more complicated then I wanted it to be, but unfortunately couldn't find a "more maintainable" version of the solution.

Open to suggestions!


TLDR: ICU implementation for these codec are not unsafe and these codecs not used at all on ICU by Chromium: https://chromium.googlesource.com/chromium/src/third_party/+/refs/heads/main/blink/renderer/platform/wtf/text/text_codec_cjk.cc

@anonrig anonrig requested a review from jasnell February 17, 2026 16:48
@anonrig anonrig requested review from a team as code owners February 17, 2026 16:48
Copy link

@ChALkeR ChALkeR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pass on full wpt ranges now?
If not, this is likely too complicated for a half-fix

@anonrig
Copy link
Member Author

anonrig commented Feb 17, 2026

Does this pass on full wpt ranges now?

It should pass now. We have a bug in our test harness that stops us from loading big5 table that WPT expects. Fixing it with this pull-request.

@ChALkeR
Copy link

ChALkeR commented Feb 17, 2026

@anonrig most of WPT tests for legacy multibyte are in html form.
You can use this function to run them without complex logic: https://github.com/ExodusOSS/bytes/blob/main/tests/wpt/loader.cjs#L137

It's a stand-alone function basically (remove encode test)

You want to run these tests from WPT:

encoding/legacy-mb-tchinese/big5/big5_chars-csbig5.html
encoding/legacy-mb-tchinese/big5/big5_chars_extra.html
encoding/legacy-mb-tchinese/big5/big5_chars-big5-hkscs.html
encoding/legacy-mb-tchinese/big5/big5_errors.html
encoding/legacy-mb-tchinese/big5/big5_chars-x-x-big5.html
encoding/legacy-mb-tchinese/big5/big5_chars.html
encoding/legacy-mb-tchinese/big5/big5_chars-cn-big5.html
encoding/legacy-mb-japanese/shift_jis/sjis_errors.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-ms932.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-shift-jis.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-csshiftjis.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-sjis.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-ms_kanji.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-windows-31j.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-x-sjis.html
encoding/legacy-mb-japanese/iso-2022-jp/iso2022jp_errors.html
encoding/legacy-mb-japanese/iso-2022-jp/iso2022jp_chars-csiso2022jp.html
encoding/legacy-mb-japanese/iso-2022-jp/iso2022jp_chars.html
encoding/legacy-mb-japanese/euc-jp/eucjp_errors.html
encoding/legacy-mb-japanese/euc-jp/eucjp_chars-cseucpkdfmtjapanese.html
encoding/legacy-mb-japanese/euc-jp/eucjp_chars-x-euc-jp.html
encoding/legacy-mb-japanese/euc-jp/eucjp_chars.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-csksc56011987.html
encoding/legacy-mb-korean/euc-kr/euckr_chars.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-ks_c_5601-1989.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-korean.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-iso-ir-149.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-windows-949.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-cseuckr.html
encoding/legacy-mb-korean/euc-kr/euckr_errors.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-ksc_5601.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-ks_c_5601-1987.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-ksc5601.html

This won't cover streaming, but would cover character ranges

@codspeed-hq

This comment was marked as off-topic.

@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch from 6525e3e to a54fab8 Compare February 17, 2026 18:29
@ChALkeR
Copy link

ChALkeR commented Feb 17, 2026

@anonrig I'm unsure if the solution shouldn't be "drop ICU-based impl and copy over the code from @exodus/bytes instead"

I'm going to test that on Node.js shortly

That depends on whether the approach in this PR works (including on streaming) and the perf of ICU here though.

@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch from ee14ac5 to c2927c4 Compare February 18, 2026 19:33
@github-actions
Copy link

github-actions bot commented Feb 18, 2026

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch from 23e69fb to 12792c2 Compare February 18, 2026 20:37
@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch 3 times, most recently from 272a120 to 3c0ee73 Compare February 19, 2026 22:39
@anonrig anonrig requested a review from jasnell February 19, 2026 22:39
@anonrig anonrig requested a review from mikea February 19, 2026 22:39
@anonrig
Copy link
Member Author

anonrig commented Feb 19, 2026

My research shows that Chromium and other projects are migrating to Rust equivalent crate: encoding_rs https://source.chromium.org/search?q=big5&ss=chromium

The new approach moves custom decoders to encoding_rs library (which was already a dependency of workerd due to lolhtml crate)

@anonrig
Copy link
Member Author

anonrig commented Feb 19, 2026

One missing thing from this pull-request is, we have to put everything behind a compat flag. Right now we use ICUDecoder for those legacy encodings. If we keep using them and only use the new implementation behind a compat flag, we won't have any breaking changes.

@ChALkeR
Copy link

ChALkeR commented Feb 20, 2026

@anonrig encoding_rs is pretty good according to my tests!
(Deno and Servo use encoding_rs, and they are mostly green)

And replacing ICU with it is a decent solution

It has some less significant bugs though

Specifically, you have to short-cut empty streaming input to noop, encoding_rs misbehaves on that (hsivonen/encoding_rs#126).

There might something else but the rest is likely on fatal streaming side which honestly is just spec oversight imo and should never be used.

@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch 2 times, most recently from fc566eb to b311527 Compare February 20, 2026 19:55
@anonrig anonrig requested review from guybedford and mikea February 20, 2026 19:55
@jasnell
Copy link
Collaborator

jasnell commented Feb 20, 2026

C++ and JS parts look good. Looked over the rust parts but feel more comfortable deferring to someone like @mikea on signoff of the rust bits.

@jasnell
Copy link
Collaborator

jasnell commented Feb 21, 2026

I'm trying out a PR review agent locally... expect some code review comments from it shortly

Copy link
Collaborator

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architectural Review (AI-generated)

Overall this is a solid direction — replacing ICU with encoding_rs for WHATWG-compliant CJK decoding is well-motivated and the CXX bridge follows established patterns in the codebase. The test coverage is thorough and the WPT infrastructure fixes are valuable.

Inline comments below cover the main concerns. The two highest-priority items are: (1) an unconditional performance regression for windows-1252/x-user-defined, and (2) an inconsistent reset contract between decode() and decodePtr() that could cause issues for TextDecoderStream with LegacyDecoder.

@anonrig anonrig requested a review from jasnell February 23, 2026 15:50
@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch from 0734371 to b1efc7a Compare February 23, 2026 15:50
@anonrig
Copy link
Member Author

anonrig commented Feb 23, 2026

cc @mikea @jasnell can you take a look? this is ready to merge.

Copy link
Collaborator

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the C++ bits. Will leave rust bits for @mikea

@anonrig
Copy link
Member Author

anonrig commented Feb 23, 2026

/bonk review this change.

@ask-bonk
Copy link

ask-bonk bot commented Feb 23, 2026

I read all 18 changed files in the PR (Rust bridge code, C++ decoder implementations, shared headers, encoding.c++, encoding.h, test files, compat flag definitions, BUILD files, and the TextDecoderStream usage in streams/encoding.c++) and cross-referenced the review comments to understand which concerns had been addressed, then analyzed the compat flag gating logic, reset/flush contracts between IcuDecoder and LegacyDecoder, and the encoding routing in the constructor to produce a structured review identifying one high-priority issue (ungated windows-1252/x-user-defined path change), two medium issues, and several nits.

github run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

legacy multi-byte encodings in TextDecoder are incorrect Figure out errors relating to GBK encoding Figure out errors relating to GB18030 decoding

4 participants