Conversation
ChALkeR
left a comment
There was a problem hiding this comment.
Does this pass on full wpt ranges now?
If not, this is likely too complicated for a half-fix
It should pass now. We have a bug in our test harness that stops us from loading big5 table that WPT expects. Fixing it with this pull-request. |
|
@anonrig most of WPT tests for legacy multibyte are in html form. It's a stand-alone function basically (remove You want to run these tests from WPT: This won't cover streaming, but would cover character ranges |
This comment was marked as off-topic.
This comment was marked as off-topic.
6525e3e to
a54fab8
Compare
|
@anonrig I'm unsure if the solution shouldn't be "drop ICU-based impl and copy over the code from @exodus/bytes instead" I'm going to test that on Node.js shortly That depends on whether the approach in this PR works (including on streaming) and the perf of ICU here though. |
ee14ac5 to
c2927c4
Compare
|
The generated output of |
23e69fb to
12792c2
Compare
272a120 to
3c0ee73
Compare
|
My research shows that Chromium and other projects are migrating to Rust equivalent crate: encoding_rs https://source.chromium.org/search?q=big5&ss=chromium The new approach moves custom decoders to encoding_rs library (which was already a dependency of workerd due to lolhtml crate) |
|
One missing thing from this pull-request is, we have to put everything behind a compat flag. Right now we use ICUDecoder for those legacy encodings. If we keep using them and only use the new implementation behind a compat flag, we won't have any breaking changes. |
|
@anonrig encoding_rs is pretty good according to my tests! And replacing ICU with it is a decent solution It has some less significant bugs though Specifically, you have to short-cut empty streaming input to noop, encoding_rs misbehaves on that (hsivonen/encoding_rs#126). There might something else but the rest is likely on fatal streaming side which honestly is just spec oversight imo and should never be used. |
fc566eb to
b311527
Compare
|
C++ and JS parts look good. Looked over the rust parts but feel more comfortable deferring to someone like @mikea on signoff of the rust bits. |
|
I'm trying out a PR review agent locally... expect some code review comments from it shortly |
jasnell
left a comment
There was a problem hiding this comment.
Architectural Review (AI-generated)
Overall this is a solid direction — replacing ICU with encoding_rs for WHATWG-compliant CJK decoding is well-motivated and the CXX bridge follows established patterns in the codebase. The test coverage is thorough and the WPT infrastructure fixes are valuable.
Inline comments below cover the main concerns. The two highest-priority items are: (1) an unconditional performance regression for windows-1252/x-user-defined, and (2) an inconsistent reset contract between decode() and decodePtr() that could cause issues for TextDecoderStream with LegacyDecoder.
0cc8d2d to
0734371
Compare
0734371 to
b1efc7a
Compare
|
/bonk review this change. |
|
I read all 18 changed files in the PR (Rust bridge code, C++ decoder implementations, shared headers, encoding.c++, encoding.h, test files, compat flag definitions, BUILD files, and the TextDecoderStream usage in streams/encoding.c++) and cross-referenced the review comments to understand which concerns had been addressed, then analyzed the compat flag gating logic, reset/flush contracts between IcuDecoder and LegacyDecoder, and the encoding routing in the constructor to produce a structured review identifying one high-priority issue (ungated windows-1252/x-user-defined path change), two medium issues, and several nits. |
Fixes #6038
Fixes #5381
Fixes #5380
The solution became a lot more complicated then I wanted it to be, but unfortunately couldn't find a "more maintainable" version of the solution.
Open to suggestions!
TLDR: ICU implementation for these codec are not unsafe and these codecs not used at all on ICU by Chromium: https://chromium.googlesource.com/chromium/src/third_party/+/refs/heads/main/blink/renderer/platform/wtf/text/text_codec_cjk.cc