Fix modified UTF-8 string encoding/decoding#42
Open
PhilGlass wants to merge 2 commits intomadisp:mainfrom
Open
Fix modified UTF-8 string encoding/decoding#42PhilGlass wants to merge 2 commits intomadisp:mainfrom
PhilGlass wants to merge 2 commits intomadisp:mainfrom
Conversation
1ee51e6 to
22d54d3
Compare
PhilGlass
commented
Feb 26, 2026
Comment on lines
-34
to
+33
| UTF8(UTF_8), | ||
| UTF16(UTF_16LE); | ||
|
|
||
| private final Charset charset; | ||
|
|
||
| Type(Charset charset) { | ||
| this.charset = charset; | ||
| } | ||
|
|
||
| public Charset charset() { | ||
| return charset; | ||
| } | ||
| UTF8, UTF16 |
Author
There was a problem hiding this comment.
This is an API change, so I can revert it (+ deprecate?) if you'd rather maintain strict compatibility. But there's no standard MUTF-8 charset I could use to make charset() return something correct, and I'd be surprised if it were used outside this library.
Comment on lines
-52
to
+41
| * 0-based byte offset from the start of the buffer where the string resides. This should be the | ||
| * location in memory where the string's character count, followed by its byte count, and then | ||
| * followed by the actual string is located. | ||
| * 0-based byte offset from the start of the buffer where the string resides. How this data is | ||
| * interpreted depends on the string's {@code type}. |
Author
There was a problem hiding this comment.
These comments were only correct for UTF-8 strings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
UTF-8 strings in
.arscfiles are encoded as modified UTF-8, butResourceStringtreats them as regular UTF-8. So if you try to encode/decode a string containing supplementary characters (code points above U+FFFF, outside the BMP) you get garbage. Some examples:"Last night's dinner 🍕""Double-struck: 𝔸𝔹ℂ𝔻𝔼"Testing: This repo doesn't have any automated tests and they aren't trivial to set up - we'd presumably want to run them on an Android device, or at least have the Android build tools create an APK/
.arscfile (that we could then read on a regular JVM) rather than committing binaries. And I don't think it's even possible to create an.arscfile containing UTF-16 strings usingaapt2, you have to useaapt. 😅I ran our internal test suite that grabs the strings from an
.arscfile against this branch (viamavenLocal()) and they all passed, including new ones for the cases above that failed previously.But they only cover decoding, we have no use case for writing out a modified
.arscfile. I got Claude to create some throwaway tests for the encoding path (that just round trip through encode -> decode and check the value is the same, based on our existing test suite), but I have less confidence that part is correct.