Skip to content

Fix modified UTF-8 string encoding/decoding#42

Open
PhilGlass wants to merge 2 commits intomadisp:mainfrom
PhilGlass:fix_string_decoding
Open

Fix modified UTF-8 string encoding/decoding#42
PhilGlass wants to merge 2 commits intomadisp:mainfrom
PhilGlass:fix_string_decoding

Conversation

@PhilGlass
Copy link

@PhilGlass PhilGlass commented Feb 26, 2026

UTF-8 strings in .arsc files are encoded as modified UTF-8, but ResourceString treats them as regular UTF-8. So if you try to encode/decode a string containing supplementary characters (code points above U+FFFF, outside the BMP) you get garbage. Some examples:

  • "Last night's dinner 🍕"
  • "Double-struck: 𝔸𝔹ℂ𝔻𝔼"

Testing: This repo doesn't have any automated tests and they aren't trivial to set up - we'd presumably want to run them on an Android device, or at least have the Android build tools create an APK/.arsc file (that we could then read on a regular JVM) rather than committing binaries. And I don't think it's even possible to create an .arsc file containing UTF-16 strings using aapt2, you have to use aapt. 😅

I ran our internal test suite that grabs the strings from an .arsc file against this branch (via mavenLocal()) and they all passed, including new ones for the cases above that failed previously.

But they only cover decoding, we have no use case for writing out a modified .arsc file. I got Claude to create some throwaway tests for the encoding path (that just round trip through encode -> decode and check the value is the same, based on our existing test suite), but I have less confidence that part is correct.

Comment on lines -34 to +33
UTF8(UTF_8),
UTF16(UTF_16LE);

private final Charset charset;

Type(Charset charset) {
this.charset = charset;
}

public Charset charset() {
return charset;
}
UTF8, UTF16
Copy link
Author

@PhilGlass PhilGlass Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an API change, so I can revert it (+ deprecate?) if you'd rather maintain strict compatibility. But there's no standard MUTF-8 charset I could use to make charset() return something correct, and I'd be surprised if it were used outside this library.

Comment on lines -52 to +41
* 0-based byte offset from the start of the buffer where the string resides. This should be the
* location in memory where the string's character count, followed by its byte count, and then
* followed by the actual string is located.
* 0-based byte offset from the start of the buffer where the string resides. How this data is
* interpreted depends on the string's {@code type}.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments were only correct for UTF-8 strings.

@PhilGlass PhilGlass marked this pull request as ready for review February 26, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant