Fix modified UTF-8 string encoding/decoding by PhilGlass · Pull Request #42 · madisp/android-chunk-utils

PhilGlass · 2026-02-26T08:24:12Z

UTF-8 strings in .arsc files are encoded as modified UTF-8, but ResourceString treats them as regular UTF-8. So if you try to encode/decode a string containing supplementary characters (code points above U+FFFF, outside the BMP) you get garbage. Some examples:

"Last night's dinner 🍕"
"Double-struck: 𝔸𝔹ℂ𝔻𝔼"

Testing: This repo doesn't have any automated tests and they aren't trivial to set up - we'd presumably want to run them on an Android device, or at least have the Android build tools create an APK/.arsc file (that we could then read on a regular JVM) rather than committing binaries. And I don't think it's even possible to create an .arsc file containing UTF-16 strings using aapt2, you have to use aapt. 😅

I ran our internal test suite that grabs the strings from an .arsc file against this branch (via mavenLocal()) and they all passed, including new ones for the cases above that failed previously.

But they only cover decoding, we have no use case for writing out a modified .arsc file. I got Claude to create some throwaway tests for the encoding path (that just round trip through encode -> decode and check the value is the same, based on our existing test suite), but I have less confidence that part is correct.

PhilGlass · 2026-02-26T08:29:43Z

src/main/java/pink/madis/apk/arsc/ResourceString.java

-    UTF8(UTF_8),
-    UTF16(UTF_16LE);
-
-    private final Charset charset;
-
-    Type(Charset charset) {
-      this.charset = charset;
-    }
-
-    public Charset charset() {
-      return charset;
-    }
+    UTF8, UTF16


This is an API change, so I can revert it (+ deprecate?) if you'd rather maintain strict compatibility. But there's no standard MUTF-8 charset I could use to make charset() return something correct, and I'd be surprised if it were used outside this library.

PhilGlass · 2026-02-26T08:31:23Z

src/main/java/pink/madis/apk/arsc/ResourceString.java

-   * 0-based byte offset from the start of the buffer where the string resides. This should be the
-   * location in memory where the string's character count, followed by its byte count, and then
-   * followed by the actual string is located.
+   * 0-based byte offset from the start of the buffer where the string resides. How this data is
+   * interpreted depends on the string's {@code type}.


These comments were only correct for UTF-8 strings.

Fix modified UTF-8 string encoding/decoding

22d54d3

PhilGlass force-pushed the fix_string_decoding branch from 1ee51e6 to 22d54d3 Compare February 26, 2026 08:39

PhilGlass commented Feb 26, 2026

View reviewed changes

PhilGlass marked this pull request as ready for review February 26, 2026 08:46

Update README

ce9d518

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix modified UTF-8 string encoding/decoding#42

Fix modified UTF-8 string encoding/decoding#42
PhilGlass wants to merge 2 commits intomadisp:mainfrom
PhilGlass:fix_string_decoding

PhilGlass commented Feb 26, 2026 •

edited

Loading

Uh oh!

PhilGlass Feb 26, 2026 •

edited

Loading

Uh oh!

PhilGlass Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PhilGlass commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhilGlass Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PhilGlass Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PhilGlass commented Feb 26, 2026 •

edited

Loading

PhilGlass Feb 26, 2026 •

edited

Loading