Skip to content

fix(pack): byte-counted slash construct must not re-encode as UTF-8#631

Merged
fglock merged 1 commit intomasterfrom
fix/pack-slash-bytestring-20260429-223151
Apr 30, 2026
Merged

fix(pack): byte-counted slash construct must not re-encode as UTF-8#631
fglock merged 1 commit intomasterfrom
fix/pack-slash-bytestring-20260429-223151

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 29, 2026

Summary

Fixes a bug in pack's "len/string" slash construct where the length prefix was computed via str.getBytes(UTF-8) while the payload was written via PackWriter.writeString() using ISO-8859-1 (one byte per Latin-1 char). For a byte string of high-Latin-1 bytes (e.g. the output of utf8::encode), this produced a doubled length prefix and zero-padded payload.

For a 12-byte string, the bug produced:

19000000 c3a9...c3a9 00 0000000000000000000000

instead of the correct:

0d000000 c3a9...c3a9 00

Repro before the fix

./jperl -e '
  my $v = "\xc3\xa9" x 6;
  print unpack("H*", pack("V/Z*", $v)), "\n";
'
# 19000000c3a9c3a9c3a9c3a9c3a9c3a900000000000000000000000000   (broken)
# 0d000000c3a9c3a9c3a9c3a9c3a9c3a900                           (perl 5.42)

Why it matters

BSON::PP round-trips every non-ASCII string through pack("V/Z*", $utf8), so this bug broke essentially every BSON corpus and mapping test for any string containing 2- or 3-byte UTF-8. ./jcpan -t BSON goes from 71/72 failing test files to 12/72.

Fix

Match the byte/char count writeString() actually uses: in byte mode use the ISO-8859-1 byte count, otherwise use the Java char count (which matches writeString's per-char output for plain Latin-1 and its per-codepoint output for high-Unicode strings).

Test plan

  • Added src/test/resources/unit/pack/slash_string.t covering V/Z*, V/a*, V/A*, n/Z*, round-trip, and the exact BSON wire-format shape (7 subtests, all pass).
  • make (full unit test suite) — passes.
  • Manually re-ran ./jcpan -t BSON: failing test files dropped from 71 to 12. Remaining failures are unrelated upstream issues (dualvar / B::svref_2object flag bits, Math::BigFloat::as_number('zero') deprecated API on the 32-bit-int fallback path, NaN/Inf encoding edge cases).

Generated with Devin

@fglock fglock force-pushed the fix/pack-slash-bytestring-20260429-223151 branch from d40b5a6 to 4d33be1 Compare April 30, 2026 07:53
pack("V/Z*", $bytes) and the other "len/string" constructs computed
the length prefix from str.getBytes(UTF-8), while the actual payload
was emitted by PackWriter.writeString() using ISO-8859-1 (one byte
per Latin-1 char). For a string of high-Latin-1 bytes (e.g. UTF-8
output of utf8::encode), the length prefix came out doubled and the
payload was followed by zero padding to match.

Concretely, for a 12-byte string this produced:

    19000000 c3a9...c3a9 00 0000000000000000000000

instead of the correct:

    0d000000 c3a9...c3a9 00

Match the byte/char count writeString actually uses: in byte mode use
the ISO-8859-1 byte count, otherwise use the Java char count (which
matches writeString's per-char output for plain Latin-1 and its
per-codepoint output for high-Unicode strings).

This is exactly what BSON::PP triggers via pack("V/Z*", $utf8_encoded)
when round-tripping any non-ASCII string through BSON. The fix takes
`./jcpan -t BSON` from 71/72 failing test files down to 12/72.

Adds src/test/resources/unit/pack/slash_string.t covering V/Z*, V/a*,
V/A*, n/Z*, round-trip, and the exact BSON wire-format shape.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock force-pushed the fix/pack-slash-bytestring-20260429-223151 branch from 4d33be1 to 613e45a Compare April 30, 2026 08:12
@fglock fglock merged commit d9fc84f into master Apr 30, 2026
2 checks passed
@fglock fglock deleted the fix/pack-slash-bytestring-20260429-223151 branch April 30, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant