fix(pack): byte-counted slash construct must not re-encode as UTF-8#631
Merged
fix(pack): byte-counted slash construct must not re-encode as UTF-8#631
Conversation
d40b5a6 to
4d33be1
Compare
pack("V/Z*", $bytes) and the other "len/string" constructs computed
the length prefix from str.getBytes(UTF-8), while the actual payload
was emitted by PackWriter.writeString() using ISO-8859-1 (one byte
per Latin-1 char). For a string of high-Latin-1 bytes (e.g. UTF-8
output of utf8::encode), the length prefix came out doubled and the
payload was followed by zero padding to match.
Concretely, for a 12-byte string this produced:
19000000 c3a9...c3a9 00 0000000000000000000000
instead of the correct:
0d000000 c3a9...c3a9 00
Match the byte/char count writeString actually uses: in byte mode use
the ISO-8859-1 byte count, otherwise use the Java char count (which
matches writeString's per-char output for plain Latin-1 and its
per-codepoint output for high-Unicode strings).
This is exactly what BSON::PP triggers via pack("V/Z*", $utf8_encoded)
when round-tripping any non-ASCII string through BSON. The fix takes
`./jcpan -t BSON` from 71/72 failing test files down to 12/72.
Adds src/test/resources/unit/pack/slash_string.t covering V/Z*, V/a*,
V/A*, n/Z*, round-trip, and the exact BSON wire-format shape.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
4d33be1 to
613e45a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a bug in pack's "len/string" slash construct where the length prefix was computed via
str.getBytes(UTF-8)while the payload was written viaPackWriter.writeString()using ISO-8859-1 (one byte per Latin-1 char). For a byte string of high-Latin-1 bytes (e.g. the output ofutf8::encode), this produced a doubled length prefix and zero-padded payload.For a 12-byte string, the bug produced:
instead of the correct:
Repro before the fix
Why it matters
BSON::PPround-trips every non-ASCII string throughpack("V/Z*", $utf8), so this bug broke essentially every BSON corpus and mapping test for any string containing 2- or 3-byte UTF-8../jcpan -t BSONgoes from 71/72 failing test files to 12/72.Fix
Match the byte/char count
writeString()actually uses: in byte mode use the ISO-8859-1 byte count, otherwise use the Java char count (which matcheswriteString's per-char output for plain Latin-1 and its per-codepoint output for high-Unicode strings).Test plan
src/test/resources/unit/pack/slash_string.tcoveringV/Z*,V/a*,V/A*,n/Z*, round-trip, and the exact BSON wire-format shape (7 subtests, all pass).make(full unit test suite) — passes../jcpan -t BSON: failing test files dropped from 71 to 12. Remaining failures are unrelated upstream issues (dualvar /B::svref_2objectflag bits,Math::BigFloat::as_number('zero')deprecated API on the 32-bit-int fallback path, NaN/Inf encoding edge cases).Generated with Devin