Tighter and broader VS16 handling#24
Merged
Merged
Conversation
only look for the VS16 cases in the variation sequences. still need to look for those with later FE0F.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes over-eager VS16 widening in the display-width calculator. Previously any non-wide grapheme followed by U+FE0F was promoted to width 2; now only bases that appear in Unicode's emoji-variation-sequences.txt (as <base> FE0F) are eligible, and the check also handles ZWJ sequences where FE0F appears later inside a grapheme cluster. To support multiple traits on a single code point, the generated character property is changed from an enum to a bitmap.
Changes:
- Convert generated
propertyfrom enum to bitmap and add a new_VS16_Eligibleflag fed fromemoji-variation-sequences.txt(Unicode 17). - Rewrite
graphemeWidthto use bitmap checks and a newhasEligibleVS16Pairhelper that locates later<base, FE0F>pairs inside a cluster without rune decoding. - Add a new
C4/C5conformance test and a fuzz target (FuzzHasEligibleVS16Pair) comparing the byte-level scanner to autf8.DecodeRune-based oracle.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| width.go | Switches width logic to bitmap properties, adds hasEligibleVS16Pair/is helpers, restricts VS16 widening to eligible bases. |
| width_test.go | Renames C3 test, adds C4 (VS16 ignored for invalid bases) and C5 (later VS16 in ZWJ sequences). |
| internal/gen/unicode.go | Adds VS16Eligible map, parses emoji-variation-sequences.txt, switches property constants to 1 << iota, makes buildPropertyBitmap OR flags. |
| internal/gen/trie.go | Emits property constants as a bitmap and updates header comment. |
| internal/gen/data/17.0.0/emoji-variation-sequences.txt | Vendored Unicode 17 source data for VS16 eligibility. |
| fuzz_test.go | Adds FuzzHasEligibleVS16Pair plus a UTF-8-decoding reference implementation as oracle. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Skip directly to each 0xEF candidate instead of scanning byte-by-byte. Loops past non-FE0F candidates (FE0E, fullwidth forms, etc.) so we still find a valid pair later in the cluster. Add C6 conformance test and fuzz seeds for the FE0E/FE20 patterns that motivated the change. ~1% improvement on String_Mixed/EastAsian/Emoji benchmarks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses issues pointed out by @gammons in #23, with thanks. I decided to research and do my implementation.
This PR tightens VS16 width handling based on what we learned from Unicode TR51 and Unicode 17 data files.
Problem we are solving
Current behavior can misclassify grapheme width when VS16 appears later in a grapheme cluster (especially in some ZWJ sequences), and it could not represent overlapping properties on one code point.
That matters because some code points are both:
AinEastAsianWidth.txt), andemoji-variation-sequences.txt).A single-property model cannot represent that overlap correctly.
What we learned about VS16 and Unicode data
emoji-variation-sequences.txt(base + FE0F / FE0E).Implementation changes
_East_Asian_Ambiguousand_VS16_Eligible).emoji-variation-sequences.txtin generation.FuzzHasEligibleVS16Pairwith a UTF-8-decoding reference oracle to validate the byte-level implementation.Test plan
go generate ./...go test ./...go test ./... -run 'TestTR51Conformance/C5'go test . -run '^$' -fuzz FuzzHasEligibleVS16Pair -fuzztime=10s