Identifier_Type data for new Unicode 18.0 characters#1325
Identifier_Type data for new Unicode 18.0 characters#1325josh-hadley wants to merge 8 commits intomainfrom
Conversation
Mostly remove items contributing to `Not_NFKC` generated Identifier_Type, per discussion in the meeting.
|
@roozbehp I removed the items from |
| @@ -59509,13 +72331,22 @@ AC00..D7A3 ; Allowed ; Recommended # [11172] (가..힣) HANGUL SYLLABLE GA.. | |||
| 11301 ; Allowed ; Recommended # (𑌁) GRANTHA SIGN CANDRABINDU | |||
| 11303 ; Allowed ; Recommended # (𑌃) GRANTHA SIGN VISARGA | |||
| 1133C ; Allowed ; Recommended # (𑌼) GRANTHA SIGN NUKTA | |||
| 11B0A ; Allowed ; Recommended # () DEVANAGARI LETTER ALTERNATE DDDA | |||
| 1D250 ; Allowed ; Recommended # () MUSICAL SYMBOL COMBINING FLAG-6 | |||
There was a problem hiding this comment.
The musical symbols here should also be marked Technical. We probably need to list them explicitly in the source data.
There was a problem hiding this comment.
updated removals.txt and regenerated; see 2e1e703
| 11303 ; output # (𑌃) GRANTHA SIGN VISARGA | ||
| 1133C ; output # (𑌼) GRANTHA SIGN NUKTA | ||
| 11B0A ; output # () DEVANAGARI LETTER ALTERNATE DDDA | ||
| 1D250..1D252 ; output # [3] (..) MUSICAL SYMBOL COMBINING FLAG-6..MUSICAL SYMBOL COMBINING FLAG-8 |
There was a problem hiding this comment.
Same here. The musical symbols should not show up here either.
There was a problem hiding this comment.
updated removals.txt and regenerated; see 2e1e703
| 11301 ; nonstarting # (𑌁) GRANTHA SIGN CANDRABINDU | ||
| 11303 ; nonstarting # (𑌃) GRANTHA SIGN VISARGA | ||
| 1133C ; nonstarting # (𑌼) GRANTHA SIGN NUKTA | ||
| 1D250..1D252 ; nonstarting # [3] (..) MUSICAL SYMBOL COMBINING FLAG-6..MUSICAL SYMBOL COMBINING FLAG-8 |
There was a problem hiding this comment.
These may or may not be desired. I don't understand IDN enough to say. @markusicu @macchiati @asmusf?
There was a problem hiding this comment.
All combining marks are forbidden by RFC 5891 from starting a label. If that's something we track in one of our files, then this value might be correct.
It should not be excuse to mark them Recommended.
| 1D200..1D241 ; Obsolete Not_XID # 4.1 [66] GREEK VOCAL NOTATION SYMBOL-1..GREEK INSTRUMENTAL NOTATION SYMBOL-54 | ||
| 1D242..1D244 ; Technical Obsolete # 4.1 [3] COMBINING GREEK MUSICAL TRISEME..COMBINING GREEK MUSICAL PENTASEME | ||
| 1D245 ; Obsolete Not_XID # 4.1 GREEK MUSICAL LEIMMA | ||
| 1D250..1D252 ; Recommended # 18.0 [3] MUSICAL SYMBOL COMBINING FLAG-6..MUSICAL SYMBOL COMBINING FLAG-8 |
There was a problem hiding this comment.
Same here with the musical symbols. None of them should become Recommended.
There was a problem hiding this comment.
updated removals.txt and regenerated; see 2e1e703
|
@josh-hadley I left some comments. The real blocker is some musical symbols unintentionally having become Recommended. (The issue of some of them having become nonstarting in idnchars.txt may or may not be desirable. That's beyond my expertise.) |
Explicitly list new musical symbols as `Technical` per review
@roozbehp thanks very much for your review and comments. I've made some updates. I'm not sure what, if anything, to do about the IDN stuff (I don't have a good grasp on this either). If @asmusf or @markusicu can suggest anything specifically actionable, I'm happy to make further changes. |
| @@ -1,6 +1,6 @@ | |||
| # intentional.txt | |||
There was a problem hiding this comment.
This file is also irrelevant to this pull request. I don't know what's causing it and if it's desirable or not, but it's best to just remove it from this pull request.
There was a problem hiding this comment.
Did you just revert this output?
If so, then next time we generate the files, we get these same changes again, right?
I think we need to make input+code changes so that the output is good.
There was a problem hiding this comment.
FYI: While most of the files touched by the tools are not published, intentional.txt is one of the few that are: https://www.unicode.org/Public/17.0.0/security/
Its documentation says: “Intentional Confusable Mappings: A selection of characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design.”
There was a problem hiding this comment.
@markusicu correct; I just reverted, and you're right that doesn't actually do anything useful. If we want content changes in the generated files, we have to modify the code that generates them. As I mentioned separately, I won't be able to look into that until next week at the earliest.
|
The IDN stuff seems to have resolved itself with your other changes. My only remaining concern is the changes to intentional.txt. |
|
OK, been busy with something else: where are we with the Musical notation combining marks. We had an issue that in the past all of these "leaked" into PVALID b/c they were combining. Short of mucking with IDNA by baking in an exception, that's going to happen b/c IDNA is a simple derived property (even though the derivation as defined in the RFC does not go through UTS46, but directly from the "gc", but never mind). Therefore, we do need to make sure they are not Recommended, so that IDNA registries can use that info to subset the PVALID set. Can someone confirm that this has been addressed? |
|
Folks, I'm traveling today and won't have access to run the tooling with any further changes that might be needed to get this satisfiable, to wit:
|
Yes, they are UTS46-valid: etc.
In this PR, we currently have etc. As noted in https://github.com/unicode-org/properties/issues/530, the only new character that is becoming ID_Type=Recommended is U+11B0A DEVANAGARI LETTER ALTERNATE DDDA. |
Ok. I don't want people to get stressed out. This PR is not necessarily urgent. We said we wanted it finished before wrapping up https://github.com/unicode-org/properties/issues/530 and getting that included in the PAG report. I will compare the data here with that issue to check for adjustments. We should be able to progress the issue without dotting the i's on the PR. |
| A7DD ; Technical Obsolete | ||
| A7E2 ; Obsolete | ||
| AB6C..AB6D ; Technical | ||
| 107BB..107BF ; Technical |
There was a problem hiding this comment.
This is the only range left in the input that comes out as Not_NFKC. I suggest we remove this one, like removing the other Not_NFKC ranges.
| 05C5 ; Uncommon_Use Obsolete # 4.1 HEBREW MARK LOWER DOT | ||
| 05C6 ; Obsolete Not_XID # 4.1 HEBREW PUNCTUATION NUN HAFUKHA | ||
| 05C7 ; Uncommon_Use Technical # 4.1 HEBREW POINT QAMATS QATAN | ||
| 05C8..05C9 ; Uncommon_Use Technical # 18.0 [2] HEBREW POINT SHEVA NA MUDGASH..HEBREW POINT DAGESH HAZAQ MUDGASH |
| 10EC5..10EC6 ; Technical # 17.0 [2] ARABIC SMALL YEH BARREE WITH TWO DOTS BELOW..ARABIC LETTER THIN NOON | ||
| 10EC7 ; Uncommon_Use # 17.0 ARABIC LETTER YEH WITH FOUR DOTS BELOW | ||
| 10EC9..10ECA ; Technical Not_XID # 18.0 [2] ARABIC SMALL BASELINE FATHA..ARABIC SMALL BASELINE DOTLESS HEAD OF KHAH | ||
| 10ECB..10ECF ; Uncommon_Use Technical # 18.0 [5] ARABIC NORTHEAST POINTING ARROWHEAD ABOVE..ARABIC LARGE CIRCLE ABOVE |
| 10EF0..10EF8 ; Uncommon_Use Technical # 18.0 [9] ARABIC SMALL LOW UPRIGHT RECTANGULAR ZERO..ARABIC SMALL HIGH WORD KABBIR | ||
| 10EF9 ; Uncommon_Use Obsolete # 18.0 ARABIC MARK CROWN |
| 1810..1819 ; Exclusion # 3.0 [10] MONGOLIAN DIGIT ZERO..MONGOLIAN DIGIT NINE | ||
| 1820..1877 ; Exclusion # 3.0 [88] MONGOLIAN LETTER A..MONGOLIAN LETTER MANCHU ZHA | ||
| 1878 ; Exclusion # 11.0 MONGOLIAN LETTER CHA WITH TWO DOTS | ||
| 1879 ; Exclusion # 18.0 MONGOLIAN LETTER ALTERNATE UE |
There was a problem hiding this comment.
- input is only Obsolete
- output is only Exclusion
- why is the output not Exclusion Obsolete?
- need to decide whether we want Obsolete as well as Exclusion
From Roozbeh Pournader, PAG:
The following are the proposed Identifier_Type values for the new letters in Unicode 18.0. Only one character, U+11B0A, is "Recommended" for identifiers, based on evidence that it is used in modern Sindhi orthography. The rest are classified according to their usage based on proposals leading to their encoding.