Skip to content

Identifier_Type data for new Unicode 18.0 characters#1325

Open
josh-hadley wants to merge 8 commits intomainfrom
jh-uni18-identifier_type
Open

Identifier_Type data for new Unicode 18.0 characters#1325
josh-hadley wants to merge 8 commits intomainfrom
jh-uni18-identifier_type

Conversation

@josh-hadley
Copy link
Copy Markdown
Collaborator

From Roozbeh Pournader, PAG:

The following are the proposed Identifier_Type values for the new letters in Unicode 18.0. Only one character, U+11B0A, is "Recommended" for identifiers, based on evidence that it is used in modern Sindhi orthography. The rest are classified according to their usage based on proposals leading to their encoding.

  • Approver: Feel free to merge on my behalf
    • rebase & merge one or more commits
    • squash & merge multiple commits into one

@markusicu markusicu requested a review from roozbehp April 9, 2026 20:58
Mostly remove items contributing to `Not_NFKC` generated Identifier_Type, per discussion in the meeting.
@josh-hadley
Copy link
Copy Markdown
Collaborator Author

@roozbehp I removed the items from removals.txt that were Not_NFKC, per our meeting discussion yesterday, and regenerated. The result only changes review.txt (expected, I guess?). Please review and LMK if further changes are needed.

@josh-hadley josh-hadley marked this pull request as ready for review April 10, 2026 16:17
Comment thread unicodetools/data/security/dev/data/source/formatted-cjk.txt Outdated
@@ -59509,13 +72331,22 @@ AC00..D7A3 ; Allowed ; Recommended # [11172] (가..힣) HANGUL SYLLABLE GA..
11301 ; Allowed ; Recommended # (𑌁) GRANTHA SIGN CANDRABINDU
11303 ; Allowed ; Recommended # (𑌃) GRANTHA SIGN VISARGA
1133C ; Allowed ; Recommended # (𑌼) GRANTHA SIGN NUKTA
11B0A ; Allowed ; Recommended # (𑬊) DEVANAGARI LETTER ALTERNATE DDDA
1D250 ; Allowed ; Recommended # (𝉐) MUSICAL SYMBOL COMBINING FLAG-6
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The musical symbols here should also be marked Technical. We probably need to list them explicitly in the source data.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated removals.txt and regenerated; see 2e1e703

11303 ; output # (𑌃) GRANTHA SIGN VISARGA
1133C ; output # (𑌼) GRANTHA SIGN NUKTA
11B0A ; output # (𑬊) DEVANAGARI LETTER ALTERNATE DDDA
1D250..1D252 ; output # [3] (𝉐..𝉒) MUSICAL SYMBOL COMBINING FLAG-6..MUSICAL SYMBOL COMBINING FLAG-8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. The musical symbols should not show up here either.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated removals.txt and regenerated; see 2e1e703

11301 ; nonstarting # (𑌁) GRANTHA SIGN CANDRABINDU
11303 ; nonstarting # (𑌃) GRANTHA SIGN VISARGA
1133C ; nonstarting # (𑌼) GRANTHA SIGN NUKTA
1D250..1D252 ; nonstarting # [3] (𝉐..𝉒) MUSICAL SYMBOL COMBINING FLAG-6..MUSICAL SYMBOL COMBINING FLAG-8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These may or may not be desired. I don't understand IDN enough to say. @markusicu @macchiati @asmusf?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All combining marks are forbidden by RFC 5891 from starting a label. If that's something we track in one of our files, then this value might be correct.

It should not be excuse to mark them Recommended.

Comment thread unicodetools/data/security/dev/IdentifierType.txt
1D200..1D241 ; Obsolete Not_XID # 4.1 [66] GREEK VOCAL NOTATION SYMBOL-1..GREEK INSTRUMENTAL NOTATION SYMBOL-54
1D242..1D244 ; Technical Obsolete # 4.1 [3] COMBINING GREEK MUSICAL TRISEME..COMBINING GREEK MUSICAL PENTASEME
1D245 ; Obsolete Not_XID # 4.1 GREEK MUSICAL LEIMMA
1D250..1D252 ; Recommended # 18.0 [3] MUSICAL SYMBOL COMBINING FLAG-6..MUSICAL SYMBOL COMBINING FLAG-8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with the musical symbols. None of them should become Recommended.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated removals.txt and regenerated; see 2e1e703

@roozbehp
Copy link
Copy Markdown
Contributor

@josh-hadley I left some comments. The real blocker is some musical symbols unintentionally having become Recommended. (The issue of some of them having become nonstarting in idnchars.txt may or may not be desirable. That's beyond my expertise.)

Explicitly list new musical symbols as `Technical` per review
@josh-hadley
Copy link
Copy Markdown
Collaborator Author

@josh-hadley I left some comments. The real blocker is some musical symbols unintentionally having become Recommended. (The issue of some of them having become nonstarting in idnchars.txt may or may not be desirable. That's beyond my expertise.)

@roozbehp thanks very much for your review and comments. I've made some updates. I'm not sure what, if anything, to do about the IDN stuff (I don't have a good grasp on this either). If @asmusf or @markusicu can suggest anything specifically actionable, I'm happy to make further changes.

@josh-hadley josh-hadley requested review from asmusf and roozbehp April 13, 2026 17:20
@@ -1,6 +1,6 @@
# intentional.txt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is also irrelevant to this pull request. I don't know what's causing it and if it's desirable or not, but it's best to just remove it from this pull request.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted in 65aa809

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you just revert this output?
If so, then next time we generate the files, we get these same changes again, right?
I think we need to make input+code changes so that the output is good.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: While most of the files touched by the tools are not published, intentional.txt is one of the few that are: https://www.unicode.org/Public/17.0.0/security/

Its documentation says: “Intentional Confusable Mappings: A selection of characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design.”

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markusicu correct; I just reverted, and you're right that doesn't actually do anything useful. If we want content changes in the generated files, we have to modify the code that generates them. As I mentioned separately, I won't be able to look into that until next week at the earliest.

@roozbehp
Copy link
Copy Markdown
Contributor

The IDN stuff seems to have resolved itself with your other changes. My only remaining concern is the changes to intentional.txt.

@josh-hadley josh-hadley requested a review from roozbehp April 13, 2026 21:34
Copy link
Copy Markdown
Contributor

@roozbehp roozbehp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. LGTM.

@asmusf
Copy link
Copy Markdown

asmusf commented Apr 14, 2026

OK, been busy with something else: where are we with the Musical notation combining marks. We had an issue that in the past all of these "leaked" into PVALID b/c they were combining. Short of mucking with IDNA by baking in an exception, that's going to happen b/c IDNA is a simple derived property (even though the derivation as defined in the RFC does not go through UTS46, but directly from the "gc", but never mind).

Therefore, we do need to make sure they are not Recommended, so that IDNA registries can use that info to subset the PVALID set.

Can someone confirm that this has been addressed?

@josh-hadley
Copy link
Copy Markdown
Collaborator Author

Folks, I'm traveling today and won't have access to run the tooling with any further changes that might be needed to get this satisfiable, to wit:

  • need to regenerate and include the output files I reverted since removing them from the PR does not remove them from the generated files
  • ⬆️ is conditional on accepting what the generated data is; if it's not acceptable, we need to do code changes. Soonest I'll be able to do that is next week, so if it's needed sooner, someone else will need to take it on.
  • ensure @asmusf's concerns around IDN are satisfied in whatever the generated output is

@markusicu
Copy link
Copy Markdown
Member

where are we with the Musical notation combining marks. We had an issue that in the past all of these "leaked" into PVALID b/c they were combining.

Yes, they are UTS46-valid:
https://www.unicode.org/Public/draft/idna/IdnaMappingTable.txt

1D127..1D128  ; valid      ;      ; NV8    # 18.0 MUSICAL SYMBOL COMBINING STRESS..MUSICAL SYMBOL COMBINING UNSTRESS
1D1EB..1D1FF  ; valid      ;      ; NV8    # 18.0 MUSICAL SYMBOL HALF SHARP..MUSICAL SYMBOL LONGA REST

etc.

Therefore, we do need to make sure they are not Recommended, so that IDNA registries can use that info to subset the PVALID set.

In this PR, we currently have
unicodetools/data/security/dev/IdentifierType.txt

1D127..1D128  ; Technical                      # 18.0   [2] MUSICAL SYMBOL COMBINING STRESS..MUSICAL SYMBOL COMBINING UNSTRESS
1D1EB..1D1FF  ; Technical Not_XID              # 18.0  [21] MUSICAL SYMBOL HALF SHARP..MUSICAL SYMBOL LONGA REST

etc.

As noted in https://github.com/unicode-org/properties/issues/530, the only new character that is becoming ID_Type=Recommended is U+11B0A DEVANAGARI LETTER ALTERNATE DDDA.

@markusicu
Copy link
Copy Markdown
Member

Folks, I'm traveling today and won't have access to run the tooling with any further changes that might be needed to get this satisfiable

Ok. I don't want people to get stressed out. This PR is not necessarily urgent. We said we wanted it finished before wrapping up https://github.com/unicode-org/properties/issues/530 and getting that included in the PAG report. I will compare the data here with that issue to check for adjustments. We should be able to progress the issue without dotting the i's on the PR.

A7DD ; Technical Obsolete
A7E2 ; Obsolete
AB6C..AB6D ; Technical
107BB..107BF ; Technical
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only range left in the input that comes out as Not_NFKC. I suggest we remove this one, like removing the other Not_NFKC ranges.

05C5 ; Uncommon_Use Obsolete # 4.1 HEBREW MARK LOWER DOT
05C6 ; Obsolete Not_XID # 4.1 HEBREW PUNCTUATION NUN HAFUKHA
05C7 ; Uncommon_Use Technical # 4.1 HEBREW POINT QAMATS QATAN
05C8..05C9 ; Uncommon_Use Technical # 18.0 [2] HEBREW POINT SHEVA NA MUDGASH..HEBREW POINT DAGESH HAZAQ MUDGASH
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool adds Uncommon_Use

10EC5..10EC6 ; Technical # 17.0 [2] ARABIC SMALL YEH BARREE WITH TWO DOTS BELOW..ARABIC LETTER THIN NOON
10EC7 ; Uncommon_Use # 17.0 ARABIC LETTER YEH WITH FOUR DOTS BELOW
10EC9..10ECA ; Technical Not_XID # 18.0 [2] ARABIC SMALL BASELINE FATHA..ARABIC SMALL BASELINE DOTLESS HEAD OF KHAH
10ECB..10ECF ; Uncommon_Use Technical # 18.0 [5] ARABIC NORTHEAST POINTING ARROWHEAD ABOVE..ARABIC LARGE CIRCLE ABOVE
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool adds Uncommon_Use

Comment on lines +4353 to +4354
10EF0..10EF8 ; Uncommon_Use Technical # 18.0 [9] ARABIC SMALL LOW UPRIGHT RECTANGULAR ZERO..ARABIC SMALL HIGH WORD KABBIR
10EF9 ; Uncommon_Use Obsolete # 18.0 ARABIC MARK CROWN
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool adds Uncommon_Use

1810..1819 ; Exclusion # 3.0 [10] MONGOLIAN DIGIT ZERO..MONGOLIAN DIGIT NINE
1820..1877 ; Exclusion # 3.0 [88] MONGOLIAN LETTER A..MONGOLIAN LETTER MANCHU ZHA
1878 ; Exclusion # 11.0 MONGOLIAN LETTER CHA WITH TWO DOTS
1879 ; Exclusion # 18.0 MONGOLIAN LETTER ALTERNATE UE
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • input is only Obsolete
  • output is only Exclusion
  • why is the output not Exclusion Obsolete?
  • need to decide whether we want Obsolete as well as Exclusion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants