Skip to content

Add support for ordinal formatting (and update CLDR)#52

Merged
jeffijoe merged 18 commits intojeffijoe:masterfrom
sfuqua:sfuqua/ordinals
Feb 18, 2026
Merged

Add support for ordinal formatting (and update CLDR)#52
jeffijoe merged 18 commits intojeffijoe:masterfrom
sfuqua:sfuqua/ordinals

Conversation

@sfuqua
Copy link
Contributor

@sfuqua sfuqua commented Feb 16, 2026

Fixes #51

This change adds support for the selectordinal function/formatter. Though I don't believe selectordinal is documented in the original ICU libraries, it seems to have ecosystem support and is a first class function in MF2, so it seems a good fit.

Thankfully the CLDR ordinals.xml has the same schema as plurals.xml so the parsing logic could be reused, though I had to update the data structures and generated code to support multiple plural "types".

  1. Add ordinals.xml from CLDR v48.1 (and update plurals.xml simultaneously)
  2. Add README to the data folder describing pedigree of the XML files
  3. Add documentation to some of the codegen types and PluralContext to serve as touchstones
  4. Update parsing infra to support two XML files and two plural "types" throughout the stack:
    a. Collections of PluralRule are replaced with a new data structure PluralRuleSet that serves as an index of locales + plural type
    b. There are now two different TryGetByLocale functions in the generated code, one for Cardinal (current behavior) and one for Ordinal (new/added behavior)
  5. Per discussion in this thread, removed default English pluralizers, which fixes Built-in English pluralizer is inconsistent w/CLDR #53
  6. Update existing tests to explicitly exercise cardinal formatting
  7. Added some new ordinal test cases
  8. Added rudimentary support for LMDL "inheritance" rules, such that i.e. "en-CA" will match CLDR "en", and unknown or unmapped locales will always change to "root"; removed the library's default mapping to "en" as a result (because "root" exists)

@sfuqua
Copy link
Contributor Author

sfuqua commented Feb 16, 2026

@jeffijoe I'm happy to tackle #53 (and update the README) in the same change. I figured as-is, this change could be a semver minor version bump for a backcompat new feature, but if you're happy to call it a major version change I'll go ahead and make the 'zero' change as well (which would just leave the open question about the codegen change). I can also do it in a separate PR - lmk if you have a preference.

I can also update the codegen to be backwards compatible by keeping the _LANG helpers and adding a new overload of TryGetRuleByLocale for ordinals, but I was pretty sure the generator project and what it spits out isn't part of the versioned API surface.

@jeffijoe
Copy link
Owner

The codegen is not part of the public API, so if everything works without the extra generated code, then we should leave it out.

Happy to make it a major bump now - I'd want to also tackle #41 in the same release since that is also a breaking change. Basically all that is for is making the default culture InvariantCulture as well as allowing passing a CultureInfo per invocation.

@sfuqua
Copy link
Contributor Author

sfuqua commented Feb 16, 2026

I was cleaning up PluralRuleKey a bit and realized I need to check some behavior about how input is matched against supported locales to ensure that e.g. "fr-FR" matches "fr".

Going to double check master with some new tests to confirm I didn't inadvertently change any behavior around fallback, and also look into how straightforward it is to canonicalize an input locale to a Unicode language ID and use the LDML matching algorithm to pick a rule.

@sfuqua
Copy link
Contributor Author

sfuqua commented Feb 17, 2026

Okay, I discovered a few other things that I'm interested in trying to contribute but that may be outside the scope of this PR through a reading of https://www.unicode.org/reports/tr35/tr35.html#Locale_Inheritance

  1. There's no language inheritance in the lib today, so "fr-FR" should match "fr", but does not (can confirm this in the existing tests by updating "ru" to "ru-RU"
  2. The CLDR plural data has two examples of subtags ("pt_PT" and "kok_Latn") - per the spec, CLDR currently always uses "_" as the separator
  3. Per the spec, -/_ are equivalent, so an input to the library of "pt-PT" should match "pt_PT"

The "-" and "_" separators are treated as equivalent, although "-" is preferred.

  1. The lookup of locales should be case-insensitive (alternatively, the input should be canonicalized)
  2. The LDML inheritance/matching algo should be implemented, such that:

Given a particular locale id "en_US_someVariant", the default search chain for a particular resource is the following.
en_US_someVariant
en_US
en
root

There is nuance in how to do search chaining for non-canonical names, however, given that almost all the plural data uses "base" language tags, I'm not sure there's value in going that far.

Notably, implementing support for root means every input will always map to a default Pluralizer based on CLDR rules, so we won't have to fallback to "en" anymore for missing languages.

I think my plan for this change is to -

  1. Update the codegen to normalize CLDR _ to BCP47 - in the Dictionary - this will hopefully make "pt-PT" work more often by default, and maintain a separate index of tags that include underscores to facilitate conversion in step 4.
  2. Restructure the codegen to use Dictionary<string, Dictionary<string, ContextPluralizer>> keyed on CLDR-locale, and then on type (or vice versa), with case-insensitive comparison for the locale Dictionary. I'll keep using a record struct for the user-facing API, but not as a Dictionary key.
  3. Update documentation to clarify the semantics of the locale string in a couple places.
  4. Call TryGetRuleByLocale successively up to the root language before failing. If we detect an underscore, we can reference the index in step 1 to remap to a support locale (e.g., "pt_PT" -> "pt-PT" -> rule).
  5. Document the edge cases in README
  6. New tests

I think step 2 (the Dictionary change) should happen regardless as it's directly related to the feature in this PR, but the rest of the fallback stuff could wait til a separate change. I'll probably start on it in parallel and let me know if you have a preference on logistics (or if this should be tracked with a new issue).

@jeffijoe
Copy link
Owner

Don't want to bog you down with logistics, but if you feel like it would be simpler for both of us to merge this PR first, we can do that, let me know when it's ready and I'll merge it. Appreciate the help!

@sfuqua sfuqua marked this pull request as draft February 17, 2026 01:35
@sfuqua sfuqua marked this pull request as ready for review February 17, 2026 06:52
@sfuqua
Copy link
Contributor Author

sfuqua commented Feb 17, 2026

Okay, this ended up a little silly but I landed a compromise for locale inheritance in a new LocaleHelper.cs.

We try the original locale as-is (now case insensitive!), then we try stripping all subtags (so en-US-Foo will try "en"), then we finally try "root".

This means -

  1. No more fallback to English (instead relying on "root" CLDR rules) for unknown languages - this is technically another "breaking" change, though depending on this behavior would've been odd
  2. Most valid BCP 47 tags or CultureInfo names passed into this library should now properly match a CLDR rule when they would've hit English before
  3. Built-in English pluralizers have been removed (can be added back in manually), and tests have been updated

This should work very well for almost all pluralization cases except for "pt-PT" with additional subtags (which would fall back to "pt", which is equivalent to "pt-BR").

We can get away with this because almost all of the plural CLDR locales are base languages.

I did end up substantially refactoring my previous codegen to key on locale first, and made Cardinal/Ordinal strongly typed instead of passing "cardinal" and "ordinal" strings everywhere.

Copy link
Owner

@jeffijoe jeffijoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're already doing a major for the next release, we can rename Pluralizers to CardinalPluralizers for consistency.

@jeffijoe jeffijoe merged commit 32b446c into jeffijoe:master Feb 18, 2026
1 check passed
@jeffijoe
Copy link
Owner

Merged, thank you so much! Was there anything else you wanted to address before I cut the next major release?

@sfuqua
Copy link
Contributor Author

sfuqua commented Feb 18, 2026

Merged, thank you so much! Was there anything else you wanted to address before I cut the next major release?

Nothing blocking on my end! Please feel free to tag me if anything strange surfaces with the changes.

Longer term I'm interested in a C#/.NET solution for MessageFormat 2 which got ratified this year - it'd be a bit of a project but the package might actually be set up well for a dual parsing pipeline in the future leveraging all the same generated CLDR bindings.

The lib I believe you originally used as a reference is now one of the reference implementations of the new syntax 😁

@jeffijoe
Copy link
Owner

jeffijoe commented Feb 18, 2026

Yes, it's been a long time, I didn't realize there was a new spec. 😅

I'll probably take a stab at #41 before I cut a new release, but I can't promise when that will be.

EDIT: Took a quick look at MF2, that is definitely a big undertaking! But I like the syntax!

EDIT2: this is nuts

.input {$pronoun :string}
.input {$count :number}
.match $pronoun $count
he one   {{He has {$count} notification.}}
he *     {{He has {$count} notifications.}}
she one  {{She has {$count} notification.}}
she *    {{She has {$count} notifications.}}
* one    {{They have {$count} notification.}}
* *      {{They have {$count} notifications.}}

@jeffijoe
Copy link
Owner

This has now been released! Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Built-in English pluralizer is inconsistent w/CLDR selectordinal support missing

3 participants