reland and fix RUST-147622 by Kmeakin · Pull Request #151277 · rust-lang/rust

Kmeakin · 2026-01-18T02:31:10Z

Responding to reviews in #148438

The `ucd_parse` crate offers a function to get the Unicode version from the readme, so we don't need to reimplement it ourselves.

Instead of `include_str!()`ing `range_search.rs`, just make it a normal module under `core::unicode`. This means the same source code doesn't have to be checked in twice, and it plays nicer with IDEs. Also rename it to `rt` because it will also include runtime functions for case foldings in the next commit.

Instead of writing the body of `to_lower()` and `to_upper()` in a string literal and pasting into the final`unicode_data.rs`, extract the common logic into a helper function in the `rt` module and then call the helper from the generated code. Same motivation as previous commit (better IDE support, less duplicate code checked into git).

Remove `#[rustfmt::skip]` from all the generated modules in `unicode_data.rs`. This means we won't have to worry so much about getting indentation and formatting right when generating code. Exempted for now the case folding tables which would be too long when formatted by `rustfmt`.

This check was made redundant (it will always be true) when we removed all ASCII characters from the tables (a8c6694).

To make the final output code easier to see: * Get rid of the unnecessary line-noise of `.unwrap()`ing calls to `write!()` by moving the `.unwrap()` into a macro. * Join consecutive `write!()` calls using a single multiline format string. * Replace `.push()` and `.push_str(format!())` with `write!()`. * If after doing all of the above, there is only a single `write!()` call in the function, just construct the string directly with `format!()`.

In the case_mapping tables, print the data in hexadecimal. This makes the relationship between each character and character it is mapped to more obvious. Do the same for cascading_map, because we are inspecting the higher byte of the input character, so it makes more sense to compare against a hexadecimal literal.

The preferred way to run `assert`s at compile-time is to put them in an unnamed constant (`const _: () = { ... };`). This avoids the asserts being evaluated on every call by tools like MIRI.

Instead of generating a standalone executable to test `unicode_data`, generate normal tests in `coretests`. This ensures tests are always generated, and will be run as part of the normal testsuite. Also change the generated tests to loop over lookup tables, rather than generating a separate `assert_eq!()` statement for every codepoint. The old approach produced a massive (20,000 lines plus) file which took minutes to compile!

Workaround for issue rust-lang#148387

Kmeakin and others added 10 commits January 18, 2026 02:32

Use ucd_parse to get UNICODE_VERSION

5ef57e7

The `ucd_parse` crate offers a function to get the Unicode version from the readme, so we don't need to reimplement it ourselves.

Remove check that first_code_point is non-ascii

6ed351c

This check was made redundant (it will always be true) when we removed all ASCII characters from the tables (a8c6694).

Move const asserts out of function bodies

cadf4cd

The preferred way to run `assert`s at compile-time is to put them in an unnamed constant (`const _: () = { ... };`). This avoids the asserts being evaluated on every call by tools like MIRI.

Rename N to Numeric

c063bce

Workaround for issue rust-lang#148387

Kmeakin force-pushed the unicode_data_fixes branch from 1ebfc8c to c063bce Compare January 18, 2026 02:36

Kmeakin mentioned this pull request Jan 18, 2026

reland and fix RUST-147622 #148438

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reland and fix RUST-147622 #151277

reland and fix RUST-147622 #151277
Kmeakin wants to merge 10 commits intorust-lang:mainfrom
Kmeakin:unicode_data_fixes

Kmeakin commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Kmeakin commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants