Skip to content

Add em dash and Unicode dash variants to the char map#210

Open
naveentehrpariya wants to merge 1 commit into
simov:masterfrom
naveentehrpariya:fix-unicode-dash-charmap
Open

Add em dash and Unicode dash variants to the char map#210
naveentehrpariya wants to merge 1 commit into
simov:masterfrom
naveentehrpariya:fix-unicode-dash-charmap

Conversation

@naveentehrpariya

Copy link
Copy Markdown

Only the en dash (U+2013) was mapped to -. Visually similar dashes were dropped as unknown characters, and for an unspaced dash this silently merged adjacent words:

slugify('rock–paper') // en dash → 'rock-paper'
slugify('rock—paper') // em dash → 'rockpaper'   ← bug (word boundary lost)

The result depended on which near-identical dash the input happened to use. Mapped the rest of the dash family to - for consistency with the en dash: U+2010 hyphen, U+2011 non-breaking hyphen, U+2012 figure dash, U+2014 em dash, U+2015 horizontal bar. Added a test covering all variants.

Only the en dash (U+2013) was mapped to "-", so visually similar
dashes were dropped as unknown characters. For an unspaced dash this
silently merged adjacent words (e.g. `rock—paper` -> `rockpaper`),
and the result depended on which near-identical dash the input used.

Map the rest of the dash family to "-" for consistency with the en
dash: U+2010 hyphen, U+2011 non-breaking hyphen, U+2012 figure dash,
U+2014 em dash, U+2015 horizontal bar.
@Trott

Trott commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

To me, the bug is the preservation of the supposed word boundary in the first instance. But regardless, slugify() should be consistent, and it is not.

console.log(slugify('rock–paper')) // rock-paper
console.log(slugify('rock—paper')) // rockpaper
console.log(slugify('rock;paper')) // rockpaper
console.log(slugify('rock/paper')) // rockpaper
console.log(slugify('rock.paper')) // rock.paper
console.log(slugify('rock,paper')) // rockpaper
console.log(slugify('rock:paper')) // rock:paper

FWIW, the slug module converts all of these to rockpaper which seems correct to me. And if it seems incorrect to you or anyone else, well, I guess at least it's consistent.

The alternative is to identify every possible punctuation mark and map it as a word boundary. That should be done once in a comprehensive way rather than with a mark here and a mark there. But it requires some decisions. Do we include all 112 Unicode general punctuation marks? And the 94 supplemental marks as well? That's do-able, but I personally have a strong preference for "remove anything you don't recognize". I suppose "change anything you don't recognize into a word boundary" would be a valid approach too. Either one would probably be best to release as a semver-major because it is likely to do something surprising/unexpected to someone's existing code somewhere.

@naveentehrpariya

Copy link
Copy Markdown
Author

Agreed — consistency is the real goal here. This PR takes the conservative route: the en dash was already mapped to -, so I just brought the rest of the visually-equivalent dash family (em dash, figure dash, etc.) in line with it. That removes the surprise where two near-identical glyphs slugify differently. Happy to adjust the target mapping if you'd prefer a different direction.

@Trott

Trott commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Agreed — consistency is the real goal here. This PR takes the conservative route: the en dash was already mapped to -, so I just brought the rest of the visually-equivalent dash family (em dash, figure dash, etc.) in line with it. That removes the surprise where two near-identical glyphs slugify differently. Happy to adjust the target mapping if you'd prefer a different direction.

A narrow fix for one particularly problematic glyph is fine by me (if it's fine by @simov), and then maybe we open a separate issue for the wider problem. I'd want to know what @simov thinks the right route is. A case could be made for "do nothing" on the more comprehensive fix because it might break things for a lot of people and they might not even realize it. But even if we do decide to go for consistency, the question remains as to whether punctuation marks should generally be mapped to a word boundary, silently dropped, or something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants