Add em dash and Unicode dash variants to the char map by naveentehrpariya · Pull Request #210 · simov/slugify

naveentehrpariya · 2026-06-28T20:23:24Z

Only the en dash (U+2013) was mapped to -. Visually similar dashes were dropped as unknown characters, and for an unspaced dash this silently merged adjacent words:

slugify('rock–paper') // en dash → 'rock-paper'
slugify('rock—paper') // em dash → 'rockpaper'   ← bug (word boundary lost)

The result depended on which near-identical dash the input happened to use. Mapped the rest of the dash family to - for consistency with the en dash: U+2010 hyphen, U+2011 non-breaking hyphen, U+2012 figure dash, U+2014 em dash, U+2015 horizontal bar. Added a test covering all variants.

Only the en dash (U+2013) was mapped to "-", so visually similar dashes were dropped as unknown characters. For an unspaced dash this silently merged adjacent words (e.g. `rock—paper` -> `rockpaper`), and the result depended on which near-identical dash the input used. Map the rest of the dash family to "-" for consistency with the en dash: U+2010 hyphen, U+2011 non-breaking hyphen, U+2012 figure dash, U+2014 em dash, U+2015 horizontal bar.

Trott · 2026-06-29T01:58:38Z

To me, the bug is the preservation of the supposed word boundary in the first instance. But regardless, slugify() should be consistent, and it is not.

console.log(slugify('rock–paper')) // rock-paper
console.log(slugify('rock—paper')) // rockpaper
console.log(slugify('rock;paper')) // rockpaper
console.log(slugify('rock/paper')) // rockpaper
console.log(slugify('rock.paper')) // rock.paper
console.log(slugify('rock,paper')) // rockpaper
console.log(slugify('rock:paper')) // rock:paper

FWIW, the slug module converts all of these to rockpaper which seems correct to me. And if it seems incorrect to you or anyone else, well, I guess at least it's consistent.

The alternative is to identify every possible punctuation mark and map it as a word boundary. That should be done once in a comprehensive way rather than with a mark here and a mark there. But it requires some decisions. Do we include all 112 Unicode general punctuation marks? And the 94 supplemental marks as well? That's do-able, but I personally have a strong preference for "remove anything you don't recognize". I suppose "change anything you don't recognize into a word boundary" would be a valid approach too. Either one would probably be best to release as a semver-major because it is likely to do something surprising/unexpected to someone's existing code somewhere.

naveentehrpariya · 2026-06-29T07:18:54Z

Agreed — consistency is the real goal here. This PR takes the conservative route: the en dash was already mapped to -, so I just brought the rest of the visually-equivalent dash family (em dash, figure dash, etc.) in line with it. That removes the surprise where two near-identical glyphs slugify differently. Happy to adjust the target mapping if you'd prefer a different direction.

Trott · 2026-06-29T13:45:53Z

Agreed — consistency is the real goal here. This PR takes the conservative route: the en dash was already mapped to -, so I just brought the rest of the visually-equivalent dash family (em dash, figure dash, etc.) in line with it. That removes the surprise where two near-identical glyphs slugify differently. Happy to adjust the target mapping if you'd prefer a different direction.

A narrow fix for one particularly problematic glyph is fine by me (if it's fine by @simov), and then maybe we open a separate issue for the wider problem. I'd want to know what @simov thinks the right route is. A case could be made for "do nothing" on the more comprehensive fix because it might break things for a lot of people and they might not even realize it. But even if we do decide to go for consistency, the question remains as to whether punctuation marks should generally be mapped to a word boundary, silently dropped, or something else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add em dash and Unicode dash variants to the char map#210

Add em dash and Unicode dash variants to the char map#210
naveentehrpariya wants to merge 1 commit into
simov:masterfrom
naveentehrpariya:fix-unicode-dash-charmap

naveentehrpariya commented Jun 28, 2026

Uh oh!

Trott commented Jun 29, 2026 •

edited

Loading

Uh oh!

naveentehrpariya commented Jun 29, 2026

Uh oh!

Trott commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

naveentehrpariya commented Jun 28, 2026

Uh oh!

Trott commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naveentehrpariya commented Jun 29, 2026

Uh oh!

Trott commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Trott commented Jun 29, 2026 •

edited

Loading