Add em dash and Unicode dash variants to the char map#210
Add em dash and Unicode dash variants to the char map#210naveentehrpariya wants to merge 1 commit into
Conversation
Only the en dash (U+2013) was mapped to "-", so visually similar dashes were dropped as unknown characters. For an unspaced dash this silently merged adjacent words (e.g. `rock—paper` -> `rockpaper`), and the result depended on which near-identical dash the input used. Map the rest of the dash family to "-" for consistency with the en dash: U+2010 hyphen, U+2011 non-breaking hyphen, U+2012 figure dash, U+2014 em dash, U+2015 horizontal bar.
|
To me, the bug is the preservation of the supposed word boundary in the first instance. But regardless, console.log(slugify('rock–paper')) // rock-paper
console.log(slugify('rock—paper')) // rockpaper
console.log(slugify('rock;paper')) // rockpaper
console.log(slugify('rock/paper')) // rockpaper
console.log(slugify('rock.paper')) // rock.paper
console.log(slugify('rock,paper')) // rockpaper
console.log(slugify('rock:paper')) // rock:paperFWIW, the The alternative is to identify every possible punctuation mark and map it as a word boundary. That should be done once in a comprehensive way rather than with a mark here and a mark there. But it requires some decisions. Do we include all 112 Unicode general punctuation marks? And the 94 supplemental marks as well? That's do-able, but I personally have a strong preference for "remove anything you don't recognize". I suppose "change anything you don't recognize into a word boundary" would be a valid approach too. Either one would probably be best to release as a semver-major because it is likely to do something surprising/unexpected to someone's existing code somewhere. |
|
Agreed — consistency is the real goal here. This PR takes the conservative route: the en dash was already mapped to |
A narrow fix for one particularly problematic glyph is fine by me (if it's fine by @simov), and then maybe we open a separate issue for the wider problem. I'd want to know what @simov thinks the right route is. A case could be made for "do nothing" on the more comprehensive fix because it might break things for a lot of people and they might not even realize it. But even if we do decide to go for consistency, the question remains as to whether punctuation marks should generally be mapped to a word boundary, silently dropped, or something else. |
Only the en dash (U+2013) was mapped to
-. Visually similar dashes were dropped as unknown characters, and for an unspaced dash this silently merged adjacent words:The result depended on which near-identical dash the input happened to use. Mapped the rest of the dash family to
-for consistency with the en dash: U+2010 hyphen, U+2011 non-breaking hyphen, U+2012 figure dash, U+2014 em dash, U+2015 horizontal bar. Added a test covering all variants.