CommonVoice FR - Preprocessing stripped every lowercase character in the model#2085
CommonVoice FR - Preprocessing stripped every lowercase character in the model#2085Kizyow wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR contains two small edits to CommonVoice ASR preprocessing: reordering English text normalization to uppercase before regex filtering, and clarifying the validated partition warning message to note it includes both train and dev data. ChangesCommonVoice preprocessing improvements
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the French text normalization in preprocess_commonvoice.py to convert input text to uppercase before applying the regular expression filter, ensuring accented characters are not incorrectly stripped. It also removes trailing whitespace in a warning message. The review feedback points out that the French uppercase letter 'Ÿ' is missing from the regular expression and suggests adding it to ensure complete coverage of French diacritics.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| elif language == "fr": | ||
| return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt).upper() | ||
| utt = utt.upper() | ||
| return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt) |
There was a problem hiding this comment.
The French uppercase letter Ÿ (corresponding to lowercase ÿ) is missing from the allowed characters list in the regular expression. Although rare, it is used in French proper nouns (e.g., L'Haÿ-les-Roses, Moÿ-de-l'Aisne). Without it, any occurrence of ÿ or Ÿ will be silently stripped after conversion to uppercase.
Adding Ÿ to the character class ensures complete coverage of French alphabet diacritics.
| return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt) | |
| return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸ' ]", "", utt) |
Description
While preparing and training a French ASR model using the Common Voice dataset, I noticed that the normalized text transcripts were almost completely blank or heavily truncated (e.g., only keeping the leading uppercase letters, spaces, and apostrophes).
After investigation, I found a logical bug in the
normalize_textfunction for French: the regex filters out all characters that are not uppercase before the.upper()method is applied. As a result, all lowercase letters (which represent ~95% of the text in Common Voice) are silently deleted.For example,
"L'éléphant"is currently reduced to"L'"because the lowercase letters are stripped before they have a chance to be capitalized.Solution
This PR fixes the execution order in the
frblock: the utterance is now converted to uppercase before running the regex filter.Summary by CodeRabbit
Release Notes
Bug Fixes
Refactor