Skip to content

CommonVoice FR - Preprocessing stripped every lowercase character in the model#2085

Open
Kizyow wants to merge 1 commit into
k2-fsa:masterfrom
Kizyow:master
Open

CommonVoice FR - Preprocessing stripped every lowercase character in the model#2085
Kizyow wants to merge 1 commit into
k2-fsa:masterfrom
Kizyow:master

Conversation

@Kizyow

@Kizyow Kizyow commented Jun 2, 2026

Copy link
Copy Markdown

Description

While preparing and training a French ASR model using the Common Voice dataset, I noticed that the normalized text transcripts were almost completely blank or heavily truncated (e.g., only keeping the leading uppercase letters, spaces, and apostrophes).

After investigation, I found a logical bug in the normalize_text function for French: the regex filters out all characters that are not uppercase before the .upper() method is applied. As a result, all lowercase letters (which represent ~95% of the text in Common Voice) are silently deleted.

For example, "L'éléphant" is currently reduced to "L'" because the lowercase letters are stripped before they have a chance to be capitalized.

Solution

This PR fixes the execution order in the fr block: the utterance is now converted to uppercase before running the regex filter.

elif language == "fr":
    utt = utt.upper()
    return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt)

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Updated warning message for the validated partition to clearly communicate that it includes both training and development data subsets.
  • Refactor

    • Streamlined text normalization workflow for English language data preprocessing by reorganizing character filtering and text normalization operations.

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee8f2d4c-d0cc-4a64-926b-cd75734b8c10

📥 Commits

Reviewing files that changed from the base of the PR and between bf478d3 and fe04bf4.

📒 Files selected for processing (1)
  • egs/commonvoice/ASR/local/preprocess_commonvoice.py

📝 Walkthrough

Walkthrough

The PR contains two small edits to CommonVoice ASR preprocessing: reordering English text normalization to uppercase before regex filtering, and clarifying the validated partition warning message to note it includes both train and dev data.

Changes

CommonVoice preprocessing improvements

Layer / File(s) Summary
English text normalization step reordering
egs/commonvoice/ASR/local/preprocess_commonvoice.py
The normalize_text function for English is refactored to uppercase the input before the character-filter regex runs, changing the order of operations from regex-then-uppercase to uppercase-then-regex.
Validated partition warning message clarification
egs/commonvoice/ASR/local/preprocess_commonvoice.py
The logged warning for the "validated" partition is updated with additional text describing that "validated" contains both 'train' and 'dev' data.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A hop through preprocessing, so neat and so fine,
Text uppercased first, then regex defines,
Warnings now clearer, the splits all align,
Train and dev dancing in "validated" line! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title mentions 'CommonVoice FR' and a preprocessing issue, but the actual changeset involves English normalization reordering (uppercasing before regex filtering), not French-specific changes. Update the title to accurately reflect the main change: normalize text before applying character filters in the English path, or clarify if the French fix is the primary intent.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the French text normalization in preprocess_commonvoice.py to convert input text to uppercase before applying the regular expression filter, ensuring accented characters are not incorrectly stripped. It also removes trailing whitespace in a warning message. The review feedback points out that the French uppercase letter 'Ÿ' is missing from the regular expression and suggests adding it to ensure complete coverage of French diacritics.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

elif language == "fr":
return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt).upper()
utt = utt.upper()
return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The French uppercase letter Ÿ (corresponding to lowercase ÿ) is missing from the allowed characters list in the regular expression. Although rare, it is used in French proper nouns (e.g., L'Haÿ-les-Roses, Moÿ-de-l'Aisne). Without it, any occurrence of ÿ or Ÿ will be silently stripped after conversion to uppercase.

Adding Ÿ to the character class ensures complete coverage of French alphabet diacritics.

Suggested change
return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt)
return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸ' ]", "", utt)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant