CommonVoice FR - Preprocessing stripped every lowercase character in the model by Kizyow · Pull Request #2085 · k2-fsa/icefall

Kizyow · 2026-06-02T18:47:18Z

Description

While preparing and training a French ASR model using the Common Voice dataset, I noticed that the normalized text transcripts were almost completely blank or heavily truncated (e.g., only keeping the leading uppercase letters, spaces, and apostrophes).

After investigation, I found a logical bug in the normalize_text function for French: the regex filters out all characters that are not uppercase before the .upper() method is applied. As a result, all lowercase letters (which represent ~95% of the text in Common Voice) are silently deleted.

For example, "L'éléphant" is currently reduced to "L'" because the lowercase letters are stripped before they have a chance to be capitalized.

Solution

This PR fixes the execution order in the fr block: the utterance is now converted to uppercase before running the regex filter.

elif language == "fr":
    utt = utt.upper()
    return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt)

Summary by CodeRabbit

Release Notes

Bug Fixes
- Updated warning message for the validated partition to clearly communicate that it includes both training and development data subsets.
Refactor
- Streamlined text normalization workflow for English language data preprocessing by reorganizing character filtering and text normalization operations.

…ice, fixed

coderabbitai · 2026-06-02T18:47:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee8f2d4c-d0cc-4a64-926b-cd75734b8c10

📥 Commits

Reviewing files that changed from the base of the PR and between bf478d3 and fe04bf4.

📒 Files selected for processing (1)

egs/commonvoice/ASR/local/preprocess_commonvoice.py

📝 Walkthrough

Walkthrough

The PR contains two small edits to CommonVoice ASR preprocessing: reordering English text normalization to uppercase before regex filtering, and clarifying the validated partition warning message to note it includes both train and dev data.

Changes

CommonVoice preprocessing improvements

Layer / File(s)	Summary
English text normalization step reordering `egs/commonvoice/ASR/local/preprocess_commonvoice.py`	The `normalize_text` function for English is refactored to uppercase the input before the character-filter regex runs, changing the order of operations from regex-then-uppercase to uppercase-then-regex.
Validated partition warning message clarification `egs/commonvoice/ASR/local/preprocess_commonvoice.py`	The logged warning for the "validated" partition is updated with additional text describing that "validated" contains both `'train'` and `'dev'` data.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A hop through preprocessing, so neat and so fine,
Text uppercased first, then regex defines,
Warnings now clearer, the splits all align,
Train and dev dancing in "validated" line! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title mentions 'CommonVoice FR' and a preprocessing issue, but the actual changeset involves English normalization reordering (uppercasing before regex filtering), not French-specific changes.	Update the title to accurately reflect the main change: normalize text before applying character filters in the English path, or clarify if the French fix is the primary intent.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the French text normalization in preprocess_commonvoice.py to convert input text to uppercase before applying the regular expression filter, ensuring accented characters are not incorrectly stripped. It also removes trailing whitespace in a warning message. The review feedback points out that the French uppercase letter 'Ÿ' is missing from the regular expression and suggests adding it to ensure complete coverage of French diacritics.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-02T18:48:09Z

    elif language == "fr":
-        return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt).upper()
+        utt = utt.upper()
+        return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt)


The French uppercase letter Ÿ (corresponding to lowercase ÿ) is missing from the allowed characters list in the regular expression. Although rare, it is used in French proper nouns (e.g., L'Haÿ-les-Roses, Moÿ-de-l'Aisne). Without it, any occurrence of ÿ or Ÿ will be silently stripped after conversion to uppercase.

Adding Ÿ to the character class ensures complete coverage of French alphabet diacritics.

Suggested change

return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt)

return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸ' ]", "", utt)

fix: regex stripped every lower char in french preprocessing-commonvo…

fe04bf4

…ice, fixed

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CommonVoice FR - Preprocessing stripped every lowercase character in the model#2085

CommonVoice FR - Preprocessing stripped every lowercase character in the model#2085
Kizyow wants to merge 1 commit into
k2-fsa:masterfrom
Kizyow:master

Kizyow commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt)
	return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸ' ]", "", utt)

Conversation

Kizyow commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kizyow commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading