Tune ML worker OCR subject normalization by SahilKumar75 · Pull Request #104 · SahilKumar75/sentri

SahilKumar75 · 2026-05-12T16:48:53Z

Summary

Expand ML worker default tuning aliases for common OCR-confused timetable subject names.
Reuse default subject, faculty, and location aliases in the default parser path.
Add parser tests for noisy subject, faculty, and lab-location normalization.

Testing

cd ml-worker && pytest tests/test_parser.py
cd ml-worker && pytest
git diff --check

Closes #103

Summary by CodeRabbit

Release Notes

Improvements
- Enhanced parser to automatically normalize OCR errors and typos in subject names and faculty/location codes (e.g., "MACH1NE LEARN1NG" → "MACHINE LEARNING", "V1" → "VI").
- Parser now applies default normalization mappings when no custom tuning profile is configured.
Tests
- Added tests to verify OCR and typo normalization across multiple variants.

coderabbitai · 2026-05-12T16:49:16Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The PR extends the ML worker parser to apply default alias normalization when no custom tuning profile is present. It expands default subject aliases with OCR/typo variants, integrates default normalization into the parser fallback logic, and adds tests validating the normalization behavior.

Changes

Default Alias Normalization

Layer / File(s)	Summary
OCR/typo alias expansion `ml-worker/src/sentri_worker/tuning.py`	`DEFAULT_SUBJECT_ALIASES` is expanded with digit-substitution and typo variants (e.g., `MACH1NE LEARNING`, `0PERATING SYSTEMS`, `C0MPUTER NETWORKS`) mapping to canonical subject vocabulary entries.
Default normalization in parser `ml-worker/src/sentri_worker/parser.py`	Parser imports default alias dictionaries (`DEFAULT_FACULTY_ALIASES`, `DEFAULT_LOCATION_ALIASES`, `DEFAULT_SUBJECT_ALIASES`) from tuning. `SUBJECT_ALIASES` is set from `DEFAULT_SUBJECT_ALIASES`. When `tuning_profile` is `None`, parser normalizes `faculty_code` via `DEFAULT_FACULTY_ALIASES` and `location` via `DEFAULT_LOCATION_ALIASES` using both direct and whitespace-compacted keys. Custom tuning profile behavior is unchanged.
Default normalization tests `ml-worker/tests/test_parser.py`	Two new test methods verify that `parse_cell_text` normalizes OCR-confused subject, faculty, and location strings (e.g., "MACH1NE LEARN1NG" → "MACHINE LEARNING", "LABIII" → "LAB-III") to canonical forms when no tuning profile is provided.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

SahilKumar75/sentri#68: Related — both PRs modify ml-worker/src/sentri_worker/parser.py, specifically the parse_cell_text / _parse_cell_text_impl flow (related PR adds caching/whitespace/dedup logic while this PR changes default alias normalization used in that parsing path).

Poem

🐰 OCR confuses digits with letters true,
But aliases now know what each typo's due.
No tuning profile? No problem at all,
Default aliases catch them before the fall! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly reflects the main objective of adding OCR subject normalization tuning to the ML worker parser.
Description check	✅ Passed	The description covers the key changes (alias expansion, reuse in parser, tests) and includes testing commands and linked issue reference, though it deviates from the provided template structure.
Linked Issues check	✅ Passed	All acceptance criteria from issue `#103` are met: noisy subject OCR variants (MACH1NE LEARN1NG, etc.) normalize correctly [`#103`], tests are added and pytest passes [`#103`], and changes are documented in PR description [`#103`].
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#103` objectives: parser tuning alias expansion, default alias reuse in parser logic, and deterministic parser tests for OCR normalization.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/ml-ocr-subject-tuning

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

SahilKumar75

ML review: this keeps the change scoped to deterministic parser tuning. The expanded default aliases cover OCR digit/letter confusions for subject names, and the parser now applies default faculty/location aliases in the no-custom-profile path. Tests cover the noisy subject, faculty, and lab-location cases.

fix: tune ml worker ocr subject normalization

47fbabf

SahilKumar75 commented May 12, 2026

View reviewed changes

SahilKumar75 merged commit 7e1203d into main May 12, 2026
2 of 3 checks passed

SahilKumar75 deleted the codex/ml-ocr-subject-tuning branch May 12, 2026 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune ML worker OCR subject normalization#104

Tune ML worker OCR subject normalization#104
SahilKumar75 merged 1 commit into
mainfrom
codex/ml-ocr-subject-tuning

SahilKumar75 commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

SahilKumar75 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SahilKumar75 commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

SahilKumar75 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SahilKumar75 commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading