Skip to content

Tune ML worker OCR subject normalization#104

Merged
SahilKumar75 merged 1 commit into
mainfrom
codex/ml-ocr-subject-tuning
May 12, 2026
Merged

Tune ML worker OCR subject normalization#104
SahilKumar75 merged 1 commit into
mainfrom
codex/ml-ocr-subject-tuning

Conversation

@SahilKumar75
Copy link
Copy Markdown
Owner

@SahilKumar75 SahilKumar75 commented May 12, 2026

Summary

  • Expand ML worker default tuning aliases for common OCR-confused timetable subject names.
  • Reuse default subject, faculty, and location aliases in the default parser path.
  • Add parser tests for noisy subject, faculty, and lab-location normalization.

Testing

  • cd ml-worker && pytest tests/test_parser.py
  • cd ml-worker && pytest
  • git diff --check

Closes #103

Summary by CodeRabbit

Release Notes

  • Improvements

    • Enhanced parser to automatically normalize OCR errors and typos in subject names and faculty/location codes (e.g., "MACH1NE LEARN1NG" → "MACHINE LEARNING", "V1" → "VI").
    • Parser now applies default normalization mappings when no custom tuning profile is configured.
  • Tests

    • Added tests to verify OCR and typo normalization across multiple variants.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The PR extends the ML worker parser to apply default alias normalization when no custom tuning profile is present. It expands default subject aliases with OCR/typo variants, integrates default normalization into the parser fallback logic, and adds tests validating the normalization behavior.

Changes

Default Alias Normalization

Layer / File(s) Summary
OCR/typo alias expansion
ml-worker/src/sentri_worker/tuning.py
DEFAULT_SUBJECT_ALIASES is expanded with digit-substitution and typo variants (e.g., MACH1NE LEARNING, 0PERATING SYSTEMS, C0MPUTER NETWORKS) mapping to canonical subject vocabulary entries.
Default normalization in parser
ml-worker/src/sentri_worker/parser.py
Parser imports default alias dictionaries (DEFAULT_FACULTY_ALIASES, DEFAULT_LOCATION_ALIASES, DEFAULT_SUBJECT_ALIASES) from tuning. SUBJECT_ALIASES is set from DEFAULT_SUBJECT_ALIASES. When tuning_profile is None, parser normalizes faculty_code via DEFAULT_FACULTY_ALIASES and location via DEFAULT_LOCATION_ALIASES using both direct and whitespace-compacted keys. Custom tuning profile behavior is unchanged.
Default normalization tests
ml-worker/tests/test_parser.py
Two new test methods verify that parse_cell_text normalizes OCR-confused subject, faculty, and location strings (e.g., "MACH1NE LEARN1NG" → "MACHINE LEARNING", "LABIII" → "LAB-III") to canonical forms when no tuning profile is provided.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • SahilKumar75/sentri#68: Related — both PRs modify ml-worker/src/sentri_worker/parser.py, specifically the parse_cell_text / _parse_cell_text_impl flow (related PR adds caching/whitespace/dedup logic while this PR changes default alias normalization used in that parsing path).

Poem

🐰 OCR confuses digits with letters true,
But aliases now know what each typo's due.
No tuning profile? No problem at all,
Default aliases catch them before the fall!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly reflects the main objective of adding OCR subject normalization tuning to the ML worker parser.
Description check ✅ Passed The description covers the key changes (alias expansion, reuse in parser, tests) and includes testing commands and linked issue reference, though it deviates from the provided template structure.
Linked Issues check ✅ Passed All acceptance criteria from issue #103 are met: noisy subject OCR variants (MACH1NE LEARN1NG, etc.) normalize correctly [#103], tests are added and pytest passes [#103], and changes are documented in PR description [#103].
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #103 objectives: parser tuning alias expansion, default alias reuse in parser logic, and deterministic parser tests for OCR normalization.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/ml-ocr-subject-tuning

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Owner Author

@SahilKumar75 SahilKumar75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML review: this keeps the change scoped to deterministic parser tuning. The expanded default aliases cover OCR digit/letter confusions for subject names, and the parser now applies default faculty/location aliases in the no-custom-profile path. Tests cover the noisy subject, faculty, and lab-location cases.

@SahilKumar75 SahilKumar75 merged commit 7e1203d into main May 12, 2026
2 of 3 checks passed
@SahilKumar75 SahilKumar75 deleted the codex/ml-ocr-subject-tuning branch May 12, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tune ML worker subject aliases for OCR-confused timetable text

1 participant