Skip to content

vietcan/toc-normalization-case-study-sach100

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Real-World AI System Design Case Study

Solving Messy OCR Data with Deterministic Architecture

This repository is not a library.
It is a real production case study showing how system architecture determines whether AI can solve complex problems reliably.


Problem

OCR output from book Tables of Contents is highly inconsistent:

  • Page numbers appear before/after titles
  • Mixed casing styles
  • Numbered vs non-numbered headings
  • Multilingual text fragments
  • Nested structures without clear delimiters

Raw OCR text is not usable directly.
Manual cleanup is slow and error-prone.
Naive AI prompts fail on edge cases.


Goal

Design a deterministic processing architecture where:

  • AI is controlled, not trusted blindly
  • Rules scale without conflict
  • Edge cases do not break the pipeline
  • Output format is stable across books

Key Insight

Most AI failures are not model failures.
They are architecture failures.

The solution was not "better prompting" —
but a layered system with strict contracts.


System Principles

  • Layered processing pipeline
  • One function = one responsibility
  • Strict input/output contracts
  • Error-tolerant parsing
  • Deterministic normalization before AI reasoning
  • Manual override allowed for rare cases

Result

After architecture redesign:

  • Rule count increased safely without conflict
  • Edge cases stopped breaking the pipeline
  • AI outputs became predictable
  • Processing time dropped dramatically

Why This Repo Exists

This is a demonstration that:

Correct architecture enables AI to solve hard problems reliably.

Not every system needs a bigger model.
Many need a better structure.


Repository Contents

File Purpose
ARCHITECTURE.md System layer design
RULES.md Rule design philosophy
AI_GOVERNANCE.md How AI is constrained
before_after.md Real input/output examples

Intended Audience

CTOs, system architects, technical founders.

This repo is about thinking, not tooling.


License

Shared for learning and discussion.

About

Production-style rule engine for transforming noisy OCR book TOC into clean structured data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors