📄 Intelligent OCR & Document Processing Pipeline

Version: 1.0.0 (Stable)
Tech Stack: Python 3.11, FastAPI, PaddleOCR, EasyOCR, PyMuPDF
Status: MVP (Minimum Viable Product)

📌 Project Overview

This project is a high-performance Document Management System backend designed to extract text from mixed-language documents (Arabic & English). It features a Smart Routing System that automatically detects the document language and selects the optimal OCR engine:

PaddleOCR: Used for high-speed processing of English text, numbers, and tables.
EasyOCR: Used for high-accuracy processing of Arabic text with correct right-to-left sentence reconstruction.

Additionally, the pipeline acts as a Dataset Builder, archiving original files, extracted text, and metadata logs for future model training.

⚙️ System Architecture & Logic

The system follows a "Hybrid & Smart" approach to balance speed and accuracy:

Input: Accepts Images (JPG, PNG) or PDFs (Scanned or Digital).
PDF Parsing: Uses PyMuPDF to attempt direct text extraction first. If failed, it converts pages to images.
The "Smart Router":
- Step 1: Runs PaddleOCR (Fast pass).
- Step 2: Checks extracted text for Arabic characters (Regex).
- Step 3 (Decision):
  - If English/Numbers: Returns PaddleOCR result immediately.
  - If Arabic: Switches to EasyOCR with paragraph=True to fix cursive connectivity and reading order.
Output: Saves the Source File, Target Text File, and JSON Metadata.

🚀 Key Features

⚡ Auto-Routing Engine: No need to manually select the language; the API decides per file.
🧠 Hybrid PDF Processing: Handles digital PDFs (text layer) and scanned PDFs (OCR layer) simultaneously.
🇸🇦 Optimized for Arabic: Solves the common "disjointed letters" and "wrong reading order" issues in Arabic OCR.
📂 Dataset Generation: Automatically organizes processed files into a structured dataset for future ML tasks.
📊 Detailed Logging: Generates a JSON report for every batch containing confidence scores, models used, and processing methods per page.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
app		app
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 Intelligent OCR & Document Processing Pipeline

📌 Project Overview

⚙️ System Architecture & Logic

🚀 Key Features

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 Intelligent OCR & Document Processing Pipeline

📌 Project Overview

⚙️ System Architecture & Logic

🚀 Key Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages