ocrguru is a lightweight, extensible CLI tool that wraps the powerful Docling OCR pipeline.
Process scanned PDFs or images in one command, choose your OCR engine, and export clean text in Markdown, JSON, or hOCR—all without the usual setup fuss.
- ✨ Features
- 🚀 Installation
- ⚡ Quick Start
- 🛠️ CLI Reference
- 🎯 Examples
- 📂 Project Structure
- 🤝 Contributing
- ✅ Testing
- 📜 License
-
Multiple Engines
- RapidOCR (default ONNX-based)
- EasyOCR (PyTorch-powered)
- Tesseract (Python wrapper or CLI)
-
Input Formats
.pdf,.png,.jpg/.jpeg,.tiff,.bmp
-
Output Formats
- Markdown (
.md) – human-friendly - JSON (
.json) – full coordinates & metadata - hOCR (
.html) – preserve layout & styling
- Markdown (
-
Zero-Config Defaults
- RapidOCR models auto-download from Hugging Face & cache locally
-
Cross-Platform
- Works on Windows, macOS, and Linux
-
Extensible Codebase
- Core logic lives in
core.py - CLI interface in
cli.py - Easily add new engines or pipeline options
- Core logic lives in
From PyPI:
pip install ocrguruFrom your GitHub clone:
git clone https://github.com/yourusername/ocrguru.git
cd ocrguru
pip install .Note:
doclingandhuggingface-hubwill install automatically.- For GPU-accelerated EasyOCR, install PyTorch with CUDA support:
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118
Perform OCR on a PDF using the default RapidOCR engine, export to Markdown:
docling-ocr --engine rapidocr --input ./scanned_document.pdf --format md --output ./scanned_document.mdNo extra flags needed!
Usage: docling-ocr [OPTIONS]
Options:
-e, --engine [easyocr|tesseract_py|tesseract_cli|rapidocr]
OCR engine (default: rapidocr)
-i, --input PATH Input file path (.pdf, image)
-f, --format [md|json|html] Output format (default: md)
-o, --output PATH Output file path
-h, --help Show this message and exit
docling-ocr --engine easyocr --input invoice.jpg --format json --output invoice.jsondocling-ocr --engine tesseract_cli --input contract.pdf --format html --output contract.hocr.htmlfor pdf in reports/*.pdf; do
out="${pdf%.pdf}.md"
docling-ocr --input "$pdf" --output "$out"
doneocrguru/
├── src/
│ └── ocrguru/
│ ├── cli.py # CLI entry point
│ └── core.py # OCR conversion logic
├── tests/ # pytest test suite
├── pyproject.toml # build & metadata
└── README.md # this file
We welcome your ideas and pull requests!
- Fork the repo & create a feature branch
- Install dev dependencies:
pip install -e .[test]
- Write tests in
tests/and implement your feature insrc/ocrguru/ - Run the test suite:
pytest
- Open a pull request against
main
Please adhere to PEP 8 and write clear commit messages.
We use pytest for automated testing. Coverage reporting is encouraged:
pytest --cov=ocrguruEnsure new features include corresponding tests.
Released under the MIT License. See LICENSE for full text.
❤️ Happy OCR’ing with ocrguru! ❤️