DocProcessor

A lightweight, three-stage pipeline for extracting and categorizing documents (PDFs and images) using PyMuPDF, Tesseract OCR, and the Claude AI API.

Input (PDF / Image)
      │
      ▼
 [ Extractor ]        ← PyMuPDF + Tesseract OCR
      │
      ▼
 [ AI Categorizer ]   ← Claude API + Pydantic validation
      │
      ▼
 [ JSON / JSONL / CSV Output ]

Features

Native PDF text extraction via PyMuPDF — 10× faster than PyPDF2
Automatic OCR fallback for scanned pages (per-page detection)
Direct image input: PNG, JPG, TIFF, BMP, WebP
AI categorization and structured key-field extraction via Claude API
Pydantic-validated output schema — no silent data corruption
Exponential-backoff retry on API failures
Batch processing: files, globs, entire directories
Output formats: JSON, JSONL, CSV
Per-file output directory mode
Confidence threshold filtering (--min-confidence)
Rich progress bar and summary table
Fully configurable via environment variables
Bilingual OCR: English + Hungarian (eng+hun)

Requirements

Python ≥ 3.11
uv (recommended) or pip
Tesseract OCR installed system-wide
Anthropic API key

Install Tesseract

Ubuntu/Debian:

sudo apt install tesseract-ocr tesseract-ocr-hun

macOS:

brew install tesseract tesseract-lang

Windows: Download from UB Mannheim.

Installation

git clone https://github.com/youruser/doc-processor.git
cd doc-processor
uv venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
uv pip install -e .

Configuration

cp .env.example .env

ANTHROPIC_API_KEY=your_api_key_here

Additional environment variables:

Variable	Default	Description
`ANTHROPIC_API_KEY`	—	Required. Anthropic API key
`DOC_MODEL`	`claude-sonnet-4-6`	Claude model to use
`DOC_MAX_INPUT_CHARS`	`8000`	Max chars sent to Claude
`DOC_OCR_LANGS`	`eng+hun`	Tesseract language string
`DOC_MIN_CONFIDENCE`	`0.0`	Global confidence threshold
`DOC_API_RETRIES`	`3`	API retry attempts
`LOG_LEVEL`	`WARNING`	Python log level

Usage

# Single file → stdout
doc-processor invoice.pdf

# Image file
doc-processor scan.png

# Batch — all PDFs in a directory
doc-processor docs/ --output-dir results/

# Multiple files, JSONL output
doc-processor *.pdf --format jsonl --output results.jsonl

# CSV export, skip low-confidence results
doc-processor docs/ --format csv --min-confidence 0.75 --output summary.csv

# Debug logging
doc-processor invoice.pdf --verbose

Example Output

{
  "category": "invoice",
  "language": "Hungarian",
  "summary": "Invoice from Példa Kft. for IT services in March 2026. Total amount is 450,000 HUF.",
  "key_fields": {
    "invoice_number": "2026/03/042",
    "date": "2026-03-15",
    "total": "450000 HUF",
    "vendor": "Példa Kft."
  },
  "confidence": 0.94,
  "source_file": "szamla_marc.pdf",
  "page_count": 2,
  "char_count": 3471
}

Supported Categories

Category	Description
`invoice`	Bills, receipts, tax invoices
`contract`	Agreements, legal documents
`report`	Reports, analyses, summaries
`form`	Filled or blank forms
`other`	Anything not matching above

Project Structure

doc_processor/
├── main.py           # CLI entry point, batch orchestrator
├── extractor.py      # Text extraction (PyMuPDF + Tesseract)
├── categorizer.py    # Claude API call, retry logic
├── models.py         # Pydantic output schema
├── exceptions.py     # Custom exception hierarchy
├── config.py         # Config dataclass, env var loading
└── utils.py          # File validation, directory collection
tests/
├── conftest.py
├── test_models.py    # Pydantic validation tests
├── test_utils.py     # File helper tests
├── test_categorizer.py  # Live API tests (requires API key)
└── fixtures/         # Put test PDFs/images here
pyproject.toml
requirements.txt

Running Tests

# Fast tests (no API key needed)
pytest tests/test_models.py tests/test_utils.py

# All tests including live API calls
ANTHROPIC_API_KEY=your_key pytest

# With coverage
pytest --cov=doc_processor

Design Decisions

Decision	Reason
PyMuPDF over PyPDF2	10× faster, handles scanned PDFs and multi-column layouts natively
Claude API for categorization	No custom ML model to train or maintain
Pydantic output validation	Catches malformed API responses before they corrupt downstream data
Exponential backoff on retries	Handles transient API rate limits gracefully
8000-char input guard	Prevents token overflow on large documents
`lang="eng+hun"`	Covers both English and Hungarian documents
`rich` for CLI output	Progress bar + summary table — no printf debugging

Contributing

See CONTRIBUTING.md.

Security

See SECURITY.md.

License

GNU General Public License v3 with additional restrictions. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocProcessor

Features

Requirements

Install Tesseract

Installation

Configuration

Usage

Example Output

Supported Categories

Project Structure

Running Tests

Design Decisions

Contributing

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
doc_processor		doc_processor
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DocProcessor

Features

Requirements

Install Tesseract

Installation

Configuration

Usage

Example Output

Supported Categories

Project Structure

Running Tests

Design Decisions

Contributing

Security

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages