OCR Router is a Python CLI that processes PDF documents in batches, extracts metadata, classifies document types, and routes files into structured folders.
It is configuration-driven: categories, issuers, naming, and folder routing are defined in YAML so each user can tailor behavior without changing source code.
OCR Router is designed for high-volume document organization where PDFs arrive mixed (bank statements, credit cards, bills, tax forms, notices, receipts, paystubs). The pipeline performs OCR when needed, extracts key fields (date, amount, account hints, issuer), classifies by keyword scoring, proposes standardized filenames, and routes files into deterministic folder structures.
The tool emphasizes safety and reviewability:
- Interactive review before moves/renames
- Dry-run mode for zero-risk previews
- CSV/JSONL manifests for traceability
- Sanitization checks to prevent accidental personal-data publication
- Interactive review before file operations
- OCR support for scanned/image-only PDFs via PDF24
- JPEG/JPG ingestion with conversion to searchable PDF (OCR applied)
- Metadata extraction (date, amount, issuer, account)
- Rule-based classification and routing templates
- Deterministic filename normalization
- Density-aware folder resolver
- Manifest output in CSV and JSONL
- Privacy guardrails for public repositories
- Optional local LLM second-opinion classifier (Ollama, no cloud, no API keys) — see Section 6
- Feedback log + embedding store — pipeline learns from your past corrections
- Eval mode — measure classifier accuracy against your organized tree
- Python 3.10+
- Windows PowerShell
- Optional OCR engine: PDF24 Creator (recommended)
- Optional local LLM stack: Ollama (recommended — see Section 6)
Install PDF24 (optional but recommended):
| Option | Command / Link |
|---|---|
| Winget | winget install SoftwareAG.PDF24Creator |
| Chocolatey | choco install pdf24creator |
| Download | https://tools.pdf24.org/en/creator |
If you just want to use the tool, install it as a standalone CLI with pipx:
# One-time setup (install pipx)
python -m pip install --user pipx
python -m pipx ensurepath
# Install ocr-router from GitHub (or a local clone)
pipx install git+https://github.com/oscarzamora/ocr-docs.git
# Verify
ocr-router --helpThis puts ocr-router on your PATH so you can run it from any terminal in any folder.
To upgrade later: pipx upgrade ocr-router. To uninstall cleanly: pipx uninstall ocr-router.
git clone https://github.com/oscarzamora/ocr-docs.git
cd ocr-docs
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install -e .[dev]# Inside the cloned repo:
Copy-Item config/routing-config.yaml config/routing-config.local.yaml
# Or anywhere outside the repo if installed via pipx:
ocr-router --help # shows OCR_CONFIG_PATH env var usageEdit your local config with your own categories, issuers, and route templates.
Main command (interactive mode):
python -m ocr_router.cli process `
--input "C:\path\to\input" `
--output "C:\path\to\output" `
--config config/routing-config.local.yamlInteractive flow (PDF + JPEG inputs):
- Analyze all input files
- Convert JPEG/JPG files to searchable PDFs through OCR preprocessing
- Show proposal table (category, issuer, filename, route)
- Ask move vs rename-in-place
- Ask file selection (all, include-list, skip-list)
- Execute and write history/manifest
The pipeline is deterministic by default (keyword scoring). The local LLM (llama3.2:3b
via Ollama) is an opt-in second opinion. There are two distinct ways to use it:
| Mode | What you run | Where the LLM helps | When to pick this |
|---|---|---|---|
(A) Standalone CLI with --llm |
ocr-router process ... --llm in any terminal |
Classifies each doc, parses your free-form replies at the confirm prompt | You like the terminal, want a single command, scripting, cron jobs |
(B) @OCR Router agent in VS Code Chat |
Type @OCR Router process ... in Copilot chat |
Same as above PLUS the agent translates natural-language goals into the right CLI commands and confirms each step in chat | You want conversational HITL, you're already in VS Code, multi-step asks |
Both modes share the exact same pipeline, same feedback log, same embedding store. Mode B is a thin chat wrapper over Mode A — nothing magical, just a friendlier surface.
Sections 6 (standalone) and 7 (agent mode) below give the full setup and examples for each.
Use this in GitHub repository description/about field:
OCR Router is a privacy-aware Python CLI for OCR, metadata extraction, deterministic document classification, smart naming, and folder routing with review-first workflows and manifest traceability.
Suggested Topics (GitHub tags):
pythonocrpdfdocument-managementcliautomationprivacyyaml-config
git tag -a v0.2.0 -m "v0.2.0: README overhaul, sanitize gate, release documentation"
git push origin v0.2.0List tags:
git tag --listShow tag details:
git show v0.2.0python -m ocr_router.cli process `
--input "C:\Docs\Incoming" `
--output "C:\Docs\Sorted" `
--config config/routing-config.local.yamlpython -m ocr_router.cli process `
--input "C:\Docs\Incoming" `
--output "C:\Docs\Sorted" `
--config config/routing-config.local.yaml `
--dry-runpython -m ocr_router.cli process `
--input "C:\Docs\Incoming" `
--output "C:\Docs\Sorted" `
--config config/routing-config.local.yaml `
--no-interactivepython -m ocr_router.cli review --manifest "C:\Docs\Sorted\manifest.jsonl"python scripts/sanitize_check.pypython -m ocr_router process `
--input "C:\Users\<user>\Documents\__downloads__" `
--output "C:\Users\<user>\Documents" `
--config config/routing-config.local.yaml `
--llmpython -m ocr_router eval `
--root "C:\Users\<user>\Documents" `
--sample 200 --llmOCR Router ships with an opt-in local-first LLM stack that runs entirely on your machine. No cloud, no API keys, no document data leaves your computer.
With it enabled the pipeline:
- Runs the keyword router (default behavior — fully deterministic).
- Asks
llama3.2:3bvia Ollama for a second opinion, with thekmost-similar past confirmed decisions injected as few-shot exemplars. - Applies a simple decision rule: agreement → confident, disagreement → flag for HITL, low LLM confidence → keep keyword + show hint.
- At the confirm prompt, parses your free-form English ("skip 2 because I haven't paid")
into structured actions (
park_some [2], note "haven't paid") with a transparentUnderstood: …recap. - Logs every decision (and every correction you make) to a JSONL feedback log so the classifier learns from your taxonomy over time.
# 1. Install Ollama (https://ollama.com), then pull the two models
ollama pull llama3.2:3b # ~2 GB — chat model
ollama pull nomic-embed-text # ~270 MB — embeddings for few-shot
# 2. Enable LLM in your local config
# Add to config/routing-config.local.yaml:
# llm:
# enabled: true
# confidence_threshold: 0.6
# fewshot_k: 5
# 3. (One time) Bootstrap the feedback log from your existing organized tree
python -m ocr_router feedback bootstrap-tree --root "C:\Users\<user>\Documents"
# 4. (One time) Embed all bootstrapped records into the local SQLite vector store
python -m ocr_router feedback embed
# 5. Verify the stack is healthy
python -m ocr_router llm doctor# Always dry-run first — see what would happen, nothing moves yet
python -m ocr_router process `
--input "C:\Users\<user>\Documents\__downloads__" `
--output "C:\Users\<user>\Documents" `
--config config\routing-config.local.yaml `
--llm --dry-run
# When the proposal table looks right, run for real (no --dry-run)
python -m ocr_router process `
--input "C:\Users\<user>\Documents\__downloads__" `
--output "C:\Users\<user>\Documents" `
--config config\routing-config.local.yaml `
--llmAt the confirm prompt you can type either deterministic syntax OR natural language:
# Deterministic (always works, no LLM required):
Enter move ALL files
1,3,5 move ONLY those numbers
skip 2,4 move all EXCEPT those numbers
park 7 keep those files in place permanently (never re-propose)
park 7 note: <r> same as park, capture the reason verbatim
q quit without moving anything
# Natural language (requires --llm; uses LLM to parse intent):
skip 2 because I haven't paid yet → park 2 + note (per unpaid convention)
park the FPL one, it's a duplicate → asks for the file number if ambiguous
move 1 3 5, the others are for Luciana → moves 1,3,5; skipped get rule prompt
4 is actually FPL not AT&T → adds issuer rule to local YAML
nevermind / cancel → quit
python -m ocr_router feedback stats # counts by event/category/backend
python -m ocr_router feedback show --limit 20 # most recent records (with Note column)
python -m ocr_router feedback search "AMEX credit card statement"
python -m ocr_router feedback parked list # files marked "keep in place"
python -m ocr_router eval --root "C:\Users\<user>\Documents" --sample 200 --llm| Layer | Default location | Purpose | Built by |
|---|---|---|---|
| Feedback log | data/_feedback/corrections.jsonl (project-local) |
Audit trail of every classify / skip / park / correction | process, feedback bootstrap* |
| Embedding store | data/_feedback/examples.sqlite (project-local) |
Vector index of past confirmed decisions | feedback embed |
| Eval audit log | data/_feedback/eval-<ts>.jsonl (project-local) |
Per-file accuracy record from one eval run | eval |
All three live inside the project folder (in data/_feedback/, which is
gitignored). They never touch the Documents tree you point --output at —
the Documents tree holds only your filed PDFs.
Override locations (when you want them elsewhere):
- Env vars:
OCR_FEEDBACK_DIR,OCR_FEEDBACK_LOG,OCR_EMBEDDINGS_DB - Config keys:
feedback.path,feedback.embeddings_db
- Document text never leaves your machine — Ollama runs locally; the codebase has no cloud fallback by design.
- The feedback log stores a configurable text excerpt per record (default 2000 chars). It is gitignored.
- The embedding store contains those same excerpts plus their 768-dim vectors — same privacy posture as the log, same gitignore.
- The sanitize gate (
scripts/sanitize_check.py) blocks any commit that contains personal names, real Windows user paths, or OneDrive references. Same gate runs in CI.
Three independent ways to disable the LLM stack:
# Per-run override (keeps config as-is)
python -m ocr_router process ... --no-llm
# Disable in config
# llm:
# enabled: false
# Full revert to pre-L4 keyword-only baseline (preserved as an annotated git tag)
git checkout pre-l4-baselineSame pipeline as Section 6, but driven through chat instead of a terminal. The repo
ships with a workspace agent definition at .github/agents/ocr-router.agent.md.
| Capability | Standalone CLI (--llm) |
@OCR Router agent mode |
|---|---|---|
| Same pipeline, same feedback log, same SQLite store | ✓ | ✓ |
Local LLM (llama3.2:3b via Ollama) |
✓ | ✓ |
| Natural-language confirm replies | ✓ (intent parser) | ✓ (intent parser + chat agent translates the broader request) |
Multi-step asks (scan and run an eval after, show me parked) |
❌ separate commands | ✓ agent stitches them |
| Renders proposal tables as clean Markdown in chat | ❌ terminal box-drawing | ✓ |
| Per-session memory of input/output folders | ❌ | ✓ (asks once, remembers) |
| Honors the routing conventions in the agent playbook (owner-namespaced files, unpaid-statements stay parked, tax forms → Tax Returns) | partial (config only) | ✓ (agent applies the conventions even when keyword/LLM disagree) |
| Works from any chat client (Cursor, Claude Desktop, Windsurf, …) | n/a | only VS Code Copilot today; MCP server is the natural next step |
Use CLI for scripts, cron jobs, terminal-only workflows. Use agent mode for conversational HITL and when you want the agent to apply conventions that don't fit neatly into the YAML config.
- Open this workspace in VS Code with the GitHub Copilot extension installed.
- Restart the chat window once so the agent gets picked up.
- In the chat input mode picker (bottom-left of the chat panel), pick Agent.
- Click the agent dropdown → OCR Router.
You: Process my downloads with LLM
@OCR Router: Which folder should I scan and which folder is your organized documents root?
You: C:\Users\me\Documents\__downloads__ → C:\Users\me\Documents
@OCR Router: [runs `process --llm --dry-run`, posts a Markdown table of proposals
with the Backend column — agree ✓ / LLM ✱ / kw / llm err]
You: park 2 because I haven't paid, the rest go
@OCR Router: Understood: park #2 — "I haven't paid", move the rest.
[runs without --dry-run, applies the selection, writes feedback log
with the note attached, appends Notes block to PROCESSED_PDFS.md]
Moved 3, parked 1, skipped 0. ✓
The agent reuses the same ocr-router CLI under the hood, so every move is
logged to corrections.jsonl and feeds future runs through the embedding store.
Your personalization stays local — the agent never edits the playbook file
or sends data anywhere outside your machine.
Copy the agent file once to your VS Code user profile so @OCR Router works in
any project you open:
# VS Code user prompts folder (Windows)
$dst = "$env:APPDATA\Code\User\prompts"
New-Item -ItemType Directory -Path $dst -Force | Out-Null
Copy-Item .github\agents\ocr-router.agent.md $dst\After the copy, the agent is discoverable everywhere — even workspaces that don't
contain this repo. Make sure ocr-router is on your PATH (install via pipx as
shown in Section 1).
- No silent moves — every run starts with
--dry-runand waits for your confirmation - No cloud calls — the pipeline is local-only, by design (no cloud backend exists)
- No code edits — the agent only invokes the CLI, never writes Python
- No bypass of
park— files you parked stay parked until youunparkthem - Never modifies its own playbook (
.github/agents/ocr-router.agent.md) — that file is static. Learnings go tocorrections.jsonl,routing-config.local.yaml, orPROCESSED_PDFS.md, all of which are gitignored / your own - No personal info in the repo — the sanitize gate blocks any commit that contains names or real Windows user paths; same gate runs in CI
| Document type | Format |
|---|---|
| Monthly statement | YYYY.MM - Issuer DocType.pdf |
| Monthly with account | YYYY.MM - Issuer DocType - (Last4 XXXX) - $Amount.pdf |
| Dated transaction / receipt | YYYY.MM.DD - Issuer DocType - $Amount.pdf |
| Paystub | YYYY.MM.DD - Issuer Paycheck - $NetPay.pdf |
| Reference / policy form | YYYY - Issuer DocType.pdf |
Missing metadata fields are omitted instead of using placeholder tokens.
ocr-docs/
src/ocr_router/
cli.py
ocr_engine.py
extractor.py
router.py
folder_resolver.py
manifest.py
config.py
feedback/ # L1-L3: feedback log, bootstrap, embeddings
log.py
bootstrap.py
store.py
llm/ # L4: local LLM classifier
schema.py
backends.py
prompts.py
classifier.py
eval/ # L6: accuracy harness
runner.py
config/
routing-config.yaml # tracked default template (no PII)
routing-config.local.yaml # local-only, gitignored
scripts/
dry_run.py
sanitize_check.py
tests/
Create .env locally (never commit):
PDF24_PATH=C:\Program Files\PDF24\pdf24-Ocr.exe
OCR_CONFIG_PATH=config/routing-config.local.yaml
LOG_LEVEL=INFO
DEBUG=false- Keep generic templates in git:
config/routing-config.yaml.env.example
- Keep personal files local and ignored:
config/routing-config.local.yaml.env- manifests and local logs
- Run the sanitization gate before push:
python scripts/sanitize_check.pyThe same check runs in GitHub Actions on pull requests and pushes.