OCR Router

OCR Router is a Python CLI that processes PDF documents in batches, extracts metadata, classifies document types, and routes files into structured folders.

It is configuration-driven: categories, issuers, naming, and folder routing are defined in YAML so each user can tailor behavior without changing source code.

About (Verbose)

OCR Router is designed for high-volume document organization where PDFs arrive mixed (bank statements, credit cards, bills, tax forms, notices, receipts, paystubs). The pipeline performs OCR when needed, extracts key fields (date, amount, account hints, issuer), classifies by keyword scoring, proposes standardized filenames, and routes files into deterministic folder structures.

The tool emphasizes safety and reviewability:

Interactive review before moves/renames
Dry-run mode for zero-risk previews
CSV/JSONL manifests for traceability
Sanitization checks to prevent accidental personal-data publication

Features

Interactive review before file operations
OCR support for scanned/image-only PDFs via PDF24
JPEG/JPG ingestion with conversion to searchable PDF (OCR applied)
Metadata extraction (date, amount, issuer, account)
Rule-based classification and routing templates
Deterministic filename normalization
Density-aware folder resolver
Manifest output in CSV and JSONL
Privacy guardrails for public repositories
Optional local LLM second-opinion classifier (Ollama, no cloud, no API keys) — see Section 6
Feedback log + embedding store — pipeline learns from your past corrections
Eval mode — measure classifier accuracy against your organized tree

1. Installation

Prerequisites

Python 3.10+
Windows PowerShell
Optional OCR engine: PDF24 Creator (recommended)
Optional local LLM stack: Ollama (recommended — see Section 6)

Install PDF24 (optional but recommended):

Option	Command / Link
Winget	`winget install SoftwareAG.PDF24Creator`
Chocolatey	`choco install pdf24creator`
Download	https://tools.pdf24.org/en/creator

One-line install (recommended for end users)

If you just want to use the tool, install it as a standalone CLI with pipx:

# One-time setup (install pipx)
python -m pip install --user pipx
python -m pipx ensurepath

# Install ocr-router from GitHub (or a local clone)
pipx install git+https://github.com/oscarzamora/ocr-docs.git

# Verify
ocr-router --help

This puts ocr-router on your PATH so you can run it from any terminal in any folder. To upgrade later: pipx upgrade ocr-router. To uninstall cleanly: pipx uninstall ocr-router.

Clone and install (for development / customizing)

git clone https://github.com/oscarzamora/ocr-docs.git
cd ocr-docs

python -m venv venv
.\venv\Scripts\Activate.ps1
pip install -e .[dev]

Create your local private config

# Inside the cloned repo:
Copy-Item config/routing-config.yaml config/routing-config.local.yaml
# Or anywhere outside the repo if installed via pipx:
ocr-router --help    # shows OCR_CONFIG_PATH env var usage

Edit your local config with your own categories, issuers, and route templates.

2. How To Use

Main command (interactive mode):

python -m ocr_router.cli process `
  --input  "C:\path\to\input" `
  --output "C:\path\to\output" `
  --config config/routing-config.local.yaml

Interactive flow (PDF + JPEG inputs):

Analyze all input files
Convert JPEG/JPG files to searchable PDFs through OCR preprocessing
Show proposal table (category, issuer, filename, route)
Ask move vs rename-in-place
Ask file selection (all, include-list, skip-list)
Execute and write history/manifest

3. Two Ways to Use the LLM (TL;DR)

The pipeline is deterministic by default (keyword scoring). The local LLM (llama3.2:3b via Ollama) is an opt-in second opinion. There are two distinct ways to use it:

Mode	What you run	Where the LLM helps	When to pick this
(A) Standalone CLI with `--llm`	`ocr-router process ... --llm` in any terminal	Classifies each doc, parses your free-form replies at the confirm prompt	You like the terminal, want a single command, scripting, cron jobs
(B) `@OCR Router` agent in VS Code Chat	Type `@OCR Router process ...` in Copilot chat	Same as above PLUS the agent translates natural-language goals into the right CLI commands and confirms each step in chat	You want conversational HITL, you're already in VS Code, multi-step asks

Both modes share the exact same pipeline, same feedback log, same embedding store. Mode B is a thin chat wrapper over Mode A — nothing magical, just a friendlier surface.

Sections 6 (standalone) and 7 (agent mode) below give the full setup and examples for each.

4. Git: About, Tags, and Release Flow

Verbose About text (for GitHub repo "About")

Use this in GitHub repository description/about field:

OCR Router is a privacy-aware Python CLI for OCR, metadata extraction, deterministic document classification, smart naming, and folder routing with review-first workflows and manifest traceability.

Version tags (annotated)

git tag -a v0.2.0 -m "v0.2.0: README overhaul, sanitize gate, release documentation"
git push origin v0.2.0

List tags:

git tag --list

Show tag details:

git show v0.2.0

5. Run Examples

Example A: Interactive processing

python -m ocr_router.cli process `
  --input  "C:\Docs\Incoming" `
  --output "C:\Docs\Sorted" `
  --config config/routing-config.local.yaml

Example B: Dry-run (no file changes)

python -m ocr_router.cli process `
  --input "C:\Docs\Incoming" `
  --output "C:\Docs\Sorted" `
  --config config/routing-config.local.yaml `
  --dry-run

Example C: Batch mode (non-interactive)

python -m ocr_router.cli process `
  --input "C:\Docs\Incoming" `
  --output "C:\Docs\Sorted" `
  --config config/routing-config.local.yaml `
  --no-interactive

Example D: Review existing manifest

python -m ocr_router.cli review --manifest "C:\Docs\Sorted\manifest.jsonl"

Example E: Validate privacy before push

python scripts/sanitize_check.py

Example F: Process with LLM second opinion (after Section 6 setup)

python -m ocr_router process `
  --input  "C:\Users\<user>\Documents\__downloads__" `
  --output "C:\Users\<user>\Documents" `
  --config config/routing-config.local.yaml `
  --llm

Example G: Measure accuracy (eval mode, read-only)

python -m ocr_router eval `
  --root "C:\Users\<user>\Documents" `
  --sample 200 --llm

6. Local LLM + Feedback Loop (Optional)

OCR Router ships with an opt-in local-first LLM stack that runs entirely on your machine. No cloud, no API keys, no document data leaves your computer.

With it enabled the pipeline:

Runs the keyword router (default behavior — fully deterministic).
Asks llama3.2:3b via Ollama for a second opinion, with the k most-similar past confirmed decisions injected as few-shot exemplars.
Applies a simple decision rule: agreement → confident, disagreement → flag for HITL, low LLM confidence → keep keyword + show hint.
At the confirm prompt, parses your free-form English ("skip 2 because I haven't paid") into structured actions (park_some [2], note "haven't paid") with a transparent Understood: … recap.
Logs every decision (and every correction you make) to a JSONL feedback log so the classifier learns from your taxonomy over time.

One-time setup

# 1. Install Ollama (https://ollama.com), then pull the two models
ollama pull llama3.2:3b           # ~2 GB — chat model
ollama pull nomic-embed-text       # ~270 MB — embeddings for few-shot

# 2. Enable LLM in your local config
#    Add to config/routing-config.local.yaml:
#       llm:
#         enabled: true
#         confidence_threshold: 0.6
#         fewshot_k: 5

# 3. (One time) Bootstrap the feedback log from your existing organized tree
python -m ocr_router feedback bootstrap-tree --root "C:\Users\<user>\Documents"

# 4. (One time) Embed all bootstrapped records into the local SQLite vector store
python -m ocr_router feedback embed

# 5. Verify the stack is healthy
python -m ocr_router llm doctor

Daily workflow (standalone CLI)

# Always dry-run first — see what would happen, nothing moves yet
python -m ocr_router process `
  --input  "C:\Users\<user>\Documents\__downloads__" `
  --output "C:\Users\<user>\Documents" `
  --config config\routing-config.local.yaml `
  --llm --dry-run

# When the proposal table looks right, run for real (no --dry-run)
python -m ocr_router process `
  --input  "C:\Users\<user>\Documents\__downloads__" `
  --output "C:\Users\<user>\Documents" `
  --config config\routing-config.local.yaml `
  --llm

At the confirm prompt you can type either deterministic syntax OR natural language:

# Deterministic (always works, no LLM required):
Enter             move ALL files
1,3,5             move ONLY those numbers
skip 2,4          move all EXCEPT those numbers
park 7            keep those files in place permanently (never re-propose)
park 7 note: <r>  same as park, capture the reason verbatim
q                 quit without moving anything

# Natural language (requires --llm; uses LLM to parse intent):
skip 2 because I haven't paid yet              → park 2 + note (per unpaid convention)
park the FPL one, it's a duplicate             → asks for the file number if ambiguous
move 1 3 5, the others are for Luciana         → moves 1,3,5; skipped get rule prompt
4 is actually FPL not AT&T                     → adds issuer rule to local YAML
nevermind / cancel                             → quit

Inspect what the pipeline has learned

python -m ocr_router feedback stats                # counts by event/category/backend
python -m ocr_router feedback show --limit 20      # most recent records (with Note column)
python -m ocr_router feedback search "AMEX credit card statement"
python -m ocr_router feedback parked list          # files marked "keep in place"
python -m ocr_router eval --root "C:\Users\<user>\Documents" --sample 200 --llm

How the data layers fit together

Layer	Default location	Purpose	Built by
Feedback log	`data/_feedback/corrections.jsonl` (project-local)	Audit trail of every classify / skip / park / correction	`process`, `feedback bootstrap*`
Embedding store	`data/_feedback/examples.sqlite` (project-local)	Vector index of past confirmed decisions	`feedback embed`
Eval audit log	`data/_feedback/eval-<ts>.jsonl` (project-local)	Per-file accuracy record from one eval run	`eval`

All three live inside the project folder (in data/_feedback/, which is gitignored). They never touch the Documents tree you point --output at — the Documents tree holds only your filed PDFs.

Override locations (when you want them elsewhere):

Env vars: OCR_FEEDBACK_DIR, OCR_FEEDBACK_LOG, OCR_EMBEDDINGS_DB
Config keys: feedback.path, feedback.embeddings_db

Privacy

Document text never leaves your machine — Ollama runs locally; the codebase has no cloud fallback by design.
The feedback log stores a configurable text excerpt per record (default 2000 chars). It is gitignored.
The embedding store contains those same excerpts plus their 768-dim vectors — same privacy posture as the log, same gitignore.
The sanitize gate (scripts/sanitize_check.py) blocks any commit that contains personal names, real Windows user paths, or OneDrive references. Same gate runs in CI.

Rollback

Three independent ways to disable the LLM stack:

# Per-run override (keeps config as-is)
python -m ocr_router process ... --no-llm

# Disable in config
#   llm:
#     enabled: false

# Full revert to pre-L4 keyword-only baseline (preserved as an annotated git tag)
git checkout pre-l4-baseline

7. Agent Mode (`@OCR Router` in VS Code Copilot Chat)

Same pipeline as Section 6, but driven through chat instead of a terminal. The repo ships with a workspace agent definition at .github/agents/ocr-router.agent.md.

Why agent mode (vs the standalone CLI)

Capability	Standalone CLI (`--llm`)	`@OCR Router` agent mode
Same pipeline, same feedback log, same SQLite store	✓	✓
Local LLM (`llama3.2:3b` via Ollama)	✓	✓
Natural-language confirm replies	✓ (intent parser)	✓ (intent parser + chat agent translates the broader request)
Multi-step asks (`scan and run an eval after, show me parked`)	❌ separate commands	✓ agent stitches them
Renders proposal tables as clean Markdown in chat	❌ terminal box-drawing	✓
Per-session memory of input/output folders	❌	✓ (asks once, remembers)
Honors the routing conventions in the agent playbook (owner-namespaced files, unpaid-statements stay parked, tax forms → Tax Returns)	partial (config only)	✓ (agent applies the conventions even when keyword/LLM disagree)
Works from any chat client (Cursor, Claude Desktop, Windsurf, …)	n/a	only VS Code Copilot today; MCP server is the natural next step

Use CLI for scripts, cron jobs, terminal-only workflows. Use agent mode for conversational HITL and when you want the agent to apply conventions that don't fit neatly into the YAML config.

Setup (assumes Section 6 is already done)

Open this workspace in VS Code with the GitHub Copilot extension installed.
Restart the chat window once so the agent gets picked up.
In the chat input mode picker (bottom-left of the chat panel), pick Agent.
Click the agent dropdown → OCR Router.

Use it

You:         Process my downloads with LLM

@OCR Router: Which folder should I scan and which folder is your organized documents root?

You:         C:\Users\me\Documents\__downloads__ → C:\Users\me\Documents

@OCR Router: [runs `process --llm --dry-run`, posts a Markdown table of proposals
              with the Backend column — agree ✓ / LLM ✱ / kw / llm err]

You:         park 2 because I haven't paid, the rest go

@OCR Router: Understood: park #2 — "I haven't paid", move the rest.
             [runs without --dry-run, applies the selection, writes feedback log
              with the note attached, appends Notes block to PROCESSED_PDFS.md]
             Moved 3, parked 1, skipped 0. ✓

The agent reuses the same ocr-router CLI under the hood, so every move is logged to corrections.jsonl and feeds future runs through the embedding store. Your personalization stays local — the agent never edits the playbook file or sends data anywhere outside your machine.

Make it available in every workspace

Copy the agent file once to your VS Code user profile so @OCR Router works in any project you open:

# VS Code user prompts folder (Windows)
$dst = "$env:APPDATA\Code\User\prompts"
New-Item -ItemType Directory -Path $dst -Force | Out-Null
Copy-Item .github\agents\ocr-router.agent.md $dst\

After the copy, the agent is discoverable everywhere — even workspaces that don't contain this repo. Make sure ocr-router is on your PATH (install via pipx as shown in Section 1).

What the agent does NOT do

No silent moves — every run starts with --dry-run and waits for your confirmation
No cloud calls — the pipeline is local-only, by design (no cloud backend exists)
No code edits — the agent only invokes the CLI, never writes Python
No bypass of park — files you parked stay parked until you unpark them
Never modifies its own playbook (.github/agents/ocr-router.agent.md) — that file is static. Learnings go to corrections.jsonl, routing-config.local.yaml, or PROCESSED_PDFS.md, all of which are gitignored / your own
No personal info in the repo — the sanitize gate blocks any commit that contains names or real Windows user paths; same gate runs in CI

Naming Convention

Document type	Format
Monthly statement	`YYYY.MM - Issuer DocType.pdf`
Monthly with account	`YYYY.MM - Issuer DocType - (Last4 XXXX) - $Amount.pdf`
Dated transaction / receipt	`YYYY.MM.DD - Issuer DocType - $Amount.pdf`
Paystub	`YYYY.MM.DD - Issuer Paycheck - $NetPay.pdf`
Reference / policy form	`YYYY - Issuer DocType.pdf`

Missing metadata fields are omitted instead of using placeholder tokens.

Project Structure

ocr-docs/
  src/ocr_router/
    cli.py
    ocr_engine.py
    extractor.py
    router.py
    folder_resolver.py
    manifest.py
    config.py
    feedback/                  # L1-L3: feedback log, bootstrap, embeddings
      log.py
      bootstrap.py
      store.py
    llm/                       # L4: local LLM classifier
      schema.py
      backends.py
      prompts.py
      classifier.py
    eval/                      # L6: accuracy harness
      runner.py
  config/
    routing-config.yaml        # tracked default template (no PII)
    routing-config.local.yaml  # local-only, gitignored
  scripts/
    dry_run.py
    sanitize_check.py
  tests/

Environment Variables

Create .env locally (never commit):

PDF24_PATH=C:\Program Files\PDF24\pdf24-Ocr.exe
OCR_CONFIG_PATH=config/routing-config.local.yaml
LOG_LEVEL=INFO
DEBUG=false

Public vs Private Data

Keep generic templates in git:
- config/routing-config.yaml
- .env.example
Keep personal files local and ignored:
- config/routing-config.local.yaml
- .env
- manifests and local logs
Run the sanitization gate before push:

python scripts/sanitize_check.py

The same check runs in GitHub Actions on pull requests and pushes.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
config		config
docs		docs
scripts		scripts
src/ocr_router		src/ocr_router
tests		tests
tools/bin		tools/bin
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OCR Router

About (Verbose)

Features

1. Installation

Prerequisites

One-line install (recommended for end users)

Clone and install (for development / customizing)

Create your local private config

2. How To Use

3. Two Ways to Use the LLM (TL;DR)

4. Git: About, Tags, and Release Flow

Verbose About text (for GitHub repo "About")

Version tags (annotated)

5. Run Examples

Example A: Interactive processing

Example B: Dry-run (no file changes)

Example C: Batch mode (non-interactive)

Example D: Review existing manifest

Example E: Validate privacy before push

Example F: Process with LLM second opinion (after Section 6 setup)

Example G: Measure accuracy (eval mode, read-only)

6. Local LLM + Feedback Loop (Optional)

One-time setup

Daily workflow (standalone CLI)

Inspect what the pipeline has learned

How the data layers fit together

Privacy

Rollback

7. Agent Mode (@OCR Router in VS Code Copilot Chat)

Why agent mode (vs the standalone CLI)

Setup (assumes Section 6 is already done)

Use it

Make it available in every workspace

What the agent does NOT do

Naming Convention

Project Structure

Environment Variables

Public vs Private Data

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

7. Agent Mode (`@OCR Router` in VS Code Copilot Chat)

Packages