📜 Breton-French OCR Pipeline

🌐 Showcase site — interactive overview of the project, corpus, and metrics.

A multi-stage pipeline for extracting bilingual Breton-French parallel corpora from scanned historical books (1860s–1940s). Produces JSONL files of aligned {"breton": "...", "français": "..."} pairs suitable for training translation models, building dictionaries, or linguistic research.

Corpus

The corpus spans 800+ pages across 9 historical Breton-language books:

Book	Period	Type	Pages	Description
`toullec_lexique_1865`	1865	Lexicon	87	Bilingual French-Breton vocabulary by theme, with a parallel preface
`colloque_lourec_1884`	1884	Phrasebook	72	4-column vocabulary lists by profession + conversational dialogues
`normant_lexique_1902`	1902	Dictionary	71	Breton→French dictionary with conjugation tables (KAOUT, BEZA)
`le_gonidec_vocabulaire_1919`	1919	Vocabulary	313	French-Breton vocabulary
`geriadur_lexique_1927`	1927	Medical lexicon	22	French→Breton anatomical/medical terminology with sub-entries
`roparz_cours_elementaire_1930`	1930	Course	31	Elementary Breton course with vocabulary, dialogues (DIVIZ), and exercises
`bozec_methode_1933`	1933	Method	78	Breton method with facing-page bilingual lessons and illustrations
`yez_hon_tadou_1940`	1940	Course	96	Breton course with GERIADUR word lists, word families, and conjugation
`daniel_ker_vreiz_1944`	1944	Course	37	Breton course with vocabulary, grammar examples, and verb tables

Droplist

Not every page contains bilingual content — covers, blank pages, tables of contents, appendices, and illustration-only pages have no translation pairs to extract. Each book can have a droplist (droplist/<book>/drop_pages.json) to exclude these pages from the OCR stage, reducing API costs and processing time.

python pipeline.py ignore pages_enhanced/my_book/05.png              # Ignore one page
python pipeline.py ignore pages_enhanced/my_book/01.png pages/my_book/02.png  # Multiple pages
python pipeline.py ignore pages/my_book/84.png                       # Works with pages/ too

Pages already in the droplist are skipped (idempotent). The JSON file is created automatically if it doesn't exist.

Setup

Requirements

Python 3.11+
NVIDIA GPU with CUDA support (for DocRes enhancement; not required for OCR)
API keys — OPENAI_API_KEY, ANTHROPIC_API_KEY, and/or GEMINI_API_KEY for the OCR stage

Installation

Base setup (venv + core Python deps + API key check):

./setup.sh
source .venv/bin/activate

With image enhancement tools (PyTorch+CUDA, DocRes, ResShift — requires NVIDIA GPU):

./setup.sh --with-enhance
source .venv/bin/activate

Pipeline Overview

pdfs/ → extract → pages/ → enhance → pages_enhanced/ → ocr → ocr/<book>/<model>/ → review → review/<book>/ → [human correction] → corpus → corpus/<book>.jsonl

#	Stage	Input	Output	Script	Description
1	extract	`pdfs/`	`pages/<book>/`	`src/extract.py`	Render PDF pages as 300 DPI PNGs
2	enhance	`pages/<book>/`	`pages_enhanced/<book>/`	`src/enhance.py`	Copy pages (no-op by default) or apply opt-in enhancements
3	ocr	`pages_enhanced/<book>/`	`ocr/<book>/<model>/<run>/`	`src/ocr/`	VLM-based bilingual text extraction
4	review	`ocr/<book>/<model>/<run>/extracted/`	`review/<book>/` + `reports/<book>/`	`src/review.py`	Copy JSONL to review folder + quality assurance
5	diff	`ocr/`, `review/`	stdout	`src/diff.py`	Compare two JSONL directories/files to see human corrections
6	evaluate	`error_rates/<book>/`	stdout	`src/evaluate.py`	Compute WER & CER against human reference
7	corpus	`review/<book>/`	`corpus/<book>.jsonl`	`src/corpus.py`	Deduplicate and merge reviewed JSONL into one file per book

Pipeline Usage

# Full pipeline — all PDFs (classical enhancement only)
python pipeline.py run

# Full pipeline — one specific PDF
python pipeline.py run pdfs/my_book.pdf

# Full pipeline — with AI enhancement (requires --with-enhance setup)
python pipeline.py run --docres --prepocr

# Individual stages
python pipeline.py extract
python pipeline.py enhance
python pipeline.py ocr

Extract

Extract Overview

Renders each page of a PDF as a 300 DPI PNG image using PyMuPDF. Each PDF's pages are saved to pages/<pdf_stem>/.

Extract Usage

python pipeline.py extract                           # All PDFs in pdfs/
python pipeline.py extract pdfs/my_book.pdf           # One specific PDF
python pipeline.py extract pdfs/a.pdf pdfs/b.pdf      # Multiple PDFs

# Direct script usage with extra options
python src/extract.py --dpi 400 pdfs/my_book.pdf

Enhance

Enhance Overview

By default, enhance is a no-op: pages are copied from pages/ to pages_enhanced/ without any processing. All enhancements are explicit opt-in flags. Three reasons:

GPU required for AI models (DocRes, PreP-OCR) — not available in all environments
Heavy dependencies — PyTorch, model weights (~1 GB) only installed with ./setup.sh --with-enhance
Frontier VLMs work better on originals — modern models like gpt-5.2 handle degraded scans well enough that preprocessing can hurt more than it helps

Important

--docres and --prepocr require an NVIDIA GPU and heavy dependencies. Run ./setup.sh --with-enhance before using them. --classical works with the base install.

Available flags (applied in this order):

Flag	What it does	Requires
`--classical`	Grayscale + CLAHE contrast equalization	Base install
`--docres`	DocRes AI: deshadowing → deblurring → appearance (CVPR 2024)	`--with-enhance` + GPU
`--prepocr`	PreP-OCR ResShift diffusion deblurring, 256×256 tiles (PreP-OCR)	`--with-enhance` + GPU
`--denoise`	Bilateral denoising (smooths paper grain)	Base install
`--binarize`	Adaptive Gaussian binarization (B&W output)	Base install
`--upscale`	2× Lanczos upscale	Base install

Enhance Usage

python pipeline.py enhance                                        # No-op: plain copy to pages_enhanced/
python pipeline.py enhance --classical                            # Grayscale + CLAHE
python pipeline.py enhance --docres                               # DocRes AI only (requires --with-enhance)
python pipeline.py enhance --docres --classical                   # DocRes + CLAHE
python pipeline.py enhance --prepocr --classical                  # PreP-OCR + CLAHE
python pipeline.py enhance --docres --prepocr --classical         # Full: DocRes + PreP-OCR + CLAHE
python pipeline.py enhance --docres --docres-tasks appearance     # Only one DocRes task
python pipeline.py enhance my_book                                # One book
python pipeline.py enhance pages/my_book/05.png                   # Single image

# Direct script usage
python src/enhance.py --classical --binarize --upscale

Compare

Compare Overview

Generates a comparison matrix of all permutations of the three enhancement stages (DocRes, PreP-OCR, Classical) for a single page. Useful for evaluating which ordering produces the best results for OCR.

Outputs 19 images to compare/<book>/<page>/:

original.png — unmodified input
3 individual DocRes sub-steps (docres_deshadowing, docres_deblurring, docres_appearance)
3 individual steps (docres_pipeline, prepocr, classical)
6 two-step permutations (e.g. docres_pipeline-prepocr, classical-docres_pipeline)
6 three-step permutations (e.g. docres_pipeline-prepocr-classical)

Two-step and three-step outputs reuse cached intermediate results to avoid redundant model passes.

Compare Usage

python pipeline.py compare pages/my_book/17.png

OCR

OCR Overview

Sends each page image to a Vision Language Model (VLM) and parses structured bilingual output. Uses a two-layer prompt system:

Base prompt (prompts/extract_bilingual_corpus.md) — defines the general extraction workflow, JSONL output format, quality rules, and exclusion criteria.
Book-specific prompt (prompts/<book_name>.md) — appended automatically based on folder name. Contains page layout descriptions, extraction rules, examples, and edge cases.

The model returns structured JSONL pairs and a quality report (status, score, remarks) for each page. Processing is resumable — pages with existing .jsonl files are skipped.

Supported providers:

Provider	Models	API key env var
OpenAI	`gpt-5.2` (default), `gpt-4.1-mini`, `o3`, etc.	`OPENAI_API_KEY`
Anthropic	`claude-sonnet-4-6`, `claude-haiku-4.5`, `claude-opus-4`, etc.	`ANTHROPIC_API_KEY`
Google	`gemini-3.1-pro`, `gemini-3-flash`, `gemini-2.5-pro`, etc.	`GEMINI_API_KEY`

OCR Usage

python pipeline.py ocr                                  # All books, default model
python pipeline.py ocr my_book                          # One book
python pipeline.py ocr --model gpt-4.1-mini             # Cheaper OpenAI model
python pipeline.py ocr --model claude-sonnet-4-6        # Anthropic
python pipeline.py ocr --model gemini-3.1-pro           # Google Gemini
python pipeline.py ocr pages/my_book/05.png             # Single image
python pipeline.py ocr --debug pages/my_book/05.png     # Show full prompts & response
python pipeline.py ocr --limit 5 my_book                # Random sample of 5 pages
python pipeline.py ocr --book-prompt prompts/my_book-next.md my_book   # Test a new book prompt
python pipeline.py ocr --main-prompt prompts/extract_bilingual_corpus-next.md  # Test a new base prompt
python pipeline.py ocr --thinking high my_book              # Deep reasoning (Gemini only)

Cost Estimation

Model	Provider	Avg cost/page	Avg time/page	Accuracy	Full run (570 pages)
`gpt-4.1-mini`	OpenAI	~$0.004	~16s	Good — occasional column misalignment and OCR typos	~$2.30 / ~2.5h
`gemini-3.1-pro`	Google	~$0.010	—	Not yet benchmarked	~$5.70
`claude-sonnet-4-6`	Anthropic	~$0.015	—	Not yet benchmarked	~$8.50
`gpt-5.2`	OpenAI	~$0.021	~10s	Better precision, correct alignment, cleaner OCR	~$12 / ~1.5h

Recommendation: Use gpt-5.2 (default) for production runs — the higher accuracy justifies the ~5× cost. Use gpt-4.1-mini for rapid iteration and prompt testing. Cost estimates for Gemini and Anthropic are based on token pricing and are not yet benchmarked on this corpus.

Prompt System

Each book has a dedicated prompt file in prompts/ that teaches the LLM how to extract bilingual pairs from that specific book's layout:

Prompt file	Key rules
`extract_bilingual_corpus.md`	Base prompt — JSONL format, quality rules, exclusion criteria
`toullec_lexique_1865.md`	4-column layout, parallel preface, gender suffix stripping
`colloque_1890.md`	4-column verb lists, conversational dialogues
`colloque_lourec_1884.md`	Profession-based sections, disambiguating parentheses
`normant_lexique_1902.md`	Breton→French direction, conjugation tables, cross-references
`geriadur_lexique_1927.md`	French→Breton direction, sub-entry expansion, abbreviations
`roparz_cours_elementaire_1930.md`	DIVIZ dialogues, mutation tables, RÉSUMÉ pages
`bozec_methode_1933.md`	Facing-page alignment, illustration captions
`yez_hon_tadou_1940.md`	GERIADUR word lists, word families, bilingual conjugation
`daniel_ker_vreiz_1944.md`	Vocabulary with pronunciation, LENNADENN exclusions

Adding a new book: Create prompts/<book_folder_name>.md following the same structure. The OCR stage picks it up automatically based on folder name.

Testing prompt changes: To iterate on a prompt without losing existing run data, copy it to a -next.md variant (e.g. prompts/my_book-next.md) and pass it via --book-prompt or --main-prompt. The prompt hash changes automatically, so a new run folder is created.

Review

Review Overview

Copies JSONL files from ocr/<book>/<model>/<run>/extracted/ to review/<book>/ (flat, no model subfolder), then scans them for common errors and quality issues. Requires --run to specify which run folder to copy from. If review/<book>/ already exists, a confirmation prompt is shown before erasing its content.

After the review step, humans can inspect and correct the JSONL files in review/<book>/ before building the final corpus.

Checks include: missing/extra keys, empty values, invalid characters, length imbalance (≥ 3×), single-character entries, extremely long entries (> 256 chars), digit-only entries, truncated entries, identical pairs.

Generates a Markdown report at reports/<book>/review.md.

Review Usage

python pipeline.py review --run 0001-20260314-1624                    # All books, specific run
python pipeline.py review --run 0001-20260314-1624 my_book           # One book
python pipeline.py review --run 0001-20260314-1624 --model gpt-5.2   # Specific model
python pipeline.py review --yes --run 0001-20260314-1624             # Skip confirmation prompts

Diff

Diff Overview

Compares two directories (or files) of JSONL data and reports added, removed, and modified entries line-by-line. The primary use case is comparing the original ocr/ output with the human-corrected review/ folder to see what was changed during manual review.

By default, it compares the ocr and review directories at the project root.

Diff Usage

python pipeline.py diff                                  # Default: compares ocr/ against review/
python pipeline.py diff ocr/my_book review/my_book       # Compare specific book directories
python pipeline.py diff -v                               # Verbose: show exactly which lines changed
python pipeline.py diff ocr/my_book/01.jsonl review/my_book/01.jsonl # Compare specific files

Evaluate

Evaluate Overview

Computes Word Error Rate (WER) and Character Error Rate (CER) for OCR outputs against manually corrected human references. Useful for benchmarking model and prompt improvements.

Each book to evaluate needs:

error_rates/<book>/human_reference/ — gold-standard JSONL files (manually corrected)
error_rates/<book>/jsonl/ — hypothesis JSONL files (OCR output to evaluate)

Metrics are computed separately for the Breton and French sides of each pair.

Evaluate Usage

python pipeline.py evaluate                    # All books in error_rates/
python pipeline.py evaluate my_book            # One specific book

Corpus

Corpus Overview

Reads per-page JSONL from review/<book>/ (after human correction), removes exact {breton, français} duplicate pairs, and writes a single corpus/<book>.jsonl per book. Empty files are skipped.

Corpus Usage

python pipeline.py corpus                                    # All books
python pipeline.py corpus geriadur_lexique_1927              # One book
python pipeline.py corpus -o /tmp/my_corpus                  # Custom output directory

Project Structure

├── pipeline.py              # Unified CLI entry point
├── setup.sh                 # Environment setup script
├── requirements.txt         # Core Python dependencies
├── requirements-enhance.txt # Enhancement deps (installed with --with-enhance)
├── src/
│   ├── utils.py             # Shared helpers (types, parsing, target discovery)
│   ├── extract.py           # PDF → PNG extraction
│   ├── enhance.py           # Image enhancement (DocRes + PreP-OCR + CLAHE)
│   ├── ocr/                 # VLM-based OCR (package)
│   │   ├── core.py          # Constants, types, cost estimation, run-folder management
│   │   ├── providers.py     # VLM client creation, retry logic, API wrappers
│   │   ├── reports.py       # Report generation and parsing
│   │   ├── sync.py          # Synchronous page-by-page processing
│   │   └── batch.py         # Gemini Batch API: submit, poll, collect
│   ├── review.py            # JSONL quality assurance
│   ├── evaluate.py          # WER/CER evaluation against human reference
│   └── corpus.py            # Deduplicate and merge into corpus/<book>.jsonl
├── prompts/
│   ├── extract_bilingual_corpus.md  # Base VLM extraction prompt
│   └── <book_name>.md              # Book-specific prompts (×9)
├── pdfs/                    # Source PDFs (not tracked)
│   └── <book>.pdf
├── pages/                   # Extracted page PNGs (not tracked)
│   └── <book>/
│       ├── 01.png
│       ├── 02.png
│       └── ...
├── pages_enhanced/          # Enhanced images (not tracked)
│   └── <book>/
│       ├── 01.png
│       ├── 02.png
│       └── ...
├── ocr/                     # Raw OCR output (per book, per model, per run)
│   └── <book>/
│       └── <model>/
│           └── <NNNN>-<YYYYMMDD>-<HHMM>/
│               ├── prompt.md
│               ├── run_state.json
│               ├── extracted/
│               │   ├── 01.jsonl
│               │   └── ...
│               └── reports/extraction/
├── review/                  # Staging area for human correction
│   └── <book>/
│       ├── 01.jsonl
│       ├── 02.jsonl
│       └── ...
├── corpus/                  # Final deduplicated corpus
│   ├── <book>.jsonl
│   └── ...
├── gold/                    # Gold-standard evaluation data
│   └── <book>/
│       ├── human_reference/ # Gold-standard JSONL (manually corrected)
│       ├── jsonl/           # Hypothesis JSONL (OCR output to evaluate)
│       └── report.md        # CER/WER + silence/bruit analysis
├── docs/                    # Showcase landing page (GitHub Pages)
│   ├── index.html
│   ├── style.css
│   ├── script.js
│   └── hero_illustration.png
├── droplist/                # Per-book page exclusion lists
│   └── <book>/
│       └── drop_pages.json  # JSON array of page numbers to skip
└── reports/                 # Auto-generated quality reports
    └── <book>/
        └── <model>/
            └── review.md

License

This pipeline is intended for personal/research use in corpus linguistics.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.agents/workflows		.agents/workflows
.github/workflows		.github/workflows
compare		compare
corpus		corpus
docs		docs
droplist		droplist
error_rates		error_rates
gold		gold
ocr		ocr
pdfs		pdfs
pdfs_excluded		pdfs_excluded
prompts		prompts
review		review
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
clean_segments.py		clean_segments.py
metadonnees_pdf.csv		metadonnees_pdf.csv
pipeline.py		pipeline.py
requirements-enhance.txt		requirements-enhance.txt
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

📜 Breton-French OCR Pipeline

Corpus

Droplist

Setup

Requirements

Installation

Pipeline Overview

Pipeline Usage

Extract

Extract Overview

Extract Usage

Enhance

Enhance Overview

Enhance Usage

Compare

Compare Overview

Compare Usage

OCR

OCR Overview

OCR Usage

Cost Estimation

Prompt System

Review

Review Overview

Review Usage

Diff

Diff Overview

Diff Usage

Evaluate

Evaluate Overview

Evaluate Usage

Corpus

Corpus Overview

Corpus Usage

Project Structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages