A refined, privacy-first desktop and CLI tool that converts PDFs — including scanned documents — into clean, structured Markdown. Built for researchers, professionals, and creators who demand accuracy, speed, and absolute data privacy.
Fast. Local. Intelligent. Fully offline.
Many PDF converters silently upload documents to remote servers. This tool does not.
- No uploads: Your files never leave your machine
- No telemetry: No usage tracking or analytics
- No cloud processing: All computation happens locally
- No background requests: Completely offline operation
Every step — extraction, OCR, reconstruction, and rendering — happens locally on your machine.
Intentionally designed for environments where confidentiality is non-negotiable:
- 🏥 Medical: Clinical notes, diagnostic reports, patient records
- ⚖️ Legal: Case files, evidence bundles, attorney-client communications
- 🏛️ Government: Policy drafts, restricted documents, classified materials
- 🎓 Academic Research: Paywalled journals, unpublished materials, grant proposals
- 💼 Corporate: Financial reports, IP-sensitive designs, strategic plans
Full support for encrypted PDFs with security-first design:
✅ Passwords never logged or saved — Memory-only processing
✅ No command-line exposure — Prevents process monitoring attacks
✅ Auto-cleanup — Temporary files deleted immediately
✅ Interactive prompts — Hidden input in GUI and CLI
GUI: Modal password dialog with masked input (*****)
CLI: getpass hidden terminal input
Supports all PDF encryption standards: 40-bit RC4, 128-bit RC4, 128/256-bit AES.
- Smart paragraph reconstruction — Joins wrapped lines intelligently
- Heading inference — Uses font metrics to detect document structure
- Bullet & numbered list detection — Recognises various formats (•, ○, -, 1., a., etc.)
- Hyphenation repair — Automatically unwraps "hy-\nphen" patterns
- URL auto-linking — Converts plain URLs into clickable Markdown links
- Inline formatting — Preserves bold and italic styling
- Header/footer removal — Detects and strips repeating page elements
- Multi-column awareness — Reduces cross-column text mixing
Your PDFs often contain tables split across blocks, columns, and various layout quirks. The robust table engine handles:
- Column-aligned tables — Detects 2+ space separated columns
- Bordered tables — Recognises explicit
|and¦delimiters - Tab-separated blocks — Handles tab-delimited data
- Multi-block vertical tables — Stitches tables split across PyMuPDF blocks
- Full Markdown rendering — Generates proper pipe tables with alignment
- Header row detection — Automatically identifies table headers
- Conservative heuristics — Avoids false positives on prose and lists
Perfect for academic papers, financial documents, and structured reports.
Detection Strategies (priority order):
- Bordered tables (highest confidence)
- Vertical multi-block tables
- ASCII whitespace-separated tables
Scientific documents finally convert cleanly. The Math Engine automatically:
- Detects inline & display math regions — Distinguishes equations from prose
- Converts Unicode math to LaTeX —
α → \alpha,√x → \sqrt{x} - Handles superscripts/subscripts —
x² → x^{2},x₁₀ → x_{10} - Preserves existing LaTeX — Keeps
$...$and$$...$$intact - Avoids Markdown escaping — Math content bypasses normal escaping
- Maintains equation integrity — Keeps equations intact across line breaks
Ideal for scientific PDFs in physics, mathematics, engineering, and chemistry.
Examples:
E = mc²→E = mc^{2}α + β³→\alpha + \beta^{3}∫₀^∞ e^(-x²) dx→\int_{0}^{\infty} e^{-x^{2}} dx
- Tesseract OCR — Lightweight, accurate, works on all major platforms
- OCRmyPDF — High-fidelity layout preservation
- Auto-detection — Automatically identifies scanned pages
- Language selection — Choose from 17+ Tesseract language packs or combine them (e.g.
eng+fra) - Configurable quality — Balance between speed and accuracy
- Mixed-mode support — Handles PDFs with both digital text and scanned pages
Auto-Detection Heuristics:
- Text density analysis (< 50 chars/page = likely scanned)
- Image coverage detection (>30% page area)
- Combined signals trigger OCR automatically
- Dark/Light themes — Obsidian-style dark mode (default) with instant toggle
- Multi-file batch conversion — Select and convert multiple PDFs in one go
- Safe output naming — Automatic
_1,_2suffixes prevent accidental overwrites - OCR language selector — Choose Tesseract language from a dropdown or type a custom code
- Live progress tracking — Determinate progress bar with per-file status in batch mode
- Real-time console — View extraction and conversion logs as they happen
- Quick access — "Open Output Folder" link to finished Markdown
- Non-blocking conversion — Cancel long-running jobs anytime with Esc
- Keyboard shortcuts — Power-user workflow (Ctrl+Enter to convert)
- Persistent settings — Theme, paths, options, and profiles saved between sessions
- Conversion profiles — Built-in and custom presets for different document types
Obsidian-inspired dark theme with purple accents, grouped options with visual separators, and an integrated progress and log panel.
Toggle between themes instantly — your preference is saved between sessions.
A modular pipeline ensures clarity, stability, and extensibility.
PDF Input
↓
┌─────────────────┐
│ 1. EXTRACT │ ← Native PyMuPDF or OCR (Tesseract/OCRmyPDF)
└─────────────────┘ with configurable language (--lang)
↓
┌─────────────────┐
│ 2. TRANSFORM │ ← Clean text, remove headers/footers, detect structure
└─────────────────┘
↓
┌─────────────────┐
│ 3. RENDER │ ← Generate Markdown with headings, lists, links
└─────────────────┘
↓
┌─────────────────┐
│ 4. EXPORT │ ← Write .md file + optional image assets
└─────────────────┘
↓
Markdown Output
Each module maintains a single responsibility, ensuring the system remains clean, testable, and easy to extend.
| Module | Purpose |
|---|---|
extract.py |
PDF text extraction, OCR orchestration (with language support), structural block formation, encrypted-PDF support |
tables.py |
Advanced table detection and Markdown table reconstruction (cell grouping, alignment rows, safety handling) |
equations.py |
Math detection heuristics and conversion to inline/display LaTeX-compatible Markdown |
transform.py |
Text cleanup, header/footer removal, block classification, integration of table/math structures into the document flow |
render.py |
Final Markdown generation with headings, lists, links, images, tables, and math rendering |
pipeline.py |
End-to-end orchestration: extract → structure → transform → tables → equations → render |
models.py |
Typed data structures: PageText, Block, Line, Span, Options (including ocr_lang) |
utils.py |
Platform helpers, OCR detection utilities, file handling, temp-file safety, logging tools |
app_gui.py |
Tkinter GUI: multi-file batch conversion, profiles, theming, progress tracking, encrypted-PDF dialogs |
cli.py |
Command-line interface for batch automation, scripting, language selection, and secured password prompts |
⭐ Single Responsibility per Module
Each component focuses on doing one thing well:
- extraction
- structure analysis
- tables
- equations
- transformation
- rendering
- user workflow (GUI/CLI)
This eliminates cross-contamination and makes features reliable and testable.
PDF → extract.py (with ocr_lang for Tesseract/OCRmyPDF)
↓
Raw blocks (text, spans, geometry)
↓
transform.py
↓
Structured blocks (paragraphs, lists, headings)
↓
tables.py
↓
Table blocks (aligned cells, rows, Markdown pipe tables)
↓
equations.py
↓
Equation blocks ($...$ / $$...$$)
↓
render.py
↓
Final Markdown output
This modular pipeline allows tables and equations to slot into the flow cleanly, without affecting the behaviour of unrelated modules.
- Researchers get reliable table conversion
- Academics get inline and display math suitable for Obsidian, Jupyter, pandoc, and mkdocs
- Developers get an extensible pipeline where new block types can be added without breaking existing components
- Users get clearer, more accurate Markdown output without extra configuration
With tables and equations now modularised, future upgrades can be added easily:
- Better table spanning (row/column spans)
- Layout-aware table detection using bounding-box coordinates
- Math rendering modes (strict, permissive)
- Multi-line header/footer detection
- Charts detection
- Diagram extraction
- Semantic tagging for AI/LLM workflows
This architecture forms a scalable base for long-term evolution of pdfmd.
# Clone repository
git clone https://github.com/M1ck4/pdfmd.git
cd pdfmd
# Install dependencies manually
pip install pymupdf pillow pytesseract ocrmypdf
# Launch GUI
python -m pdfmd.app_gui# Clone and install
git clone https://github.com/M1ck4/pdfmd.git
cd pdfmd
# Minimal install (native text extraction only)
pip install -e .
# OR: Full install with OCR support (recommended)
pip install -e .[full]
# Use the CLI
pdfmd input.pdf-
Install Tesseract OCR:
- Download: https://github.com/UB-Mannheim/tesseract/wiki
- Run installer and check "Add to PATH"
- For non-English documents, select additional language packs during installation
-
Install Python packages (if running without the package installer):
pip install pymupdf pillow pytesseract ocrmypdf
-
Verify installation:
tesseract --version tesseract --list-langs
# Install Tesseract (with additional languages if needed)
brew install tesseract
brew install tesseract-lang # all language packs
# Install OCRmyPDF (recommended)
brew install ocrmypdf
# Install Python dependencies manually
pip install pymupdf pillow pytesseract ocrmypdf# System dependencies
sudo apt-get update
sudo apt-get install tesseract-ocr ocrmypdf
# Additional language packs (examples)
sudo apt-get install tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-jpn
# Python dependencies
pip install pymupdf pillow pytesseract ocrmypdfDownload the latest .exe from Releases — no Python required.
Note: Tesseract must still be installed separately for OCR functionality.
The graphical interface can be started in several ways:
# If installed as a package:
python -m pdfmd.app_gui
# Direct execution (from package directory):
python app_gui.pyBasic Conversion in 7 Steps:
-
📂 Select Input PDF(s)
- Click Browse... next to "Input PDF(s)"
- Select one or multiple PDF files (multi-select supported)
- For multiple files, the entry shows the count and first filename
- Paths are remembered between sessions
-
💾 Choose Output Location
- Single file: Output path is auto-suggested as
input.md - Multiple files: Output defaults to the input folder; Browse opens a folder picker
- Click Browse... to change location
- Or manually edit the path
- Single file: Output path is auto-suggested as
-
⚙️ Select Profile
- Choose from built-in profiles:
- Default — Balanced settings for most documents
- Academic article — Optimised for papers with equations
- Slides / handouts — Image export + page breaks
- Scan-heavy / OCR-first — Force OCR on all pages
- Or use your custom saved profiles
- Choose from built-in profiles:
-
🔧 Configure Options
OCR Settings:
- OCR mode:
off— Native text extraction (fastest)auto— Detect scanned pages automatically ✨ recommendedtesseract— Force OCR on all pagesocrmypdf— High-quality OCR preprocessing
- Language: Select from 17 common Tesseract language codes or type any valid code. Combine with
+for multi-language documents (e.g.eng+fra).
Output Options:
- ☑️ Export images — Save images to
_assets/folder - ☑️ Insert page breaks — Add
---between pages - ☑️ Preview first 3 pages — Quick test before full conversion
Text Processing:
- ☑️ Remove repeating header / footer — Auto-detect and strip
- ☑️ Promote CAPS to headings — Treat ALL CAPS as section titles
- ☑️ Defragment short orphans — Merge isolated short lines
Fine-Tuning:
- Heading size ratio (1.0–2.5) — Font size threshold for headings
- Orphan max length (10–120) — Character limit for line merging
- OCR mode:
-
▶️ Convert- Click ▶ Convert
- Or press Ctrl+Enter (keyboard shortcut)
- The conversion runs in the background
-
📊 Monitor Progress
- Watch the progress bar for completion status
- For batch conversions, see
[1/5],[2/5]etc. in the log - View live logs in the console panel
- See current status in the status line
- Press Stop or Esc to cancel if needed
-
✅ Open Output
- When complete, click Open folder link
- Opens the output directory in your file manager
- Your Markdown file(s) are ready to use
The GUI supports converting multiple PDFs in a single run:
- Click Browse and select multiple PDF files (Ctrl+click or Shift+click)
- The input field shows the file count: "52 files selected, first: report.pdf, +51 more"
- Output automatically switches to a folder picker
- Each PDF produces its own
<filename>.mdin the output folder - No accidental overwrites — if
report.mdalready exists, the output is saved asreport_1.md,report_2.md, etc. - Progress scales across all files with per-file log headers
- If one file fails, the rest continue; a summary shows successes and failures
Built-in Profiles:
- Default — Balanced settings for general documents, auto-detect headers/footers, smart heading detection
- Academic article — Optimised for research papers, higher orphan threshold (60 chars), tighter heading ratio (1.10), OCR mode:
auto - Slides / handouts — Export images automatically, insert page breaks between slides, disabled header/footer removal, OCR mode:
auto - Scan-heavy / OCR-first — Force Tesseract OCR on all pages, no CAPS-to-heading conversion, best for old scanned documents
Custom Profiles:
- Adjust settings to your preference
- Click Save...
- Enter a profile name
- Profile is saved and available for future use (including OCR language)
To delete: Select a custom profile, click Delete, confirm. (Built-in profiles cannot be deleted.)
| Shortcut | Action |
|---|---|
| Ctrl+O | Browse for input PDF(s) |
| Ctrl+Shift+O | Browse for output location |
| Ctrl+Enter | Start conversion |
| Esc | Stop/cancel conversion |
🎨 Themes
Toggle between Dark and Light themes. Theme preference is saved between sessions.
- Dark — Obsidian-inspired dark mode with layered panels and purple accents
- Light — Clean light mode with white cards on grey
🌐 OCR Language
The language selector next to the OCR mode dropdown lets you choose the Tesseract language for OCR. It comes pre-loaded with 17 common languages:
eng, deu, fra, spa, ita, por, nld, pol, rus, chi_sim, chi_tra, jpn, kor, ara, hin, tur, vie
The field is editable — type any Tesseract language code directly, or combine languages with + (e.g. eng+deu). The selected language is saved with your config and profiles.
Note: The corresponding Tesseract language pack must be installed on your system for OCR to work in that language.
🔒 Password Protection
For encrypted PDFs:
- Start conversion as normal
- Password dialog appears automatically
- Enter password (input is hidden)
- Click OK or press Enter
- Conversion proceeds with decrypted content
Password is used in-memory only, never logged or saved to disk, not passed to external processes.
Stop a long-running conversion by clicking Stop or pressing Esc. Current step completes, then conversion stops gracefully. In batch mode, remaining files are skipped.
📝 Live Logging
The console panel shows real-time progress:
[pipeline] Extracting text...
[pipeline] Transforming pages...
[profile] Applied profile: Academic article
[pipeline] Removed repeating edges → header='Chapter 1', footer='- - 1'
[pipeline] Rendering Markdown...
[pipeline] Saved → /path/to/output.md
In batch mode, each file is logged with a numbered header:
============================================================
[1/5] report.pdf
============================================================
Input: C:\Documents\report.pdf
Output: C:\Documents\output\report.md
...
💾 Persistent Settings
Automatically saved between sessions:
- Last input/output paths
- Current options and settings (including OCR language)
- Custom profiles
- Theme preference
Configuration stored at: ~/.pdfmd_gui.json
Quick Preview:
- Select your PDF
- Check Preview first 3 pages
- Click ▶ Convert
- Review output to verify settings
- Uncheck preview and run full conversion
Batch Processing:
- Click Browse and select all PDFs to convert
- Set output folder (or use the auto-suggested input directory)
- Choose your profile and options
- Click ▶ Convert — all files are processed sequentially
- Review the batch summary in the log
Scanned Documents:
- Select scanned PDF
- Set OCR mode to auto or tesseract
- Select the correct Language for the document
- Consider enabling Export images
- Click ▶ Convert
- Monitor OCR progress in logs (may take several minutes)
Non-English Documents:
- Select your PDF
- Set OCR mode to auto or tesseract
- Choose the appropriate Language from the dropdown (e.g.
deufor German,jpnfor Japanese) - For mixed-language documents, type a combined code:
eng+fra - Click ▶ Convert
Academic Papers:
- Select Academic article profile
- Verify settings (OCR: auto, heading ratio: 1.10)
- Click ▶ Convert
- Tables and equations are automatically detected and formatted
The CLI can be invoked in several ways:
# If installed as a package (recommended):
pdfmd input.pdf
# Using Python module syntax (from project root):
python -m pdfmd.cli input.pdf# Basic conversion (writes input.md next to the PDF)
pdfmd report.pdf
# Specify output file
pdfmd report.pdf -o notes.md
# Auto-detect scanned pages and OCR as needed
pdfmd scan.pdf --ocr auto
# OCR a German document
pdfmd vertrag.pdf --ocr auto --lang deu
# OCR a mixed English/French document
pdfmd bilingual.pdf --ocr tesseract --lang eng+fra
# Batch convert multiple PDFs
pdfmd *.pdf --ocr auto -o converted_md/📄 Standard Documents
# Clean, text-based PDFs (articles, reports, books)
pdfmd document.pdf
# With statistics summary
pdfmd document.pdf --stats🔍 Scanned Documents
# Auto-detect and OCR scanned pages only
pdfmd scan.pdf --ocr auto
# Force Tesseract OCR on all pages
pdfmd scan.pdf --ocr tesseract
# Use OCRmyPDF for high-quality layout preservation
pdfmd scan.pdf --ocr ocrmypdf
# OCR with a specific language
pdfmd japanese_doc.pdf --ocr tesseract --lang jpn🖼️ Documents with Images
# Extract images to _assets/ folder with references
pdfmd presentation.pdf --export-images
# OCR + images for scanned slides
pdfmd slides.pdf --ocr auto --export-images📋 Quick Preview
# Process only first 3 pages (fast inspection)
pdfmd long_paper.pdf --preview-only
# Preview with stats
pdfmd long_paper.pdf --preview-only --stats🔒 Password-Protected PDFs
# Interactive password prompt (secure, no command-line exposure)
pdfmd encrypted.pdf
# The CLI will detect encryption and prompt for password
# Password is never logged or shown in process listings🔇 Scripting & Automation
# Quiet mode (errors only, good for scripts)
pdfmd batch/*.pdf --ocr auto --quiet --no-progress
# Non-interactive mode (fails if password needed)
pdfmd document.pdf --no-progress -q🔬 Debug & Verbose Output
# Basic verbose output
pdfmd document.pdf -v
# Debug-level detail (includes pipeline stages)
pdfmd document.pdf -vv
# Without coloured output (for logs)
pdfmd document.pdf -v --no-colorusage: pdfmd [-h] [-o OUTPUT] [--ocr {off,auto,tesseract,ocrmypdf}]
[--lang LANG] [--export-images] [--page-breaks] [--preview-only]
[--no-progress] [-q] [-v] [--stats] [--no-color] [--version]
INPUT_PDF [INPUT_PDF ...]
Convert PDF files to clean, Obsidian-ready Markdown with table and
math-aware conversion. Runs fully offline: no uploads, no telemetry,
no cloud dependencies.
positional arguments:
INPUT_PDF Path(s) to input PDF file(s). Multiple files supported.
options:
-h, --help Show this help message and exit
-o OUTPUT, --output OUTPUT
Output path. For single input: .md file path.
For multiple inputs: directory (created if needed).
Default: writes input.md next to each PDF.
--ocr {off,auto,tesseract,ocrmypdf}
OCR mode (default: off):
off — use native text extraction only
auto — detect scanned pages, OCR as needed
tesseract — force page-by-page Tesseract OCR
ocrmypdf — pre-process with OCRmyPDF for high-fidelity layout
--lang LANG Tesseract language code(s) for OCR (default: eng).
Use a Tesseract language code, e.g. 'deu' for German,
'fra' for French, 'jpn' for Japanese.
Combine with '+' for multiple: 'eng+fra'.
Only used when --ocr is not 'off'.
--export-images Export images to _assets/ folder next to output file,
with Markdown image references appended to document.
--page-breaks Insert '---' horizontal rule between pages in output.
--preview-only Only process first 3 pages (useful for quick inspection
of large documents or testing settings).
--no-progress Disable terminal progress bar (useful for logging).
-q, --quiet Suppress non-error messages. Only show errors.
-v, --verbose Increase verbosity:
-v — show conversion stages and logs
-vv — debug-level detail with full pipeline info
--stats Print document statistics after conversion:
word count, headings, tables, lists.
--no-color Disable coloured terminal output (for log files).
--version Print version and exit.
Batch Processing:
# Convert all PDFs in current directory
pdfmd *.pdf --ocr auto -o markdown_output/
# Convert with consistent settings
for pdf in papers/*.pdf; do
pdfmd "$pdf" --ocr auto --lang eng --stats
doneTables and Math:
# The CLI automatically detects and converts:
# • Text tables → GitHub-flavoured Markdown tables
# • Unicode math (E = mc², x₁₀², α + β³) → LaTeX-style equations
# • Existing LaTeX math is preserved
pdfmd academic_paper.pdf --statsIntegration with Other Tools:
# Pipeline with other markdown tools
pdfmd input.pdf -o - | pandoc -f markdown -o output.docx
# Generate and preview
pdfmd paper.pdf && code paper.md
# Conversion + commit
pdfmd updated.pdf && git add updated.md && git commit -m "Update notes"Single PDF:
pdfmd input.pdf
# Creates: input.md (same directory as input.pdf)
pdfmd input.pdf -o notes.md
# Creates: notes.md (current directory)
pdfmd input.pdf -o ~/Documents/notes.md
# Creates: ~/Documents/notes.mdMultiple PDFs:
pdfmd file1.pdf file2.pdf file3.pdf
# Creates: file1.md, file2.md, file3.md (next to originals)
pdfmd *.pdf -o converted/
# Creates: converted/file1.md, converted/file2.md, ...
# Directory is created if it doesn't existImage Export:
pdfmd slides.pdf --export-images
# Creates:
# slides.md
# slides_assets/
# ├── img_001_01.png
# ├── img_001_02.png
# └── ...
# Images referenced at end of slides.mdMissing Dependencies:
$ pdfmd scan.pdf --ocr tesseract
Error: OCR mode 'tesseract' selected but Tesseract binary is not available.
Install Tesseract from: https://github.com/UB-Mannheim/tesseract/wiki
Then run: pip install pytesseract pillowPassword-Protected Files:
$ pdfmd encrypted.pdf
PDF is password protected. Enter password (input will be hidden):
[password entry is hidden]
Converting encrypted.pdf → encrypted.mdInvalid Files:
$ pdfmd missing.pdf
Error: input file not found: missing.pdf
$ pdfmd document.txt
Error: The input file must have a .pdf extension.Password Handling:
- Interactive prompts only — passwords never passed via command-line arguments
- No process exposure — passwords not visible in
psor process listings - Memory-only — passwords never logged, cached, or persisted to disk
- No network — all processing is local, passwords never transmitted
Privacy:
- 100% offline — no uploads, no telemetry, no external API calls
- No cloud dependencies — all OCR and processing happens on your machine
- Output is unencrypted — protect
.mdfiles according to your environment's security requirements
Large Documents:
# Preview first to check settings (fast)
pdfmd large_book.pdf --preview-only --stats
# Then convert full document
pdfmd large_book.pdf --ocr auto
# Disable progress bar for slight speed improvement
pdfmd large_book.pdf --no-progressOCR Performance:
# Fastest: only OCR scanned pages
pdfmd mixed.pdf --ocr auto
# Medium: page-by-page Tesseract (more accurate for scans)
pdfmd scan.pdf --ocr tesseract
# Slowest but best quality: OCRmyPDF preprocessing
pdfmd scan.pdf --ocr ocrmypdfBatch Optimisation:
# Process in parallel (Unix/Linux/macOS):
ls *.pdf | xargs -n 1 -P 4 pdfmd --ocr auto --quiet
# Windows PowerShell parallel:
Get-ChildItem *.pdf | ForEach-Object -Parallel {
pdfmd $_.FullName --ocr auto --quiet
} -ThrottleLimit 40— Success (all files converted)1— Error (one or more files failed)
# Use in scripts:
if pdfmd document.pdf --quiet; then
echo "Conversion successful"
else
echo "Conversion failed"
exit 1
fiFor developers wanting to integrate pdfmd into their own Python code, a full, detailed API reference is available:
This document covers:
- Programmatic use of
pdf_to_markdown - All
Optionsfields and behaviours (includingocr_lang) - Progress & logging callbacks
- Advanced / lower-level pipeline access
- Integration examples (scripts, pandoc, Jupyter)
OCR Language (default eng)
- Tesseract language code for OCR
- Common codes:
eng(English),deu(German),fra(French),jpn(Japanese),chi_sim(Chinese Simplified) - Combine with
+for multi-language:eng+fra - Requires the corresponding Tesseract language pack to be installed
Heading Size Ratio (1.0 to 2.5, default 1.15)
- Font size multiplier for heading detection
- Lower = more headings, Higher = fewer headings
- Example: Body text 11pt → headings must be ≥12.65pt
Orphan Max Length (10 to 120, default 45)
- Maximum characters for orphan line merging
- Short isolated lines get merged into previous paragraph
CAPS to Headings (default: True)
- Treats ALL-CAPS or MOSTLY-CAPS lines as headings
Remove Headers/Footers (default: True)
- Detects repeating text across 3+ pages
- Removes "Page N", "- - 1", footer patterns
Defragment Short Lines (default: True)
- Merges short orphan lines into paragraphs
- Improves reading flow
Settings saved to: ~/.pdfmd_gui.json
The GUI persists your last-used options to this config file (including OCR language). The CLI currently uses its own defaults and command-line flags.
Safe to edit manually for advanced customisation.
To reset GUI settings:
rm ~/.pdfmd_gui.jsonINTRODUCTION
This is a para-
graph with hyph-
enation.
• Bullet one
• Bullet two
Page 1
# Introduction
This is a paragraph with hyphenation.
- Bullet one
- Bullet twoImprovements:
- ✅ Hyphenation repaired (
para-graph→paragraph) - ✅ Extra spaces normalised
- ✅ Bullets converted to Markdown
- ✅ Page numbers removed
- ✅ Heading properly formatted
Before (PDF):
Name Age City
Alice 30 New York
Bob 25 London
Carol 35 Tokyo
After (Markdown):
| Name | Age | City |
|:------|----:|:---------|
| Alice | 30 | New York |
| Bob | 25 | London |
| Carol | 35 | Tokyo |Before (PDF):
The equation E = mc² shows mass-energy equivalence.
For integrals: ∫₀^∞ e^(-x²) dx = √π/2
After (Markdown):
The equation $E = mc^{2}$ shows mass-energy equivalence.
For integrals: $\int_{0}^{\infty} e^{-x^{2}} dx = \sqrt{\pi}/2$-
Test with preview mode first:
pdfmd large.pdf --preview-only --ocr auto
-
Disable OCR if not needed:
pdfmd text-only.pdf --ocr off
-
Only export images when necessary — Each image adds processing time
- Use Tesseract instead of OCRmyPDF — Faster but less accurate
- Close other applications — OCR is CPU-intensive
- Process in batches — Split large PDFs first
# Process 4 PDFs simultaneously (Unix, requires GNU parallel)
find . -name "*.pdf" | parallel -j 4 pdfmd {} --ocr autoIn the GUI, batch conversion processes files sequentially but with scaled progress tracking and per-file logging.
Auto-Detection & Engine Selection:
| Platform | Primary OCR | Fallback | Notes |
|---|---|---|---|
| Windows | Tesseract | Native PyMuPDF | Fast, lightweight |
| macOS | OCRmyPDF | Tesseract | Best layout preservation |
| Linux | OCRmyPDF | Tesseract | Ideal for servers |
Scanned PDF Detection:
The auto mode analyses the first 3 pages for:
- Text density (< 50 chars/page = likely scanned)
- Large images covering >30% of page area
- Combined low text + high image coverage triggers OCR
pip install pymupdfWindows: Reinstall Tesseract and check "Add to PATH" during installation
macOS: brew install tesseract
Linux: sudo apt-get install tesseract-ocr
Verify installation:
tesseract --versionpip install ocrmypdfOr on macOS:
brew install ocrmypdfFixed in v1.6.0. This occurred when a PDF contained images in CMYK or other non-RGB colourspaces. The image export pipeline now automatically converts all colourspaces to RGB before saving as PNG, and skips any images that still fail rather than aborting the conversion.
- Check original scan quality — Blurry scans won't improve
- Ensure the correct language is selected — use
--lang(CLI) or the Language dropdown (GUI) - Try different OCR mode:
pdfmd scan.pdf --ocr ocrmypdf --lang deu # Better than tesseract - Ensure Tesseract language data is installed:
tesseract --list-langs
- For very poor scans, consider rescanning at higher DPI
- Ensure PyMuPDF is up to date:
pip install --upgrade pymupdf - Check that PDF actually requires a password (not just restricted)
- Try running from command line to see error messages
# Check if tkinter is installed (comes with Python on most systems)
python -c "import tkinter"
# On Linux, you may need to install:
sudo apt-get install python3-tkIf installed as a package but command not found:
# Ensure pip install directory is in PATH, or use:
python -m pdfmd.cli input.pdfFixed in v1.6.0. This occurred when a saved config file existed from a previous session and triggered theme application before the UI was fully built. The fix adds guards to _apply_theme() and _set_status() so they safely defer if called before the UI is ready.
To clear a corrupted config:
rm ~/.pdfmd_gui.jsonProblem: Progress bar stuck, no log updates
Solution:
- Press Esc or click Stop to cancel
- Try with Preview first 3 pages to diagnose
- Check if PDF is corrupted or extremely large
- Try different OCR mode
Problem: Password dialog keeps appearing
Solution:
- Verify password is correct
- Check if PDF has user vs. owner password restrictions
- Try opening PDF in another viewer to test password
Problem: "Open folder" link doesn't open file manager
Solution:
- Manually navigate to output file location
- Check file was actually created (look in logs)
- On Linux, ensure
xdg-openis available
Problem: OCR taking too long (>5 minutes for 50 pages)
Expected Behaviour:
- Tesseract: ~1 page/second at 300 DPI
- OCRmyPDF: ~2-3 seconds/page (includes pre-processing)
Solutions:
- Use preview mode to test settings first
- Consider
--ocr autoinstead of forcing OCR on all pages - Disable image export if not needed
- Close resource-heavy applications
Problem: Application using excessive RAM
Causes:
- Large PDFs (>100 pages)
- High-resolution images
- OCR processing
Solutions:
- Process in preview mode first
- Split large PDFs into smaller chunks
- Disable image export
- Increase system swap space
Contributions welcome! You can help by:
- Testing with difficult PDFs (scanned, multi-column, handwritten)
- Improving OCR heuristics and accuracy
- Enhancing Markdown formatting logic
- Expanding profile presets
- Adding unit tests
- Improving documentation
MIT License. Free for personal and commercial use.
See LICENSE file for details.
Built with:
- PyMuPDF — Fast PDF rendering and text extraction
- Tesseract OCR — Google's open-source OCR engine
- OCRmyPDF — High-quality OCR layer addition
- Pillow — Image processing
- pytesseract — Python Tesseract wrapper
- The PyMuPDF team for excellent PDF handling capabilities
- The Tesseract OCR community for continuous improvements
- All contributors and testers who help improve pdfmd
- Repository: https://github.com/M1ck4/pdfmd
- Issues: https://github.com/M1ck4/pdfmd/issues
- Releases: https://github.com/M1ck4/pdfmd/releases
- API Reference: doc/API.md
- Check Documentation: Read this README thoroughly
- Search Issues: Check if your problem is already reported
- Ask Questions: Open a GitHub issue with the
questionlabel - Report Bugs: Provide detailed information (see Contributing section)
We welcome feature requests! Please open an issue with:
- Clear description of the proposed feature
- Use cases and benefits
- Any implementation ideas (optional)
- Use Academic article profile for papers
- Enable
--statsto verify table/equation extraction - Preview mode helps dial in heading detection
- Save custom profiles for different journal formats
- For non-English papers, set the correct
--langfor OCR
- Always verify password security (in-memory only)
- Use
--quietmode for scripting document workflows - Batch processing for discovery documents (GUI multi-select or CLI wildcards)
- Consider splitting very large files first
- Study the modular architecture for extending features
- Each module has clear input/output contracts
- Add custom profiles via JSON config
- Hook into pipeline stages for custom processing
- The
ocr_langfield onOptionsis passed through to both Tesseract and OCRmyPDF
- Start with default settings and iterate
- Use preview mode to find optimal settings
- Save profiles once you find settings you like
- Keyboard shortcuts speed up workflow significantly
- Use multi-file selection to batch convert an entire folder in one go
Free. Open. Useful. Private. Always.
