Lightweight Doc to agent-ready knowledge pipeline
Quick Start • Architecture • Python API • CLI • Configuration • Contributing
DocMeld converts PDF, Word documents into structured, agent-consumable formats through a three-stage pipeline — without requiring expensive OCR, VLM, or multimodal models. Built for the age of AI agents, it bridges the gap between static documents and the structured knowledge that LLMs need.
Most tools stop at format conversion. DocMeld goes further: Document → Structured Elements → Page Knowledge → AI-Enriched Metadata, producing outputs ready for RAG pipelines, agent systems, and downstream AI workflows.
| DocMeld | MinerU | Docling | Marker | MarkItDown | |
|---|---|---|---|---|---|
| No ML models required | ✅ | ❌ | ❌ | ❌ | ✅ |
| Runs fully offline (core) | ✅ | ❌ | ✅ | ✅ | ✅ |
| Agent-ready outputs | ✅ | ❌ | ❌ | ❌ | ❌ |
| AI metadata enrichment | ✅ | ❌ | ❌ | ❌ | ❌ |
| Lightweight install | ✅ | ❌ | ❌ | ❌ | ✅ |
| MIT license | ✅ | ❌ (AGPL) | ✅ | ❌ (GPL) | ✅ |
| Swappable backends | ✅ | ❌ | N/A | ❌ | ❌ |
pip install docmeldWith optional Docling backend:
pip install docmeld[docling]from docmeld import DocMeldParser
parser = DocMeldParser("research_paper.pdf")
result = parser.process_all()
print(f"Processed {result.successful}/{result.total_files} files in {result.processing_time_seconds}s")Or from the command line:
docmeld process research_paper.pdfThat's it. Your PDF is now structured JSON, page-by-page JSONL, and (optionally) AI-enriched metadata.
DocMeld uses a three-stage medallion architecture. Each stage is independently runnable and idempotent — re-running skips already-processed files.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ BRONZE │ │ SILVER │ │ GOLD │
│ │ │ │ │ │
│ PDF → JSON │─────▶│ JSON → JSONL│─────▶│ JSONL → AI │
│ elements │ │ pages │ │ metadata │
│ │ │ │ │ │
│ PyMuPDF / │ │ Title │ │ DeepSeek │
│ Docling │ │ hierarchy │ │ enrichment │
└─────────────┘ └─────────────┘ └─────────────┘
offline offline requires API key
Extracts document elements (titles, text, tables, images) into a unified JSON format with element IDs and parent-child hierarchy.
[
{
"type": "title",
"level": 0,
"content": "Executive Summary",
"page_no": 1,
"element_id": "e_001",
"parent_id": ""
},
{
"type": "text",
"content": "The company reported strong Q2 results...",
"page_no": 1,
"element_id": "e_002",
"parent_id": "e_001"
},
{
"type": "table",
"content": "| Metric | Q1 | Q2 |\n|---|---|---|\n| Revenue | 10M | 15M |",
"summary": "Items: Revenue",
"page_no": 2,
"element_id": "e_003",
"parent_id": "e_001",
"table_data": {
"headers": ["Metric", "Q1", "Q2"],
"rows": [["Revenue", "10M", "15M"]],
"num_rows": 1,
"num_cols": 3
}
}
]Supported element types:
| Type | Fields | Description |
|---|---|---|
title |
level, content |
Headings with hierarchy (0–5) |
text |
content |
Paragraph content |
table |
content, summary, table_data |
Markdown tables with structured data |
image |
image_name, image, bbox, image_id |
Base64-encoded images with metadata |
All elements include page_no, element_id, and parent_id for cross-referencing.
Transforms flat element lists into self-contained page documents with title hierarchy tracking, markdown rendering, and global table numbering.
{
"metadata": {
"uuid": "a1b2c3d4-...",
"source": "research_paper.pdf",
"page_no": "page1",
"session_title": "# Executive Summary\n"
},
"page_content": "# Executive Summary\n\nThe company reported strong Q2 results...\n\n[[Table1]]\n| Metric | Q1 | Q2 |\n|---|---|---|\n| Revenue | 10M | 15M |\n[/Table1]"
}Each page carries its full title context, so pages are independently meaningful — ideal for chunked retrieval in RAG systems.
Adds semantic descriptions and keywords to each page using DeepSeek-chat, with exponential backoff retry and per-page error resilience.
{
"metadata": {
"uuid": "a1b2c3d4-...",
"source": "research_paper.pdf",
"page_no": "page1",
"session_title": "# Executive Summary\n",
"description": "Company reports strong Q2 results with 50% revenue growth",
"keywords": ["revenue", "quarterly results", "growth", "financial performance"]
},
"page_content": "..."
}The gold stage is optional — bronze and silver run fully offline with zero API calls.
After processing research_paper.pdf:
research_paper.pdf # Original (untouched)
research_paper_a3f5c2/ # Output folder (name + MD5 suffix)
├── research_paper_a3f5c2.json # Bronze: structured elements
├── research_paper_a3f5c2.jsonl # Silver: page-by-page documents
└── research_paper_a3f5c2_gold.jsonl # Gold: AI-enriched (optional)
Output folder names are sanitized and include an MD5 hash suffix for uniqueness, ensuring safe cross-platform filenames even for PDFs with unicode or special characters.
from docmeld import DocMeldParser
# Single file — all three stages
parser = DocMeldParser("paper.pdf")
result = parser.process_all()
# Batch — process every PDF in a folder
parser = DocMeldParser("/path/to/papers/")
result = parser.process_all()
print(f"{result.successful}/{result.total_files} files, {result.processing_time_seconds}s")from docmeld import DocMeldParser
parser = DocMeldParser("paper.pdf")
# Bronze only
bronze = parser.process_bronze()
print(f"{bronze.element_count} elements across {bronze.page_count} pages")
print(f"Output: {bronze.output_path}")
# Silver (requires bronze output)
silver = parser.process_silver(bronze.output_path)
print(f"{silver.page_count} pages → {silver.output_path}")
# Gold (requires silver output + API key)
gold = parser.process_gold(silver.output_path)
print(f"{gold.pages_enriched} enriched, {gold.pages_failed} failed")DocMeld supports multiple PDF parsing backends through a pluggable architecture:
# Default: PyMuPDF (lightweight, fast)
parser = DocMeldParser("paper.pdf", backend="pymupdf")
# Alternative: Docling (IBM's ML-powered parser, better for complex layouts)
parser = DocMeldParser("paper.pdf", backend="docling")import json
# Load bronze output
with open("paper_a3f5c2/paper_a3f5c2.json") as f:
elements = json.load(f)
# Filter by type
titles = [e for e in elements if e["type"] == "title"]
tables = [e for e in elements if e["type"] == "table"]
# Navigate hierarchy via parent_id
for elem in elements:
if elem["parent_id"] == "e_001":
print(f" Child of first title: {elem['content'][:50]}")
# Access structured table data
for table in tables:
headers = table["table_data"]["headers"]
rows = table["table_data"]["rows"]
print(f"Table: {len(rows)} rows × {len(headers)} cols")All pipeline stages return typed Pydantic models:
BronzeResult(output_path, output_dir, element_count, page_count, skipped)
SilverResult(output_path, page_count, skipped)
GoldResult(output_path, pages_enriched, pages_failed, skipped)
ProcessingResult(total_files, successful, failed, failures, processing_time_seconds, ...)# Full pipeline (bronze → silver → gold)
docmeld process paper.pdf
docmeld process /path/to/papers/
# Individual stages
docmeld bronze paper.pdf # PDF → JSON
docmeld silver paper_a3f5c2/paper_a3f5c2.json # JSON → JSONL
docmeld gold paper_a3f5c2/paper_a3f5c2.jsonl # JSONL → enriched JSONL
# Choose parsing backend
docmeld bronze paper.pdf --backend docling
docmeld process paper.pdf --backend pymupdf # defaultCreate a .env.local file in your working directory:
DEEPSEEK_API_KEY=your_key_here
# Optional: custom API endpoint
# DEEPSEEK_API_ENDPOINT=https://api.deepseek.comThe gold stage is entirely optional. Bronze and silver stages run offline with no API keys, no network calls, and no model downloads.
DocMeld writes timestamped log files (docmeld_YYYYMMDD_HHMMSS.log) to the working directory. Console output shows INFO-level messages; log files capture full DEBUG output.
DocMeld enforces a strict element schema via Pydantic models. This contract guarantees downstream consumers always get a predictable structure.
from docmeld.bronze.element_types import (
TitleElement, # type, level, content, page_no, element_id, parent_id
TextElement, # type, content, page_no, element_id, parent_id
TableElement, # type, content, summary, page_no, element_id, parent_id, table_data
ImageElement, # type, image_name, content, image, image_id, bbox, page_no, element_id, parent_id
)Element types are validated at creation time. New types may be added in minor versions, but existing types will never change shape in minor/patch releases.
- Bronze → Silver → Gold pipeline
- CLI interface with subcommands
- Swappable backends (PyMuPDF + Docling)
- Element hierarchy (
element_id/parent_id) - Structured table data extraction
- Idempotent processing
- Batch folder processing
- Research paper batch categorization
- Paper-to-PRD generation
- Paper-to-workflow extraction
- Book-to-Claude-Skills generation
- DOCX / PPTX support
- OCR for scanned PDFs (
pip install docmeld[ocr]) - Agent prompt generation
- LangChain / LlamaIndex integration
git clone https://github.com/[username]/docmeld.git
cd docmeld
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"pytest tests/ -v --cov=docmeld # 144 tests, 81% coverage
ruff check docmeld/ # Linting
black --check docmeld/ # Formatting
mypy docmeld/ # Strict type checkingdocmeld/
├── docmeld/
│ ├── __init__.py # Public API (DocMeldParser, __version__)
│ ├── parser.py # Pipeline orchestrator
│ ├── cli.py # CLI entry point (argparse)
│ ├── bronze/
│ │ ├── backends/
│ │ │ ├── pymupdf_backend.py # PyMuPDF + pymupdf4llm
│ │ │ └── docling_backend.py # Docling (optional)
│ │ ├── element_extractor.py # Extraction + post-processing
│ │ ├── element_types.py # Pydantic element models
│ │ ├── filename_sanitizer.py # Safe filenames + MD5 hashing
│ │ └── processor.py # Bronze orchestrator
│ ├── silver/
│ │ ├── page_aggregator.py # Group elements by page
│ │ ├── page_models.py # Result models (Pydantic)
│ │ ├── markdown_renderer.py # Elements → markdown
│ │ ├── title_tracker.py # Title hierarchy state
│ │ └── processor.py # Silver orchestrator
│ ├── gold/
│ │ ├── deepseek_client.py # API client + retry logic
│ │ ├── metadata_extractor.py # Content → description + keywords
│ │ └── processor.py # Gold orchestrator
│ └── utils/
│ ├── env_loader.py # .env.local loading
│ ├── logging.py # Timestamped log setup
│ └── progress.py # Progress indicators
├── tests/ # Unit, integration, contract tests
├── pyproject.toml
├── CONTRIBUTING.md
├── CHANGELOG.md
└── LICENSE # MIT
We welcome contributions. See CONTRIBUTING.md for the full guide. The short version:
- Fork and clone
- Write tests first (TDD is non-negotiable)
- Run all quality gates before pushing
- Open a PR with a clear description
MIT License — see LICENSE for details.
@software{docmeld2026,
title = {DocMeld: Lightweight PDF to Agent-Ready Knowledge Pipeline},
year = {2026},
license = {MIT},
url = {https://github.com/[username]/docmeld}
}