DocMeld

Lightweight Doc to agent-ready knowledge pipeline

Quick Start • Architecture • Python API • CLI • Configuration • Contributing

DocMeld converts PDF, Word documents into structured, agent-consumable formats through a three-stage pipeline — without requiring expensive OCR, VLM, or multimodal models. Built for the age of AI agents, it bridges the gap between static documents and the structured knowledge that LLMs need.

Most tools stop at format conversion. DocMeld goes further: Document → Structured Elements → Page Knowledge → AI-Enriched Metadata, producing outputs ready for RAG pipelines, agent systems, and downstream AI workflows.

Why DocMeld?

	DocMeld	MinerU	Docling	Marker	MarkItDown
No ML models required	✅	❌	❌	❌	✅
Runs fully offline (core)	✅	❌	✅	✅	✅
Agent-ready outputs	✅	❌	❌	❌	❌
AI metadata enrichment	✅	❌	❌	❌	❌
Lightweight install	✅	❌	❌	❌	✅
MIT license	✅	❌ (AGPL)	✅	❌ (GPL)	✅
Swappable backends	✅	❌	N/A	❌	❌

Quick Start

Installation

pip install docmeld

With optional Docling backend:

pip install docmeld[docling]

Process your first PDF

from docmeld import DocMeldParser

parser = DocMeldParser("research_paper.pdf")
result = parser.process_all()
print(f"Processed {result.successful}/{result.total_files} files in {result.processing_time_seconds}s")

Or from the command line:

docmeld process research_paper.pdf

That's it. Your PDF is now structured JSON, page-by-page JSONL, and (optionally) AI-enriched metadata.

Pipeline Architecture

DocMeld uses a three-stage medallion architecture. Each stage is independently runnable and idempotent — re-running skips already-processed files.

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│   BRONZE    │      │   SILVER    │      │    GOLD     │
│             │      │             │      │             │
│  PDF → JSON │─────▶│ JSON → JSONL│─────▶│ JSONL → AI  │
│  elements   │      │  pages      │      │  metadata   │
│             │      │             │      │             │
│  PyMuPDF /  │      │  Title      │      │  DeepSeek   │
│  Docling    │      │  hierarchy  │      │  enrichment │
└─────────────┘      └─────────────┘      └─────────────┘
   offline              offline            requires API key

Bronze: PDF → Structured JSON

Extracts document elements (titles, text, tables, images) into a unified JSON format with element IDs and parent-child hierarchy.

[
  {
    "type": "title",
    "level": 0,
    "content": "Executive Summary",
    "page_no": 1,
    "element_id": "e_001",
    "parent_id": ""
  },
  {
    "type": "text",
    "content": "The company reported strong Q2 results...",
    "page_no": 1,
    "element_id": "e_002",
    "parent_id": "e_001"
  },
  {
    "type": "table",
    "content": "| Metric | Q1 | Q2 |\n|---|---|---|\n| Revenue | 10M | 15M |",
    "summary": "Items: Revenue",
    "page_no": 2,
    "element_id": "e_003",
    "parent_id": "e_001",
    "table_data": {
      "headers": ["Metric", "Q1", "Q2"],
      "rows": [["Revenue", "10M", "15M"]],
      "num_rows": 1,
      "num_cols": 3
    }
  }
]

Supported element types:

Type	Fields	Description
`title`	`level`, `content`	Headings with hierarchy (0–5)
`text`	`content`	Paragraph content
`table`	`content`, `summary`, `table_data`	Markdown tables with structured data
`image`	`image_name`, `image`, `bbox`, `image_id`	Base64-encoded images with metadata

All elements include page_no, element_id, and parent_id for cross-referencing.

Silver: JSON → Page-by-Page JSONL

Transforms flat element lists into self-contained page documents with title hierarchy tracking, markdown rendering, and global table numbering.

{
  "metadata": {
    "uuid": "a1b2c3d4-...",
    "source": "research_paper.pdf",
    "page_no": "page1",
    "session_title": "# Executive Summary\n"
  },
  "page_content": "# Executive Summary\n\nThe company reported strong Q2 results...\n\n[[Table1]]\n| Metric | Q1 | Q2 |\n|---|---|---|\n| Revenue | 10M | 15M |\n[/Table1]"
}

Each page carries its full title context, so pages are independently meaningful — ideal for chunked retrieval in RAG systems.

Gold: JSONL → AI-Enriched Metadata

Adds semantic descriptions and keywords to each page using DeepSeek-chat, with exponential backoff retry and per-page error resilience.

{
  "metadata": {
    "uuid": "a1b2c3d4-...",
    "source": "research_paper.pdf",
    "page_no": "page1",
    "session_title": "# Executive Summary\n",
    "description": "Company reports strong Q2 results with 50% revenue growth",
    "keywords": ["revenue", "quarterly results", "growth", "financial performance"]
  },
  "page_content": "..."
}

The gold stage is optional — bronze and silver run fully offline with zero API calls.

Output Structure

After processing research_paper.pdf:

research_paper.pdf                              # Original (untouched)
research_paper_a3f5c2/                          # Output folder (name + MD5 suffix)
├── research_paper_a3f5c2.json                  # Bronze: structured elements
├── research_paper_a3f5c2.jsonl                 # Silver: page-by-page documents
└── research_paper_a3f5c2_gold.jsonl            # Gold: AI-enriched (optional)

Output folder names are sanitized and include an MD5 hash suffix for uniqueness, ensuring safe cross-platform filenames even for PDFs with unicode or special characters.

Python API

Full Pipeline

from docmeld import DocMeldParser

# Single file — all three stages
parser = DocMeldParser("paper.pdf")
result = parser.process_all()

# Batch — process every PDF in a folder
parser = DocMeldParser("/path/to/papers/")
result = parser.process_all()
print(f"{result.successful}/{result.total_files} files, {result.processing_time_seconds}s")

Individual Stages

from docmeld import DocMeldParser

parser = DocMeldParser("paper.pdf")

# Bronze only
bronze = parser.process_bronze()
print(f"{bronze.element_count} elements across {bronze.page_count} pages")
print(f"Output: {bronze.output_path}")

# Silver (requires bronze output)
silver = parser.process_silver(bronze.output_path)
print(f"{silver.page_count} pages → {silver.output_path}")

# Gold (requires silver output + API key)
gold = parser.process_gold(silver.output_path)
print(f"{gold.pages_enriched} enriched, {gold.pages_failed} failed")

Swappable Backends

DocMeld supports multiple PDF parsing backends through a pluggable architecture:

# Default: PyMuPDF (lightweight, fast)
parser = DocMeldParser("paper.pdf", backend="pymupdf")

# Alternative: Docling (IBM's ML-powered parser, better for complex layouts)
parser = DocMeldParser("paper.pdf", backend="docling")

Working with Elements

import json

# Load bronze output
with open("paper_a3f5c2/paper_a3f5c2.json") as f:
    elements = json.load(f)

# Filter by type
titles = [e for e in elements if e["type"] == "title"]
tables = [e for e in elements if e["type"] == "table"]

# Navigate hierarchy via parent_id
for elem in elements:
    if elem["parent_id"] == "e_001":
        print(f"  Child of first title: {elem['content'][:50]}")

# Access structured table data
for table in tables:
    headers = table["table_data"]["headers"]
    rows = table["table_data"]["rows"]
    print(f"Table: {len(rows)} rows × {len(headers)} cols")

Result Models

All pipeline stages return typed Pydantic models:

BronzeResult(output_path, output_dir, element_count, page_count, skipped)
SilverResult(output_path, page_count, skipped)
GoldResult(output_path, pages_enriched, pages_failed, skipped)
ProcessingResult(total_files, successful, failed, failures, processing_time_seconds, ...)

CLI Reference

# Full pipeline (bronze → silver → gold)
docmeld process paper.pdf
docmeld process /path/to/papers/

# Individual stages
docmeld bronze paper.pdf                    # PDF → JSON
docmeld silver paper_a3f5c2/paper_a3f5c2.json    # JSON → JSONL
docmeld gold paper_a3f5c2/paper_a3f5c2.jsonl     # JSONL → enriched JSONL

# Choose parsing backend
docmeld bronze paper.pdf --backend docling
docmeld process paper.pdf --backend pymupdf       # default

Configuration

Gold Stage (AI Enrichment)

Create a .env.local file in your working directory:

DEEPSEEK_API_KEY=your_key_here

# Optional: custom API endpoint
# DEEPSEEK_API_ENDPOINT=https://api.deepseek.com

The gold stage is entirely optional. Bronze and silver stages run offline with no API keys, no network calls, and no model downloads.

Logging

DocMeld writes timestamped log files (docmeld_YYYYMMDD_HHMMSS.log) to the working directory. Console output shows INFO-level messages; log files capture full DEBUG output.

Unified Element Schema

DocMeld enforces a strict element schema via Pydantic models. This contract guarantees downstream consumers always get a predictable structure.

from docmeld.bronze.element_types import (
    TitleElement,    # type, level, content, page_no, element_id, parent_id
    TextElement,     # type, content, page_no, element_id, parent_id
    TableElement,    # type, content, summary, page_no, element_id, parent_id, table_data
    ImageElement,    # type, image_name, content, image, image_id, bbox, page_no, element_id, parent_id
)

Element types are validated at creation time. New types may be added in minor versions, but existing types will never change shape in minor/patch releases.

Roadmap

Development

Setup

git clone https://github.com/[username]/docmeld.git
cd docmeld
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

Quality Gates

pytest tests/ -v --cov=docmeld       # 144 tests, 81% coverage
ruff check docmeld/                   # Linting
black --check docmeld/                # Formatting
mypy docmeld/                         # Strict type checking

Project Structure

docmeld/
├── docmeld/
│   ├── __init__.py              # Public API (DocMeldParser, __version__)
│   ├── parser.py                # Pipeline orchestrator
│   ├── cli.py                   # CLI entry point (argparse)
│   ├── bronze/
│   │   ├── backends/
│   │   │   ├── pymupdf_backend.py   # PyMuPDF + pymupdf4llm
│   │   │   └── docling_backend.py   # Docling (optional)
│   │   ├── element_extractor.py     # Extraction + post-processing
│   │   ├── element_types.py         # Pydantic element models
│   │   ├── filename_sanitizer.py    # Safe filenames + MD5 hashing
│   │   └── processor.py            # Bronze orchestrator
│   ├── silver/
│   │   ├── page_aggregator.py       # Group elements by page
│   │   ├── page_models.py          # Result models (Pydantic)
│   │   ├── markdown_renderer.py     # Elements → markdown
│   │   ├── title_tracker.py         # Title hierarchy state
│   │   └── processor.py            # Silver orchestrator
│   ├── gold/
│   │   ├── deepseek_client.py       # API client + retry logic
│   │   ├── metadata_extractor.py    # Content → description + keywords
│   │   └── processor.py            # Gold orchestrator
│   └── utils/
│       ├── env_loader.py            # .env.local loading
│       ├── logging.py               # Timestamped log setup
│       └── progress.py              # Progress indicators
├── tests/                       # Unit, integration, contract tests
├── pyproject.toml
├── CONTRIBUTING.md
├── CHANGELOG.md
└── LICENSE                      # MIT

Contributing

We welcome contributions. See CONTRIBUTING.md for the full guide. The short version:

Fork and clone
Write tests first (TDD is non-negotiable)
Run all quality gates before pushing
Open a PR with a clear description

License

MIT License — see LICENSE for details.

Citation

@software{docmeld2026,
  title     = {DocMeld: Lightweight PDF to Agent-Ready Knowledge Pipeline},
  year      = {2026},
  license   = {MIT},
  url       = {https://github.com/[username]/docmeld}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.claude/commands		.claude/commands
.specify		.specify
docmeld		docmeld
references		references
specs		specs
.DS_Store		.DS_Store
.env.local.template		.env.local.template
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

DocMeld

Why DocMeld?

Quick Start

Installation

Process your first PDF

Pipeline Architecture

Bronze: PDF → Structured JSON

Silver: JSON → Page-by-Page JSONL

Gold: JSONL → AI-Enriched Metadata

Output Structure

Python API

Full Pipeline

Individual Stages

Swappable Backends

Working with Elements

Result Models

CLI Reference

Configuration

Gold Stage (AI Enrichment)

Logging

Unified Element Schema

Roadmap

Development

Setup

Quality Gates

Project Structure

Contributing

License

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages