Skip to content

agentii-ai/DocMeld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocMeld Banner

DocMeld

Lightweight Doc to agent-ready knowledge pipeline

License: MIT Python 3.9+ Code style: black Ruff Checked with mypy Tests: 144 passed Coverage: 81%

Quick StartArchitecturePython APICLIConfigurationContributing


DocMeld converts PDF, Word documents into structured, agent-consumable formats through a three-stage pipeline — without requiring expensive OCR, VLM, or multimodal models. Built for the age of AI agents, it bridges the gap between static documents and the structured knowledge that LLMs need.

Most tools stop at format conversion. DocMeld goes further: Document → Structured Elements → Page Knowledge → AI-Enriched Metadata, producing outputs ready for RAG pipelines, agent systems, and downstream AI workflows.

Why DocMeld?

DocMeld MinerU Docling Marker MarkItDown
No ML models required
Runs fully offline (core)
Agent-ready outputs
AI metadata enrichment
Lightweight install
MIT license ❌ (AGPL) ❌ (GPL)
Swappable backends N/A

Quick Start

Installation

pip install docmeld

With optional Docling backend:

pip install docmeld[docling]

Process your first PDF

from docmeld import DocMeldParser

parser = DocMeldParser("research_paper.pdf")
result = parser.process_all()
print(f"Processed {result.successful}/{result.total_files} files in {result.processing_time_seconds}s")

Or from the command line:

docmeld process research_paper.pdf

That's it. Your PDF is now structured JSON, page-by-page JSONL, and (optionally) AI-enriched metadata.

Pipeline Architecture

DocMeld uses a three-stage medallion architecture. Each stage is independently runnable and idempotent — re-running skips already-processed files.

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│   BRONZE    │      │   SILVER    │      │    GOLD     │
│             │      │             │      │             │
│  PDF → JSON │─────▶│ JSON → JSONL│─────▶│ JSONL → AI  │
│  elements   │      │  pages      │      │  metadata   │
│             │      │             │      │             │
│  PyMuPDF /  │      │  Title      │      │  DeepSeek   │
│  Docling    │      │  hierarchy  │      │  enrichment │
└─────────────┘      └─────────────┘      └─────────────┘
   offline              offline            requires API key

Bronze: PDF → Structured JSON

Extracts document elements (titles, text, tables, images) into a unified JSON format with element IDs and parent-child hierarchy.

[
  {
    "type": "title",
    "level": 0,
    "content": "Executive Summary",
    "page_no": 1,
    "element_id": "e_001",
    "parent_id": ""
  },
  {
    "type": "text",
    "content": "The company reported strong Q2 results...",
    "page_no": 1,
    "element_id": "e_002",
    "parent_id": "e_001"
  },
  {
    "type": "table",
    "content": "| Metric | Q1 | Q2 |\n|---|---|---|\n| Revenue | 10M | 15M |",
    "summary": "Items: Revenue",
    "page_no": 2,
    "element_id": "e_003",
    "parent_id": "e_001",
    "table_data": {
      "headers": ["Metric", "Q1", "Q2"],
      "rows": [["Revenue", "10M", "15M"]],
      "num_rows": 1,
      "num_cols": 3
    }
  }
]

Supported element types:

Type Fields Description
title level, content Headings with hierarchy (0–5)
text content Paragraph content
table content, summary, table_data Markdown tables with structured data
image image_name, image, bbox, image_id Base64-encoded images with metadata

All elements include page_no, element_id, and parent_id for cross-referencing.

Silver: JSON → Page-by-Page JSONL

Transforms flat element lists into self-contained page documents with title hierarchy tracking, markdown rendering, and global table numbering.

{
  "metadata": {
    "uuid": "a1b2c3d4-...",
    "source": "research_paper.pdf",
    "page_no": "page1",
    "session_title": "# Executive Summary\n"
  },
  "page_content": "# Executive Summary\n\nThe company reported strong Q2 results...\n\n[[Table1]]\n| Metric | Q1 | Q2 |\n|---|---|---|\n| Revenue | 10M | 15M |\n[/Table1]"
}

Each page carries its full title context, so pages are independently meaningful — ideal for chunked retrieval in RAG systems.

Gold: JSONL → AI-Enriched Metadata

Adds semantic descriptions and keywords to each page using DeepSeek-chat, with exponential backoff retry and per-page error resilience.

{
  "metadata": {
    "uuid": "a1b2c3d4-...",
    "source": "research_paper.pdf",
    "page_no": "page1",
    "session_title": "# Executive Summary\n",
    "description": "Company reports strong Q2 results with 50% revenue growth",
    "keywords": ["revenue", "quarterly results", "growth", "financial performance"]
  },
  "page_content": "..."
}

The gold stage is optional — bronze and silver run fully offline with zero API calls.

Output Structure

After processing research_paper.pdf:

research_paper.pdf                              # Original (untouched)
research_paper_a3f5c2/                          # Output folder (name + MD5 suffix)
├── research_paper_a3f5c2.json                  # Bronze: structured elements
├── research_paper_a3f5c2.jsonl                 # Silver: page-by-page documents
└── research_paper_a3f5c2_gold.jsonl            # Gold: AI-enriched (optional)

Output folder names are sanitized and include an MD5 hash suffix for uniqueness, ensuring safe cross-platform filenames even for PDFs with unicode or special characters.

Python API

Full Pipeline

from docmeld import DocMeldParser

# Single file — all three stages
parser = DocMeldParser("paper.pdf")
result = parser.process_all()

# Batch — process every PDF in a folder
parser = DocMeldParser("/path/to/papers/")
result = parser.process_all()
print(f"{result.successful}/{result.total_files} files, {result.processing_time_seconds}s")

Individual Stages

from docmeld import DocMeldParser

parser = DocMeldParser("paper.pdf")

# Bronze only
bronze = parser.process_bronze()
print(f"{bronze.element_count} elements across {bronze.page_count} pages")
print(f"Output: {bronze.output_path}")

# Silver (requires bronze output)
silver = parser.process_silver(bronze.output_path)
print(f"{silver.page_count} pages → {silver.output_path}")

# Gold (requires silver output + API key)
gold = parser.process_gold(silver.output_path)
print(f"{gold.pages_enriched} enriched, {gold.pages_failed} failed")

Swappable Backends

DocMeld supports multiple PDF parsing backends through a pluggable architecture:

# Default: PyMuPDF (lightweight, fast)
parser = DocMeldParser("paper.pdf", backend="pymupdf")

# Alternative: Docling (IBM's ML-powered parser, better for complex layouts)
parser = DocMeldParser("paper.pdf", backend="docling")

Working with Elements

import json

# Load bronze output
with open("paper_a3f5c2/paper_a3f5c2.json") as f:
    elements = json.load(f)

# Filter by type
titles = [e for e in elements if e["type"] == "title"]
tables = [e for e in elements if e["type"] == "table"]

# Navigate hierarchy via parent_id
for elem in elements:
    if elem["parent_id"] == "e_001":
        print(f"  Child of first title: {elem['content'][:50]}")

# Access structured table data
for table in tables:
    headers = table["table_data"]["headers"]
    rows = table["table_data"]["rows"]
    print(f"Table: {len(rows)} rows × {len(headers)} cols")

Result Models

All pipeline stages return typed Pydantic models:

BronzeResult(output_path, output_dir, element_count, page_count, skipped)
SilverResult(output_path, page_count, skipped)
GoldResult(output_path, pages_enriched, pages_failed, skipped)
ProcessingResult(total_files, successful, failed, failures, processing_time_seconds, ...)

CLI Reference

# Full pipeline (bronze → silver → gold)
docmeld process paper.pdf
docmeld process /path/to/papers/

# Individual stages
docmeld bronze paper.pdf                    # PDF → JSON
docmeld silver paper_a3f5c2/paper_a3f5c2.json    # JSON → JSONL
docmeld gold paper_a3f5c2/paper_a3f5c2.jsonl     # JSONL → enriched JSONL

# Choose parsing backend
docmeld bronze paper.pdf --backend docling
docmeld process paper.pdf --backend pymupdf       # default

Configuration

Gold Stage (AI Enrichment)

Create a .env.local file in your working directory:

DEEPSEEK_API_KEY=your_key_here

# Optional: custom API endpoint
# DEEPSEEK_API_ENDPOINT=https://api.deepseek.com

The gold stage is entirely optional. Bronze and silver stages run offline with no API keys, no network calls, and no model downloads.

Logging

DocMeld writes timestamped log files (docmeld_YYYYMMDD_HHMMSS.log) to the working directory. Console output shows INFO-level messages; log files capture full DEBUG output.

Unified Element Schema

DocMeld enforces a strict element schema via Pydantic models. This contract guarantees downstream consumers always get a predictable structure.

from docmeld.bronze.element_types import (
    TitleElement,    # type, level, content, page_no, element_id, parent_id
    TextElement,     # type, content, page_no, element_id, parent_id
    TableElement,    # type, content, summary, page_no, element_id, parent_id, table_data
    ImageElement,    # type, image_name, content, image, image_id, bbox, page_no, element_id, parent_id
)

Element types are validated at creation time. New types may be added in minor versions, but existing types will never change shape in minor/patch releases.

Roadmap

  • Bronze → Silver → Gold pipeline
  • CLI interface with subcommands
  • Swappable backends (PyMuPDF + Docling)
  • Element hierarchy (element_id / parent_id)
  • Structured table data extraction
  • Idempotent processing
  • Batch folder processing
  • Research paper batch categorization
  • Paper-to-PRD generation
  • Paper-to-workflow extraction
  • Book-to-Claude-Skills generation
  • DOCX / PPTX support
  • OCR for scanned PDFs (pip install docmeld[ocr])
  • Agent prompt generation
  • LangChain / LlamaIndex integration

Development

Setup

git clone https://github.com/[username]/docmeld.git
cd docmeld
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

Quality Gates

pytest tests/ -v --cov=docmeld       # 144 tests, 81% coverage
ruff check docmeld/                   # Linting
black --check docmeld/                # Formatting
mypy docmeld/                         # Strict type checking

Project Structure

docmeld/
├── docmeld/
│   ├── __init__.py              # Public API (DocMeldParser, __version__)
│   ├── parser.py                # Pipeline orchestrator
│   ├── cli.py                   # CLI entry point (argparse)
│   ├── bronze/
│   │   ├── backends/
│   │   │   ├── pymupdf_backend.py   # PyMuPDF + pymupdf4llm
│   │   │   └── docling_backend.py   # Docling (optional)
│   │   ├── element_extractor.py     # Extraction + post-processing
│   │   ├── element_types.py         # Pydantic element models
│   │   ├── filename_sanitizer.py    # Safe filenames + MD5 hashing
│   │   └── processor.py            # Bronze orchestrator
│   ├── silver/
│   │   ├── page_aggregator.py       # Group elements by page
│   │   ├── page_models.py          # Result models (Pydantic)
│   │   ├── markdown_renderer.py     # Elements → markdown
│   │   ├── title_tracker.py         # Title hierarchy state
│   │   └── processor.py            # Silver orchestrator
│   ├── gold/
│   │   ├── deepseek_client.py       # API client + retry logic
│   │   ├── metadata_extractor.py    # Content → description + keywords
│   │   └── processor.py            # Gold orchestrator
│   └── utils/
│       ├── env_loader.py            # .env.local loading
│       ├── logging.py               # Timestamped log setup
│       └── progress.py              # Progress indicators
├── tests/                       # Unit, integration, contract tests
├── pyproject.toml
├── CONTRIBUTING.md
├── CHANGELOG.md
└── LICENSE                      # MIT

Contributing

We welcome contributions. See CONTRIBUTING.md for the full guide. The short version:

  1. Fork and clone
  2. Write tests first (TDD is non-negotiable)
  3. Run all quality gates before pushing
  4. Open a PR with a clear description

License

MIT License — see LICENSE for details.

Citation

@software{docmeld2026,
  title     = {DocMeld: Lightweight PDF to Agent-Ready Knowledge Pipeline},
  year      = {2026},
  license   = {MIT},
  url       = {https://github.com/[username]/docmeld}
}

About

Lightweight Doc-to-agent-ready knowledge pipeline. Three-stage Bronze→Silver→Gold architecture extracts structured elements, page content, and AI-enriched metadata from research papers and books. Generate PRDs, workflows, topic clusters, and Claude Code skills from PDFs. No OCR required.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors