Document Analysis MCP Server

A Model Context Protocol (MCP) server for PDF document analysis. Uses the Claude API for intelligent multi-page extraction, structure analysis, OCR, and document classification.

Features

Full PDF extraction — Iterative LLM analysis across all pages, not just the first chunk
Structure extraction — Tables, table of contents, figures, and headings
OCR support — Scanned PDF handling via Tesseract
Document classification — 17+ content types with extraction strategy recommendations
Hash-based caching — Prevents re-processing identical documents
API usage tracking — Per-session cost breakdown and usage summary

Architecture

The server is built on FastMCP using Streamable HTTP transport (stateless). Incoming tool calls are routed to handlers in tools/, which delegate to processors/ for PDF extraction (pdfplumber with pypdf fallback), LLM analysis (Anthropic API), and OCR (Tesseract). Results are cached by document hash. Configuration is managed via Pydantic settings loaded from environment variables.

MCP client → FastMCP server → tool handlers → processors → Claude API / Tesseract
                                                          → hash-based cache

MCP Tools

Tool	Purpose
`pdf_extract_full_tool`	Full document extraction with optional LLM analysis (quick / comprehensive / deep)
`pdf_extract_structure_tool`	Extract TOC, tables, and headings
`pdf_classify_tool`	Classify document type and recommend extraction strategy
`pdf_ocr_tool`	Handle scanned PDFs via Tesseract
`pdf_kb_ingest_tool`	One-shot extraction + classification + chunking for knowledge base ingestion
`health_check`	Server health, API key status, cache stats
`cache_stats`	Cache statistics and usage information
`usage_summary`	API usage tracking and cost breakdown

Quick Start

Prerequisites

Python 3.10+
Tesseract OCR
Anthropic API key

# Ubuntu / Debian
sudo apt install tesseract-ocr

# macOS
brew install tesseract

Installation

git clone https://github.com/krisoye/document-analysis-mcp.git
cd document-analysis-mcp
pip install -e ".[dev]"

Configuration

export ANTHROPIC_API_KEY="sk-ant-..."      # Required for LLM analysis
export DOC_ANALYSIS_PORT=8766              # Default port
export DEPLOYMENT_MODE=development         # development | staging | production

Run

doc-analysis-server
# Server starts at http://localhost:8766

Register with Claude Code

claude mcp add --transport http document-analysis http://localhost:8766/mcp -s user

After registration, all MCP tools listed above are available in every Claude Code session.

Configuration

Variable	Default	Purpose
`ANTHROPIC_API_KEY`	(required)	Anthropic API key for Claude access
`DOC_ANALYSIS_HOST`	`127.0.0.1`	Server bind address (`0.0.0.0` for external access)
`DOC_ANALYSIS_PORT`	`8766`	Server port
`CACHE_DIR`	`/var/cache/document-analysis-mcp`	Cache directory for extraction results
`CACHE_TTL_DAYS`	`30`	Cache expiration in days
`DEFAULT_MODEL`	`claude-sonnet-4-20250514`	Claude model for extraction and analysis
`CLASSIFICATION_MODEL`	`claude-3-5-haiku-20241022`	Faster model for classification only
`MAX_TOKENS`	`4096`	Maximum tokens for LLM responses
`LOG_LEVEL`	`INFO`	Logging verbosity (`DEBUG`, `INFO`, `WARNING`, `ERROR`)

See src/document_analysis_mcp/config.py for the complete settings reference.

Development

# Run tests
pytest

# Lint
ruff check src/

# Format
ruff format src/

# Type check
mypy src/

Tech Stack

Component	Library
MCP server framework	FastMCP
LLM analysis	Anthropic Python SDK
PDF text extraction	pdfplumber
PDF fallback / metadata	pypdf
OCR	Tesseract via pytesseract
Image processing	Pillow
Configuration / validation	Pydantic + pydantic-settings

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
deploy		deploy
src/document_analysis_mcp		src/document_analysis_mcp
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Analysis MCP Server

Features

Architecture

MCP Tools

Quick Start

Prerequisites

Installation

Configuration

Run

Register with Claude Code

Configuration

Development

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document Analysis MCP Server

Features

Architecture

MCP Tools

Quick Start

Prerequisites

Installation

Configuration

Run

Register with Claude Code

Configuration

Development

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages