Skip to content

krisoye/document-analysis-mcp

Repository files navigation

Document Analysis MCP Server

Tests Python 3.10+ License: MIT

A Model Context Protocol (MCP) server for PDF document analysis. Uses the Claude API for intelligent multi-page extraction, structure analysis, OCR, and document classification.

Features

  • Full PDF extraction — Iterative LLM analysis across all pages, not just the first chunk
  • Structure extraction — Tables, table of contents, figures, and headings
  • OCR support — Scanned PDF handling via Tesseract
  • Document classification — 17+ content types with extraction strategy recommendations
  • Hash-based caching — Prevents re-processing identical documents
  • API usage tracking — Per-session cost breakdown and usage summary

Architecture

The server is built on FastMCP using Streamable HTTP transport (stateless). Incoming tool calls are routed to handlers in tools/, which delegate to processors/ for PDF extraction (pdfplumber with pypdf fallback), LLM analysis (Anthropic API), and OCR (Tesseract). Results are cached by document hash. Configuration is managed via Pydantic settings loaded from environment variables.

MCP client → FastMCP server → tool handlers → processors → Claude API / Tesseract
                                                          → hash-based cache

MCP Tools

Tool Purpose
pdf_extract_full_tool Full document extraction with optional LLM analysis (quick / comprehensive / deep)
pdf_extract_structure_tool Extract TOC, tables, and headings
pdf_classify_tool Classify document type and recommend extraction strategy
pdf_ocr_tool Handle scanned PDFs via Tesseract
pdf_kb_ingest_tool One-shot extraction + classification + chunking for knowledge base ingestion
health_check Server health, API key status, cache stats
cache_stats Cache statistics and usage information
usage_summary API usage tracking and cost breakdown

Quick Start

Prerequisites

  • Python 3.10+
  • Tesseract OCR
  • Anthropic API key
# Ubuntu / Debian
sudo apt install tesseract-ocr

# macOS
brew install tesseract

Installation

git clone https://github.com/krisoye/document-analysis-mcp.git
cd document-analysis-mcp
pip install -e ".[dev]"

Configuration

export ANTHROPIC_API_KEY="sk-ant-..."      # Required for LLM analysis
export DOC_ANALYSIS_PORT=8766              # Default port
export DEPLOYMENT_MODE=development         # development | staging | production

Run

doc-analysis-server
# Server starts at http://localhost:8766

Register with Claude Code

claude mcp add --transport http document-analysis http://localhost:8766/mcp -s user

After registration, all MCP tools listed above are available in every Claude Code session.

Configuration

Variable Default Purpose
ANTHROPIC_API_KEY (required) Anthropic API key for Claude access
DOC_ANALYSIS_HOST 127.0.0.1 Server bind address (0.0.0.0 for external access)
DOC_ANALYSIS_PORT 8766 Server port
CACHE_DIR /var/cache/document-analysis-mcp Cache directory for extraction results
CACHE_TTL_DAYS 30 Cache expiration in days
DEFAULT_MODEL claude-sonnet-4-20250514 Claude model for extraction and analysis
CLASSIFICATION_MODEL claude-3-5-haiku-20241022 Faster model for classification only
MAX_TOKENS 4096 Maximum tokens for LLM responses
LOG_LEVEL INFO Logging verbosity (DEBUG, INFO, WARNING, ERROR)

See src/document_analysis_mcp/config.py for the complete settings reference.

Development

# Run tests
pytest

# Lint
ruff check src/

# Format
ruff format src/

# Type check
mypy src/

Tech Stack

Component Library
MCP server framework FastMCP
LLM analysis Anthropic Python SDK
PDF text extraction pdfplumber
PDF fallback / metadata pypdf
OCR Tesseract via pytesseract
Image processing Pillow
Configuration / validation Pydantic + pydantic-settings

License

MIT — see LICENSE.

About

A Model Context Protocol (MCP) server for PDF document analysis. Uses the Claude API for intelligent multi-page extraction, structure analysis, OCR, and document classification.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors