A Model Context Protocol (MCP) server for PDF document analysis. Uses the Claude API for intelligent multi-page extraction, structure analysis, OCR, and document classification.
- Full PDF extraction — Iterative LLM analysis across all pages, not just the first chunk
- Structure extraction — Tables, table of contents, figures, and headings
- OCR support — Scanned PDF handling via Tesseract
- Document classification — 17+ content types with extraction strategy recommendations
- Hash-based caching — Prevents re-processing identical documents
- API usage tracking — Per-session cost breakdown and usage summary
The server is built on FastMCP using Streamable HTTP transport (stateless). Incoming tool calls are routed to handlers in tools/, which delegate to processors/ for PDF extraction (pdfplumber with pypdf fallback), LLM analysis (Anthropic API), and OCR (Tesseract). Results are cached by document hash. Configuration is managed via Pydantic settings loaded from environment variables.
MCP client → FastMCP server → tool handlers → processors → Claude API / Tesseract
→ hash-based cache
| Tool | Purpose |
|---|---|
pdf_extract_full_tool |
Full document extraction with optional LLM analysis (quick / comprehensive / deep) |
pdf_extract_structure_tool |
Extract TOC, tables, and headings |
pdf_classify_tool |
Classify document type and recommend extraction strategy |
pdf_ocr_tool |
Handle scanned PDFs via Tesseract |
pdf_kb_ingest_tool |
One-shot extraction + classification + chunking for knowledge base ingestion |
health_check |
Server health, API key status, cache stats |
cache_stats |
Cache statistics and usage information |
usage_summary |
API usage tracking and cost breakdown |
- Python 3.10+
- Tesseract OCR
- Anthropic API key
# Ubuntu / Debian
sudo apt install tesseract-ocr
# macOS
brew install tesseractgit clone https://github.com/krisoye/document-analysis-mcp.git
cd document-analysis-mcp
pip install -e ".[dev]"export ANTHROPIC_API_KEY="sk-ant-..." # Required for LLM analysis
export DOC_ANALYSIS_PORT=8766 # Default port
export DEPLOYMENT_MODE=development # development | staging | productiondoc-analysis-server
# Server starts at http://localhost:8766claude mcp add --transport http document-analysis http://localhost:8766/mcp -s userAfter registration, all MCP tools listed above are available in every Claude Code session.
| Variable | Default | Purpose |
|---|---|---|
ANTHROPIC_API_KEY |
(required) | Anthropic API key for Claude access |
DOC_ANALYSIS_HOST |
127.0.0.1 |
Server bind address (0.0.0.0 for external access) |
DOC_ANALYSIS_PORT |
8766 |
Server port |
CACHE_DIR |
/var/cache/document-analysis-mcp |
Cache directory for extraction results |
CACHE_TTL_DAYS |
30 |
Cache expiration in days |
DEFAULT_MODEL |
claude-sonnet-4-20250514 |
Claude model for extraction and analysis |
CLASSIFICATION_MODEL |
claude-3-5-haiku-20241022 |
Faster model for classification only |
MAX_TOKENS |
4096 |
Maximum tokens for LLM responses |
LOG_LEVEL |
INFO |
Logging verbosity (DEBUG, INFO, WARNING, ERROR) |
See src/document_analysis_mcp/config.py for the complete settings reference.
# Run tests
pytest
# Lint
ruff check src/
# Format
ruff format src/
# Type check
mypy src/| Component | Library |
|---|---|
| MCP server framework | FastMCP |
| LLM analysis | Anthropic Python SDK |
| PDF text extraction | pdfplumber |
| PDF fallback / metadata | pypdf |
| OCR | Tesseract via pytesseract |
| Image processing | Pillow |
| Configuration / validation | Pydantic + pydantic-settings |
MIT — see LICENSE.