With DocSlight, precisely parse and extract data from any document, including PDFs, scans, images, and Office files. It is an open-source AI project from ComPDF (KDAN ecosystem).
- If you find DocSlight useful, please consider giving us a ⭐ Star on GitHub. It helps us grow and improve.
- Got questions or ideas? Join the conversation in our Discussions.
Quick Start • Product Editions • Usage • Benchmark • Cloud API → • Documentation
Unlike traditional OCR tools, DocSlight combines AI-powered document parsing, OCR for 80+ languages, and structured data extraction into a single open-source platform. You can deploy it locally or use it via cloud API with higher accuracy.
- Open-source document data extraction engine with no vendor lock-in
- OCR for 80+ languages with multilingual auto-detection
- Structured field extraction with bounding-box traceability
- Markdown / JSON output for downstream processing
- Web UI + CLI + Python SDK
- Local deployment or Cloud API
- Built for RAG, AI Agents, and enterprise document workflows
- RAG pipelines and knowledge base construction
- Invoice processing and document information extraction
- Contract analysis and clause parsing
- AI copilots and AI agent tool integration
- Enterprise document automation and intelligent document processing (IDP)
Whether you're building a personal RAG project or a large-scale enterprise document automation system, DocSlight provides a scalable foundation for document understanding.
# 1. Install
pip install docslight
# 2. Parse a document
docslight parse invoice.pdf --mode local --output invoice.md
# 3. View the result
cat invoice.md# 1. Install
pip install docslight
# 2. Set your API key
export COMPDF_API_KEY="your_public_key" # Get one at https://compdf.com
# 3. Parse with the cloud engine
docslight parse invoice.pdf --mode cloud --output jsonGet the API Key: Log in to the ComPDF Console. On the API Key page, create or copy your publicKey.
# Start the web interface
docslight web
python -m docslight.web_app --host 0.0.0.0 --port 8000
docker compose -f docker/docker-compose.yml up
# Open http://localhost:3022 and drag & drop filesAll features above come with ComPDF — check them out here.
Need workflow automation, RBAC, audit logs, private deployment, or dedicated support? Explore Enterprise: https://www.compdf.com/ai/docslight
| Feature | DocSlight Lite (Local) | DocSlight-Lite (Cloud) | DocSlight Enterprise (SaaS) | DocSlight Enterprise (Self-hosted Deployment) |
|---|---|---|---|---|
| Upload Files from Local | ✅ | ✅ | ✅ | ✅ |
| Upload Files from Cloud | ❌ | ❌ | ✅ | ✅ |
| Upload Files from DMS | ❌ | ❌ | ✅ | ✅ |
| Upload Files from Scanner | ❌ | ❌ | ✅ | ✅ |
| PDF Parsing | ✅ | ✅ | ✅ | ✅ |
| Image Parsing | ✅ | ✅ | ✅ | ✅ |
| Word / PPT / Excel Parsing | ✅ | ✅ | ✅ | ✅ |
| Markdown Output | ✅ | ✅ | ✅ | ✅ |
| JSON Output | ✅ | ✅ | ✅ | ✅ |
| PDF Extraction | Local LLM Required | ✅ | ✅ | ✅ |
| Image Extraction | Local LLM Required | ✅ | ✅ | ✅ |
| Word / PPT / Excel Extraction | Local LLM Required | ✅ | ✅ | ✅ |
| Legacy Office Formats for Parsing and Extraction (.doc/.ppt/.xls) | ❌ | ✅ | ✅ | ✅ |
| Batch Processing | ✅ | ❌ | ✅ | ✅ |
| Auto Classification | ❌ | ❌ | ✅ | ✅ |
| Human Review Workflow | ❌ | ❌ | ✅ | ✅ |
| Complex Layout Analysis | Basic | Advanced | Advanced | Advanced |
| OCR Optimization | Basic | Advanced | Advanced | Advanced |
| Result Traceability | ❌ | ❌ | ✅ | ✅ |
| Result Post-Processing | ❌ | ❌ | ✅ | ✅ |
| Intelligent Result Review | ❌ | ❌ | ✅ | ✅ |
| Custom Rule-Based Alerts | ❌ | ❌ | ✅ | ✅ |
| Webhook Integration | ❌ | ❌ | ✅ | ✅ |
| API Management | ❌ | Limited | ✅ | ✅ |
| Knowledge Base Integration | ❌ | ❌ | ✅ | ✅ |
| Audit Logs | ❌ | ❌ | ✅ | ✅ |
| RBAC | ❌ | ❌ | ✅ | ✅ |
| Tenant Support | ❌ | ❌ | ❌ | ✅ |
| Self-hosted Deployment | Local Only | ❌ | ❌ | ✅ |
| Dedicated GPU | ❌ | ❌ | Optional | ✅ |
- RAG Pipeline — Parse documents -> embed vectors -> query with an LLM
- Invoice Processing — Extract invoice numbers, dates, totals, and line items
- Contract Analysis — Parse clauses, parties, and dates with bounding-box traceability
- Document Digitization — Batch convert scanned archives into searchable text
- AI Agent Integration — Provide MCP-based document reading for Claude / ChatGPT
Runnable example code is available in examples/:
cloud_parse.pycloud_extract.pylocal_parse.pylocal_extract_ollama.pylocal_extract_openai_compatible.pypath_examples.py
from docslight import Parser
# Local mode — open-source OCR and document parsing
parser = Parser(mode="local")
result = parser.parse("contract.pdf")
print(result.text) # Full Markdown text
print(result.metadata) # Pages, blocks, bounding boxes
# Cloud mode — higher-accuracy PDF parsing
parser = Parser(mode="cloud", api_key="your_key")
result = parser.parse("invoice.pdf")
print(result.text)
print(result.tables) # Structured table data
print(result.blocks[0].bbox) # Bounding-box traceability# Parse a PDF to Markdown
docslight parse document.pdf -o md
# Parse an image to JSON with bounding boxes
docslight parse scan.png -o json --bbox
# Field extraction (cloud mode)
docslight extract invoice.pdf --schema '{"fields": ["invoice_no", "date", "total"]}'
# Watch a directory for new files
docslight watch ./incoming/ --clouddocslight parse [options] <input>
docslight extract [options] <input>| Option | Description |
|---|---|
input |
Required input file path. |
--mode {cloud,local} |
Required processing mode. Use cloud for ComPDF Cloud, or local for offline local processing. |
--api-key API_KEY |
Cloud API key. Required in cloud mode unless DOCSLIGHT_API_KEY is already set. |
--base-url BASE_URL |
Optional custom cloud API base URL. |
--output, -o OUTPUT |
Output file path. For text formats, defaults to standard output; for ZIP output, it is required or recommended. |
--format {markdown,json,standard-json,zip} |
Parse output format. Defaults to Markdown unless the output path ends with .zip. |
--local-parser LOCAL_PARSER |
Optional local parser selector for local mode. |
--local-llm-provider LOCAL_LLM_PROVIDER |
Local LLM provider setting. Not required for parse-only workflows. |
--local-llm-model LOCAL_LLM_MODEL |
Local LLM model name. Not required for parse-only workflows. |
--local-llm-base-url LOCAL_LLM_BASE_URL |
Local LLM endpoint base URL for providers that require it. |
--local-llm-api-key LOCAL_LLM_API_KEY |
Local LLM API key for providers that require it. |
pip install "docslight[web]"
docslight web
python -m docslight.web_app --host 0.0.0.0 --port 8000
docker compose -f docker/docker-compose.yml up
# Open http://127.0.0.1:3022| Capability | DocSlight | MinerU | PDF-Extract-Kit | ExtractThinker |
|---|---|---|---|---|
| PDF Parsing | ✅ | ✅ | ✅ | |
| OCR Support | ✅ | ✅ | ❌ | |
| Data Extraction | ✅ | ❌ | ❌ | ✅ |
| Web UI | ✅ | ❌ | ❌ | ❌ |
| CLI | ✅ | ✅ | ✅ | ❌ |
| Python SDK | ✅ | ✅ | ✅ | |
| Cloud API | ✅ | ❌ | ❌ | ❌ |
| Enterprise Deployment | ✅ | ❌ | ❌ | ❌ |
| Markdown Output | ✅ | ✅ | ✅ | |
| JSON Output | ✅ | ✅ | ✅ | ✅ |
| Multi-language OCR | ✅ | ❌ | ||
| Commercial Support | ✅ | ❌ | ❌ | ❌ |
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DocSlight Open-Source Document Parser & Extractor │
│ (LGPL License | Local + Cloud Dual Mode) │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Access Layer(Entry Points) │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────┐ ┌────────────────────────┐ ┌───────────────────┐ │
│ │ Docker Web UI(Primary) │ │ CLI │ │ Python SDK │ │
│ │ One-click container │ │ Command Line │ │ Native Code │ │
│ │ deployment, ready to │ │ Tool │ │ Integration │ │
│ │ use out of the box │ │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ docker compose -f │ │docslight parse <file> │ │ from docslight │ │
│ │ docker/compose.yml up │ │ │ │ import Parser │ │
│ │ │ │docslight extract <file>│ │ parser.parse() │ │
│ │ Browser access: │ │ │ │ │ │
│ │ http://localhost:3022 │ │ docslight web │ │ │ │
│ └──────────────────────────┘ └────────────────────────┘ └───────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Core Processing Router │
│ Auto-switch between Local and Cloud Engine via --mode / config │
└─────────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────┴─────────────────┐
│ │
▼ ▼
┌───────────────────────────────────┐ ┌─────────────────────────────────────┐
│ 🖥️ Local Mode(Lite Local) │ │ ☁️ Cloud Mode(Lite Cloud) │
│ (Free, Offline, CPU Support) │ │ (High Accuracy, API Key, GPU) │
├───────────────────────────────────┤ ├─────────────────────────────────────┤
│ • Input Formats: │ │ • Input Formats: │
│ PDF / Images / New Office │ │ + Legacy Office (.doc/.xls etc.)│
│ (.docx/.pptx/.xlsx) │ │ │
│ • Base OCR: PaddleOCR │ │ • High-Accuracy VLM OCR Engine │
│ • Basic Layout Analysis │ │ • Complex Layout (Tables/Formulas/ │
│ • Field Extraction: requires │ │ Multi-column) │
│ local LLM (Ollama/OpenAI │ │ • Built-in AI Field Extraction │
│ compatible) │ │ • Bounding Box (BBox) Traceability│
│ • Output: Markdown / JSON / Text │ │ • Output: Markdown / JSON / Text │
└───────────────────────────────────┘ └─────────────────────────────────────┘
│ │
└─────────────────┬─────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ AI Capability Layer(Engine Modules) │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌─────────────────────────────┐ │
│ │ OCR Engine │ │ Structure │ │ Field Extraction Module │ │
│ │ • Local: │ │ Analyzer │ │ • Template Extraction │ │
│ │ PaddleOCR │ │ • Block │ │ • Custom Fields │ │
│ │ • Cloud: │ │ Classification│ │ • Rules + LLM Combo │ │
│ │ VLM Engine │ │ • Table │ │ • BBox Traceability │ │
│ │ │ │ Detection │ │ │ │
│ │ │ │ • Key-Value │ │ │ │
│ │ │ │ Mapping │ │ │ │
│ │ │ │ • Formula │ │ │ │
│ │ │ │ Recognition │ │ │ │
│ └──────────────────┘ └──────────────────┘ └─────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Output & Ecosystem Layer │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────────────────────┐ │
│ │ Standard │ │ AI Ecosystem │ │ Enterprise Extensions │ │
│ │ Output │ │ Integration │ │ (SaaS / Private Deployment) │ │
│ │ Formats │ │ │ │ │ │
│ │ • Markdown │ │ • LangChain │ │ • Workflow Orchestration │ │
│ │ • JSON │ │ • LlamaIndex │ │ • Knowledge Base / DMS │ │
│ │ • Text │ │ • CrewAI │ │ • RBAC / Audit Logs │ │
│ │ • with BBox │ │ • AutoGen │ │ • Smart Review / Custom Rules│ │
│ │ Coordinates│ │ • Haystack │ │ • Multi-tenancy / Private │ │
│ │ Tracing │ │ │ │ Deployment │ │
│ └───────────────┘ └───────────────┘ └───────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ Target Scenarios: RAG Pipelines / AI Agents / Enterprise Document │ │
│ │ Automation / Intelligent Document Processing(IDP) │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
DocSlight helps developers build modern AI document workflows with open-source PDF parsing and open-source document data extraction.
- RAG systems
- AI assistants
- Enterprise knowledge bases
- AI agent workflows
- Document search engines
- MCP applications
- Intelligent document processing (IDP) for open-source workflows at any scale
- OpenAI
- Claude
- Ollama
- LangChain
- LlamaIndex
- CrewAI
- AutoGen
- Haystack
PDF / Image / Office Document
↓
docslight
↓
Markdown / JSON Output
↓
Vector Database
↓
LLM / AI Agent
↓
Answers & Automation
| Model Type | Methods | Parameters | Overall Score↑ | TextEdit↓ | FormulaCDM↑ | TableTEDS↑ | TableTEDS-S↑ | Read OrderEdit↓ |
|---|---|---|---|---|---|---|---|---|
| DocSlight (Cloud) | Specialized VLMs | 0.9B | 96.45 | 0.0321 | 97.76 | 94.80 | 97.02 | 0.131 |
| MinerU2.5-Pro | Specialized VLMs | 1.2B | 95.75 | 0.036 | 97.45 | 93.42 | 95.92 | 0.120 |
| GLM-OCR | Specialized VLMs | 0.9B | 95.22 | 0.044 | 97.18 | 92.83 | 95.39 | 0.133 |
| PaddleOCR-VL-1.5 | Specialized VLMs | 0.9B | 94.93 | 0.038 | 96.89 | 91.67 | 94.37 | 0.130 |
| Ovis2.6-30B-A3B | Specialized VLMs | 30B | 93.70 | 0.035 | 95.17 | 89.44 | 92.40 | 0.135 |
| Logics-Parsing-v2 | Specialized VLMs | 4B | 93.33 | 0.041 | 95.65 | 88.42 | 91.98 | 0.137 |
| HunyuanOCR | Specialized VLMs | 1B | 89.95 | 0.088 | 87.68 | 91.01 | 93.23 | 0.171 |
| Qwen3-VL-235B | General VLMs | 235B | 89.78 | 0.063 | 92.55 | 83.07 | 86.75 | 0.166 |
| Dolphin-v2 | Specialized VLMs | 3B | 89.50 | 0.069 | 91.01 | 84.40 | 87.44 | 0.150 |
| GPT-5.2 | General VLMs | - | 86.59 | 0.114 | 88.21 | 82.95 | 87.93 | 0.193 |
| Mistral OCR | Specialized VLMs | - | 85.66 | 0.097 | 89.91 | 76.78 | 80.93 | 0.171 |
| Nanonets-OCR-s | Specialized VLMs | 3B | 83.61 | 0.108 | 81.46 | 80.18 | 84.51 | 0.213 |
| Marker | Pipeline Tools | - | 78.44 | 0.157 | 85.24 | 65.77 | 73.24 | 0.243 |
Methodology: Based on real human-annotated data and measured with character-level accuracy. The test set covers 500+ enterprise documents, including invoices, contracts, tables, and reports. The dataset is available at benchmarks/dataset.
| Package | Description |
|---|---|
docslight |
Core CLI + Python SDK |
docslight[web] |
Web UI for browser-based drag-and-drop workflows |
docslight[cloud] |
Cloud API client with higher accuracy |
docslight[all] |
All features |
pip install "docslight[all]"Have suggestions? Start a discussion. If you find DocSlight useful, please consider giving us a ⭐ Star on GitHub. It helps us grow and improve.
DocSlight is released as open source under the LGPL.
Commercial / Enterprise licenses with support for GPU self-hosted deployment are available at compdf.com.
Built by the ComPDF team.
Website ·
Docs ·
Enterprise Inquiries

