DocSlight - An Open-source Document Parser & Document Data Extraction Engine

With DocSlight, precisely parse and extract data from any document, including PDFs, scans, images, and Office files. It is an open-source AI project from ComPDF (KDAN ecosystem).

If you find DocSlight useful, please consider giving us a ⭐ Star on GitHub. It helps us grow and improve.

Got questions or ideas? Join the conversation in our Discussions.

Quick Start • Product Editions • Usage • Benchmark • Cloud API → • Documentation

Why DocSlight?

Unlike traditional OCR tools, DocSlight combines AI-powered document parsing, OCR for 80+ languages, and structured data extraction into a single open-source platform. You can deploy it locally or use it via cloud API with higher accuracy.

Key Advantages

Open-source document data extraction engine with no vendor lock-in
OCR for 80+ languages with multilingual auto-detection
Structured field extraction with bounding-box traceability
Markdown / JSON output for downstream processing
Web UI + CLI + Python SDK
Local deployment or Cloud API
Built for RAG, AI Agents, and enterprise document workflows

Perfect For

RAG pipelines and knowledge base construction
Invoice processing and document information extraction
Contract analysis and clause parsing
AI copilots and AI agent tool integration
Enterprise document automation and intelligent document processing (IDP)

Whether you're building a personal RAG project or a large-scale enterprise document automation system, DocSlight provides a scalable foundation for document understanding.

Quick Start

Local Mode (Free, no registration required)

# 1. Install
pip install docslight

# 2. Parse a document
docslight parse invoice.pdf --mode local --output invoice.md

# 3. View the result
cat invoice.md

Cloud Mode (Higher accuracy, free quota available)

# 1. Install
pip install docslight

# 2. Set your API key
export COMPDF_API_KEY="your_public_key"    # Get one at https://compdf.com

# 3. Parse with the cloud engine
docslight parse invoice.pdf --mode cloud --output json

Get the API Key: Log in to the ComPDF Console. On the API Key page, create or copy your publicKey.

Web UI (Browser)

# Start the web interface
docslight web
python -m docslight.web_app --host 0.0.0.0 --port 8000

docker compose -f docker/docker-compose.yml up
# Open http://localhost:3022 and drag & drop files

All features above come with ComPDF — check them out here.

Product Editions

Need workflow automation, RBAC, audit logs, private deployment, or dedicated support? Explore Enterprise: https://www.compdf.com/ai/docslight

Feature	DocSlight Lite (Local)	DocSlight-Lite (Cloud)	DocSlight Enterprise (SaaS)	DocSlight Enterprise (Self-hosted Deployment)
Upload Files from Local	✅	✅	✅	✅
Upload Files from Cloud	❌	❌	✅	✅
Upload Files from DMS	❌	❌	✅	✅
Upload Files from Scanner	❌	❌	✅	✅
PDF Parsing	✅	✅	✅	✅
Image Parsing	✅	✅	✅	✅
Word / PPT / Excel Parsing	✅	✅	✅	✅
Markdown Output	✅	✅	✅	✅
JSON Output	✅	✅	✅	✅
PDF Extraction	Local LLM Required	✅	✅	✅
Image Extraction	Local LLM Required	✅	✅	✅
Word / PPT / Excel Extraction	Local LLM Required	✅	✅	✅
Legacy Office Formats for Parsing and Extraction (.doc/.ppt/.xls)	❌	✅	✅	✅
Batch Processing	✅	❌	✅	✅
Auto Classification	❌	❌	✅	✅
Human Review Workflow	❌	❌	✅	✅
Complex Layout Analysis	Basic	Advanced	Advanced	Advanced
OCR Optimization	Basic	Advanced	Advanced	Advanced
Result Traceability	❌	❌	✅	✅
Result Post-Processing	❌	❌	✅	✅
Intelligent Result Review	❌	❌	✅	✅
Custom Rule-Based Alerts	❌	❌	✅	✅
Webhook Integration	❌	❌	✅	✅
API Management	❌	Limited	✅	✅
Knowledge Base Integration	❌	❌	✅	✅
Audit Logs	❌	❌	✅	✅
RBAC	❌	❌	✅	✅
Tenant Support	❌	❌	❌	✅
Self-hosted Deployment	Local Only	❌	❌	✅
Dedicated GPU	❌	❌	Optional	✅

Use Cases

RAG Pipeline — Parse documents -> embed vectors -> query with an LLM
Invoice Processing — Extract invoice numbers, dates, totals, and line items
Contract Analysis — Parse clauses, parties, and dates with bounding-box traceability
Document Digitization — Batch convert scanned archives into searchable text
AI Agent Integration — Provide MCP-based document reading for Claude / ChatGPT

Runnable example code is available in examples/:

Usage

Python SDK

from docslight import Parser

# Local mode — open-source OCR and document parsing
parser = Parser(mode="local")
result = parser.parse("contract.pdf")
print(result.text)                    # Full Markdown text
print(result.metadata)                # Pages, blocks, bounding boxes

# Cloud mode — higher-accuracy PDF parsing
parser = Parser(mode="cloud", api_key="your_key")
result = parser.parse("invoice.pdf")
print(result.text)
print(result.tables)                  # Structured table data
print(result.blocks[0].bbox)          # Bounding-box traceability

CLI

# Parse a PDF to Markdown
docslight parse document.pdf -o md

# Parse an image to JSON with bounding boxes
docslight parse scan.png -o json --bbox

# Field extraction (cloud mode)
docslight extract invoice.pdf --schema '{"fields": ["invoice_no", "date", "total"]}'

# Watch a directory for new files
docslight watch ./incoming/ --cloud

Parse | extract CLI options

docslight parse [options] <input>
docslight extract [options] <input>

Option	Description
`input`	Required input file path.
`--mode {cloud,local}`	Required processing mode. Use `cloud` for ComPDF Cloud, or `local` for offline local processing.
`--api-key API_KEY`	Cloud API key. Required in cloud mode unless `DOCSLIGHT_API_KEY` is already set.
`--base-url BASE_URL`	Optional custom cloud API base URL.
`--output, -o OUTPUT`	Output file path. For text formats, defaults to standard output; for ZIP output, it is required or recommended.
`--format {markdown,json,standard-json,zip}`	Parse output format. Defaults to Markdown unless the output path ends with `.zip`.
`--local-parser LOCAL_PARSER`	Optional local parser selector for local mode.
`--local-llm-provider LOCAL_LLM_PROVIDER`	Local LLM provider setting. Not required for parse-only workflows.
`--local-llm-model LOCAL_LLM_MODEL`	Local LLM model name. Not required for parse-only workflows.
`--local-llm-base-url LOCAL_LLM_BASE_URL`	Local LLM endpoint base URL for providers that require it.
`--local-llm-api-key LOCAL_LLM_API_KEY`	Local LLM API key for providers that require it.

Docker

pip install "docslight[web]"
docslight web

python -m docslight.web_app --host 0.0.0.0 --port 8000

docker compose -f docker/docker-compose.yml up
# Open http://127.0.0.1:3022

Comparison

Capability	DocSlight	MinerU	PDF-Extract-Kit	ExtractThinker
PDF Parsing	✅	✅	✅	⚠️
OCR Support	✅	⚠️	✅	❌
Data Extraction	✅	❌	❌	✅
Web UI	✅	❌	❌	❌
CLI	✅	✅	✅	❌
Python SDK	✅	✅	✅	⚠️
Cloud API	✅	❌	❌	❌
Enterprise Deployment	✅	❌	❌	❌
Markdown Output	✅	✅	✅	⚠️
JSON Output	✅	✅	✅	✅
Multi-language OCR	✅	⚠️	⚠️	❌
Commercial Support	✅	❌	❌	❌

Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                      DocSlight Open-Source Document Parser & Extractor           │
│                        （LGPL License | Local + Cloud Dual Mode）               │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                        Access Layer（Entry Points）                            │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│  ┌──────────────────────────┐   ┌────────────────────────┐  ┌───────────────────┐ │
│  │  Docker Web UI（Primary） │  │          CLI            │  │    Python SDK     │ │
│  │  One-click container     │  │      Command Line       │  │  Native Code     │ │
│  │  deployment, ready to    │  │        Tool             │  │  Integration     │ │
│  │  use out of the box      │  │                         │  │                  │ │
│  │                           │  │                         │  │                  │ │
│  │  docker compose -f       │  │docslight parse <file>   │  │  from docslight  │ │
│  │  docker/compose.yml up   │  │                         │  │  import Parser   │ │
│  │                           │  │docslight extract <file>│  │  parser.parse()  │ │
│  │  Browser access:         │  │                         │  │                  │ │
│  │  http://localhost:3022   │  │  docslight web          │  │                  │ │
│  └──────────────────────────┘  └────────────────────────┘  └───────────────────┘ │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         Core Processing Router                                  │
│              Auto-switch between Local and Cloud Engine via --mode / config     │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                    ┌─────────────────┴─────────────────┐
                    │                                   │
                    ▼                                   ▼
┌───────────────────────────────────┐   ┌─────────────────────────────────────┐
│      🖥️ Local Mode（Lite Local）  │   │       ☁️ Cloud Mode（Lite Cloud）   │
│  （Free, Offline, CPU Support）   │   │  （High Accuracy, API Key, GPU）    │
├───────────────────────────────────┤   ├─────────────────────────────────────┤
│  • Input Formats:                 │   │  • Input Formats:                   │
│    PDF / Images / New Office      │   │    + Legacy Office (.doc/.xls etc.)│
│    (.docx/.pptx/.xlsx)           │   │                                     │
│  • Base OCR: PaddleOCR            │   │  • High-Accuracy VLM OCR Engine    │
│  • Basic Layout Analysis          │   │  • Complex Layout (Tables/Formulas/ │
│  • Field Extraction: requires     │   │    Multi-column)                   │
│    local LLM (Ollama/OpenAI      │   │  • Built-in AI Field Extraction     │
│    compatible)                    │   │  • Bounding Box (BBox) Traceability│
│  • Output: Markdown / JSON / Text │   │  • Output: Markdown / JSON / Text  │
└───────────────────────────────────┘   └─────────────────────────────────────┘
                    │                                   │
                    └─────────────────┬─────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                          AI Capability Layer（Engine Modules）                  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│  ┌──────────────────┐  ┌──────────────────┐  ┌─────────────────────────────┐  │
│  │   OCR Engine     │  │  Structure       │  │  Field Extraction Module   │  │
│  │  • Local:        │  │  Analyzer        │  │  • Template Extraction     │  │
│  │  PaddleOCR       │  │  • Block         │  │  • Custom Fields           │  │
│  │  • Cloud:        │  │    Classification│  │  • Rules + LLM Combo       │  │
│  │  VLM Engine      │  │  • Table         │  │  • BBox Traceability       │  │
│  │                  │  │    Detection     │  │                             │  │
│  │                  │  │  • Key-Value     │  │                             │  │
│  │                  │  │    Mapping       │  │                             │  │
│  │                  │  │  • Formula       │  │                             │  │
│  │                  │  │    Recognition   │  │                             │  │
│  └──────────────────┘  └──────────────────┘  └─────────────────────────────┘  │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         Output & Ecosystem Layer                               │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   ┌───────────────┐   ┌───────────────┐   ┌───────────────────────────────┐  │
│   │ Standard      │   │ AI Ecosystem  │   │ Enterprise Extensions         │  │
│   │ Output        │   │ Integration   │   │ （SaaS / Private Deployment）  │  │
│   │ Formats       │   │               │   │                               │  │
│   │  • Markdown   │   │  • LangChain  │   │  • Workflow Orchestration     │  │
│   │  • JSON       │   │  • LlamaIndex │   │  • Knowledge Base / DMS       │  │
│   │  • Text       │   │  • CrewAI     │   │  • RBAC / Audit Logs          │  │
│   │  • with BBox  │   │  • AutoGen    │   │  • Smart Review / Custom Rules│  │
│   │    Coordinates│   │  • Haystack   │   │  • Multi-tenancy / Private    │  │
│   │    Tracing    │   │               │   │    Deployment                 │  │
│   └───────────────┘   └───────────────┘   └───────────────────────────────┘  │
│                                                                                 │
│   ┌──────────────────────────────────────────────────────────────────────────┐ │
│   │  Target Scenarios: RAG Pipelines / AI Agents / Enterprise Document      │ │
│   │  Automation / Intelligent Document Processing（IDP）                    │ │
│   └──────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘

Built for AI Agents & RAG

DocSlight helps developers build modern AI document workflows with open-source PDF parsing and open-source document data extraction.

Common Applications

RAG systems
AI assistants
Enterprise knowledge bases
AI agent workflows
Document search engines
MCP applications
Intelligent document processing (IDP) for open-source workflows at any scale

Compatible Ecosystem

OpenAI
Claude
Ollama
LangChain
LlamaIndex
CrewAI
AutoGen
Haystack

Typical Workflow

PDF / Image / Office Document
            ↓
        docslight
            ↓
Markdown / JSON Output
            ↓
Vector Database
            ↓
LLM / AI Agent
            ↓
Answers & Automation

Benchmark

Model Type	Methods	Parameters	Overall Score↑	TextEdit↓	FormulaCDM↑	TableTEDS↑	TableTEDS-S↑	Read OrderEdit↓
DocSlight (Cloud)	Specialized VLMs	0.9B	96.45	0.0321	97.76	94.80	97.02	0.131
MinerU2.5-Pro	Specialized VLMs	1.2B	95.75	0.036	97.45	93.42	95.92	0.120
GLM-OCR	Specialized VLMs	0.9B	95.22	0.044	97.18	92.83	95.39	0.133
PaddleOCR-VL-1.5	Specialized VLMs	0.9B	94.93	0.038	96.89	91.67	94.37	0.130
Ovis2.6-30B-A3B	Specialized VLMs	30B	93.70	0.035	95.17	89.44	92.40	0.135
Logics-Parsing-v2	Specialized VLMs	4B	93.33	0.041	95.65	88.42	91.98	0.137
HunyuanOCR	Specialized VLMs	1B	89.95	0.088	87.68	91.01	93.23	0.171
Qwen3-VL-235B	General VLMs	235B	89.78	0.063	92.55	83.07	86.75	0.166
Dolphin-v2	Specialized VLMs	3B	89.50	0.069	91.01	84.40	87.44	0.150
GPT-5.2	General VLMs	-	86.59	0.114	88.21	82.95	87.93	0.193
Mistral OCR	Specialized VLMs	-	85.66	0.097	89.91	76.78	80.93	0.171
Nanonets-OCR-s	Specialized VLMs	3B	83.61	0.108	81.46	80.18	84.51	0.213
Marker	Pipeline Tools	-	78.44	0.157	85.24	65.77	73.24	0.243

Methodology: Based on real human-annotated data and measured with character-level accuracy. The test set covers 500+ enterprise documents, including invoices, contracts, tables, and reports. The dataset is available at benchmarks/dataset.

Package Variants

Package	Description
`docslight`	Core CLI + Python SDK
`docslight[web]`	Web UI for browser-based drag-and-drop workflows
`docslight[cloud]`	Cloud API client with higher accuracy
`docslight[all]`	All features

pip install "docslight[all]"

Support

Have suggestions? Start a discussion. If you find DocSlight useful, please consider giving us a ⭐ Star on GitHub. It helps us grow and improve.

License

DocSlight is released as open source under the LGPL.

Commercial / Enterprise licenses with support for GPU self-hosted deployment are available at compdf.com.

Built by the ComPDF team.
Website · Docs · Enterprise Inquiries

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Images		Images
docslight_lite		docslight_lite
docslight_server		docslight_server
docslight_web		docslight_web
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
README_TW.md		README_TW.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocSlight - An Open-source Document Parser & Document Data Extraction Engine

Why DocSlight?

Key Advantages

Perfect For

Quick Start

Local Mode (Free, no registration required)

Cloud Mode (Higher accuracy, free quota available)

Web UI (Browser)

Product Editions

Use Cases

Usage

Python SDK

CLI

Parse | extract CLI options

Docker

Comparison

Architecture

Built for AI Agents & RAG

Common Applications

Compatible Ecosystem

Typical Workflow

Benchmark

Package Variants

Support

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DocSlight - An Open-source Document Parser & Document Data Extraction Engine

Why DocSlight?

Key Advantages

Perfect For

Quick Start

Local Mode (Free, no registration required)

Cloud Mode (Higher accuracy, free quota available)

Web UI (Browser)

Product Editions

Use Cases

Usage

Python SDK

CLI

Parse | extract CLI options

Docker

Comparison

Architecture

Built for AI Agents & RAG

Common Applications

Compatible Ecosystem

Typical Workflow

Benchmark

Package Variants

Support

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages