Skip to content

ComPDFKit/docslight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English | 繁體中文 | 简体中文

DocSlight - An Open-source Document Parser & Document Data Extraction Engine

With DocSlight, precisely parse and extract data from any document, including PDFs, scans, images, and Office files. It is an open-source AI project from ComPDF (KDAN ecosystem).

  • If you find DocSlight useful, please consider giving us a ⭐ Star on GitHub. It helps us grow and improve.
  • Got questions or ideas? Join the conversation in our Discussions.

License Python GitHub Stars PyPI PRs Welcome

Quick StartProduct EditionsUsageBenchmarkCloud API →Documentation

Why DocSlight?

Unlike traditional OCR tools, DocSlight combines AI-powered document parsing, OCR for 80+ languages, and structured data extraction into a single open-source platform. You can deploy it locally or use it via cloud API with higher accuracy.

Key Advantages

  • Open-source document data extraction engine with no vendor lock-in
  • OCR for 80+ languages with multilingual auto-detection
  • Structured field extraction with bounding-box traceability
  • Markdown / JSON output for downstream processing
  • Web UI + CLI + Python SDK
  • Local deployment or Cloud API
  • Built for RAG, AI Agents, and enterprise document workflows

Perfect For

  • RAG pipelines and knowledge base construction
  • Invoice processing and document information extraction
  • Contract analysis and clause parsing
  • AI copilots and AI agent tool integration
  • Enterprise document automation and intelligent document processing (IDP)

Whether you're building a personal RAG project or a large-scale enterprise document automation system, DocSlight provides a scalable foundation for document understanding.

DocSlight Demo


Quick Start

Local Mode (Free, no registration required)

# 1. Install
pip install docslight

# 2. Parse a document
docslight parse invoice.pdf --mode local --output invoice.md

# 3. View the result
cat invoice.md

Cloud Mode (Higher accuracy, free quota available)

# 1. Install
pip install docslight

# 2. Set your API key
export COMPDF_API_KEY="your_public_key"    # Get one at https://compdf.com

# 3. Parse with the cloud engine
docslight parse invoice.pdf --mode cloud --output json

Get the API Key: Log in to the ComPDF Console. On the API Key page, create or copy your publicKey.

get-license-en

Web UI (Browser)

# Start the web interface
docslight web
python -m docslight.web_app --host 0.0.0.0 --port 8000

docker compose -f docker/docker-compose.yml up
# Open http://localhost:3022 and drag & drop files

All features above come with ComPDF — check them out here.


Product Editions

Need workflow automation, RBAC, audit logs, private deployment, or dedicated support? Explore Enterprise: https://www.compdf.com/ai/docslight

Feature DocSlight Lite (Local) DocSlight-Lite (Cloud) DocSlight Enterprise (SaaS) DocSlight Enterprise (Self-hosted Deployment)
Upload Files from Local
Upload Files from Cloud
Upload Files from DMS
Upload Files from Scanner
PDF Parsing
Image Parsing
Word / PPT / Excel Parsing
Markdown Output
JSON Output
PDF Extraction Local LLM Required
Image Extraction Local LLM Required
Word / PPT / Excel Extraction Local LLM Required
Legacy Office Formats for Parsing and Extraction (.doc/.ppt/.xls)
Batch Processing
Auto Classification
Human Review Workflow
Complex Layout Analysis Basic Advanced Advanced Advanced
OCR Optimization Basic Advanced Advanced Advanced
Result Traceability
Result Post-Processing
Intelligent Result Review
Custom Rule-Based Alerts
Webhook Integration
API Management Limited
Knowledge Base Integration
Audit Logs
RBAC
Tenant Support
Self-hosted Deployment Local Only
Dedicated GPU Optional

Use Cases

  • RAG Pipeline — Parse documents -> embed vectors -> query with an LLM
  • Invoice Processing — Extract invoice numbers, dates, totals, and line items
  • Contract Analysis — Parse clauses, parties, and dates with bounding-box traceability
  • Document Digitization — Batch convert scanned archives into searchable text
  • AI Agent Integration — Provide MCP-based document reading for Claude / ChatGPT

Runnable example code is available in examples/:


Usage

Python SDK

from docslight import Parser

# Local mode — open-source OCR and document parsing
parser = Parser(mode="local")
result = parser.parse("contract.pdf")
print(result.text)                    # Full Markdown text
print(result.metadata)                # Pages, blocks, bounding boxes

# Cloud mode — higher-accuracy PDF parsing
parser = Parser(mode="cloud", api_key="your_key")
result = parser.parse("invoice.pdf")
print(result.text)
print(result.tables)                  # Structured table data
print(result.blocks[0].bbox)          # Bounding-box traceability

CLI

# Parse a PDF to Markdown
docslight parse document.pdf -o md

# Parse an image to JSON with bounding boxes
docslight parse scan.png -o json --bbox

# Field extraction (cloud mode)
docslight extract invoice.pdf --schema '{"fields": ["invoice_no", "date", "total"]}'

# Watch a directory for new files
docslight watch ./incoming/ --cloud

Parse | extract CLI options

docslight parse [options] <input>
docslight extract [options] <input>
Option Description
input Required input file path.
--mode {cloud,local} Required processing mode. Use cloud for ComPDF Cloud, or local for offline local processing.
--api-key API_KEY Cloud API key. Required in cloud mode unless DOCSLIGHT_API_KEY is already set.
--base-url BASE_URL Optional custom cloud API base URL.
--output, -o OUTPUT Output file path. For text formats, defaults to standard output; for ZIP output, it is required or recommended.
--format {markdown,json,standard-json,zip} Parse output format. Defaults to Markdown unless the output path ends with .zip.
--local-parser LOCAL_PARSER Optional local parser selector for local mode.
--local-llm-provider LOCAL_LLM_PROVIDER Local LLM provider setting. Not required for parse-only workflows.
--local-llm-model LOCAL_LLM_MODEL Local LLM model name. Not required for parse-only workflows.
--local-llm-base-url LOCAL_LLM_BASE_URL Local LLM endpoint base URL for providers that require it.
--local-llm-api-key LOCAL_LLM_API_KEY Local LLM API key for providers that require it.

Docker

pip install "docslight[web]"
docslight web

python -m docslight.web_app --host 0.0.0.0 --port 8000

docker compose -f docker/docker-compose.yml up
# Open http://127.0.0.1:3022

Comparison

Capability DocSlight MinerU PDF-Extract-Kit ExtractThinker
PDF Parsing ⚠️
OCR Support ⚠️
Data Extraction
Web UI
CLI
Python SDK ⚠️
Cloud API
Enterprise Deployment
Markdown Output ⚠️
JSON Output
Multi-language OCR ⚠️ ⚠️
Commercial Support

Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                      DocSlight Open-Source Document Parser & Extractor           │
│                        (LGPL License | Local + Cloud Dual Mode)               │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                        Access Layer(Entry Points)                            │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│  ┌──────────────────────────┐   ┌────────────────────────┐  ┌───────────────────┐ │
│  │  Docker Web UI(Primary) │  │          CLI            │  │    Python SDK     │ │
│  │  One-click container     │  │      Command Line       │  │  Native Code     │ │
│  │  deployment, ready to    │  │        Tool             │  │  Integration     │ │
│  │  use out of the box      │  │                         │  │                  │ │
│  │                           │  │                         │  │                  │ │
│  │  docker compose -f       │  │docslight parse <file>   │  │  from docslight  │ │
│  │  docker/compose.yml up   │  │                         │  │  import Parser   │ │
│  │                           │  │docslight extract <file>│  │  parser.parse()  │ │
│  │  Browser access:         │  │                         │  │                  │ │
│  │  http://localhost:3022   │  │  docslight web          │  │                  │ │
│  └──────────────────────────┘  └────────────────────────┘  └───────────────────┘ │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         Core Processing Router                                  │
│              Auto-switch between Local and Cloud Engine via --mode / config     │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                    ┌─────────────────┴─────────────────┐
                    │                                   │
                    ▼                                   ▼
┌───────────────────────────────────┐   ┌─────────────────────────────────────┐
│      🖥️ Local Mode(Lite Local)  │   │       ☁️ Cloud Mode(Lite Cloud)   │
│  (Free, Offline, CPU Support)   │   │  (High Accuracy, API Key, GPU)    │
├───────────────────────────────────┤   ├─────────────────────────────────────┤
│  • Input Formats:                 │   │  • Input Formats:                   │
│    PDF / Images / New Office      │   │    + Legacy Office (.doc/.xls etc.)│
│    (.docx/.pptx/.xlsx)           │   │                                     │
│  • Base OCR: PaddleOCR            │   │  • High-Accuracy VLM OCR Engine    │
│  • Basic Layout Analysis          │   │  • Complex Layout (Tables/Formulas/ │
│  • Field Extraction: requires     │   │    Multi-column)                   │
│    local LLM (Ollama/OpenAI      │   │  • Built-in AI Field Extraction     │
│    compatible)                    │   │  • Bounding Box (BBox) Traceability│
│  • Output: Markdown / JSON / Text │   │  • Output: Markdown / JSON / Text  │
└───────────────────────────────────┘   └─────────────────────────────────────┘
                    │                                   │
                    └─────────────────┬─────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                          AI Capability Layer(Engine Modules)                  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│  ┌──────────────────┐  ┌──────────────────┐  ┌─────────────────────────────┐  │
│  │   OCR Engine     │  │  Structure       │  │  Field Extraction Module   │  │
│  │  • Local:        │  │  Analyzer        │  │  • Template Extraction     │  │
│  │  PaddleOCR       │  │  • Block         │  │  • Custom Fields           │  │
│  │  • Cloud:        │  │    Classification│  │  • Rules + LLM Combo       │  │
│  │  VLM Engine      │  │  • Table         │  │  • BBox Traceability       │  │
│  │                  │  │    Detection     │  │                             │  │
│  │                  │  │  • Key-Value     │  │                             │  │
│  │                  │  │    Mapping       │  │                             │  │
│  │                  │  │  • Formula       │  │                             │  │
│  │                  │  │    Recognition   │  │                             │  │
│  └──────────────────┘  └──────────────────┘  └─────────────────────────────┘  │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         Output & Ecosystem Layer                               │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   ┌───────────────┐   ┌───────────────┐   ┌───────────────────────────────┐  │
│   │ Standard      │   │ AI Ecosystem  │   │ Enterprise Extensions         │  │
│   │ Output        │   │ Integration   │   │ (SaaS / Private Deployment)  │  │
│   │ Formats       │   │               │   │                               │  │
│   │  • Markdown   │   │  • LangChain  │   │  • Workflow Orchestration     │  │
│   │  • JSON       │   │  • LlamaIndex │   │  • Knowledge Base / DMS       │  │
│   │  • Text       │   │  • CrewAI     │   │  • RBAC / Audit Logs          │  │
│   │  • with BBox  │   │  • AutoGen    │   │  • Smart Review / Custom Rules│  │
│   │    Coordinates│   │  • Haystack   │   │  • Multi-tenancy / Private    │  │
│   │    Tracing    │   │               │   │    Deployment                 │  │
│   └───────────────┘   └───────────────┘   └───────────────────────────────┘  │
│                                                                                 │
│   ┌──────────────────────────────────────────────────────────────────────────┐ │
│   │  Target Scenarios: RAG Pipelines / AI Agents / Enterprise Document      │ │
│   │  Automation / Intelligent Document Processing(IDP)                    │ │
│   └──────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘

Built for AI Agents & RAG

DocSlight helps developers build modern AI document workflows with open-source PDF parsing and open-source document data extraction.

Common Applications

  • RAG systems
  • AI assistants
  • Enterprise knowledge bases
  • AI agent workflows
  • Document search engines
  • MCP applications
  • Intelligent document processing (IDP) for open-source workflows at any scale

Compatible Ecosystem

  • OpenAI
  • Claude
  • Ollama
  • LangChain
  • LlamaIndex
  • CrewAI
  • AutoGen
  • Haystack

Typical Workflow

PDF / Image / Office Document
            ↓
        docslight
            ↓
Markdown / JSON Output
            ↓
Vector Database
            ↓
LLM / AI Agent
            ↓
Answers & Automation

Benchmark

Model Type Methods Parameters Overall Score↑ TextEdit↓ FormulaCDM↑ TableTEDS↑ TableTEDS-S↑ Read OrderEdit↓
DocSlight (Cloud) Specialized VLMs 0.9B 96.45 0.0321 97.76 94.80 97.02 0.131
MinerU2.5-Pro Specialized VLMs 1.2B 95.75 0.036 97.45 93.42 95.92 0.120
GLM-OCR Specialized VLMs 0.9B 95.22 0.044 97.18 92.83 95.39 0.133
PaddleOCR-VL-1.5 Specialized VLMs 0.9B 94.93 0.038 96.89 91.67 94.37 0.130
Ovis2.6-30B-A3B Specialized VLMs 30B 93.70 0.035 95.17 89.44 92.40 0.135
Logics-Parsing-v2 Specialized VLMs 4B 93.33 0.041 95.65 88.42 91.98 0.137
HunyuanOCR Specialized VLMs 1B 89.95 0.088 87.68 91.01 93.23 0.171
Qwen3-VL-235B General VLMs 235B 89.78 0.063 92.55 83.07 86.75 0.166
Dolphin-v2 Specialized VLMs 3B 89.50 0.069 91.01 84.40 87.44 0.150
GPT-5.2 General VLMs - 86.59 0.114 88.21 82.95 87.93 0.193
Mistral OCR Specialized VLMs - 85.66 0.097 89.91 76.78 80.93 0.171
Nanonets-OCR-s Specialized VLMs 3B 83.61 0.108 81.46 80.18 84.51 0.213
Marker Pipeline Tools - 78.44 0.157 85.24 65.77 73.24 0.243

Methodology: Based on real human-annotated data and measured with character-level accuracy. The test set covers 500+ enterprise documents, including invoices, contracts, tables, and reports. The dataset is available at benchmarks/dataset.


Package Variants

Package Description
docslight Core CLI + Python SDK
docslight[web] Web UI for browser-based drag-and-drop workflows
docslight[cloud] Cloud API client with higher accuracy
docslight[all] All features
pip install "docslight[all]"

Support

Have suggestions? Start a discussion. If you find DocSlight useful, please consider giving us a ⭐ Star on GitHub. It helps us grow and improve.


License

DocSlight is released as open source under the LGPL.

Commercial / Enterprise licenses with support for GPU self-hosted deployment are available at compdf.com.


Built by the ComPDF team.
Website · Docs · Enterprise Inquiries