Turn Documents into Deliverables
PullData is a high-performance, text-based RAG (Retrieval-Augmented Generation) system optimized for extracting structured data from documents and generating versatile output formats. Built for efficiency on modest hardware (Tesla P4 8GB VRAM), it excels at transforming PDFs into Excel spreadsheets, PowerPoint presentations, Markdown, JSON, and more.
- Web UI & REST API: Interactive web interface with FastAPI backend for easy document management
- Versatile Output Formats: Excel (.xlsx), PowerPoint (.pptx), Styled PDF, Markdown, JSON, LaTeX
- VP-Ready Styled PDFs: Three professional styles (Executive, Modernist, Academic) with chain-based LLM structuring
- Advanced Table Extraction: Preserves table structure and enables direct Excel export
- Flexible LLM Options: Local models OR OpenAI-compatible APIs (LM Studio, Ollama, OpenAI, Groq, etc.)
- Pluggable Storage: PostgreSQL + pgvector, SQLite (local), or ChromaDB
- Smart Caching: LLM output caching and embedding caching for blazing-fast repeated queries
- Differential Updates: Hash-based change detection avoids re-processing unchanged content
- Multi-Project Isolation: Manage multiple document collections with complete separation
- Hardware Optimized: Runs efficiently on Tesla P4 (8GB VRAM) with INT8 quantization
# Clone the repository
git clone https://github.com/pulldata/pulldata.git
cd pulldata
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies (includes Web UI components)
pip install -e .
# Verify installation (optional but recommended)
python verify_install.py
# Copy environment file
cp .env.example .env # On Windows: copy .env.example .env# Start the server
python run_server.py
# Open browser to http://localhost:8000/ui/
# - Upload documents
# - Query with natural language
# - Generate Styled PDFs with style selector (Executive, Modernist, Academic)
# - Export to Excel, PowerPoint, Markdown, and more
# - Download results instantlySee Web UI Guide for full documentation.
from pulldata import PullData
# Initialize with default local storage (SQLite + FAISS)
rag = PullData(project="financial_reports")
# Ingest documents
rag.ingest("./documents/*.pdf")
# Query and generate Excel output
result = rag.query(
query="Extract Q3 revenue by region",
output_format="excel"
)
# Save output
result.save("revenue_report.xlsx")# Initialize a new project
pulldata init --project financial_reports --backend local
# Ingest documents
pulldata ingest --project financial_reports --path ./documents/
# Query with Excel output
pulldata query \
--project financial_reports \
--query "Extract Q3 revenue by region" \
--output excel \
--save revenue_report.xlsx
# Query with Markdown output
pulldata query \
--project financial_reports \
--query "Summarize key findings" \
--output markdown \
--save summary.md┌─────────────────────────────────────────┐
│ Document Ingestion │
├─────────────────────────────────────────┤
│ PDF → PyMuPDF (text) + pdfplumber │
│ (tables) │
│ ↓ │
│ Semantic Chunking (512 tokens) │
│ ↓ │
│ BGE Embeddings (384 dim) │
│ ↓ │
│ Storage Backend (Postgres/SQLite/Chroma) │
│ + FAISS Index │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Query Pipeline │
├─────────────────────────────────────────┤
│ User Query → Embedding → FAISS Search │
│ → Metadata Filtering → LLM Generation │
│ → Format Synthesis → Output File │
└─────────────────────────────────────────┘
| Component | Technology | Size | VRAM |
|---|---|---|---|
| Embedder | BAAI/bge-small-en-v1.5 | 33M | ~0.5GB |
| LLM (Default) | Qwen/Qwen2.5-3B-Instruct | 3B | ~3GB (INT8) |
| LLM (API) | OpenAI-compatible APIs | - | - |
| Vector DB | FAISS + pgvector | - | - |
| Storage | PostgreSQL / SQLite / ChromaDB | - | - |
PullData supports two modes for language models:
Run models directly on your hardware using transformers:
- Qwen 2.5 3B (Default for P4)
- Qwen 2.5 7B (High-end GPUs)
- Llama 3.2 3B
- Phi-2
Use OpenAI-compatible API endpoints:
- LM Studio - Local API server with UI (Recommended for no GPU)
- Ollama - Easy local LLM runner
- OpenAI - GPT-3.5, GPT-4
- Groq - Ultra-fast inference
- Together AI - Fast open-source models
- vLLM - Self-hosted inference server
- Text Generation WebUI - Feature-rich local server
Switch with one config change:
# Local model
models:
llm:
provider: local
# API endpoint
models:
llm:
provider: api
api:
base_url: http://localhost:1234/v1 # LM Studio
model: local-modelSee API Configuration Guide for detailed setup.
PullData uses YAML configuration files located in the configs/ directory:
- configs/default.yaml - Main configuration
- configs/models.yaml - Model presets for different hardware
Zero-config, single-file storage. Perfect for development and small projects.
storage:
backend: local
local:
sqlite_path: ./data/pulldata.db
faiss_index_path: ./data/faiss_indexesProduction-ready with multi-user support and advanced querying.
storage:
backend: postgres
postgres:
host: localhost
port: 5432
database: pulldata
user: pulldata_user
password: ${POSTGRES_PASSWORD}Standalone vector database with built-in persistence.
storage:
backend: chromadb
chromadb:
persist_directory: ./data/chroma_dbPullData detects unchanged content using SHA-256 hashing and only re-processes modified chunks:
# First ingest
rag.ingest("report.pdf") # Full processing
# Update document (only changed pages re-processed)
rag.ingest("report.pdf") # Differential updateRepeated queries are served from cache instantly:
# Cache key: hash(query + context_ids + model)
result1 = rag.query("What's Q3 revenue?") # LLM call (2s)
result2 = rag.query("What's Q3 revenue?") # Cache hit (0.01s)results = rag.query(
query="Revenue trends",
filters={
"doc_type": "financial_report",
"date_range": ("2024-01-01", "2024-12-31"),
"tags": ["quarterly", "audited"],
"page_number": 5
}
)# Create separate projects for isolation
finance_rag = PullData(project="finance")
legal_rag = PullData(project="legal")
# Each has independent storage and indexes
finance_rag.ingest("financial_docs/")
legal_rag.ingest("legal_docs/")Professional PDF reports with three distinct styles:
| Style | Description | Best For |
|---|---|---|
| Executive | Clean, corporate blue accents, gradient headers | Board presentations, stakeholder reports |
| Modernist | Bold dark theme, high contrast, tech aesthetic | Tech companies, modern brands |
| Academic | Classical serif fonts, scholarly formatting | Research, academic publications |
Features:
- Chain-based LLM structuring (optimized for small models 1.7B-7B)
- Key metrics dashboard with trend indicators
- Executive summary with drop-cap styling
- Actionable recommendations section
- Source references with relevance scores
- A4 format with print-optimized layouts
# Generate styled PDF
result = rag.query(
query="Analyze Q3 performance",
output_format="styled_pdf",
pdf_style="executive" # or "modernist", "academic"
)
result.save("quarterly_report.pdf")- Preserves table structure from PDFs
- Automatic styling (headers, filters, freeze panes)
- Support for multiple sheets
- Formula support (coming soon)
- Template-based generation
- Automatic slide layouts
- Table and chart embedding
- Clean, readable format
- Automatic TOC generation
- Code highlighting
- Structured data extraction
- Configurable schema
- Nested data support
- Academic paper formatting
- Math equation support
- Citation management (coming soon)
| Task | Performance | Hardware |
|---|---|---|
| Ingest (per page) | <5s | Tesla P4 |
| Query latency | <2s | Tesla P4 |
| Cache hit latency | <0.05s | Any |
| Table extraction accuracy | >90% | - |
pulldata/
├── configs/ # Configuration files
├── pulldata/
│ ├── core/ # Core data structures
│ ├── parsing/ # Document parsing
│ ├── embedding/ # Embedding generation
│ ├── storage/ # Storage backends
│ ├── retrieval/ # Vector search & filtering
│ ├── generation/ # LLM generation
│ ├── synthesis/ # Output format synthesis
│ ├── pipeline/ # End-to-end orchestration
│ └── cli/ # CLI interface
├── tests/ # Unit & integration tests
├── benchmarks/ # Performance benchmarks
├── examples/ # Usage examples
└── docs/ # Documentation
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=pulldata --cov-report=html# Format code
black pulldata/
# Lint code
ruff check pulldata/
# Type checking
mypy pulldata/- GPU: None (CPU mode supported)
- RAM: 8GB
- Storage: 5GB (models + data)
- GPU: Tesla P4 (8GB VRAM) or equivalent
- RAM: 16GB
- Storage: 20GB
- GPU: RTX 3090 (24GB) or better
- RAM: 32GB
- Storage: 50GB
- Project structure setup
- Core data structures
- Document parsing (PDF + tables)
- Embedding generation (local + API)
- Storage backends (PostgreSQL, SQLite)
- Vector search with FAISS
- LLM generation (local + API)
- Excel, Markdown, JSON, PowerPoint, PDF output
- CLI interface
- FastAPI REST API
- Web UI with file upload
- VP-Ready Styled PDFs (Executive, Modernist, Academic)
- Chain-based LLM structuring for small models
- ChromaDB backend integration
- LaTeX output formatter
- Streaming generation
- Authentication & authorization
- Reranking support
- Table embeddings
- Multi-modal support (images, charts)
- Advanced entity extraction
- Batch processing API
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.
@software{pulldata2025,
title = {PullData: Turn Documents into Deliverables},
author = {PullData Team},
year = {2025},
url = {https://github.com/pulldata/pulldata}
}📚 Complete Documentation Index - All guides in one place
Quick links:
- Quick Start - Get started in 5 minutes
- Config Quick Start - Change settings (LM Studio, OpenAI, etc.)
- Web UI Guide - Using the Web interface
- Configuration Guide - Complete config reference
- Features Status - What's implemented
- Issues: https://github.com/pulldata/pulldata/issues
- Discussions: https://github.com/pulldata/pulldata/discussions
Status: Alpha - Active Development (~97% Feature Complete)
Last Updated: 2025-12-20