PullData

Turn Documents into Deliverables

PullData is a high-performance, text-based RAG (Retrieval-Augmented Generation) system optimized for extracting structured data from documents and generating versatile output formats. Built for efficiency on modest hardware (Tesla P4 8GB VRAM), it excels at transforming PDFs into Excel spreadsheets, PowerPoint presentations, Markdown, JSON, and more.

Key Features

Web UI & REST API: Interactive web interface with FastAPI backend for easy document management
Versatile Output Formats: Excel (.xlsx), PowerPoint (.pptx), Styled PDF, Markdown, JSON, LaTeX
VP-Ready Styled PDFs: Three professional styles (Executive, Modernist, Academic) with chain-based LLM structuring
Advanced Table Extraction: Preserves table structure and enables direct Excel export
Flexible LLM Options: Local models OR OpenAI-compatible APIs (LM Studio, Ollama, OpenAI, Groq, etc.)
Pluggable Storage: PostgreSQL + pgvector, SQLite (local), or ChromaDB
Smart Caching: LLM output caching and embedding caching for blazing-fast repeated queries
Differential Updates: Hash-based change detection avoids re-processing unchanged content
Multi-Project Isolation: Manage multiple document collections with complete separation
Hardware Optimized: Runs efficiently on Tesla P4 (8GB VRAM) with INT8 quantization

Quick Start

Installation

# Clone the repository
git clone https://github.com/pulldata/pulldata.git
cd pulldata

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies (includes Web UI components)
pip install -e .

# Verify installation (optional but recommended)
python verify_install.py

# Copy environment file
cp .env.example .env  # On Windows: copy .env.example .env

Web UI (Easiest Way)

# Start the server
python run_server.py

# Open browser to http://localhost:8000/ui/
# - Upload documents
# - Query with natural language
# - Generate Styled PDFs with style selector (Executive, Modernist, Academic)
# - Export to Excel, PowerPoint, Markdown, and more
# - Download results instantly

See Web UI Guide for full documentation.

Basic Usage (Python API)

from pulldata import PullData

# Initialize with default local storage (SQLite + FAISS)
rag = PullData(project="financial_reports")

# Ingest documents
rag.ingest("./documents/*.pdf")

# Query and generate Excel output
result = rag.query(
    query="Extract Q3 revenue by region",
    output_format="excel"
)

# Save output
result.save("revenue_report.xlsx")

CLI Usage

# Initialize a new project
pulldata init --project financial_reports --backend local

# Ingest documents
pulldata ingest --project financial_reports --path ./documents/

# Query with Excel output
pulldata query \
  --project financial_reports \
  --query "Extract Q3 revenue by region" \
  --output excel \
  --save revenue_report.xlsx

# Query with Markdown output
pulldata query \
  --project financial_reports \
  --query "Summarize key findings" \
  --output markdown \
  --save summary.md

Architecture Overview

┌─────────────────────────────────────────┐
│         Document Ingestion              │
├─────────────────────────────────────────┤
│ PDF → PyMuPDF (text) + pdfplumber       │
│      (tables)                            │
│ ↓                                        │
│ Semantic Chunking (512 tokens)          │
│ ↓                                        │
│ BGE Embeddings (384 dim)                │
│ ↓                                        │
│ Storage Backend (Postgres/SQLite/Chroma) │
│ + FAISS Index                            │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│           Query Pipeline                │
├─────────────────────────────────────────┤
│ User Query → Embedding → FAISS Search   │
│ → Metadata Filtering → LLM Generation   │
│ → Format Synthesis → Output File        │
└─────────────────────────────────────────┘

Technology Stack

Component	Technology	Size	VRAM
Embedder	BAAI/bge-small-en-v1.5	33M	~0.5GB
LLM (Default)	Qwen/Qwen2.5-3B-Instruct	3B	~3GB (INT8)
LLM (API)	OpenAI-compatible APIs	-	-
Vector DB	FAISS + pgvector	-	-
Storage	PostgreSQL / SQLite / ChromaDB	-	-

LLM Options

PullData supports two modes for language models:

Local Models (Default)

Run models directly on your hardware using transformers:

Qwen 2.5 3B (Default for P4)
Qwen 2.5 7B (High-end GPUs)
Llama 3.2 3B
Phi-2

API Providers

Use OpenAI-compatible API endpoints:

LM Studio - Local API server with UI (Recommended for no GPU)
Ollama - Easy local LLM runner
OpenAI - GPT-3.5, GPT-4
Groq - Ultra-fast inference
Together AI - Fast open-source models
vLLM - Self-hosted inference server
Text Generation WebUI - Feature-rich local server

Switch with one config change:

# Local model
models:
  llm:
    provider: local

# API endpoint
models:
  llm:
    provider: api
    api:
      base_url: http://localhost:1234/v1  # LM Studio
      model: local-model

See API Configuration Guide for detailed setup.

Configuration

PullData uses YAML configuration files located in the configs/ directory:

configs/default.yaml - Main configuration
configs/models.yaml - Model presets for different hardware

Storage Backends

Local (SQLite + FAISS)

Zero-config, single-file storage. Perfect for development and small projects.

storage:
  backend: local
  local:
    sqlite_path: ./data/pulldata.db
    faiss_index_path: ./data/faiss_indexes

PostgreSQL + pgvector

Production-ready with multi-user support and advanced querying.

storage:
  backend: postgres
  postgres:
    host: localhost
    port: 5432
    database: pulldata
    user: pulldata_user
    password: ${POSTGRES_PASSWORD}

ChromaDB

Standalone vector database with built-in persistence.

storage:
  backend: chromadb
  chromadb:
    persist_directory: ./data/chroma_db

Advanced Features

Differential Updates

PullData detects unchanged content using SHA-256 hashing and only re-processes modified chunks:

# First ingest
rag.ingest("report.pdf")  # Full processing

# Update document (only changed pages re-processed)
rag.ingest("report.pdf")  # Differential update

LLM Output Caching

Repeated queries are served from cache instantly:

# Cache key: hash(query + context_ids + model)
result1 = rag.query("What's Q3 revenue?")  # LLM call (2s)
result2 = rag.query("What's Q3 revenue?")  # Cache hit (0.01s)

Advanced Metadata Filtering

results = rag.query(
    query="Revenue trends",
    filters={
        "doc_type": "financial_report",
        "date_range": ("2024-01-01", "2024-12-31"),
        "tags": ["quarterly", "audited"],
        "page_number": 5
    }
)

Multi-Project Management

# Create separate projects for isolation
finance_rag = PullData(project="finance")
legal_rag = PullData(project="legal")

# Each has independent storage and indexes
finance_rag.ingest("financial_docs/")
legal_rag.ingest("legal_docs/")

Output Formats

Styled PDF

Professional PDF reports with three distinct styles:

Style	Description	Best For
Executive	Clean, corporate blue accents, gradient headers	Board presentations, stakeholder reports
Modernist	Bold dark theme, high contrast, tech aesthetic	Tech companies, modern brands
Academic	Classical serif fonts, scholarly formatting	Research, academic publications

Features:

Chain-based LLM structuring (optimized for small models 1.7B-7B)
Key metrics dashboard with trend indicators
Executive summary with drop-cap styling
Actionable recommendations section
Source references with relevance scores
A4 format with print-optimized layouts

# Generate styled PDF
result = rag.query(
    query="Analyze Q3 performance",
    output_format="styled_pdf",
    pdf_style="executive"  # or "modernist", "academic"
)
result.save("quarterly_report.pdf")

Excel (.xlsx)

Preserves table structure from PDFs
Automatic styling (headers, filters, freeze panes)
Support for multiple sheets
Formula support (coming soon)

PowerPoint (.pptx)

Template-based generation
Automatic slide layouts
Table and chart embedding

Markdown

Clean, readable format
Automatic TOC generation
Code highlighting

JSON

Structured data extraction
Configurable schema
Nested data support

LaTeX

Academic paper formatting
Math equation support
Citation management (coming soon)

Performance Benchmarks

Task	Performance	Hardware
Ingest (per page)	<5s	Tesla P4
Query latency	<2s	Tesla P4
Cache hit latency	<0.05s	Any
Table extraction accuracy	>90%	-

Project Structure

pulldata/
├── configs/              # Configuration files
├── pulldata/
│   ├── core/            # Core data structures
│   ├── parsing/         # Document parsing
│   ├── embedding/       # Embedding generation
│   ├── storage/         # Storage backends
│   ├── retrieval/       # Vector search & filtering
│   ├── generation/      # LLM generation
│   ├── synthesis/       # Output format synthesis
│   ├── pipeline/        # End-to-end orchestration
│   └── cli/            # CLI interface
├── tests/               # Unit & integration tests
├── benchmarks/          # Performance benchmarks
├── examples/            # Usage examples
└── docs/               # Documentation

Development

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=pulldata --cov-report=html

Code Quality

# Format code
black pulldata/

# Lint code
ruff check pulldata/

# Type checking
mypy pulldata/

Hardware Requirements

Minimum

GPU: None (CPU mode supported)
RAM: 8GB
Storage: 5GB (models + data)

Recommended (Target)

GPU: Tesla P4 (8GB VRAM) or equivalent
RAM: 16GB
Storage: 20GB

High Performance

GPU: RTX 3090 (24GB) or better
RAM: 32GB
Storage: 50GB

Roadmap

Completed ✅

In Progress 🚧

ChromaDB backend integration
LaTeX output formatter
Streaming generation
Authentication & authorization

Planned 📋

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Citation

@software{pulldata2025,
  title = {PullData: Turn Documents into Deliverables},
  author = {PullData Team},
  year = {2025},
  url = {https://github.com/pulldata/pulldata}
}

Documentation

📚 Complete Documentation Index - All guides in one place

Quick links:

Quick Start - Get started in 5 minutes
Config Quick Start - Change settings (LM Studio, OpenAI, etc.)
Web UI Guide - Using the Web interface
Configuration Guide - Complete config reference
Features Status - What's implemented

Support

Issues: https://github.com/pulldata/pulldata/issues
Discussions: https://github.com/pulldata/pulldata/discussions

Status: Alpha - Active Development (~97% Feature Complete)

Last Updated: 2025-12-20

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.claude		.claude
configs		configs
docs		docs
examples		examples
examples_output		examples_output
output		output
pulldata		pulldata
testdocs		testdocs
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
STYLED_PDF_SUMMARY.md		STYLED_PDF_SUMMARY.md
WEB_UI_STYLED_PDF.md		WEB_UI_STYLED_PDF.md
debug_retrieval.py		debug_retrieval.py
project_structure.txt		project_structure.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_server.py		run_server.py
setup.py		setup.py
test_reload_fix.py		test_reload_fix.py
verify_install.py		verify_install.py
verify_setup.py		verify_setup.py

License

husaynirfan1/PullData

Folders and files

Latest commit

History

Repository files navigation

PullData

Key Features

Quick Start

Installation

Web UI (Easiest Way)

Basic Usage (Python API)

CLI Usage

Architecture Overview

Technology Stack

LLM Options

Local Models (Default)

API Providers

Configuration

Storage Backends

Local (SQLite + FAISS)

PostgreSQL + pgvector

ChromaDB

Advanced Features

Differential Updates

LLM Output Caching

Advanced Metadata Filtering

Multi-Project Management

Output Formats

Styled PDF

Excel (.xlsx)

PowerPoint (.pptx)

Markdown

JSON

LaTeX

Performance Benchmarks

Project Structure

Development

Running Tests

Code Quality

Hardware Requirements

Minimum

Recommended (Target)

High Performance

Roadmap

Completed ✅

In Progress 🚧

Planned 📋

Contributing

License

Citation

Documentation

Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages