DocExtract AI

Extract structured data from unstructured documents in seconds -- not hours.

For Hiring Managers

30-second pitch: DocExtract is a production document-extraction RAG system with eval-gated CI, cost-aware routing, citation grounding, and a live demo. It turns messy PDFs into structured data while measuring quality, latency, and per-document cost.

Key proof: 95.5% accepted extraction F1 over the 28-case extraction baseline (autoresearch/baseline.json) — replayed deterministically in CI by scripts/eval_offline_replay.py at zero API cost, so the Eval Gate badge reflects a real, reproducible score (combined F1 0.9555). The full 72-case corpus (51 golden + 21 adversarial) is committed and re-measured end-to-end by scripts/benchmark.py. Cost (~$0.03/doc), p95 latency (~4.1s) and straight-through (~88%) are modeled from the in-repo pricing table and call distribution (docs/cost-model.md) — reproducible as metered numbers via scripts/benchmark.py once an API budget is attached. 1,280 collected tests (1,273 passing, 81% coverage); live demo at docextract-demo.streamlit.app.

Eval rigor: a documented failure-mode taxonomy (what the system is designed to catch, the mitigation, and the next experiment); the offline replay gate (scripts/eval_offline_replay.py) fails CI on a >3-point F1 regression vs the committed baseline.

Engineering signals: FastAPI, pgvector RAG, Claude Sonnet/Haiku routing, Gemini Flash LLM-as-judge, promptfoo CI gate, OpenTelemetry cost attribution, prompt caching, and async worker architecture.

Hiring fit: AI Engineer, LLM Evaluation Engineer, AI Backend Engineer, LLMOps Engineer.

Proof in 30 seconds -- 95.5% F1 (CI-replayed, zero-cost) | ~$0.03/doc (modeled) | ~4.1s p95 (modeled) | 1,280 tests, 81% cov | 72-case corpus | live demo

Metric	Value	Basis
Extraction accuracy (F1)	95.5%	Measured — CI-replayed from committed fixtures, zero API cost
Avg cost per document	~$0.03	Modeled — pricing table x call distribution (cost-model.md)
p95 end-to-end latency	~4.1s	Modeled — pending a metered `scripts/benchmark.py` run
Straight-through rate	~88%	Modeled — pending production traces
Test suite	1,280 collected tests	Measured — 1,273 passing, 81% coverage
Eval framework	LLM-as-judge + promptfoo CI gate + offline replay	Code: `scripts/eval_*.py`

Modeled cost attribution (~$0.03/doc — token pricing x call distribution; reproduce metered numbers with scripts/benchmark.py):

Stage	Model	Avg cost	Notes
Classification	Claude Haiku 4.5	$0.0003-$0.001	Haiku routing saves ~67% vs Sonnet; A/B z-test confirmed <2% quality loss
Extraction	Claude Sonnet 4.6	$0.004-$0.012	Prompt caching (ADR-0015) cuts repeat-call cost ~60% on runs >5 docs
Self-reflection (12% of docs)	Claude Sonnet 4.6	+20% base cost	Triggered by low-confidence threshold; 88% of docs skip this pass
LLM judge	Gemini 2.5 Flash	~$0.001	10% sampling; independent grader removes self-grading bias
Per-document total (avg)		~$0.03

Per-call token usage and cache hits are captured by app/services/llm_tracer.py and exported as OpenTelemetry metrics (prompt_cache_read_tokens_total, prompt_cache_creation_tokens_total, plus LLM latency/token counters) via app/observability.py. Per-request USD cost is then computed from those token counts by app/services/cost_tracker.py against the in-repo pricing table; methodology in docs/cost-model.md.

Key features: instructor typed extraction with auto-retry, LLM-as-judge online quality scoring (10% sampling), hybrid RRF retrieval, vision extraction mode, business metrics API, 15-page Streamlit dashboard

Best fit -- AI Engineer, AI Backend Engineer

Recent Engineering Decisions

Feature	ADR	Impact
Anthropic Prompt Caching	ADR-0015	~60% eval cost reduction; cache_creation_tokens tracked in OTel
Native Citations API	ADR-0016	Character-level grounding for extracted fields — cite the exact source span
Independent LLM Judge (Gemini)	ADR-0018	Eliminates self-grading bias; Gemini 2.5 Flash primary, Claude Haiku fallback
TF-IDF Reranker	ADR-0019	Replaces no-op stub; combines TF-IDF cosine + retrieval RRF score
Agentic Self-Reflection	ADR-0019	Low-confidence extractions trigger a reflection + revise pass

Detailed Hiring Evidence

If you're evaluating for...	Where to look	Training behind it
AI / ML Engineer	Agentic RAG ReAct loop (`app/services/agentic_rag.py`), RAGAS evaluation pipeline (`app/services/ragas_evaluator.py`), QLoRA fine-tuning pipeline (`scripts/train_qlora.py`) — training infrastructure ready, W&B experiment tracking, golden eval CI gate	IBM GenAI Engineering (144h), IBM RAG & Agentic AI (24h), DeepLearning.AI Deep Learning (120h)
Backend / Platform Engineer	Circuit breaker model fallback (`app/services/circuit_breaker.py`), async ARQ job queue (`worker/`), prompt versioning, eval CI, and sliding-window rate limiter	Microsoft AI & ML Engineering (75h), Google Cloud GenAI Leader (25h)
Full-Stack AI Engineer	15-page Streamlit dashboard (`frontend/`), SSE streaming progress, MCP tool server (`mcp_server.py`), interactive demo sandbox	IBM BI Analyst (141h), Google Data Analytics (181h), Microsoft Data Viz (87h)
MLOps / LLMOps Engineer	Prompt versioning + regression testing (`app/services/prompt_registry.py`), model A/B testing with z-test significance (`app/services/model_ab_test.py`), DeepEval CI gates, cost tracking per request	Duke LLMOps (48h), Google Advanced Data Analytics (200h)

→ Supporting background map: docs/certifications.md

Quickstart

git clone https://github.com/ChunkyTortoise/docextract.git
cd docextract
cp .env.example .env  # Add ANTHROPIC_API_KEY + GEMINI_API_KEY
docker compose up -d
open http://localhost:8501  # Streamlit UI

Services: API at :8000 (/docs for Swagger) | Frontend at :8501 | PostgreSQL :5432 | Redis :6379

Demo

First visit may take 30 seconds to wake up. Pre-cached results for invoice, contract, and receipt extraction.

Local demo (no API key needed):

DEMO_MODE=true streamlit run frontend/app.py

Architecture

graph LR
  A[Browser / API Client] -->|POST /documents| B[FastAPI]
  B -->|enqueue| C[ARQ Worker]
  C -->|classify| D{Model Router}
  D -->|primary| E[Claude Sonnet]
  D -->|fallback| F[Claude Haiku]
  E -->|Pass 2: extract + correct| G[pgvector HNSW]
  B -->|SSE stream stages| A
  G -->|semantic search| B
  B -->|/metrics| H[Prometheus]
  D --- I[Circuit Breaker]

Supported Models

Model	Provider	Env Var	Notes
`claude-sonnet-4-6`	Anthropic	`ANTHROPIC_API_KEY`	Default extraction model
`claude-haiku-4-5-20251001`	Anthropic	`ANTHROPIC_API_KEY`	Default classification + circuit breaker fallback
Gemini (embedding)	Google	`GEMINI_API_KEY`	Used for pgvector embeddings only

Screenshots

Upload & Extraction	Extracted Records & ROI

SSE Streaming Demo

Real-time progress: PREPROCESSING > EXTRACTING > CLASSIFYING > VALIDATING > EMBEDDING > COMPLETED

Key Capabilities

Extraction: Two-pass Claude pipeline (draft + verify via tool_use), 6 document types, 95.5% accepted extraction F1 baseline with a 72-case eval corpus (51 golden + 21 adversarial)
Search & RAG: pgvector semantic search (768-dim HNSW), hybrid BM25+RRF retrieval, agentic ReAct loop with 5 tools, map-reduce multi-document synthesis, semantic deduplication cache
Reliability: Circuit breaker (Sonnet to Haiku fallback), dead-letter queue, idempotent retries, HMAC-signed webhooks with 4-attempt retry, SHA-256 upload dedup
Observability: OpenTelemetry traces (Jaeger/Tempo), Prometheus metrics, Grafana dashboards, per-request cost tracking, structured logging
Developer Experience: SSE streaming progress, MCP server integration, prompt versioning (semver), model A/B testing (z-test), 19 ADRs, 81.59% latest local coverage with an 80% CI gate

Performance

Metric	Value
Document extraction (p50)	~8s (two-pass Claude)
SSE first token (p50)	<500ms
Semantic search (p95)	<100ms
Extraction accuracy (eval gate)	95.5% accepted F1 baseline (`autoresearch/baseline.json`, 28 scored cases)
Eval corpus	72 scored cases: 51 golden + 21 adversarial
Test suite	1,280 collected tests; latest local run: 1,273 passed, 5 skipped, 2 deselected
Coverage	81.59% latest local coverage; 80% CI gate

Evaluation Results

Current eval corpus: 72 scored cases, 51 golden + 21 adversarial (prompt injection, PII leak, hallucination bait). The accepted F1 baseline is stored in autoresearch/baseline.json, and quality checks are CI-enforced on every PR that touches prompts or extraction services via eval-gate.yml. Failure modes and next experiments are tracked in docs/eval-failure-analysis.md.

Document Type	F1 Score
Invoice	97.3%
Purchase Order	97.6%
Bank Statement	95.8%
Medical Record	99.2%
Receipt	91.1%
Identity Document	81.4%
Overall	95.5%

Baseline: autoresearch/baseline.json (28-case baseline: 16 golden + 12 adversarial, legacy runner).

# Full eval suite (Promptfoo + Ragas + LLM-judge, ~$0.44, ~4 min):
make eval

# Fast eval (Promptfoo only, ~$0.02, ~20s):
make eval-fast

For methodology details see docs/eval-methodology.md.

Project Structure

app/
  api/          -- FastAPI route modules (10 routers)
  auth/         -- API key auth + rate limiting middleware
  models/       -- SQLAlchemy models (8 tables)
  schemas/      -- Pydantic request/response schemas
  services/     -- Extraction, classification, embedding, validation
  storage/      -- Pluggable storage backends (local, R2)
  utils/        -- Hashing, MIME detection, token counting
worker/         -- ARQ async job processor
frontend/       -- Streamlit 15-page dashboard
alembic/        -- Database migrations (001-012)
scripts/        -- CLI tools: eval harness, training, seeding, Langfuse sync
tests/          -- Unit, integration, frontend, e2e, and load tests
evals/          -- Golden + adversarial eval corpus (72 scored cases)
prompts/        -- Versioned prompt templates with CHANGELOG

Architecture Decisions

19 Architecture Decision Records (ADRs) document the key design choices: docs/adr/

ADR	Decision
ADR-0001	ARQ over Celery for async job queue
ADR-0002	pgvector over Pinecone/Weaviate
ADR-0003	Two-pass Claude extraction with confidence gating
ADR-0006	Circuit breaker model fallback chain
ADR-0011	API key auth over OAuth/JWT
ADR-0012	Pluggable storage backend (Local/R2)
ADR-0015	Anthropic prompt caching — 60%+ eval cost reduction
ADR-0016	Native Citations API for character-level grounding
ADR-0017	Two-layer semantic cache (L1 exact hash + L2 embedding similarity)
ADR-0018	Gemini 2.5 as independent judge (eliminates self-grading bias)
ADR-0019	TF-IDF reranker + agentic self-reflection loop

Production Readiness

Runs locally via Docker Compose. Reference Kubernetes and AWS Terraform configs are included for future deployment work, but the clearest production-facing proof here is the live demo, observability stack, and CI-enforced eval gate.

Cloud infrastructure (deploy/aws/main.tf, deploy/k8s/): Reference AWS Terraform and Kubernetes configs are included for infrastructure direction, along with Docker Compose for local end-to-end runs.

Document	Purpose
SLO Targets	Latency, availability, quality, cost targets
Common Failure Runbook	Circuit breaker, Redis, DB, queue, vector index recovery
Security Guide	API keys, webhooks, CORS, data handling
Compliance & Privacy	Privacy controls, PII handling notes, and compliance considerations
Architecture	Full system architecture overview
Case Study	Engineering journey from prototype to production
Demo Walkthrough	Five-minute live/local demo path for reviewers
Portfolio Metrics	Canonical source for hiring-facing metric claims
MCP Integration	Claude Desktop / agent framework setup
Cost Model	Token costs, per-document pricing, volume estimates
Certifications Applied	Supporting background mapped to implementation areas

Deployment

Primary (local / self-host): docker compose up -d — full API + worker + Streamlit frontend + Postgres + Redis stack.

Render (one-click demo):

Reference Kubernetes (deploy/k8s/, kustomize), AWS Terraform (deploy/aws/), and Fly.io (fly.toml) manifests are committed for infrastructure direction — not the production-facing proof here (that is the live demo, observability stack, and CI eval gate). See deploy/ for full manifests.

Running Tests

pytest tests/ -v                      # Full suite (1,280 collected tests)
pytest tests/ -v --run-eval           # Include golden eval (requires API key)
python scripts/run_eval_ci.py --ci    # Deterministic eval (no API key)

Known Limitations

Tesseract degradation on handwriting: OCR accuracy drops significantly on handwritten documents. Set OCR_ENGINE=vision to route through Claude's vision API instead.
English-only extraction prompts: Non-English documents may extract with lower accuracy.

Contributing

See CONTRIBUTING.md for development setup, testing, and PR guidelines.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.claude		.claude
.github		.github
.streamlit		.streamlit
adapters		adapters
alembic		alembic
app		app
autoresearch		autoresearch
deploy		deploy
docs		docs
evals		evals
frontend		frontend
notebooks		notebooks
prompts		prompts
scripts		scripts
storage/reports		storage/reports
tests		tests
worker		worker
.claudeignore		.claudeignore
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitleaksignore		.gitleaksignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CASE_STUDY.md		CASE_STUDY.md
CONTRIBUTING.md		CONTRIBUTING.md
DECISIONS.md		DECISIONS.md
DEMO.md		DEMO.md
Dockerfile		Dockerfile
Dockerfile.frontend		Dockerfile.frontend
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
alembic.ini		alembic.ini
docker-compose.observability.yml		docker-compose.observability.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
fly.toml		fly.toml
locust.conf		locust.conf
mcp_server.py		mcp_server.py
promptfooconfig.yaml		promptfooconfig.yaml
pyproject.toml		pyproject.toml
render.yaml		render.yaml
requirements.txt		requirements.txt
requirements_ci.txt		requirements_ci.txt
requirements_demo.txt		requirements_demo.txt
requirements_full.txt		requirements_full.txt
streamlit_demo.py		streamlit_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocExtract AI

For Hiring Managers

Recent Engineering Decisions

Detailed Hiring Evidence

Quickstart

Demo

Architecture

Supported Models

Screenshots

SSE Streaming Demo

Key Capabilities

Performance

Evaluation Results

Project Structure

Architecture Decisions

Production Readiness

Deployment

Running Tests

Known Limitations

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocExtract AI

For Hiring Managers

Recent Engineering Decisions

Detailed Hiring Evidence

Quickstart

Demo

Architecture

Supported Models

Screenshots

SSE Streaming Demo

Key Capabilities

Performance

Evaluation Results

Project Structure

Architecture Decisions

Production Readiness

Deployment

Running Tests

Known Limitations

Contributing

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages