Extract structured data from unstructured documents in seconds -- not hours.
30-second pitch: DocExtract is a production document-extraction RAG system with eval-gated CI, cost-aware routing, citation grounding, and a live demo. It turns messy PDFs into structured data while measuring quality, latency, and per-document cost.
Key proof: 95.5% accepted extraction F1 over the 28-case extraction baseline (autoresearch/baseline.json) — replayed deterministically in CI by scripts/eval_offline_replay.py at zero API cost, so the Eval Gate badge reflects a real, reproducible score (combined F1 0.9555). The full 72-case corpus (51 golden + 21 adversarial) is committed and re-measured end-to-end by scripts/benchmark.py. Cost (~$0.03/doc), p95 latency (~4.1s) and straight-through (~88%) are modeled from the in-repo pricing table and call distribution (docs/cost-model.md) — reproducible as metered numbers via scripts/benchmark.py once an API budget is attached. 1,280 collected tests (1,273 passing, 81% coverage); live demo at docextract-demo.streamlit.app.
Eval rigor: a documented failure-mode taxonomy (what the system is designed to catch, the mitigation, and the next experiment); the offline replay gate (scripts/eval_offline_replay.py) fails CI on a >3-point F1 regression vs the committed baseline.
Engineering signals: FastAPI, pgvector RAG, Claude Sonnet/Haiku routing, Gemini Flash LLM-as-judge, promptfoo CI gate, OpenTelemetry cost attribution, prompt caching, and async worker architecture.
Hiring fit: AI Engineer, LLM Evaluation Engineer, AI Backend Engineer, LLMOps Engineer.
Proof in 30 seconds -- 95.5% F1 (CI-replayed, zero-cost) | ~$0.03/doc (modeled) | ~4.1s p95 (modeled) | 1,280 tests, 81% cov | 72-case corpus | live demo
| Metric | Value | Basis |
|---|---|---|
| Extraction accuracy (F1) | 95.5% | Measured — CI-replayed from committed fixtures, zero API cost |
| Avg cost per document | ~$0.03 | Modeled — pricing table x call distribution (cost-model.md) |
| p95 end-to-end latency | ~4.1s | Modeled — pending a metered scripts/benchmark.py run |
| Straight-through rate | ~88% | Modeled — pending production traces |
| Test suite | 1,280 collected tests | Measured — 1,273 passing, 81% coverage |
| Eval framework | LLM-as-judge + promptfoo CI gate + offline replay | Code: scripts/eval_*.py |
Modeled cost attribution (~$0.03/doc — token pricing x call distribution; reproduce metered numbers with scripts/benchmark.py):
| Stage | Model | Avg cost | Notes |
|---|---|---|---|
| Classification | Claude Haiku 4.5 | $0.0003-$0.001 | Haiku routing saves ~67% vs Sonnet; A/B z-test confirmed <2% quality loss |
| Extraction | Claude Sonnet 4.6 | $0.004-$0.012 | Prompt caching (ADR-0015) cuts repeat-call cost ~60% on runs >5 docs |
| Self-reflection (12% of docs) | Claude Sonnet 4.6 | +20% base cost | Triggered by low-confidence threshold; 88% of docs skip this pass |
| LLM judge | Gemini 2.5 Flash | ~$0.001 | 10% sampling; independent grader removes self-grading bias |
| Per-document total (avg) | ~$0.03 |
Per-call token usage and cache hits are captured by app/services/llm_tracer.py and exported as OpenTelemetry metrics (prompt_cache_read_tokens_total, prompt_cache_creation_tokens_total, plus LLM latency/token counters) via app/observability.py. Per-request USD cost is then computed from those token counts by app/services/cost_tracker.py against the in-repo pricing table; methodology in docs/cost-model.md.
Key features: instructor typed extraction with auto-retry, LLM-as-judge online quality scoring (10% sampling), hybrid RRF retrieval, vision extraction mode, business metrics API, 15-page Streamlit dashboard
Best fit -- AI Engineer, AI Backend Engineer
| Feature | ADR | Impact |
|---|---|---|
| Anthropic Prompt Caching | ADR-0015 | ~60% eval cost reduction; cache_creation_tokens tracked in OTel |
| Native Citations API | ADR-0016 | Character-level grounding for extracted fields — cite the exact source span |
| Independent LLM Judge (Gemini) | ADR-0018 | Eliminates self-grading bias; Gemini 2.5 Flash primary, Claude Haiku fallback |
| TF-IDF Reranker | ADR-0019 | Replaces no-op stub; combines TF-IDF cosine + retrieval RRF score |
| Agentic Self-Reflection | ADR-0019 | Low-confidence extractions trigger a reflection + revise pass |
| If you're evaluating for... | Where to look | Training behind it |
|---|---|---|
| AI / ML Engineer | Agentic RAG ReAct loop (app/services/agentic_rag.py), RAGAS evaluation pipeline (app/services/ragas_evaluator.py), QLoRA fine-tuning pipeline (scripts/train_qlora.py) — training infrastructure ready, W&B experiment tracking, golden eval CI gate |
IBM GenAI Engineering (144h), IBM RAG & Agentic AI (24h), DeepLearning.AI Deep Learning (120h) |
| Backend / Platform Engineer | Circuit breaker model fallback (app/services/circuit_breaker.py), async ARQ job queue (worker/), prompt versioning, eval CI, and sliding-window rate limiter |
Microsoft AI & ML Engineering (75h), Google Cloud GenAI Leader (25h) |
| Full-Stack AI Engineer | 15-page Streamlit dashboard (frontend/), SSE streaming progress, MCP tool server (mcp_server.py), interactive demo sandbox |
IBM BI Analyst (141h), Google Data Analytics (181h), Microsoft Data Viz (87h) |
| MLOps / LLMOps Engineer | Prompt versioning + regression testing (app/services/prompt_registry.py), model A/B testing with z-test significance (app/services/model_ab_test.py), DeepEval CI gates, cost tracking per request |
Duke LLMOps (48h), Google Advanced Data Analytics (200h) |
→ Supporting background map: docs/certifications.md
git clone https://github.com/ChunkyTortoise/docextract.git
cd docextract
cp .env.example .env # Add ANTHROPIC_API_KEY + GEMINI_API_KEY
docker compose up -d
open http://localhost:8501 # Streamlit UIServices: API at :8000 (/docs for Swagger) | Frontend at :8501 | PostgreSQL :5432 | Redis :6379
First visit may take 30 seconds to wake up. Pre-cached results for invoice, contract, and receipt extraction.
Local demo (no API key needed):
DEMO_MODE=true streamlit run frontend/app.pygraph LR
A[Browser / API Client] -->|POST /documents| B[FastAPI]
B -->|enqueue| C[ARQ Worker]
C -->|classify| D{Model Router}
D -->|primary| E[Claude Sonnet]
D -->|fallback| F[Claude Haiku]
E -->|Pass 2: extract + correct| G[pgvector HNSW]
B -->|SSE stream stages| A
G -->|semantic search| B
B -->|/metrics| H[Prometheus]
D --- I[Circuit Breaker]
| Model | Provider | Env Var | Notes |
|---|---|---|---|
claude-sonnet-4-6 |
Anthropic | ANTHROPIC_API_KEY |
Default extraction model |
claude-haiku-4-5-20251001 |
Anthropic | ANTHROPIC_API_KEY |
Default classification + circuit breaker fallback |
| Gemini (embedding) | GEMINI_API_KEY |
Used for pgvector embeddings only |
| Upload & Extraction | Extracted Records & ROI |
|---|---|
![]() |
![]() |
Real-time progress: PREPROCESSING > EXTRACTING > CLASSIFYING > VALIDATING > EMBEDDING > COMPLETED
- Extraction: Two-pass Claude pipeline (draft + verify via
tool_use), 6 document types, 95.5% accepted extraction F1 baseline with a 72-case eval corpus (51 golden + 21 adversarial) - Search & RAG: pgvector semantic search (768-dim HNSW), hybrid BM25+RRF retrieval, agentic ReAct loop with 5 tools, map-reduce multi-document synthesis, semantic deduplication cache
- Reliability: Circuit breaker (Sonnet to Haiku fallback), dead-letter queue, idempotent retries, HMAC-signed webhooks with 4-attempt retry, SHA-256 upload dedup
- Observability: OpenTelemetry traces (Jaeger/Tempo), Prometheus metrics, Grafana dashboards, per-request cost tracking, structured logging
- Developer Experience: SSE streaming progress, MCP server integration, prompt versioning (semver), model A/B testing (z-test), 19 ADRs, 81.59% latest local coverage with an 80% CI gate
| Metric | Value |
|---|---|
| Document extraction (p50) | ~8s (two-pass Claude) |
| SSE first token (p50) | <500ms |
| Semantic search (p95) | <100ms |
| Extraction accuracy (eval gate) | 95.5% accepted F1 baseline (autoresearch/baseline.json, 28 scored cases) |
| Eval corpus | 72 scored cases: 51 golden + 21 adversarial |
| Test suite | 1,280 collected tests; latest local run: 1,273 passed, 5 skipped, 2 deselected |
| Coverage | 81.59% latest local coverage; 80% CI gate |
Current eval corpus: 72 scored cases, 51 golden + 21 adversarial (prompt injection, PII leak, hallucination bait). The accepted F1 baseline is stored in autoresearch/baseline.json, and quality checks are CI-enforced on every PR that touches prompts or extraction services via eval-gate.yml. Failure modes and next experiments are tracked in docs/eval-failure-analysis.md.
| Document Type | F1 Score |
|---|---|
| Invoice | 97.3% |
| Purchase Order | 97.6% |
| Bank Statement | 95.8% |
| Medical Record | 99.2% |
| Receipt | 91.1% |
| Identity Document | 81.4% |
| Overall | 95.5% |
Baseline: autoresearch/baseline.json (28-case baseline: 16 golden + 12 adversarial, legacy runner).
# Full eval suite (Promptfoo + Ragas + LLM-judge, ~$0.44, ~4 min):
make eval
# Fast eval (Promptfoo only, ~$0.02, ~20s):
make eval-fastFor methodology details see docs/eval-methodology.md.
app/
api/ -- FastAPI route modules (10 routers)
auth/ -- API key auth + rate limiting middleware
models/ -- SQLAlchemy models (8 tables)
schemas/ -- Pydantic request/response schemas
services/ -- Extraction, classification, embedding, validation
storage/ -- Pluggable storage backends (local, R2)
utils/ -- Hashing, MIME detection, token counting
worker/ -- ARQ async job processor
frontend/ -- Streamlit 15-page dashboard
alembic/ -- Database migrations (001-012)
scripts/ -- CLI tools: eval harness, training, seeding, Langfuse sync
tests/ -- Unit, integration, frontend, e2e, and load tests
evals/ -- Golden + adversarial eval corpus (72 scored cases)
prompts/ -- Versioned prompt templates with CHANGELOG
19 Architecture Decision Records (ADRs) document the key design choices: docs/adr/
| ADR | Decision |
|---|---|
| ADR-0001 | ARQ over Celery for async job queue |
| ADR-0002 | pgvector over Pinecone/Weaviate |
| ADR-0003 | Two-pass Claude extraction with confidence gating |
| ADR-0006 | Circuit breaker model fallback chain |
| ADR-0011 | API key auth over OAuth/JWT |
| ADR-0012 | Pluggable storage backend (Local/R2) |
| ADR-0015 | Anthropic prompt caching — 60%+ eval cost reduction |
| ADR-0016 | Native Citations API for character-level grounding |
| ADR-0017 | Two-layer semantic cache (L1 exact hash + L2 embedding similarity) |
| ADR-0018 | Gemini 2.5 as independent judge (eliminates self-grading bias) |
| ADR-0019 | TF-IDF reranker + agentic self-reflection loop |
Runs locally via Docker Compose. Reference Kubernetes and AWS Terraform configs are included for future deployment work, but the clearest production-facing proof here is the live demo, observability stack, and CI-enforced eval gate.
Cloud infrastructure (deploy/aws/main.tf, deploy/k8s/): Reference AWS Terraform and Kubernetes configs are included for infrastructure direction, along with Docker Compose for local end-to-end runs.
| Document | Purpose |
|---|---|
| SLO Targets | Latency, availability, quality, cost targets |
| Common Failure Runbook | Circuit breaker, Redis, DB, queue, vector index recovery |
| Security Guide | API keys, webhooks, CORS, data handling |
| Compliance & Privacy | Privacy controls, PII handling notes, and compliance considerations |
| Architecture | Full system architecture overview |
| Case Study | Engineering journey from prototype to production |
| Demo Walkthrough | Five-minute live/local demo path for reviewers |
| Portfolio Metrics | Canonical source for hiring-facing metric claims |
| MCP Integration | Claude Desktop / agent framework setup |
| Cost Model | Token costs, per-document pricing, volume estimates |
| Certifications Applied | Supporting background mapped to implementation areas |
Primary (local / self-host): docker compose up -d — full API + worker + Streamlit frontend + Postgres + Redis stack.
Reference Kubernetes (deploy/k8s/, kustomize), AWS Terraform (deploy/aws/), and Fly.io (fly.toml) manifests are committed for infrastructure direction — not the production-facing proof here (that is the live demo, observability stack, and CI eval gate). See deploy/ for full manifests.
pytest tests/ -v # Full suite (1,280 collected tests)
pytest tests/ -v --run-eval # Include golden eval (requires API key)
python scripts/run_eval_ci.py --ci # Deterministic eval (no API key)- Tesseract degradation on handwriting: OCR accuracy drops significantly on handwritten documents. Set
OCR_ENGINE=visionto route through Claude's vision API instead. - English-only extraction prompts: Non-English documents may extract with lower accuracy.
See CONTRIBUTING.md for development setup, testing, and PR guidelines.
MIT



