GovIntel is a local-first federal procurement intelligence system. It imports USAspending contract awards, stores them in PostgreSQL, builds a local Chroma retrieval index, and generates citation-grounded market intelligence briefs through a FastAPI service and Streamlit UI.
The default workflow is intentionally practical: public contract data, local storage, hybrid retrieval, SQL analytics, Gemini-backed report generation, and fail-closed citation validation.
Watch the walkthrough on YouTube or open the repository MP4.
The walkthrough shows the Streamlit UI as a user selects filters, enters a DHS cybersecurity market question, generates a grounded brief, and reviews the cited contract evidence.
Use the UI or /api/v1/analyze endpoint for questions like:
| Question | Useful filters | What GovIntel returns |
|---|---|---|
| Who are the top DHS cybersecurity contractors by total award value? | Agency DHS, NAICS 541512, 3 years |
Ranked contractors, spend context, cited awards, and citation validation metadata. |
| Who leads DoD artificial intelligence and data platform awards? | Agency DoD, NAICS 541512, 3 years |
Mission-data platform brief with leading vendors, spend signals, and cited awards. |
| Which GSA cloud marketplace awards indicate growing demand? | Agency GSA, NAICS 541512, 3 years |
Cloud-demand brief with marketplace leaders, award evidence, and trace metadata. |
- Async USAspending.gov award ingestion with pagination and idempotent PostgreSQL upserts.
- Typed Pydantic models for awards, analysis requests, contractor summaries, retrieved evidence, and generated briefs.
- Chroma-backed vector indexing with sentence-transformer embeddings.
- BM25 keyword retrieval, vector retrieval, hybrid merge, and cross-encoder reranking.
- SQL analytics for top contractors, quarterly spend trends, and market concentration.
- Versioned prompt templates and structured JSON generation.
- Citation validation that rejects unsupported citations before returning a brief.
- FastAPI
/api/v1/analyzeendpoint with optionalX-API-Keyprotection. - Streamlit UI for choosing filters, generating briefs, and inspecting cited contract evidence.
- Docker Compose stack for PostgreSQL, the API, and the UI.
Optional extension hooks are included for Langfuse tracing, Pinecone mirroring, Hugging Face-hosted generation against a separately served model, and offline training/evaluation utilities. They are not required for the core workflow.
See docs/architecture.md for the technical deep dive.
flowchart LR
A[USAspending.gov API] --> B[Ingestion CLI]
B --> C[PostgreSQL contracts]
C --> D[Indexing CLI]
D --> E[Chroma vector index]
C --> F[BM25 corpus]
E --> G[Hybrid retrieval]
F --> G
G --> H[Reranker]
C --> I[SQL analytics]
H --> J[Prompt context]
I --> J
J --> K[Generation provider]
K --> L[Structured brief draft]
L --> M[Citation validation]
M --> N[FastAPI + Streamlit]
- Python 3.10+
- FastAPI, Uvicorn, and Streamlit
- PostgreSQL 16, SQLAlchemy asyncio, and asyncpg
- ChromaDB and sentence-transformers
- rank-bm25 and cross-encoder reranking
- Pydantic v2 and pydantic-settings
- Jinja2 and PyYAML for prompt templates
- pytest, pytest-cov, Ruff, and mypy
docs/ Public architecture notes
eval/ Evaluation query fixtures and gold answers
prompts/ Versioned prompt templates
scripts/ Training helper entry points
src/govintel/
api/ FastAPI app, routes, and dependencies
ingestion/ USAspending import, loading, embeddings, and indexing
retrieval/ BM25, vector search, hybrid retrieval, and reranking
analysis/ SQL analytics for contractor rankings, trends, and HHI
generation/ Prompt loading, LLM clients, reports, and citations
frontend/ Streamlit app and API client helpers
evaluation/ Offline evaluation runner and metrics
training/ Offline synthetic-data and QLoRA utilities
models.py Shared domain models
tests/ Unit and integration-style test coverage
- Python 3.10 or newer
- Docker and Docker Compose
- Network access to the public USAspending.gov API
- Gemini API key for live brief generation
git clone <repository-url>
cd GovtIntel
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"cp .env.example .envFor the normal local workflow, set:
EXTERNAL_PROVIDERS_ENABLED=true
GENERATION_PROVIDER=gemini
GEMINI_API_KEY=<your-gemini-api-key>The default ingestion scope imports a bounded USAspending award slice for NAICS
541512. Set APP_API_KEY if you want /api/v1/analyze to require
X-API-Key.
Start PostgreSQL:
make db-upImport contract data and build the retrieval index:
make db-seed
make indexRun the API:
make runIn a second shell, run the Streamlit UI:
make uiOpen the UI at http://127.0.0.1:8501. The API health check is available at
http://127.0.0.1:8000/api/v1/health.
curl -X POST http://127.0.0.1:8000/api/v1/analyze \
-H "Content-Type: application/json" \
-d '{
"question": "Who are the top DHS cybersecurity contractors?",
"agency_filter": "DHS",
"naics_filter": "541512",
"date_range_years": 3,
"generation_provider": "gemini"
}'When APP_API_KEY is set, include -H "X-API-Key: <key>".
Build and start PostgreSQL, the API, and the Streamlit UI:
docker compose up --buildThen seed and index contract data through the API image:
docker compose run --rm api python -m govintel.ingestion.bootstrap
docker compose run --rm api python -m govintel.ingestion.indexThe UI runs at http://127.0.0.1:8501.
Run the offline evaluation harness:
EXTERNAL_PROVIDERS_ENABLED=false \
python3 -m govintel.evaluation.run_ablation \
--output eval/results/latest.jsonRun the main quality checks:
pytest -q --cov=govintel --cov-report=term-missing --cov-fail-under=90
ruff check src/ tests/
mypy src/
docker compose config --no-interpolateThese paths are implemented but not needed for the default workflow:
- Langfuse tracing with conservative prompt/context redaction.
- Pinecone mirroring for managed vector search.
- Hugging Face hosted generation when
HF_MODEL_IDpoints to a servable model or endpoint andHF_INFERENCE_ENABLED=true. - Offline synthetic training-data generation and QLoRA launcher utilities.
Generated training data, model artifacts, local eval results, and local notes are ignored by Git.
The primary persisted table is contracts, keyed by award_id. Each row
captures the recipient, awarding agency, award amount, performance dates, NAICS
code, description, place of performance, and award type. Inserts use
ON CONFLICT upserts so repeated ingestion runs refresh existing records
without duplicating awards.
MIT
