Skip to content

sahaavi/GovtIntel

Repository files navigation

GovIntel

GovIntel is a local-first federal procurement intelligence system. It imports USAspending contract awards, stores them in PostgreSQL, builds a local Chroma retrieval index, and generates citation-grounded market intelligence briefs through a FastAPI service and Streamlit UI.

The default workflow is intentionally practical: public contract data, local storage, hybrid retrieval, SQL analytics, Gemini-backed report generation, and fail-closed citation validation.

Demo

GovIntel Streamlit UI showing a generated procurement intelligence brief

Watch the walkthrough on YouTube or open the repository MP4.

The walkthrough shows the Streamlit UI as a user selects filters, enters a DHS cybersecurity market question, generates a grounded brief, and reviews the cited contract evidence.

Example Analysis Runs

Use the UI or /api/v1/analyze endpoint for questions like:

Question Useful filters What GovIntel returns
Who are the top DHS cybersecurity contractors by total award value? Agency DHS, NAICS 541512, 3 years Ranked contractors, spend context, cited awards, and citation validation metadata.
Who leads DoD artificial intelligence and data platform awards? Agency DoD, NAICS 541512, 3 years Mission-data platform brief with leading vendors, spend signals, and cited awards.
Which GSA cloud marketplace awards indicate growing demand? Agency GSA, NAICS 541512, 3 years Cloud-demand brief with marketplace leaders, award evidence, and trace metadata.

Capabilities

  • Async USAspending.gov award ingestion with pagination and idempotent PostgreSQL upserts.
  • Typed Pydantic models for awards, analysis requests, contractor summaries, retrieved evidence, and generated briefs.
  • Chroma-backed vector indexing with sentence-transformer embeddings.
  • BM25 keyword retrieval, vector retrieval, hybrid merge, and cross-encoder reranking.
  • SQL analytics for top contractors, quarterly spend trends, and market concentration.
  • Versioned prompt templates and structured JSON generation.
  • Citation validation that rejects unsupported citations before returning a brief.
  • FastAPI /api/v1/analyze endpoint with optional X-API-Key protection.
  • Streamlit UI for choosing filters, generating briefs, and inspecting cited contract evidence.
  • Docker Compose stack for PostgreSQL, the API, and the UI.

Optional extension hooks are included for Langfuse tracing, Pinecone mirroring, Hugging Face-hosted generation against a separately served model, and offline training/evaluation utilities. They are not required for the core workflow.

Architecture

See docs/architecture.md for the technical deep dive.

flowchart LR
    A[USAspending.gov API] --> B[Ingestion CLI]
    B --> C[PostgreSQL contracts]
    C --> D[Indexing CLI]
    D --> E[Chroma vector index]
    C --> F[BM25 corpus]
    E --> G[Hybrid retrieval]
    F --> G
    G --> H[Reranker]
    C --> I[SQL analytics]
    H --> J[Prompt context]
    I --> J
    J --> K[Generation provider]
    K --> L[Structured brief draft]
    L --> M[Citation validation]
    M --> N[FastAPI + Streamlit]
Loading

Tech Stack

  • Python 3.10+
  • FastAPI, Uvicorn, and Streamlit
  • PostgreSQL 16, SQLAlchemy asyncio, and asyncpg
  • ChromaDB and sentence-transformers
  • rank-bm25 and cross-encoder reranking
  • Pydantic v2 and pydantic-settings
  • Jinja2 and PyYAML for prompt templates
  • pytest, pytest-cov, Ruff, and mypy

Repository Layout

docs/          Public architecture notes
eval/          Evaluation query fixtures and gold answers
prompts/       Versioned prompt templates
scripts/       Training helper entry points
src/govintel/
  api/          FastAPI app, routes, and dependencies
  ingestion/    USAspending import, loading, embeddings, and indexing
  retrieval/    BM25, vector search, hybrid retrieval, and reranking
  analysis/     SQL analytics for contractor rankings, trends, and HHI
  generation/   Prompt loading, LLM clients, reports, and citations
  frontend/     Streamlit app and API client helpers
  evaluation/   Offline evaluation runner and metrics
  training/     Offline synthetic-data and QLoRA utilities
  models.py     Shared domain models
tests/          Unit and integration-style test coverage

Quick Start

Prerequisites

  • Python 3.10 or newer
  • Docker and Docker Compose
  • Network access to the public USAspending.gov API
  • Gemini API key for live brief generation

Install

git clone <repository-url>
cd GovtIntel
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Configure

cp .env.example .env

For the normal local workflow, set:

EXTERNAL_PROVIDERS_ENABLED=true
GENERATION_PROVIDER=gemini
GEMINI_API_KEY=<your-gemini-api-key>

The default ingestion scope imports a bounded USAspending award slice for NAICS 541512. Set APP_API_KEY if you want /api/v1/analyze to require X-API-Key.

Run Locally

Start PostgreSQL:

make db-up

Import contract data and build the retrieval index:

make db-seed
make index

Run the API:

make run

In a second shell, run the Streamlit UI:

make ui

Open the UI at http://127.0.0.1:8501. The API health check is available at http://127.0.0.1:8000/api/v1/health.

API Example

curl -X POST http://127.0.0.1:8000/api/v1/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Who are the top DHS cybersecurity contractors?",
    "agency_filter": "DHS",
    "naics_filter": "541512",
    "date_range_years": 3,
    "generation_provider": "gemini"
  }'

When APP_API_KEY is set, include -H "X-API-Key: <key>".

Docker Compose

Build and start PostgreSQL, the API, and the Streamlit UI:

docker compose up --build

Then seed and index contract data through the API image:

docker compose run --rm api python -m govintel.ingestion.bootstrap
docker compose run --rm api python -m govintel.ingestion.index

The UI runs at http://127.0.0.1:8501.

Evaluation And Quality Checks

Run the offline evaluation harness:

EXTERNAL_PROVIDERS_ENABLED=false \
python3 -m govintel.evaluation.run_ablation \
  --output eval/results/latest.json

Run the main quality checks:

pytest -q --cov=govintel --cov-report=term-missing --cov-fail-under=90
ruff check src/ tests/
mypy src/
docker compose config --no-interpolate

Optional Extensions

These paths are implemented but not needed for the default workflow:

  • Langfuse tracing with conservative prompt/context redaction.
  • Pinecone mirroring for managed vector search.
  • Hugging Face hosted generation when HF_MODEL_ID points to a servable model or endpoint and HF_INFERENCE_ENABLED=true.
  • Offline synthetic training-data generation and QLoRA launcher utilities.

Generated training data, model artifacts, local eval results, and local notes are ignored by Git.

Data Model

The primary persisted table is contracts, keyed by award_id. Each row captures the recipient, awarding agency, award amount, performance dates, NAICS code, description, place of performance, and award type. Inserts use ON CONFLICT upserts so repeated ingestion runs refresh existing records without duplicating awards.

License

MIT

About

Local-first RAG system for federal procurement intelligence over USAspending contract data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages