Skip to content

red512/krai

Repository files navigation

krai — Long-Running Export/Import API

Async export/import API built with FastAPI + React frontend, deployed on GKE with Pub/Sub, GCS, and Firestore.

image

Directory Structure

krai/
├── .github/workflows/
│   ├── ci.yaml                # Lint (ruff) → Test (pytest) → Grype scan → Slack
│   └── cd.yaml                # Docker build → Artifact Registry → update krai-gitops → Slack
├── main.py                    # FastAPI API server (exports, imports, job status)
├── worker.py                  # Pub/Sub pull subscriber (processes jobs)
├── test_main.py               # Unit + integration tests (TestClient, mock mode)
├── scripts/
│   ├── test-api.sh            # E2E test script
│   └── manage-users.sh        # Firestore email allowlist management
├── requirements.txt           # Production dependencies
├── requirements-dev.txt       # Dev dependencies (pytest, ruff)
├── Dockerfile                 # Python 3.12-slim, non-root krai user
└── media/                     # Screenshots for README

Google Sign-In

Dashboard

Architecture

Runtime Architecture

graph LR
    User([User / Browser])
    Script([Scripts / CI])

    subgraph Auth
        Google[Google OAuth]
    end

    subgraph GKE Cluster
        FE[React Frontend]
        API[FastAPI API Pod]
        Worker[Worker Pod]
    end

    subgraph GCP
        FS[(Firestore)]
        PS[/Pub/Sub/]
        GCS[(Cloud Storage)]
    end

    User -->|Google Sign-In| Google
    Google -->|ID Token| FE
    User -->|Browse| FE
    FE -->|Bearer Token| API
    Script -->|x-api-key| API

    API -->|Create Job| FS
    API -->|Publish| PS
    PS -->|Pull| Worker
    Worker -->|Upload Result| GCS
    Worker -->|Update Status| FS
    API -->|Signed URL| GCS
Loading

Platform Components

graph TB
    subgraph GitHub
        BackendRepo[krai-backend]
        FrontendRepo[krai-frontend]
        GitOps[krai-gitops]
    end

    subgraph GKE Cluster
        ArgoCD[ArgoCD]
        ESO[ESO Operator]
        KEDA[KEDA Operator]
        API[API Pods<br/>HPA — CPU]
        Worker[Worker Pods<br/>KEDA — queue depth]
    end

    subgraph GCP
        SM[Secret Manager]
        PS[/Pub/Sub/]
    end

    BackendRepo -->|CI/CD pushes image tag| GitOps
    FrontendRepo -->|CI/CD pushes image tag| GitOps
    GitOps -->|Auto-sync| ArgoCD
    ArgoCD -->|Deploy| API
    ArgoCD -->|Deploy| Worker

    SM -->|API_KEY via Workload Identity| ESO
    ESO -->|K8s Secret| API
    ESO -->|K8s Secret| Worker

    PS -->|Queue depth| KEDA
    KEDA -->|Scale| Worker
Loading

Export Flow

sequenceDiagram
    participant UI as React Frontend
    participant API as FastAPI (GKE)
    participant FS as Firestore
    participant PS as Pub/Sub
    participant W as Worker (GKE)
    participant GCS as GCS

    UI->>+API: POST /api/v1/exports (Bearer token or x-api-key)
    API->>FS: Create job (PENDING)
    API->>PS: Publish message
    API-->>-UI: 202 {job_id}

    PS->>W: Pull message
    W->>GCS: Generate & upload data
    W->>FS: Update job (COMPLETED + signed URL)

    UI->>+API: GET /api/v1/jobs/{id}
    API->>FS: Read job
    API-->>-UI: 200 {status, progress}

    UI->>+API: GET /api/v1/jobs/{id}/result
    API->>FS: Read job
    API-->>-UI: 200 {download_url}

    UI->>GCS: Download via signed URL
Loading

Import Flow

sequenceDiagram
    participant UI as React Frontend
    participant API as FastAPI (GKE)
    participant FS as Firestore
    participant PS as Pub/Sub
    participant W as Worker (GKE)

    UI->>+API: POST /api/v1/imports (Bearer token or x-api-key)
    API->>FS: Create job (PENDING)
    API->>PS: Publish message
    API-->>-UI: 202 {job_id}

    PS->>W: Pull message
    W->>W: Fetch & process data from source
    W->>FS: Update job (COMPLETED + records_processed)

    UI->>+API: GET /api/v1/jobs/{id}
    API->>FS: Read job
    API-->>-UI: 200 {status, progress}

    UI->>+API: GET /api/v1/jobs/{id}/result
    API->>FS: Read job
    API-->>-UI: 200 {records_processed}
Loading

Key Design Decisions

Decision Why
Signed URLs API server never buffers 1-100MB files
Pub/Sub Decouples API from workers, natural backpressure
Firestore Serverless job tracking, no schema migrations

Firestore Job Data | GKE + HPA | Auto-scales API pods on CPU | | KEDA | Scales worker pods based on Pub/Sub queue depth (messages per worker) instead of CPU | | Workload Identity | No static credentials (GCP's IRSA equivalent) | | External Secrets Operator | API key synced from GCP Secret Manager — no plaintext secrets in Helm values | | Google OAuth + API Key | Dual auth: browser users sign in with Google, scripts use API key | | Firestore email allowlist | OAuth users checked against allowed_emails collection — manage access without redeployment | | Separate namespaces | krai-backend and krai-frontend deploy and scale independently |

Worker Autoscaling (KEDA)

API pods scale on CPU via standard HPA — CPU correlates well with HTTP request load. Worker pods use KEDA to scale on Pub/Sub queue depth instead, because a worker could be idle-polling at low CPU while messages pile up.

KEDA checks the krai-jobs-sub subscription every 15s and calculates: desired workers = undelivered messages / messagesPerWorker (5).

Messages in queue Workers Reason
0 1 minReplicaCount keeps at least 1 worker running
1–5 1 ≤ 5 / 5 = 1 worker needed
6–10 2 6 / 5 = 1.2 → rounds up to 2
11–15 3 11 / 5 = 2.2 → rounds up to 3
16+ 3 Capped at maxReplicaCount: 3
0 (after burst) 1 Scales down after 60s cooldownPeriod

Quick Start (Local)

# Backend (mock mode — no GCP credentials needed)
cd krai
pip install -r requirements.txt
python main.py

# In another terminal — E2E test
bash scripts/test-api.sh

# Frontend
cd krai-frontend
npm install
npm start

API Reference

All endpoints (except /healthz) require authentication: either x-api-key header or Authorization: Bearer <google-id-token>.

Method Path Description Response
GET /healthz Health check 200
POST /api/v1/exports Create export job 202 {job_id, status}
POST /api/v1/imports Create import job 202 {job_id, status}
GET /api/v1/jobs/{id} Poll job status 200 {status, progress}
GET /api/v1/jobs/{id}/result Get result (when completed) 200 {download_url} or {records_processed}

Deploy to GKE

# 1. Provision infrastructure
cd krai-terraform
terraform init
terraform apply -var="project_id=YOUR_PROJECT" -var="api_key=YOUR_API_KEY"

# 2. ArgoCD auto-syncs all Helm charts from krai-gitops
#    - external-secrets     → external-secrets namespace (ESO operator)
#    - keda                 → keda namespace (KEDA operator)
#    - krai-helm-chart      → krai-backend namespace
#    - krai-frontend-chart  → krai-frontend namespace

ArgoCD Applications

Multi-Repo Layout

Repo Purpose
krai Backend API + worker (FastAPI, Python)
krai-frontend React frontend
krai-gitops Helm charts + ArgoCD manifests (backend, frontend, KEDA, ESO)
krai-terraform Terraform IaC (GKE, VPC, IAM, GCS, Pub/Sub, Firestore, Artifact Registry, Secret Manager, GitHub OIDC)

GKE Namespace Layout

external-secrets namespace: ESO operator + webhook + cert-controller
keda namespace:             KEDA operator + metrics server
krai-backend namespace:     API pods + Worker pods (KEDA-scaled) + LoadBalancer Service + ExternalSecret
krai-frontend namespace:    React pods + LoadBalancer Service

CI/CD Pipeline

Each repo has its own GitHub Actions workflows:

Repo CI (ci.yaml) CD (cd.yaml)
krai-backend Lint (ruff) → Test (pytest) → Grype scan → Slack Docker build → Push to Artifact Registry → Update image tag in krai-gitops → Slack
krai-frontend Build → Grype scan → Slack Docker build → Push to Artifact Registry → Update image tag in krai-gitops → Slack
krai-terraform Checkov IaC security scan → Slack (checkov.yaml)
krai-gitops — (ArgoCD auto-syncs on push)

Slack Notifications

Grype CVE Scan

GitHub Actions authenticates to GCP via Workload Identity Federation (OIDC) — no static credentials. Terraform provisions the identity pool, provider, and a dedicated github-actions service account with artifactregistry.writer role only.

GitHub Secrets Required

Set these on krai-backend, krai-frontend, and krai-terraform repos:

Secret Repos Source
GCP_WORKLOAD_IDENTITY_PROVIDER backend, frontend terraform output gcp_workload_identity_provider
GCP_SERVICE_ACCOUNT backend, frontend terraform output github_actions_service_account
GCP_PROJECT_ID backend, frontend Your GCP project ID
GITOPS_PAT backend, frontend GitHub PAT with repo scope for krai-gitops
SLACK_WEBHOOK_URL all three Slack incoming webhook URL for CI/CD notifications

Image Tagging Strategy

Each push to main builds a Docker image tagged with the short git SHA and latest. The CD pipeline updates the Helm values in krai-gitops with the SHA tag. ArgoCD detects the commit and deploys the new image. Using SHA (not latest) ensures Kubernetes always pulls the correct version and provides an audit trail for rollbacks.

Testing

# Unit tests + lint
pip install -r requirements-dev.txt
pytest test_main.py -v
ruff check .

# E2E test (requires server running — see Quick Start)
bash scripts/test-api.sh                                          # local (default: localhost:8080)
bash scripts/test-api.sh http://localhost:8081 your-api-key       # against GKE via port-forward

User Management

When using Google OAuth, the backend checks the user's email against a Firestore allowed_emails collection. If the email is not in the collection, the request is rejected with 403. This acts as an allowlist — only explicitly approved users can access the API via OAuth. API key auth bypasses this check.

Manage the allowlist with the provided script (requires gcloud auth and GCP_PROJECT env var):

export GCP_PROJECT=your-project-id

# Add a user
bash scripts/manage-users.sh add user@example.com

# Remove a user
bash scripts/manage-users.sh remove user@example.com

# List all allowed users
bash scripts/manage-users.sh list

Security

  • Dual authentication: Google OAuth (browser) + API key (scripts) on all endpoints
  • Firestore email allowlist: OAuth users checked against allowed_emails collection
  • External Secrets Operator: API key synced from GCP Secret Manager → K8s Secret (no plaintext in Helm values)
  • Rate limiting (100 req/15 min)
  • Signed URLs with 15-min TTL (via IAM signBlob API — compatible with Workload Identity, no SA key needed)
  • Non-root container, read-only filesystem (with /tmp emptyDir for GCS client), drop all capabilities
  • Workload Identity for GKE pods (no static GCP credentials)
  • Workload Identity Federation for CI/CD (GitHub OIDC, no static GCP credentials)
  • Separate service accounts: krai-app (application), krai-eso (ESO), keda-operator (KEDA), github-actions (CI/CD)
  • Private GCS bucket with uniform access control

About

KRAI is a cloud-native async job platform for data imports/exports, with a FastAPI backend, React frontend, and GCP infrastructure (GKE, Pub/Sub, Firestore) managed via GitOps and Terraform.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors