Skip to content

shiviancodes/mlops-finbert-aws

Repository files navigation

CI Ruff

mlops-finbert-aws

Production-grade MLOps pipeline: FinBERT financial sentiment API deployed on AWS with containerised ONNX Runtime, Terraform-provisioned infrastructure, ECS orchestration, automated CI/CD, and full observability stack.

Python FastAPI ONNX Runtime PostgreSQL Docker Terraform AWS Kubernetes GitHub Actions License

A FinBERT (ProsusAI/finbert) financial-sentiment inference service. PyTorch weights are exported to ONNX at build time and served through ONNX Runtime inside a hardened container. The model classifies text as positive / negative / neutral with per-label scores, and every prediction is persisted to PostgreSQL for retrieval and pagination. Infrastructure is fully described in Terraform; releases ship through a six-stage GitHub Actions pipeline.


Architecture

flowchart LR
    Client["HTTPS Client"]

    subgraph AWS["AWS - us-east-1"]
        APIGW["API Gateway<br/>HTTP API · TLS termination"]

        subgraph Task["EC2 · ECS Task - FastAPI + ONNX Runtime"]
            API["FastAPI app<br/>uvicorn :8000"]
            ORT["ONNX Runtime<br/>FinBERT · CPUExecutionProvider"]
            API --> ORT
        end

        subgraph Data["Private Subnets"]
            RDS[("RDS PostgreSQL 16<br/>predictions")]
        end

        subgraph Obs["Observability"]
            CW["CloudWatch<br/>Logs · Metrics · Alarms"]
            SNS["SNS<br/>alert topic"]
            CW --> SNS
        end

        APIGW --> API
        API -->|"SQLAlchemy"| RDS
        API -->|"PutMetricData · JSON logs"| CW
    end

    subgraph CICD["CI/CD"]
        GH["GitHub<br/>push to main"]
        GHA["GitHub Actions<br/>lint · test · build"]
        ECR["Amazon ECR"]
        GH --> GHA --> ECR
    end

    Client -->|"HTTPS"| APIGW
    ECR -->|"image pull"| API
    GHA -->|"force-new-deployment"| API
Loading

Local Kubernetes: a parallel k8s/ manifest set runs the same image on minikube (Deployment, Service, HPA, in-cluster Postgres). This is a local orchestration demonstration and is not part of the AWS production path.


Key Engineering Decisions

Decision Problem Choice Consequence
ONNX Runtime over PyTorch for inference PyTorch is a heavy training framework; carrying it into the serving image inflates memory and image size on a small instance. Export the trained weights to ONNX at build time and serve with ONNX Runtime; torch is installed only to export, then uninstalled. The runtime image ships zero training dependencies, so the inference footprint stays small enough to run comfortably on a t3-class host.
ECS on EC2 over Fargate FinBERT's load-time memory spike can exceed the container's working set on a small instance. Run the ECS task on a self-managed EC2 launch type so the host can be configured with 2 GB of swap. Host-level control over swap and instance sizing keeps the model loadable and cost predictable, at the price of managing the EC2 host.
PostgreSQL over DynamoDB Predictions need ordered pagination, relational integrity, and exact score precision. Use RDS PostgreSQL with DECIMAL(6,4) score columns and timestamp-based cursor pagination. Clean keyset pagination and lossless score storage, at the cost of running a managed relational instance.
Synchronous inference over an async queue Could decouple inference behind SQS and a worker pool. Keep prediction a single blocking request/response. p99 latency stays well under 400 ms and the system has no queue/worker operational surface; revisit only if throughput demands batching.
API Gateway over direct EC2 exposure Exposing the app port directly makes the instance the public contract and offloads TLS to the app. Front the service with an API Gateway HTTP API integrating to the Elastic IP. Managed HTTPS/TLS termination and a stable public front door, decoupled from the underlying host.
SSM Parameter Store over Secrets Manager The API key and DB URL are static secrets that never rotate. Store them as SSM SecureString parameters injected into the task as secrets. Encrypted secret delivery at lower cost, without paying for rotation machinery that is not needed.

CI/CD Pipeline

Defined in .github/workflows/deploy.yml. Triggered on push to main, but only when relevant paths change (app/**, tests/**, migrations/**, Dockerfile, pyproject.toml, alembic.ini, the workflow itself) - documentation and infra edits don't burn a deploy. AWS access uses OIDC role assumption, so no long-lived credentials are stored in GitHub.

flowchart LR
    L["lint<br/>ruff · black"]
    U["unit-tests<br/>pytest tests/unit"]
    I["integration-tests<br/>pytest tests/integration"]
    B["build-and-push<br/>Docker → ECR"]
    D["deploy<br/>ECS force-new-deployment"]
    S["smoke-test<br/>/ready · /predict · /health"]

    L --> U
    L --> I
    U --> B
    I --> B
    B --> D --> S
Loading
Stage What it does Why it exists
lint ruff check . and black --check .. Fail fast on style/format before spending compute on tests.
unit-tests pytest tests/unit/ against a stubbed model. Validate API contracts and business logic in isolation.
integration-tests Spins up Postgres, runs migrations, exercises the real predict→persist flow. Catch schema, ORM, and serialisation regressions end to end.
build-and-push Multi-stage Docker build, tags :latest and :<sha>, pushes to ECR. Produce the immutable, ONNX-baked image that gets deployed.
deploy aws ecs update-service --force-new-deployment, then waits for service stability. Roll the new image onto the cluster and block until healthy.
smoke-test Polls /ready for model_loaded == true, then hits /predict and /health. Prove the live deployment actually serves before the run goes green.

Model-file caching. The ONNX export is expensive, so the integration job caches finbert.onnx, finbert.onnx.data, and tokenizer/ (GitHub Actions cache, key finbert-onnx-v1). On a cache miss it extracts the artefacts directly from the latest ECR image rather than re-exporting.

Readiness polling. The smoke test does not assume the deployment is warm - it polls the /ready endpoint (which reports model_loaded and db_connected) on a fixed interval until the model is actually loaded, then runs assertions.


API Reference

All /api/v1 routes require an X-API-Key header. Replace the base URL with your deployed stage:

BASE_URL=https://<api-id>.execute-api.us-east-1.amazonaws.com/prod

POST /api/v1/predict

Classify a piece of financial text and persist the result.

curl -X POST "$BASE_URL/api/v1/predict" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{
    "text": "The company beat earnings expectations and raised full-year guidance.",
    "source": "earnings-call"
  }'
{
  "request_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
  "text": "The company beat earnings expectations and raised full-year guidance.",
  "label": "positive",
  "scores": {
    "positive": 0.9512,
    "negative": 0.0121,
    "neutral": 0.0367
  },
  "confidence": 0.9512,
  "model_version": "ProsusAI/finbert",
  "latency_ms": 38,
  "created_at": "2026-05-29T10:14:22.481Z"
}

text is required (1–2000 chars); source is optional (≤100 chars). If the prediction succeeds but the database write fails, the response is still returned with an X-Storage-Warning: write_failed header and a synthetic request_id.

GET /api/v1/predictions/{request_id}

Retrieve a single stored prediction by id.

curl "$BASE_URL/api/v1/predictions/7c9e6679-7425-40de-944b-e07fc1f90ae7" \
  -H "X-API-Key: $API_KEY"

Returns the same shape as POST /predict, or 404:

{ "detail": "Prediction not found" }

GET /api/v1/predictions

List stored predictions, newest first, with keyset (cursor) pagination.

curl "$BASE_URL/api/v1/predictions?limit=20&label=negative" \
  -H "X-API-Key: $API_KEY"

Query params: limit (default 20, max 100), cursor (a request_id to page from), label (filter by positive / negative / neutral).

{
  "items": [
    {
      "request_id": "9f1b...",
      "text": "Quarterly revenue fell short of analyst estimates.",
      "label": "negative",
      "scores": { "positive": 0.0204, "negative": 0.9433, "neutral": 0.0363 },
      "confidence": 0.9433,
      "model_version": "ProsusAI/finbert",
      "latency_ms": 41,
      "created_at": "2026-05-29T09:58:03.112Z"
    }
  ],
  "next_cursor": "9f1b...",
  "total_count": 1284
}

Operational endpoints

Endpoint Purpose
GET /health Liveness - returns 200 if the process is up.
GET /ready Readiness - reports model_loaded and db_connected; 503 until both are true.

Local Development

Prerequisites: Docker and Docker Compose.

cp .env.example .env          # set DATABASE_URL and API_KEY
docker compose up --build

The app container's entrypoint runs alembic upgrade head before starting uvicorn, so the schema is migrated automatically. The API is then available at http://localhost:8000.

Environment variables (see .env.example):

Variable Description
DATABASE_URL SQLAlchemy/psycopg connection string for PostgreSQL.
API_KEY Value clients must send in the X-API-Key header.

Compose also accepts ONNX_MODEL_PATH and TOKENIZER_PATH; these default to the paths baked into the image and rarely need overriding.


Kubernetes (Local)

The k8s/ manifests run the same image on minikube for local orchestration. This is a demonstration of Kubernetes patterns - it is not the production deployment (production runs on ECS/EC2).

minikube start
eval $(minikube docker-env)            # build into minikube's daemon
docker build -t mlops-finbert-aws:local .

kubectl apply -f k8s/00-namespace.yaml
kubectl apply -f k8s/configmap.yaml -f k8s/secret.yaml
kubectl apply -f k8s/postgres.yaml
kubectl apply -f k8s/deployment.yaml -f k8s/service.yaml
kubectl apply -f k8s/hpa.yaml

The Deployment uses an init container that blocks until Postgres accepts connections, with liveness (/health) and readiness (/ready) probes. The HPA scales 1 → 3 replicas on 70% average CPU.


Observability

The application emits structured JSON logs (via structlog) to CloudWatch Logs and publishes custom metrics to the FinbertAPI namespace.

Metrics

Metric Type Notes
prediction_count Count Dimensioned by label (positive / negative / neutral).
prediction_latency_p99 Milliseconds Per-request inference latency.
db_write_failures Count Emitted when a prediction fails to persist after retry.

Alarms (all routed to an SNS alert topic):

Alarm Condition
High error rate error_rate > 10% for 5 minutes
High latency prediction_latency_p99 > 3000 ms for 5 minutes
EC2 CPU CPUUtilization > 90% for 10 minutes
DB write failures db_write_failures > 0

A CloudWatch dashboard (mlops-finbert-aws-dashboard) surfaces prediction counts by label, p99 latency, DB write failures, and EC2/RDS CPU on a single pane.


Repository Structure

.
├── app/
│   ├── main.py              # FastAPI app + lifespan (loads model on startup)
│   ├── config.py            # pydantic-settings configuration
│   ├── api/
│   │   ├── predict.py       # POST /predict - inference, persistence, metrics
│   │   ├── predictions.py   # GET single + cursor-paginated list
│   │   └── health.py        # /health (liveness) + /ready (readiness)
│   ├── model/finbert.py     # ONNX Runtime session + tokenizer + softmax
│   ├── db/                  # SQLAlchemy session + Prediction ORM model
│   └── schemas/             # Pydantic request/response models
├── tests/
│   ├── unit/                # API/contract tests with a stubbed model
│   └── integration/         # Full predict → persist flow against Postgres
├── terraform/               # IaC: VPC, EC2, ECS, RDS, API Gateway, IAM,
│                            #      ECR, SSM, CloudWatch, SNS, S3 backend
├── k8s/                     # Local minikube manifests (Deployment, HPA, …)
├── migrations/              # Alembic migrations (predictions table)
├── scripts/export_onnx.py   # PyTorch → ONNX export, run at image build
├── .github/workflows/       # CI/CD pipeline (deploy.yml)
├── Dockerfile               # Multi-stage build; ONNX baked, torch dropped
├── docker-compose.yml       # Local app + Postgres
└── pyproject.toml           # Dependencies, ruff, black, pytest config

License

MIT

About

Production-grade MLOps pipeline: FinBERT sentiment API on AWS with ONNX Runtime, Terraform IaC, ECS orchestration, Kubernetes, automated CI/CD, and full observability.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors