mlops-finbert-aws

Production-grade MLOps pipeline: FinBERT financial sentiment API deployed on AWS with containerised ONNX Runtime, Terraform-provisioned infrastructure, ECS orchestration, automated CI/CD, and full observability stack.

A FinBERT (ProsusAI/finbert) financial-sentiment inference service. PyTorch weights are exported to ONNX at build time and served through ONNX Runtime inside a hardened container. The model classifies text as positive / negative / neutral with per-label scores, and every prediction is persisted to PostgreSQL for retrieval and pagination. Infrastructure is fully described in Terraform; releases ship through a six-stage GitHub Actions pipeline.

Architecture

flowchart LR
    Client["HTTPS Client"]

    subgraph AWS["AWS - us-east-1"]
        APIGW["API Gateway<br/>HTTP API · TLS termination"]

        subgraph Task["EC2 · ECS Task - FastAPI + ONNX Runtime"]
            API["FastAPI app<br/>uvicorn :8000"]
            ORT["ONNX Runtime<br/>FinBERT · CPUExecutionProvider"]
            API --> ORT
        end

        subgraph Data["Private Subnets"]
            RDS[("RDS PostgreSQL 16<br/>predictions")]
        end

        subgraph Obs["Observability"]
            CW["CloudWatch<br/>Logs · Metrics · Alarms"]
            SNS["SNS<br/>alert topic"]
            CW --> SNS
        end

        APIGW --> API
        API -->|"SQLAlchemy"| RDS
        API -->|"PutMetricData · JSON logs"| CW
    end

    subgraph CICD["CI/CD"]
        GH["GitHub<br/>push to main"]
        GHA["GitHub Actions<br/>lint · test · build"]
        ECR["Amazon ECR"]
        GH --> GHA --> ECR
    end

    Client -->|"HTTPS"| APIGW
    ECR -->|"image pull"| API
    GHA -->|"force-new-deployment"| API

Local Kubernetes: a parallel k8s/ manifest set runs the same image on minikube (Deployment, Service, HPA, in-cluster Postgres). This is a local orchestration demonstration and is not part of the AWS production path.

Key Engineering Decisions

Decision	Problem	Choice	Consequence
ONNX Runtime over PyTorch for inference	PyTorch is a heavy training framework; carrying it into the serving image inflates memory and image size on a small instance.	Export the trained weights to ONNX at build time and serve with ONNX Runtime; torch is installed only to export, then uninstalled.	The runtime image ships zero training dependencies, so the inference footprint stays small enough to run comfortably on a t3-class host.
ECS on EC2 over Fargate	FinBERT's load-time memory spike can exceed the container's working set on a small instance.	Run the ECS task on a self-managed EC2 launch type so the host can be configured with 2 GB of swap.	Host-level control over swap and instance sizing keeps the model loadable and cost predictable, at the price of managing the EC2 host.
PostgreSQL over DynamoDB	Predictions need ordered pagination, relational integrity, and exact score precision.	Use RDS PostgreSQL with `DECIMAL(6,4)` score columns and timestamp-based cursor pagination.	Clean keyset pagination and lossless score storage, at the cost of running a managed relational instance.
Synchronous inference over an async queue	Could decouple inference behind SQS and a worker pool.	Keep prediction a single blocking request/response.	p99 latency stays well under 400 ms and the system has no queue/worker operational surface; revisit only if throughput demands batching.
API Gateway over direct EC2 exposure	Exposing the app port directly makes the instance the public contract and offloads TLS to the app.	Front the service with an API Gateway HTTP API integrating to the Elastic IP.	Managed HTTPS/TLS termination and a stable public front door, decoupled from the underlying host.
SSM Parameter Store over Secrets Manager	The API key and DB URL are static secrets that never rotate.	Store them as SSM `SecureString` parameters injected into the task as secrets.	Encrypted secret delivery at lower cost, without paying for rotation machinery that is not needed.

CI/CD Pipeline

Defined in .github/workflows/deploy.yml. Triggered on push to main, but only when relevant paths change (app/**, tests/**, migrations/**, Dockerfile, pyproject.toml, alembic.ini, the workflow itself) - documentation and infra edits don't burn a deploy. AWS access uses OIDC role assumption, so no long-lived credentials are stored in GitHub.

flowchart LR
    L["lint<br/>ruff · black"]
    U["unit-tests<br/>pytest tests/unit"]
    I["integration-tests<br/>pytest tests/integration"]
    B["build-and-push<br/>Docker → ECR"]
    D["deploy<br/>ECS force-new-deployment"]
    S["smoke-test<br/>/ready · /predict · /health"]

    L --> U
    L --> I
    U --> B
    I --> B
    B --> D --> S

Stage	What it does	Why it exists
lint	`ruff check .` and `black --check .`.	Fail fast on style/format before spending compute on tests.
unit-tests	`pytest tests/unit/` against a stubbed model.	Validate API contracts and business logic in isolation.
integration-tests	Spins up Postgres, runs migrations, exercises the real predict→persist flow.	Catch schema, ORM, and serialisation regressions end to end.
build-and-push	Multi-stage Docker build, tags `:latest` and `:<sha>`, pushes to ECR.	Produce the immutable, ONNX-baked image that gets deployed.
deploy	`aws ecs update-service --force-new-deployment`, then waits for service stability.	Roll the new image onto the cluster and block until healthy.
smoke-test	Polls `/ready` for `model_loaded == true`, then hits `/predict` and `/health`.	Prove the live deployment actually serves before the run goes green.

Model-file caching. The ONNX export is expensive, so the integration job caches finbert.onnx, finbert.onnx.data, and tokenizer/ (GitHub Actions cache, key finbert-onnx-v1). On a cache miss it extracts the artefacts directly from the latest ECR image rather than re-exporting.

Readiness polling. The smoke test does not assume the deployment is warm - it polls the /ready endpoint (which reports model_loaded and db_connected) on a fixed interval until the model is actually loaded, then runs assertions.

API Reference

All /api/v1 routes require an X-API-Key header. Replace the base URL with your deployed stage:

BASE_URL=https://<api-id>.execute-api.us-east-1.amazonaws.com/prod

`POST /api/v1/predict`

Classify a piece of financial text and persist the result.

curl -X POST "$BASE_URL/api/v1/predict" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{
    "text": "The company beat earnings expectations and raised full-year guidance.",
    "source": "earnings-call"
  }'

{
  "request_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
  "text": "The company beat earnings expectations and raised full-year guidance.",
  "label": "positive",
  "scores": {
    "positive": 0.9512,
    "negative": 0.0121,
    "neutral": 0.0367
  },
  "confidence": 0.9512,
  "model_version": "ProsusAI/finbert",
  "latency_ms": 38,
  "created_at": "2026-05-29T10:14:22.481Z"
}

text is required (1–2000 chars); source is optional (≤100 chars). If the prediction succeeds but the database write fails, the response is still returned with an X-Storage-Warning: write_failed header and a synthetic request_id.

`GET /api/v1/predictions/{request_id}`

Retrieve a single stored prediction by id.

curl "$BASE_URL/api/v1/predictions/7c9e6679-7425-40de-944b-e07fc1f90ae7" \
  -H "X-API-Key: $API_KEY"

Returns the same shape as POST /predict, or 404:

{ "detail": "Prediction not found" }

`GET /api/v1/predictions`

List stored predictions, newest first, with keyset (cursor) pagination.

curl "$BASE_URL/api/v1/predictions?limit=20&label=negative" \
  -H "X-API-Key: $API_KEY"

Query params: limit (default 20, max 100), cursor (a request_id to page from), label (filter by positive / negative / neutral).

{
  "items": [
    {
      "request_id": "9f1b...",
      "text": "Quarterly revenue fell short of analyst estimates.",
      "label": "negative",
      "scores": { "positive": 0.0204, "negative": 0.9433, "neutral": 0.0363 },
      "confidence": 0.9433,
      "model_version": "ProsusAI/finbert",
      "latency_ms": 41,
      "created_at": "2026-05-29T09:58:03.112Z"
    }
  ],
  "next_cursor": "9f1b...",
  "total_count": 1284
}

Operational endpoints

Endpoint	Purpose
`GET /health`	Liveness - returns `200` if the process is up.
`GET /ready`	Readiness - reports `model_loaded` and `db_connected`; `503` until both are true.

Local Development

Prerequisites: Docker and Docker Compose.

cp .env.example .env          # set DATABASE_URL and API_KEY
docker compose up --build

The app container's entrypoint runs alembic upgrade head before starting uvicorn, so the schema is migrated automatically. The API is then available at http://localhost:8000.

Environment variables (see .env.example):

Variable	Description
`DATABASE_URL`	SQLAlchemy/psycopg connection string for PostgreSQL.
`API_KEY`	Value clients must send in the `X-API-Key` header.

Compose also accepts ONNX_MODEL_PATH and TOKENIZER_PATH; these default to the paths baked into the image and rarely need overriding.

Kubernetes (Local)

The k8s/ manifests run the same image on minikube for local orchestration. This is a demonstration of Kubernetes patterns - it is not the production deployment (production runs on ECS/EC2).

minikube start
eval $(minikube docker-env)            # build into minikube's daemon
docker build -t mlops-finbert-aws:local .

kubectl apply -f k8s/00-namespace.yaml
kubectl apply -f k8s/configmap.yaml -f k8s/secret.yaml
kubectl apply -f k8s/postgres.yaml
kubectl apply -f k8s/deployment.yaml -f k8s/service.yaml
kubectl apply -f k8s/hpa.yaml

The Deployment uses an init container that blocks until Postgres accepts connections, with liveness (/health) and readiness (/ready) probes. The HPA scales 1 → 3 replicas on 70% average CPU.

Observability

The application emits structured JSON logs (via structlog) to CloudWatch Logs and publishes custom metrics to the FinbertAPI namespace.

Metrics

Metric	Type	Notes
`prediction_count`	Count	Dimensioned by `label` (positive / negative / neutral).
`prediction_latency_p99`	Milliseconds	Per-request inference latency.
`db_write_failures`	Count	Emitted when a prediction fails to persist after retry.

Alarms (all routed to an SNS alert topic):

Alarm	Condition
High error rate	`error_rate` > 10% for 5 minutes
High latency	`prediction_latency_p99` > 3000 ms for 5 minutes
EC2 CPU	`CPUUtilization` > 90% for 10 minutes
DB write failures	`db_write_failures` > 0

A CloudWatch dashboard (mlops-finbert-aws-dashboard) surfaces prediction counts by label, p99 latency, DB write failures, and EC2/RDS CPU on a single pane.

Repository Structure

.
├── app/
│   ├── main.py              # FastAPI app + lifespan (loads model on startup)
│   ├── config.py            # pydantic-settings configuration
│   ├── api/
│   │   ├── predict.py       # POST /predict - inference, persistence, metrics
│   │   ├── predictions.py   # GET single + cursor-paginated list
│   │   └── health.py        # /health (liveness) + /ready (readiness)
│   ├── model/finbert.py     # ONNX Runtime session + tokenizer + softmax
│   ├── db/                  # SQLAlchemy session + Prediction ORM model
│   └── schemas/             # Pydantic request/response models
├── tests/
│   ├── unit/                # API/contract tests with a stubbed model
│   └── integration/         # Full predict → persist flow against Postgres
├── terraform/               # IaC: VPC, EC2, ECS, RDS, API Gateway, IAM,
│                            #      ECR, SSM, CloudWatch, SNS, S3 backend
├── k8s/                     # Local minikube manifests (Deployment, HPA, …)
├── migrations/              # Alembic migrations (predictions table)
├── scripts/export_onnx.py   # PyTorch → ONNX export, run at image build
├── .github/workflows/       # CI/CD pipeline (deploy.yml)
├── Dockerfile               # Multi-stage build; ONNX baked, torch dropped
├── docker-compose.yml       # Local app + Postgres
└── pyproject.toml           # Dependencies, ruff, black, pytest config

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlops-finbert-aws

Architecture

Key Engineering Decisions

CI/CD Pipeline

API Reference

`POST /api/v1/predict`

`GET /api/v1/predictions/{request_id}`

`GET /api/v1/predictions`

Operational endpoints

Local Development

Kubernetes (Local)

Observability

Repository Structure

License

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
app		app
k8s		k8s
migrations		migrations
scripts		scripts
terraform		terraform
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

mlops-finbert-aws

Architecture

Key Engineering Decisions

CI/CD Pipeline

API Reference

POST /api/v1/predict

GET /api/v1/predictions/{request_id}

GET /api/v1/predictions

Operational endpoints

Local Development

Kubernetes (Local)

Observability

Repository Structure

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages

`POST /api/v1/predict`

`GET /api/v1/predictions/{request_id}`

`GET /api/v1/predictions`