Production-grade MLOps pipeline: FinBERT financial sentiment API deployed on AWS with containerised ONNX Runtime, Terraform-provisioned infrastructure, ECS orchestration, automated CI/CD, and full observability stack.
A FinBERT (ProsusAI/finbert) financial-sentiment inference service. PyTorch weights are
exported to ONNX at build time and served through ONNX Runtime inside a hardened container.
The model classifies text as positive / negative / neutral with per-label scores, and
every prediction is persisted to PostgreSQL for retrieval and pagination. Infrastructure is
fully described in Terraform; releases ship through a six-stage GitHub Actions pipeline.
flowchart LR
Client["HTTPS Client"]
subgraph AWS["AWS - us-east-1"]
APIGW["API Gateway<br/>HTTP API · TLS termination"]
subgraph Task["EC2 · ECS Task - FastAPI + ONNX Runtime"]
API["FastAPI app<br/>uvicorn :8000"]
ORT["ONNX Runtime<br/>FinBERT · CPUExecutionProvider"]
API --> ORT
end
subgraph Data["Private Subnets"]
RDS[("RDS PostgreSQL 16<br/>predictions")]
end
subgraph Obs["Observability"]
CW["CloudWatch<br/>Logs · Metrics · Alarms"]
SNS["SNS<br/>alert topic"]
CW --> SNS
end
APIGW --> API
API -->|"SQLAlchemy"| RDS
API -->|"PutMetricData · JSON logs"| CW
end
subgraph CICD["CI/CD"]
GH["GitHub<br/>push to main"]
GHA["GitHub Actions<br/>lint · test · build"]
ECR["Amazon ECR"]
GH --> GHA --> ECR
end
Client -->|"HTTPS"| APIGW
ECR -->|"image pull"| API
GHA -->|"force-new-deployment"| API
Local Kubernetes: a parallel
k8s/manifest set runs the same image on minikube (Deployment, Service, HPA, in-cluster Postgres). This is a local orchestration demonstration and is not part of the AWS production path.
| Decision | Problem | Choice | Consequence |
|---|---|---|---|
| ONNX Runtime over PyTorch for inference | PyTorch is a heavy training framework; carrying it into the serving image inflates memory and image size on a small instance. | Export the trained weights to ONNX at build time and serve with ONNX Runtime; torch is installed only to export, then uninstalled. | The runtime image ships zero training dependencies, so the inference footprint stays small enough to run comfortably on a t3-class host. |
| ECS on EC2 over Fargate | FinBERT's load-time memory spike can exceed the container's working set on a small instance. | Run the ECS task on a self-managed EC2 launch type so the host can be configured with 2 GB of swap. | Host-level control over swap and instance sizing keeps the model loadable and cost predictable, at the price of managing the EC2 host. |
| PostgreSQL over DynamoDB | Predictions need ordered pagination, relational integrity, and exact score precision. | Use RDS PostgreSQL with DECIMAL(6,4) score columns and timestamp-based cursor pagination. |
Clean keyset pagination and lossless score storage, at the cost of running a managed relational instance. |
| Synchronous inference over an async queue | Could decouple inference behind SQS and a worker pool. | Keep prediction a single blocking request/response. | p99 latency stays well under 400 ms and the system has no queue/worker operational surface; revisit only if throughput demands batching. |
| API Gateway over direct EC2 exposure | Exposing the app port directly makes the instance the public contract and offloads TLS to the app. | Front the service with an API Gateway HTTP API integrating to the Elastic IP. | Managed HTTPS/TLS termination and a stable public front door, decoupled from the underlying host. |
| SSM Parameter Store over Secrets Manager | The API key and DB URL are static secrets that never rotate. | Store them as SSM SecureString parameters injected into the task as secrets. |
Encrypted secret delivery at lower cost, without paying for rotation machinery that is not needed. |
Defined in .github/workflows/deploy.yml. Triggered on push to
main, but only when relevant paths change (app/**, tests/**, migrations/**,
Dockerfile, pyproject.toml, alembic.ini, the workflow itself) - documentation and infra
edits don't burn a deploy. AWS access uses OIDC role assumption, so no long-lived
credentials are stored in GitHub.
flowchart LR
L["lint<br/>ruff · black"]
U["unit-tests<br/>pytest tests/unit"]
I["integration-tests<br/>pytest tests/integration"]
B["build-and-push<br/>Docker → ECR"]
D["deploy<br/>ECS force-new-deployment"]
S["smoke-test<br/>/ready · /predict · /health"]
L --> U
L --> I
U --> B
I --> B
B --> D --> S
| Stage | What it does | Why it exists |
|---|---|---|
| lint | ruff check . and black --check .. |
Fail fast on style/format before spending compute on tests. |
| unit-tests | pytest tests/unit/ against a stubbed model. |
Validate API contracts and business logic in isolation. |
| integration-tests | Spins up Postgres, runs migrations, exercises the real predict→persist flow. | Catch schema, ORM, and serialisation regressions end to end. |
| build-and-push | Multi-stage Docker build, tags :latest and :<sha>, pushes to ECR. |
Produce the immutable, ONNX-baked image that gets deployed. |
| deploy | aws ecs update-service --force-new-deployment, then waits for service stability. |
Roll the new image onto the cluster and block until healthy. |
| smoke-test | Polls /ready for model_loaded == true, then hits /predict and /health. |
Prove the live deployment actually serves before the run goes green. |
Model-file caching. The ONNX export is expensive, so the integration job caches
finbert.onnx, finbert.onnx.data, and tokenizer/ (GitHub Actions cache, key
finbert-onnx-v1). On a cache miss it extracts the artefacts directly from the latest ECR image
rather than re-exporting.
Readiness polling. The smoke test does not assume the deployment is warm - it polls the
/ready endpoint (which reports model_loaded and db_connected) on a fixed interval until the
model is actually loaded, then runs assertions.
All /api/v1 routes require an X-API-Key header. Replace the base URL with your deployed
stage:
BASE_URL=https://<api-id>.execute-api.us-east-1.amazonaws.com/prod
Classify a piece of financial text and persist the result.
curl -X POST "$BASE_URL/api/v1/predict" \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d '{
"text": "The company beat earnings expectations and raised full-year guidance.",
"source": "earnings-call"
}'{
"request_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"text": "The company beat earnings expectations and raised full-year guidance.",
"label": "positive",
"scores": {
"positive": 0.9512,
"negative": 0.0121,
"neutral": 0.0367
},
"confidence": 0.9512,
"model_version": "ProsusAI/finbert",
"latency_ms": 38,
"created_at": "2026-05-29T10:14:22.481Z"
}text is required (1–2000 chars); source is optional (≤100 chars). If the prediction
succeeds but the database write fails, the response is still returned with an
X-Storage-Warning: write_failed header and a synthetic request_id.
Retrieve a single stored prediction by id.
curl "$BASE_URL/api/v1/predictions/7c9e6679-7425-40de-944b-e07fc1f90ae7" \
-H "X-API-Key: $API_KEY"Returns the same shape as POST /predict, or 404:
{ "detail": "Prediction not found" }List stored predictions, newest first, with keyset (cursor) pagination.
curl "$BASE_URL/api/v1/predictions?limit=20&label=negative" \
-H "X-API-Key: $API_KEY"Query params: limit (default 20, max 100), cursor (a request_id to page from), label
(filter by positive / negative / neutral).
{
"items": [
{
"request_id": "9f1b...",
"text": "Quarterly revenue fell short of analyst estimates.",
"label": "negative",
"scores": { "positive": 0.0204, "negative": 0.9433, "neutral": 0.0363 },
"confidence": 0.9433,
"model_version": "ProsusAI/finbert",
"latency_ms": 41,
"created_at": "2026-05-29T09:58:03.112Z"
}
],
"next_cursor": "9f1b...",
"total_count": 1284
}| Endpoint | Purpose |
|---|---|
GET /health |
Liveness - returns 200 if the process is up. |
GET /ready |
Readiness - reports model_loaded and db_connected; 503 until both are true. |
Prerequisites: Docker and Docker Compose.
cp .env.example .env # set DATABASE_URL and API_KEY
docker compose up --buildThe app container's entrypoint runs alembic upgrade head before starting uvicorn, so the
schema is migrated automatically. The API is then available at http://localhost:8000.
Environment variables (see .env.example):
| Variable | Description |
|---|---|
DATABASE_URL |
SQLAlchemy/psycopg connection string for PostgreSQL. |
API_KEY |
Value clients must send in the X-API-Key header. |
Compose also accepts ONNX_MODEL_PATH and TOKENIZER_PATH; these default to the paths baked
into the image and rarely need overriding.
The k8s/ manifests run the same image on minikube for local orchestration. This is a
demonstration of Kubernetes patterns - it is not the production deployment (production runs
on ECS/EC2).
minikube start
eval $(minikube docker-env) # build into minikube's daemon
docker build -t mlops-finbert-aws:local .
kubectl apply -f k8s/00-namespace.yaml
kubectl apply -f k8s/configmap.yaml -f k8s/secret.yaml
kubectl apply -f k8s/postgres.yaml
kubectl apply -f k8s/deployment.yaml -f k8s/service.yaml
kubectl apply -f k8s/hpa.yamlThe Deployment uses an init container that blocks until Postgres accepts connections, with
liveness (/health) and readiness (/ready) probes. The
HPA scales 1 → 3 replicas on 70% average CPU.
The application emits structured JSON logs (via structlog) to CloudWatch Logs and publishes
custom metrics to the FinbertAPI namespace.
Metrics
| Metric | Type | Notes |
|---|---|---|
prediction_count |
Count | Dimensioned by label (positive / negative / neutral). |
prediction_latency_p99 |
Milliseconds | Per-request inference latency. |
db_write_failures |
Count | Emitted when a prediction fails to persist after retry. |
Alarms (all routed to an SNS alert topic):
| Alarm | Condition |
|---|---|
| High error rate | error_rate > 10% for 5 minutes |
| High latency | prediction_latency_p99 > 3000 ms for 5 minutes |
| EC2 CPU | CPUUtilization > 90% for 10 minutes |
| DB write failures | db_write_failures > 0 |
A CloudWatch dashboard (mlops-finbert-aws-dashboard) surfaces prediction counts by label,
p99 latency, DB write failures, and EC2/RDS CPU on a single pane.
.
├── app/
│ ├── main.py # FastAPI app + lifespan (loads model on startup)
│ ├── config.py # pydantic-settings configuration
│ ├── api/
│ │ ├── predict.py # POST /predict - inference, persistence, metrics
│ │ ├── predictions.py # GET single + cursor-paginated list
│ │ └── health.py # /health (liveness) + /ready (readiness)
│ ├── model/finbert.py # ONNX Runtime session + tokenizer + softmax
│ ├── db/ # SQLAlchemy session + Prediction ORM model
│ └── schemas/ # Pydantic request/response models
├── tests/
│ ├── unit/ # API/contract tests with a stubbed model
│ └── integration/ # Full predict → persist flow against Postgres
├── terraform/ # IaC: VPC, EC2, ECS, RDS, API Gateway, IAM,
│ # ECR, SSM, CloudWatch, SNS, S3 backend
├── k8s/ # Local minikube manifests (Deployment, HPA, …)
├── migrations/ # Alembic migrations (predictions table)
├── scripts/export_onnx.py # PyTorch → ONNX export, run at image build
├── .github/workflows/ # CI/CD pipeline (deploy.yml)
├── Dockerfile # Multi-stage build; ONNX baked, torch dropped
├── docker-compose.yml # Local app + Postgres
└── pyproject.toml # Dependencies, ruff, black, pytest config