Human-first portfolio for entry-level MLOps / Production ML roles Β· 3 ML services Β· GKE + EKS evidence Β· 18 ADRs Β· 395+ tests
βοΈ Operational status: Infrastructure (GKE + EKS) is currently offline. The code, manifests, Terraform, and CI/CD are production-tested β the clusters were deployed to during development (v3.6.0, March 2026) and torn down after. See PORTFOLIO_STATUS.md for what is live, what is paused, and how to reactivate in ~1 hour. See ADR-018 for the decision record.
Most ML portfolios show models that score well. This one shows what happens after a model has to become a service: APIs, tests, deployment artifacts, monitoring, cost trade-offs, and documented lessons.
The GitHub Pages site is now the best entry point for recruiters and non-technical reviewers: duqueom.github.io/ML-MLOps-Portfolio. This README remains the deeper technical record.
| Incident | Root Cause | Fix | Outcome | ADR |
|---|---|---|---|---|
| 81% error rate under load | uvicorn --workers N on K8s: workers share one CPU budget β thrashing, not parallelism |
asyncio.run_in_executor + ThreadPoolExecutor(4) β sklearn C extensions release the GIL |
Errors 81% β 0% Β· CPU 2000m β 1000m | 014 / 015 |
| SHAP returning all zeros | TreeExplainer incompatible with StackingClassifier β evaluated 4 alternatives before deciding |
KernelExplainer in original 10-feature space (interpretable by business, not 38 encoded cols) |
Real SHAP values in production | 010 |
| HPA never scaled down | Memory-based HPA + fixed ML footprint: ceil(replicas Γ usage/target) always β₯ current replicas |
CPU-only HPA β CPU correlates with traffic; memory is a constant, not a signal | 3 β 1 pods in 8 minutes | 001 |
This is not a tutorial project. It's an operational record.
The CHANGELOG traces the full incident history from v1.0.0 to v3.6.0. Each entry has a root cause and a resolution.
| I want to understand... | Start here |
|---|---|
| Why decisions were made (not just what) | 18 ADRs β |
| Incidents diagnosed in production | ENGINEERING_HIGHLIGHTS.md β |
| Agentic Development Configuration | AGENTS.md β |
| What was built and how it performs | Projects β |
| How to run it locally in 5 minutes | Quick Start β |
| Multi-cloud deployment evidence | Deployment β |
| What broke and when | CHANGELOG.md β |
The MLOps patterns in this portfolio are available as a reusable, opinionated template:
ML-MLOps-Production-Template Β· related-projects.md
v1.12.0 highlights β audit Round-3 close + pre-commit hardened as mandatory first filter:
- 32 encoded anti-patterns (D-01 β D-32) β runtime, data, EDA, security, closed-loop, lifecycle (warm-up, PDB, PSS), delivery (env gates, API contracts, SBOM, digest pin), placeholder hygiene (D-32: kebab-vs-snake path bug)
- Pre-commit as mandatory first filter β 14 hooks (black, isort, flake8, mypy, bandit, gitleaks, trailing-whitespace, EOF, YAML, merge-conflict, large-files, validate-agentic, ci-autofix-policy-contract, scaffold-smoke);
default_install_hook_types: [pre-commit, pre-push]so a singlepre-commit installcovers both stages;make verify-hooksaudits any time;scripts/dev-setup.shbootstraps idempotently and verifies hooks actually landed in.git/hooks/ - Closed-loop verification workflow β
golden-path-extended.ymlre-deploys + posts 100 valid + 5 invalid/predictrequests + asserts the prediction-log counter increments; newtest_closed_loop_workflow_contract.pyparses both schema and workflow and fails LOUD if they drift (R3 HIGH-1 fix) - GCP β AWS Terraform parity (v1.11.0) β secrets / logging / KMS at the live layer + bootstrap split + 14 parity contract tests; cluster defaults (private endpoint opt-in, system/workload pool split with taint, deny-default NetworkPolicy)
- ADR-018 Operational Memory Plane + ADR-019 Agentic CI Self-Healing (Phase 0) β policy YAMLs (
templates/config/{ci_autofix_policy,model_routing_policy}.yaml) + 10-invariant contract test enforcing escalation-only semantics; runtime phases scoped as explicit follow-ons - OSS package complete (v1.11.0) β NOTICE (Apache-2.0 attribution) + DCO.md +
.github/CODEOWNERSrouting for AGENTS.md, ADRs, infra, governance YAMLs - Two Behavior Protocols β static AUTO/CONSULT/STOP mapping (AGENTS.md) PLUS dynamic risk escalation (ADR-010) based on live signals:
incident_active,drift_severe,error_budget_exhausted,off_hours,recent_rollback - 6 environment overlays (
gcp-{dev,staging,prod}+aws-{dev,staging,prod}) with PSS-labeled namespaces (baseline for dev/staging, restricted for prod) and tier-scaled resources β closes the silent gap where the deploy workflows referenced names the repo never shipped - Image digest pinning end-to-end β build job captures
sha256:...,deploy-common.ymlrunskustomize edit set image β¦@<digest>BEFOREkubectl apply; the Kyverno digest gate finally has compliant manifests to admit - Cosign + SBOM actually invoked in
deploy-{gcp,aws}.yml(was a silent gap until v1.10.0); SLSA L2 trust chain end-to-end - 6-phase EDA pipeline with leakage hard gate + baseline distributions feeding drift detection
- Cloud-native secrets β
common_utils/secrets.py(AWS Secrets Manager / GCP Secret Manager via IRSA/WI); two bootstrap runbooks (GCP WIF + AWS IRSA) +/secret-breachemergency workflow - Per-environment Terraform remote state β partial backend configs under
templates/infra/terraform/{gcp,aws}/backend-configs/with theterraform-state-bootstrap.mdrunbook - Drift + retrain operationalized β cloud-aware GCS/S3 adapters via OIDC, Prometheus Pushgateway integration, MLflow promotion hooks
- Typed inter-agent handoffs β frozen dataclasses validating invariants at construction;
DeploymentRequestrefuses to construct whenenv=production+audit.passed=False;SecurityAuditResultblocks on anytrivy_highfinding - Audit trail β every agentic operation appends to
ops/audit.jsonlwith risk signals + base mode; CI callsscripts/audit_record.pyon every deploy (success AND failure viaif: always()) and mirrors a markdown summary to the GitHub Actions step summary - Golden Path E2E workflow β
.github/workflows/golden-path.ymlvalidates the full chain on every PR: scaffold β build + sign by digest β kind cluster + Kyverno admit + smoke β audit trail. Trust anchor for the audit closure. - Tri-IDE full parity β Windsurf (15 rules / 16 skills / 12 workflows) Β· Claude Code (14 rules / 12 commands / 16-skill index) Β· Cursor (12 rules / 12 commands / 16-skill index)
- Closed-loop monitoring β prediction logger + ground truth ingestion + sliced performance (ADR-007) + Champion/Challenger McNemar + bootstrap ΞAUC gate (ADR-008) + 10-panel Grafana dashboard
- Governed delivery β dev β staging β prod chain with GitHub Environment Protection, 2 reviewers + 15min soak + tag-only for prod (ADR-011); reusable
deploy-common.ymlsingle source of truth - DORA metrics β exporter script aggregates deployment_frequency, lead_time_for_changes, change_failure_rate, mttr from GitHub API +
ops/audit.jsonl - Incident playbooks β
/rollback(STOP-class 7-step),/secret-breach,/incident,/drift-check,/performance-reviewslash commands - 19 ADRs β each records alternatives rejected AND measurable revisit triggers; ADR-015 publishes the productization roadmap (3 phases / 12 PRs); ADR-016 codifies the external-audit R2 remediation backlog; ADR-018/019 ratify the new agent capabilities at policy-only Phase 0
Not explanations of what was built β records of what was evaluated, rejected, and why. Written for technical reviewers.
| ADR | Decision | The Harder Choice |
|---|---|---|
| 001 | CPU-only HPA | Proved mathematically that memory HPA cannot scale down ML pods |
| 003 | StackingClassifier | Acknowledged single LightGBM achieves comparable AUC at lower cost |
| 005 | Compatible release pinning | numpy 2.x silently broke serialized models β silent failure, worst category |
| 006 | CronJob over Airflow | Documented why Airflow is over-engineering for a 3-model portfolio |
| 007 | No Feature Store | Designed full Feast architecture for when time-window features are needed |
| 008 | Argo Rollouts canary | Progressive delivery with Prometheus analysis gates β not all-or-nothing rollout |
| 009 | Removed CarVision | MAPE 32.9% not defensible β knowing when not to build is harder |
| 010 | SHAP KernelExplainer | Diagnosed production bug, evaluated 4 alternatives before deciding |
| 014 | Single-worker pods | Found uvicorn --workers anti-pattern under K8s from first principles |
| 015 | Async inference | GIL analysis β ThreadPoolExecutor β 81% errors β 0% |
| 016 | GCP/AWS latency gap | $24/mo vs $145/mo β both meet SLA; chose FinOps over vanity metrics |
| 017 | Custom vs Managed ML | FastAPI+K8s primary, SageMaker/Vertex AI as documented complement |
| 018 | Portfolio Maintenance Mode | $180β220/mo idle cost β documented teardown and reactivation path |
View all 18 ADRs with full context, alternatives considered, and trade-offs β
Those 18 ADRs don't just live in docs β they're encoded as behavioral constraints in the AI development environment itself.
AGENTS.md β Project identity, critical DO NOT VIOLATE patterns, HPA targets
.windsurf/
βββ rules/ β 7 context-aware rules (glob-triggered per file type)
β βββ 01-mlops-conventions.md always_on: core ADR constraints
β βββ 02-kubernetes.md k8s/**/*.yaml: HPA 50/60/60%, single-worker
β βββ 03-terraform.md **/*.tf: state management, tagging
β βββ 04-python-ml.md **/*.py: async patterns, SHAP, pinning
β βββ 05-github-actions.md .github/workflows/: CI standards
β βββ 06-documentation.md docs/**/*.md: ADR format, content guidelines
β βββ 07-docker.md Dockerfile*: multi-stage, non-root, no model bake
βββ skills/ β 6 multi-step operational procedures with supplementary data
β βββ debug-ml-inference/ symptom β root cause β ADR cross-reference
β βββ deploy-gke/ deploy-aws/ pre/post-deploy checklists + rollback procedures
β βββ drift-detection/ per-service PSI thresholds + alert integration
β βββ model-retrain/ validation criteria + acceptance gates per service
β βββ release-checklist/ full multi-cloud release + CHANGELOG template
βββ workflows/ β 6 structured prompt workflows
/incident Β· /retrain Β· /release Β· /load-test Β· /new-adr Β· /drift-check
The agent knows: 50%/60%/60% CPU targets (not 70%), KernelExplainer for SHAP (not TreeExplainer), workers=1 (never N) under K8s. Operational knowledge encoded as constraints β not just referenced as documentation.
β AGENTS.md Β |Β .windsurf/
| Project | Type | Best Metric | Coverage | Latency p50 | Key Engineering Decision |
|---|---|---|---|---|---|
| π¦ BankChurn | Classification | AUC 0.87 | 90% | 200ms GCP / 110ms AWS | Async inference via ThreadPoolExecutor Β· threshold 0.35 (30:1 cost ratio) |
| π NLPInsight | NLP Sentiment | Acc 80.6% | 98% | 78ms GCP / 100ms AWS | Upgraded to harder dataset (97% β 80.6%) for honest benchmark |
| π ChicagoTaxi | Batch Pipeline | RΒ² 0.96 | 91% | 100ms GCP / 120ms AWS | Data leakage found & fixed Β· lag features + temporal split |
| Infrastructure | Status | Details |
|---|---|---|
| GCP Deployment | β Verified | GKE 1β5 nodes, 6 pods, 0% error rate under 100 concurrent users |
| AWS Deployment | β Verified | EKS 1β5 nodes, 6 pods, CI/CD via GitHub Actions |
| CI/CD | β Unified | 10-job matrix, security scanning (Trivy/Bandit/Gitleaks), automated deploy to both clouds |
| IaC | β Multi-Cloud | Terraform (GCP + AWS) Β· terraform plan = 0 drift |
| Monitoring | β Full Stack | Prometheus + Grafana (26 panels, 16 alert rules) + MLflow |
| Security | β Automated | Blocking on HIGH Β· non-root containers Β· Network Policies Β· IRSA/Workload Identity |
π¦ 1. BankChurn Predictor β Customer Churn Prediction
Production-style churn prediction with StackingClassifier ensemble (RF + GradientBoosting + XGBoost + LightGBM β LogisticRegression meta-learner). ChurnFeatureEngineer with domain-specific ratios, bins, and risk scores. MLflow experiment tracking.
| AUC-ROC | F1 | Precision | Recall | Coverage | In-Pod Latency (GKE) |
|---|---|---|---|---|---|
| 0.87 | 0.62 | 0.73 | 0.54 | 90% | 103ms p50 / 111ms p95 |
Why these metrics: AUC-ROC is the primary metric β 20.4% churn rate (4:1 imbalance) makes accuracy meaningless. Production threshold: 0.35 (not default 0.50) β missed churner costs ~$1,500β$3,000 LTV vs. ~$50 retention offer (30:1 cost ratio). At 0.35, Recall = 0.78; at 0.50, Recall = 0.54. The precision trade-off is intentional and quantified with business context.
Key engineering decisions:
- ADR-015:
uvicorn --workers Nunder Kubernetes causes CPU thrashing (shared budget). Fixed viaasyncio.run_in_executor+ThreadPoolExecutor(4)exploiting GIL release in sklearn C extensions β 81% error rate β 0%, CPU 2000m β 1000m - ADR-010: SHAP returning all-zero values in production.
TreeExplainerincompatible withStackingClassifier. Evaluated 4 alternatives βKernelExplainerin original 10-feature space for business interpretability - ADR-003: 7-model comparison (5-fold CV). StackingClassifier AUC 0.87 vs single LightGBM 0.86. Documented that simpler model wins in production under strict latency SLAs
π Project Β· π Model Card Β· πΊ Video
π 2. NLPInsight Analyzer β Financial Sentiment Analysis
Financial sentiment analysis on Twitter Financial News β 11,931 real financial tweets with stock tickers, informal language, and noisy text. TF-IDF + LogReg production model (5ms, CPU-only) with optional FinBERT backend for GPU environments.
| Accuracy | F1 (weighted) | F1 (macro) | Labels | Dataset |
|---|---|---|---|---|
| 80.6% | 0.810 | 0.748 | 3 | 11,931 tweets |
Why these metrics: 80.6% on real financial tweets (vs 97% on the easier Financial PhraseBank) is the honest choice. The dataset upgrade β from 4,845 curated sentences to 11,931 noisy real tweets β deliberately lowered the metric to produce a more defensible benchmark. F1-macro (0.748) guards against ignoring the minority negative class.
Key engineering decisions:
- ADR-009: Chose harder dataset over better-looking number β intellectual honesty over portfolio optics
- Dual-backend design: TF-IDF+LogReg for CPU production (5ms p50), FinBERT for GPU environments β same API contract, different serving backend
π Project Β· π Model Card Β· πΊ Video
π 3. ChicagoTaxi Demand Pipeline β Batch Processing at Scale
Data engineering pipeline processing 6.3M taxi trips (2.8 GB CSV) via PySpark ETL into partitioned Parquet, with batch prediction using lag features and temporal split.
| Raw Rows | Clean Rows | ETL Throughput | Model RΒ² | RMSE | MAE | Compression |
|---|---|---|---|---|---|---|
| 6.36M | 5.37M | 3,320 rows/sec | 0.96 | 7.87 | 2.85 | 97% (2.8GBβ95MB) |
Why this project: The RΒ² 0.96 is leak-free β same-period aggregate features (
avg_fare,avg_speed) were identified as data leakage, removed, and replaced with lag features (1h, 24h, 168h, rolling 24h) and a temporal train/test split. RΒ² improved from 0.905 β 0.965 with honest features. The initial high RΒ² was a signal to investigate, not celebrate.
Key engineering decisions:
- ADR-009 (data leakage):
avg_farewas computed from the same trips being predicted β future information leaked into training. Documented, fixed, RΒ² re-measured with honest features only
π Project Β· π Model Card
| Category | Technologies |
|---|---|
| ML/DS | Scikit-learn, XGBoost, LightGBM, HuggingFace (FinBERT), PySpark, Dask, Pandas, NumPy, SHAP, Optuna |
| MLOps | MLflow (9 experiments), DVC, Docker, Kubernetes, Terraform, Argo Rollouts |
| API | FastAPI, Pydantic, async inference (ThreadPoolExecutor + asyncio) |
| Cloud & IaC | GCP (GKE, GCS, Artifact Registry, Cloud SQL, Workload Identity), AWS (EKS, S3, ECR, RDS, IRSA), Terraform, Kustomize |
| Monitoring | Prometheus (16 alert rules), Grafana (26-panel dashboard), Locust load testing, Evidently drift detection |
| CI/CD | GitHub Actions (CI + deploy-gcp + deploy-aws + smoke tests), Codecov, pre-commit hooks |
| Security | Gitleaks, Bandit, Trivy, pip-audit, non-root containers, Network Policies, Pod Disruption Budgets |
| Testing | pytest (395+ tests, 90β98% coverage), Pandera data validation, 43 adversarial tests |
| Responsible AI | Fairness audits (disparate impact + equal opportunity), SHAP explainability, drift detection (KS + PSI) |
| Agentic | Windsurf Cascade, AGENTS.md, 7 glob-triggered rules + 6 operational skills + 6 structured workflows |
| Managed ML | AWS SageMaker Endpoints, GCP Vertex AI (ADR-017) |
graph TB
subgraph "CI/CD Pipeline β GitHub Actions"
GH[GitHub Actions] --> LINT[Lint + Security<br/>Bandit Β· Gitleaks Β· Trivy]
GH --> TEST[pytest Β· 395+ tests<br/>90-98% coverage]
GH --> BUILD[Docker Build]
BUILD --> AR[GCP Artifact Registry]
BUILD --> ECR[AWS ECR]
end
subgraph "Training Pipeline"
DATA[Raw Data] --> FE[Feature Engineering]
FE --> TRAIN[Model Training<br/>MLflow Tracking]
TRAIN --> GCS[GCS Models]
TRAIN --> S3[S3 Models]
end
subgraph "GCP β GKE Cluster (us-central1)"
direction TB
GCE_ING[nginx Ingress<br/>LoadBalancer IP] --> BC1[BankChurn<br/>StackingClassifier]
GCE_ING --> NL1[NLPInsight<br/>TF-IDF+LogReg]
GCE_ING --> CT1[ChicagoTaxi<br/>Batch Predictions]
BC1 -.->|Init Container| GCS
NL1 -.->|Init Container| GCS
CT1 -.->|Init Container| GCS
PROM1[Prometheus] --> GRAF1[Grafana]
DRIFT1[Drift CronJob] --> BC1
end
subgraph "AWS β EKS Cluster (us-east-1)"
direction TB
AWS_ING[nginx Ingress<br/>NLB] --> BC2[BankChurn<br/>StackingClassifier]
AWS_ING --> NL2[NLPInsight<br/>TF-IDF+LogReg]
AWS_ING --> CT2[ChicagoTaxi<br/>Batch Predictions]
BC2 -.->|Init Container| S3
NL2 -.->|Init Container| S3
CT2 -.->|Init Container| S3
PROM2[Prometheus] --> GRAF2[Grafana]
DRIFT2[Drift CronJob] --> BC2
end
subgraph "IaC β Terraform + Kustomize"
TF[Terraform<br/>GCP + AWS modules] --> GCE_ING
TF --> AWS_ING
KUST[Kustomize Overlays<br/>base + gcp + aws] --> GCE_ING
KUST --> AWS_ING
end
For detailed architecture docs β docs/ARCHITECTURE_PORTFOLIO.md.
# 1. Clone and enter
git clone https://github.com/DuqueOM/ML-MLOps-Portfolio.git && cd ML-MLOps-Portfolio
# 2. Generate demo models (first time only, ~2 min)
bash scripts/setup_demo_models.sh
# 3. Start full stack (APIs + MLflow + Dashboard, ~3 min build)
docker compose -f docker-compose.demo.yml up -d --build
# 4. Wait for services and verify health (~60s)
sleep 60 && bash scripts/run_demo_tests.sh
# 5. Access services
# π¦ BankChurn API: http://localhost:8001/docs
# π NLPInsight API: http://localhost:8003/docs
# π ChicagoTaxi API: http://localhost:8004/docs
# π MLflow: http://localhost:5000For API examples, monitoring setup, and troubleshooting β QUICK_START.md and RUNBOOK.md.
Same ML system deployed cloud-agnostically on both GCP and AWS:
Same 6 services running on GCP (GKE, us-central1) and AWS (EKS, us-east-1) β simultaneously deployed and verified
| Component | GCP β | AWS β |
|---|---|---|
| K8s Cluster | GKE 1β5 nodes (us-central1) |
EKS 1β5 nodes (us-east-1) |
| Container Registry | Artifact Registry | ECR (3 private repos) |
| Model Storage | GCS (versioned) | S3 (encrypted, versioned) |
| Load Balancer | nginx Ingress (static IP) | nginx Ingress (NLB) |
| IAM for Pods | Workload Identity | IRSA |
| CI/CD | deploy-gcp.yml |
deploy-aws.yml |
| IaC | infra/terraform/gcp/ |
infra/terraform/aws/ |
| Drift Detection | CronJob (daily 06:00 UTC) | CronJob (daily 06:00 UTC) |
| Monitoring | Prometheus + Grafana + MLflow | Prometheus + Grafana + MLflow |
Cloud-Agnostic Design: Monitoring stack, K8s patterns (HPA, anti-affinity, health probes), and CI/CD structure are identical across clouds. Only the init container SDK and ingress annotations differ. See ADR-013.
π° FinOps: Infrastructure is provisioned on-demand via Terraform and decommissioned after validation. Re-deployable in <15 minutes with
terraform applyβ reproducibility over always-on cost. GCP ~$51/month Β· AWS ~$45/month when running. Performance difference documented in ADR-016 β accepted as a cost trade-off, not hidden.
π GCP Evidence β click to expand
βοΈ AWS Evidence β click to expand
| Document | Description |
|---|---|
| β Engineering Highlights | Start here β incidents diagnosed, decisions made, trade-offs documented |
| ADRs (18) | Every non-trivial architectural decision with context, alternatives, and trade-offs |
| AGENTS.md | Agentic development configuration |
| RUNBOOK.md | Copy-paste commands for common operations |
| Quick Start | 5-minute demo with API examples and health checks |
| Architecture | System design, Mermaid diagrams, infrastructure, CI/CD workflow |
| CHANGELOG | Full incident history from v1.0.0 to v3.6.0 |
| Multi-Cloud Comparison | GCP vs AWS with real measured data |
| Deployment Evidence | Screenshots, load tests, production verification |
| Managed ML Guide | SageMaker + Vertex AI deployment guide (ADR-017) |
Built using Windsurf Cascade for code generation and boilerplate. All architectural decisions, system design, trade-off analysis, and incident resolution are the author's. The .windsurf/ configuration constrains the agent with documented decisions β demonstrating that AI tooling can be governed, not just used.
Duque Ortega Mutis Β· MLOps / ML Platform Engineer
14 years running operations taught me that systems fail silently when nobody monitors them, nobody documents decisions, and nobody thinks about what happens at 2am. That's the mindset I bring to ML infrastructure β not just deploying models, but building systems you can actually trust in production.
Portfolio Version: 3.6.0 Β· License: MIT Β· Status: β Deployed on GCP (GKE) + AWS (EKS)
Building ML systems that work at 2am π











