Skip to content

DuqueOM/ML-MLOps-Portfolio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

753 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ ML/MLOps Portfolio β€” Production ML Systems

Human-first portfolio for entry-level MLOps / Production ML roles Β· 3 ML services Β· GKE + EKS evidence Β· 18 ADRs Β· 395+ tests

CI codecov Python Kubernetes Terraform License

Portfolio Site YouTube

Status

βš™οΈ Operational status: Infrastructure (GKE + EKS) is currently offline. The code, manifests, Terraform, and CI/CD are production-tested β€” the clusters were deployed to during development (v3.6.0, March 2026) and torn down after. See PORTFOLIO_STATUS.md for what is live, what is paused, and how to reactivate in ~1 hour. See ADR-018 for the decision record.


⚑ Why This Portfolio Is Different

Most ML portfolios show models that score well. This one shows what happens after a model has to become a service: APIs, tests, deployment artifacts, monitoring, cost trade-offs, and documented lessons.

The GitHub Pages site is now the best entry point for recruiters and non-technical reviewers: duqueom.github.io/ML-MLOps-Portfolio. This README remains the deeper technical record.

Three production incidents diagnosed β€” root cause to fix, documented with data:

Incident Root Cause Fix Outcome ADR
81% error rate under load uvicorn --workers N on K8s: workers share one CPU budget β†’ thrashing, not parallelism asyncio.run_in_executor + ThreadPoolExecutor(4) β€” sklearn C extensions release the GIL Errors 81% β†’ 0% Β· CPU 2000m β†’ 1000m 014 / 015
SHAP returning all zeros TreeExplainer incompatible with StackingClassifier β€” evaluated 4 alternatives before deciding KernelExplainer in original 10-feature space (interpretable by business, not 38 encoded cols) Real SHAP values in production 010
HPA never scaled down Memory-based HPA + fixed ML footprint: ceil(replicas Γ— usage/target) always β‰₯ current replicas CPU-only HPA β€” CPU correlates with traffic; memory is a constant, not a signal 3 β†’ 1 pods in 8 minutes 001

This is not a tutorial project. It's an operational record.

The CHANGELOG traces the full incident history from v1.0.0 to v3.6.0. Each entry has a root cause and a resolution.

Portfolio Demo

πŸ—ΊοΈ Quick Navigation

I want to understand... Start here
Why decisions were made (not just what) 18 ADRs ↓
Incidents diagnosed in production ENGINEERING_HIGHLIGHTS.md β†’
Agentic Development Configuration AGENTS.md ↓
What was built and how it performs Projects ↓
How to run it locally in 5 minutes Quick Start ↓
Multi-cloud deployment evidence Deployment ↓
What broke and when CHANGELOG.md β†’

Template

The MLOps patterns in this portfolio are available as a reusable, opinionated template:

ML-MLOps-Production-Template Β· related-projects.md

v1.12.0 highlights β€” audit Round-3 close + pre-commit hardened as mandatory first filter:

  • 32 encoded anti-patterns (D-01 β†’ D-32) β€” runtime, data, EDA, security, closed-loop, lifecycle (warm-up, PDB, PSS), delivery (env gates, API contracts, SBOM, digest pin), placeholder hygiene (D-32: kebab-vs-snake path bug)
  • Pre-commit as mandatory first filter β€” 14 hooks (black, isort, flake8, mypy, bandit, gitleaks, trailing-whitespace, EOF, YAML, merge-conflict, large-files, validate-agentic, ci-autofix-policy-contract, scaffold-smoke); default_install_hook_types: [pre-commit, pre-push] so a single pre-commit install covers both stages; make verify-hooks audits any time; scripts/dev-setup.sh bootstraps idempotently and verifies hooks actually landed in .git/hooks/
  • Closed-loop verification workflow β€” golden-path-extended.yml re-deploys + posts 100 valid + 5 invalid /predict requests + asserts the prediction-log counter increments; new test_closed_loop_workflow_contract.py parses both schema and workflow and fails LOUD if they drift (R3 HIGH-1 fix)
  • GCP ↔ AWS Terraform parity (v1.11.0) β€” secrets / logging / KMS at the live layer + bootstrap split + 14 parity contract tests; cluster defaults (private endpoint opt-in, system/workload pool split with taint, deny-default NetworkPolicy)
  • ADR-018 Operational Memory Plane + ADR-019 Agentic CI Self-Healing (Phase 0) β€” policy YAMLs (templates/config/{ci_autofix_policy,model_routing_policy}.yaml) + 10-invariant contract test enforcing escalation-only semantics; runtime phases scoped as explicit follow-ons
  • OSS package complete (v1.11.0) β€” NOTICE (Apache-2.0 attribution) + DCO.md + .github/CODEOWNERS routing for AGENTS.md, ADRs, infra, governance YAMLs
  • Two Behavior Protocols β€” static AUTO/CONSULT/STOP mapping (AGENTS.md) PLUS dynamic risk escalation (ADR-010) based on live signals: incident_active, drift_severe, error_budget_exhausted, off_hours, recent_rollback
  • 6 environment overlays (gcp-{dev,staging,prod} + aws-{dev,staging,prod}) with PSS-labeled namespaces (baseline for dev/staging, restricted for prod) and tier-scaled resources β€” closes the silent gap where the deploy workflows referenced names the repo never shipped
  • Image digest pinning end-to-end β€” build job captures sha256:..., deploy-common.yml runs kustomize edit set image …@<digest> BEFORE kubectl apply; the Kyverno digest gate finally has compliant manifests to admit
  • Cosign + SBOM actually invoked in deploy-{gcp,aws}.yml (was a silent gap until v1.10.0); SLSA L2 trust chain end-to-end
  • 6-phase EDA pipeline with leakage hard gate + baseline distributions feeding drift detection
  • Cloud-native secrets β€” common_utils/secrets.py (AWS Secrets Manager / GCP Secret Manager via IRSA/WI); two bootstrap runbooks (GCP WIF + AWS IRSA) + /secret-breach emergency workflow
  • Per-environment Terraform remote state β€” partial backend configs under templates/infra/terraform/{gcp,aws}/backend-configs/ with the terraform-state-bootstrap.md runbook
  • Drift + retrain operationalized β€” cloud-aware GCS/S3 adapters via OIDC, Prometheus Pushgateway integration, MLflow promotion hooks
  • Typed inter-agent handoffs β€” frozen dataclasses validating invariants at construction; DeploymentRequest refuses to construct when env=production + audit.passed=False; SecurityAuditResult blocks on any trivy_high finding
  • Audit trail β€” every agentic operation appends to ops/audit.jsonl with risk signals + base mode; CI calls scripts/audit_record.py on every deploy (success AND failure via if: always()) and mirrors a markdown summary to the GitHub Actions step summary
  • Golden Path E2E workflow β€” .github/workflows/golden-path.yml validates the full chain on every PR: scaffold β†’ build + sign by digest β†’ kind cluster + Kyverno admit + smoke β†’ audit trail. Trust anchor for the audit closure.
  • Tri-IDE full parity β€” Windsurf (15 rules / 16 skills / 12 workflows) Β· Claude Code (14 rules / 12 commands / 16-skill index) Β· Cursor (12 rules / 12 commands / 16-skill index)
  • Closed-loop monitoring β€” prediction logger + ground truth ingestion + sliced performance (ADR-007) + Champion/Challenger McNemar + bootstrap Ξ”AUC gate (ADR-008) + 10-panel Grafana dashboard
  • Governed delivery β€” dev β†’ staging β†’ prod chain with GitHub Environment Protection, 2 reviewers + 15min soak + tag-only for prod (ADR-011); reusable deploy-common.yml single source of truth
  • DORA metrics β€” exporter script aggregates deployment_frequency, lead_time_for_changes, change_failure_rate, mttr from GitHub API + ops/audit.jsonl
  • Incident playbooks β€” /rollback (STOP-class 7-step), /secret-breach, /incident, /drift-check, /performance-review slash commands
  • 19 ADRs β€” each records alternatives rejected AND measurable revisit triggers; ADR-015 publishes the productization roadmap (3 phases / 12 PRs); ADR-016 codifies the external-audit R2 remediation backlog; ADR-018/019 ratify the new agent capabilities at policy-only Phase 0

πŸ“ Architectural Decision Records β€” 18 Documented

Not explanations of what was built β€” records of what was evaluated, rejected, and why. Written for technical reviewers.

ADR Decision The Harder Choice
001 CPU-only HPA Proved mathematically that memory HPA cannot scale down ML pods
003 StackingClassifier Acknowledged single LightGBM achieves comparable AUC at lower cost
005 Compatible release pinning numpy 2.x silently broke serialized models β€” silent failure, worst category
006 CronJob over Airflow Documented why Airflow is over-engineering for a 3-model portfolio
007 No Feature Store Designed full Feast architecture for when time-window features are needed
008 Argo Rollouts canary Progressive delivery with Prometheus analysis gates β€” not all-or-nothing rollout
009 Removed CarVision MAPE 32.9% not defensible β€” knowing when not to build is harder
010 SHAP KernelExplainer Diagnosed production bug, evaluated 4 alternatives before deciding
014 Single-worker pods Found uvicorn --workers anti-pattern under K8s from first principles
015 Async inference GIL analysis β†’ ThreadPoolExecutor β†’ 81% errors β†’ 0%
016 GCP/AWS latency gap $24/mo vs $145/mo β€” both meet SLA; chose FinOps over vanity metrics
017 Custom vs Managed ML FastAPI+K8s primary, SageMaker/Vertex AI as documented complement
018 Portfolio Maintenance Mode $180–220/mo idle cost β€” documented teardown and reactivation path

View all 18 ADRs with full context, alternatives considered, and trade-offs β†’


πŸ€– Agentic Development Configuration

Those 18 ADRs don't just live in docs β€” they're encoded as behavioral constraints in the AI development environment itself.

AGENTS.md           β€” Project identity, critical DO NOT VIOLATE patterns, HPA targets
.windsurf/
β”œβ”€β”€ rules/          β€” 7 context-aware rules (glob-triggered per file type)
β”‚   β”œβ”€β”€ 01-mlops-conventions.md     always_on: core ADR constraints
β”‚   β”œβ”€β”€ 02-kubernetes.md            k8s/**/*.yaml: HPA 50/60/60%, single-worker
β”‚   β”œβ”€β”€ 03-terraform.md             **/*.tf: state management, tagging
β”‚   β”œβ”€β”€ 04-python-ml.md             **/*.py: async patterns, SHAP, pinning
β”‚   β”œβ”€β”€ 05-github-actions.md        .github/workflows/: CI standards
β”‚   β”œβ”€β”€ 06-documentation.md         docs/**/*.md: ADR format, content guidelines
β”‚   └── 07-docker.md                Dockerfile*: multi-stage, non-root, no model bake
β”œβ”€β”€ skills/         β€” 6 multi-step operational procedures with supplementary data
β”‚   β”œβ”€β”€ debug-ml-inference/         symptom β†’ root cause β†’ ADR cross-reference
β”‚   β”œβ”€β”€ deploy-gke/ deploy-aws/     pre/post-deploy checklists + rollback procedures
β”‚   β”œβ”€β”€ drift-detection/            per-service PSI thresholds + alert integration
β”‚   β”œβ”€β”€ model-retrain/              validation criteria + acceptance gates per service
β”‚   └── release-checklist/          full multi-cloud release + CHANGELOG template
└── workflows/      β€” 6 structured prompt workflows
    /incident Β· /retrain Β· /release Β· /load-test Β· /new-adr Β· /drift-check

The agent knows: 50%/60%/60% CPU targets (not 70%), KernelExplainer for SHAP (not TreeExplainer), workers=1 (never N) under K8s. Operational knowledge encoded as constraints β€” not just referenced as documentation.

β†’ AGENTS.md Β |Β  .windsurf/


πŸ“Š Key Metrics

Project Type Best Metric Coverage Latency p50 Key Engineering Decision
🏦 BankChurn Classification AUC 0.87 90% 200ms GCP / 110ms AWS Async inference via ThreadPoolExecutor · threshold 0.35 (30:1 cost ratio)
πŸ“ NLPInsight NLP Sentiment Acc 80.6% 98% 78ms GCP / 100ms AWS Upgraded to harder dataset (97% β†’ 80.6%) for honest benchmark
πŸš• ChicagoTaxi Batch Pipeline RΒ² 0.96 91% 100ms GCP / 120ms AWS Data leakage found & fixed Β· lag features + temporal split
Infrastructure Status Details
GCP Deployment βœ… Verified GKE 1–5 nodes, 6 pods, 0% error rate under 100 concurrent users
AWS Deployment βœ… Verified EKS 1–5 nodes, 6 pods, CI/CD via GitHub Actions
CI/CD βœ… Unified 10-job matrix, security scanning (Trivy/Bandit/Gitleaks), automated deploy to both clouds
IaC βœ… Multi-Cloud Terraform (GCP + AWS) Β· terraform plan = 0 drift
Monitoring βœ… Full Stack Prometheus + Grafana (26 panels, 16 alert rules) + MLflow
Security βœ… Automated Blocking on HIGH Β· non-root containers Β· Network Policies Β· IRSA/Workload Identity

🌟 Production-Style Projects

🏦 1. BankChurn Predictor β€” Customer Churn Prediction

Production-style churn prediction with StackingClassifier ensemble (RF + GradientBoosting + XGBoost + LightGBM β†’ LogisticRegression meta-learner). ChurnFeatureEngineer with domain-specific ratios, bins, and risk scores. MLflow experiment tracking.

AUC-ROC F1 Precision Recall Coverage In-Pod Latency (GKE)
0.87 0.62 0.73 0.54 90% 103ms p50 / 111ms p95

Why these metrics: AUC-ROC is the primary metric β€” 20.4% churn rate (4:1 imbalance) makes accuracy meaningless. Production threshold: 0.35 (not default 0.50) β€” missed churner costs ~$1,500–$3,000 LTV vs. ~$50 retention offer (30:1 cost ratio). At 0.35, Recall = 0.78; at 0.50, Recall = 0.54. The precision trade-off is intentional and quantified with business context.

Key engineering decisions:

  • ADR-015: uvicorn --workers N under Kubernetes causes CPU thrashing (shared budget). Fixed via asyncio.run_in_executor + ThreadPoolExecutor(4) exploiting GIL release in sklearn C extensions β†’ 81% error rate β†’ 0%, CPU 2000m β†’ 1000m
  • ADR-010: SHAP returning all-zero values in production. TreeExplainer incompatible with StackingClassifier. Evaluated 4 alternatives β†’ KernelExplainer in original 10-feature space for business interpretability
  • ADR-003: 7-model comparison (5-fold CV). StackingClassifier AUC 0.87 vs single LightGBM 0.86. Documented that simpler model wins in production under strict latency SLAs

πŸ“‚ Project Β· πŸ“„ Model Card Β· πŸ“Ί Video


πŸ“ 2. NLPInsight Analyzer β€” Financial Sentiment Analysis

Financial sentiment analysis on Twitter Financial News β€” 11,931 real financial tweets with stock tickers, informal language, and noisy text. TF-IDF + LogReg production model (5ms, CPU-only) with optional FinBERT backend for GPU environments.

Accuracy F1 (weighted) F1 (macro) Labels Dataset
80.6% 0.810 0.748 3 11,931 tweets

Why these metrics: 80.6% on real financial tweets (vs 97% on the easier Financial PhraseBank) is the honest choice. The dataset upgrade β€” from 4,845 curated sentences to 11,931 noisy real tweets β€” deliberately lowered the metric to produce a more defensible benchmark. F1-macro (0.748) guards against ignoring the minority negative class.

Key engineering decisions:

  • ADR-009: Chose harder dataset over better-looking number β€” intellectual honesty over portfolio optics
  • Dual-backend design: TF-IDF+LogReg for CPU production (5ms p50), FinBERT for GPU environments β€” same API contract, different serving backend

πŸ“‚ Project Β· πŸ“„ Model Card Β· πŸ“Ί Video


πŸš• 3. ChicagoTaxi Demand Pipeline β€” Batch Processing at Scale

Data engineering pipeline processing 6.3M taxi trips (2.8 GB CSV) via PySpark ETL into partitioned Parquet, with batch prediction using lag features and temporal split.

Raw Rows Clean Rows ETL Throughput Model RΒ² RMSE MAE Compression
6.36M 5.37M 3,320 rows/sec 0.96 7.87 2.85 97% (2.8GB→95MB)

Why this project: The RΒ² 0.96 is leak-free β€” same-period aggregate features (avg_fare, avg_speed) were identified as data leakage, removed, and replaced with lag features (1h, 24h, 168h, rolling 24h) and a temporal train/test split. RΒ² improved from 0.905 β†’ 0.965 with honest features. The initial high RΒ² was a signal to investigate, not celebrate.

Key engineering decisions:

  • ADR-009 (data leakage): avg_fare was computed from the same trips being predicted β€” future information leaked into training. Documented, fixed, RΒ² re-measured with honest features only

πŸ“‚ Project Β· πŸ“„ Model Card


πŸ› οΈ Tech Stack

Category Technologies
ML/DS Scikit-learn, XGBoost, LightGBM, HuggingFace (FinBERT), PySpark, Dask, Pandas, NumPy, SHAP, Optuna
MLOps MLflow (9 experiments), DVC, Docker, Kubernetes, Terraform, Argo Rollouts
API FastAPI, Pydantic, async inference (ThreadPoolExecutor + asyncio)
Cloud & IaC GCP (GKE, GCS, Artifact Registry, Cloud SQL, Workload Identity), AWS (EKS, S3, ECR, RDS, IRSA), Terraform, Kustomize
Monitoring Prometheus (16 alert rules), Grafana (26-panel dashboard), Locust load testing, Evidently drift detection
CI/CD GitHub Actions (CI + deploy-gcp + deploy-aws + smoke tests), Codecov, pre-commit hooks
Security Gitleaks, Bandit, Trivy, pip-audit, non-root containers, Network Policies, Pod Disruption Budgets
Testing pytest (395+ tests, 90–98% coverage), Pandera data validation, 43 adversarial tests
Responsible AI Fairness audits (disparate impact + equal opportunity), SHAP explainability, drift detection (KS + PSI)
Agentic Windsurf Cascade, AGENTS.md, 7 glob-triggered rules + 6 operational skills + 6 structured workflows
Managed ML AWS SageMaker Endpoints, GCP Vertex AI (ADR-017)

πŸ—οΈ Architecture

graph TB
    subgraph "CI/CD Pipeline β€” GitHub Actions"
        GH[GitHub Actions] --> LINT[Lint + Security<br/>Bandit Β· Gitleaks Β· Trivy]
        GH --> TEST[pytest Β· 395+ tests<br/>90-98% coverage]
        GH --> BUILD[Docker Build]
        BUILD --> AR[GCP Artifact Registry]
        BUILD --> ECR[AWS ECR]
    end

    subgraph "Training Pipeline"
        DATA[Raw Data] --> FE[Feature Engineering]
        FE --> TRAIN[Model Training<br/>MLflow Tracking]
        TRAIN --> GCS[GCS Models]
        TRAIN --> S3[S3 Models]
    end

    subgraph "GCP β€” GKE Cluster (us-central1)"
        direction TB
        GCE_ING[nginx Ingress<br/>LoadBalancer IP] --> BC1[BankChurn<br/>StackingClassifier]
        GCE_ING --> NL1[NLPInsight<br/>TF-IDF+LogReg]
        GCE_ING --> CT1[ChicagoTaxi<br/>Batch Predictions]
        BC1 -.->|Init Container| GCS
        NL1 -.->|Init Container| GCS
        CT1 -.->|Init Container| GCS
        PROM1[Prometheus] --> GRAF1[Grafana]
        DRIFT1[Drift CronJob] --> BC1
    end

    subgraph "AWS β€” EKS Cluster (us-east-1)"
        direction TB
        AWS_ING[nginx Ingress<br/>NLB] --> BC2[BankChurn<br/>StackingClassifier]
        AWS_ING --> NL2[NLPInsight<br/>TF-IDF+LogReg]
        AWS_ING --> CT2[ChicagoTaxi<br/>Batch Predictions]
        BC2 -.->|Init Container| S3
        NL2 -.->|Init Container| S3
        CT2 -.->|Init Container| S3
        PROM2[Prometheus] --> GRAF2[Grafana]
        DRIFT2[Drift CronJob] --> BC2
    end

    subgraph "IaC β€” Terraform + Kustomize"
        TF[Terraform<br/>GCP + AWS modules] --> GCE_ING
        TF --> AWS_ING
        KUST[Kustomize Overlays<br/>base + gcp + aws] --> GCE_ING
        KUST --> AWS_ING
    end
Loading

For detailed architecture docs β†’ docs/ARCHITECTURE_PORTFOLIO.md.


πŸš€ Quick Start

# 1. Clone and enter
git clone https://github.com/DuqueOM/ML-MLOps-Portfolio.git && cd ML-MLOps-Portfolio

# 2. Generate demo models (first time only, ~2 min)
bash scripts/setup_demo_models.sh

# 3. Start full stack (APIs + MLflow + Dashboard, ~3 min build)
docker compose -f docker-compose.demo.yml up -d --build

# 4. Wait for services and verify health (~60s)
sleep 60 && bash scripts/run_demo_tests.sh

# 5. Access services
#    🏦 BankChurn API:    http://localhost:8001/docs
#    πŸ“ NLPInsight API:   http://localhost:8003/docs
#    πŸš• ChicagoTaxi API:  http://localhost:8004/docs
#    πŸ“Š MLflow:           http://localhost:5000

For API examples, monitoring setup, and troubleshooting β†’ QUICK_START.md and RUNBOOK.md.


☁️ Multi-Cloud Production Deployment

Same ML system deployed cloud-agnostically on both GCP and AWS:

Multi-Cloud HERO: GKE vs EKS

Same 6 services running on GCP (GKE, us-central1) and AWS (EKS, us-east-1) β€” simultaneously deployed and verified

Component GCP βœ… AWS βœ…
K8s Cluster GKE 1–5 nodes (us-central1) EKS 1–5 nodes (us-east-1)
Container Registry Artifact Registry ECR (3 private repos)
Model Storage GCS (versioned) S3 (encrypted, versioned)
Load Balancer nginx Ingress (static IP) nginx Ingress (NLB)
IAM for Pods Workload Identity IRSA
CI/CD deploy-gcp.yml deploy-aws.yml
IaC infra/terraform/gcp/ infra/terraform/aws/
Drift Detection CronJob (daily 06:00 UTC) CronJob (daily 06:00 UTC)
Monitoring Prometheus + Grafana + MLflow Prometheus + Grafana + MLflow

Cloud-Agnostic Design: Monitoring stack, K8s patterns (HPA, anti-affinity, health probes), and CI/CD structure are identical across clouds. Only the init container SDK and ingress annotations differ. See ADR-013.

πŸ’° FinOps: Infrastructure is provisioned on-demand via Terraform and decommissioned after validation. Re-deployable in <15 minutes with terraform apply β€” reproducibility over always-on cost. GCP ~$51/month Β· AWS ~$45/month when running. Performance difference documented in ADR-016 β€” accepted as a cost trade-off, not hidden.

🎬 Full Demo β€” YouTube (3:30 min)

πŸ“Š GCP Evidence β€” click to expand

GKE Workloads β€” 6 services running

GKE Workloads

Grafana ML Dashboard β€” 26 panels

Grafana

GitHub Actions Pipeline β€” 10 jobs green

CI/CD

BankChurn prediction with SHAP explainability

Prediction

☁️ AWS Evidence β€” click to expand

EKS Cluster β€” Active (us-east-1)

EKS Cluster

EKS Workloads β€” 6 pods Running

EKS Pods

ECR β€” 3 Private Repositories

ECR

S3 β€” Model Storage (encrypted, versioned)

S3

Health Checks via ELB

Health

SHAP Prediction on EKS

SHAP EKS


πŸ“š Documentation

Document Description
⭐ Engineering Highlights Start here β€” incidents diagnosed, decisions made, trade-offs documented
ADRs (18) Every non-trivial architectural decision with context, alternatives, and trade-offs
AGENTS.md Agentic development configuration
RUNBOOK.md Copy-paste commands for common operations
Quick Start 5-minute demo with API examples and health checks
Architecture System design, Mermaid diagrams, infrastructure, CI/CD workflow
CHANGELOG Full incident history from v1.0.0 to v3.6.0
Multi-Cloud Comparison GCP vs AWS with real measured data
Deployment Evidence Screenshots, load tests, production verification
Managed ML Guide SageMaker + Vertex AI deployment guide (ADR-017)

πŸ”§ AI Transparency

Built using Windsurf Cascade for code generation and boilerplate. All architectural decisions, system design, trade-off analysis, and incident resolution are the author's. The .windsurf/ configuration constrains the agent with documented decisions β€” demonstrating that AI tooling can be governed, not just used.


πŸ‘€ Author

Duque Ortega Mutis Β· MLOps / ML Platform Engineer

14 years running operations taught me that systems fail silently when nobody monitors them, nobody documents decisions, and nobody thinks about what happens at 2am. That's the mindset I bring to ML infrastructure β€” not just deploying models, but building systems you can actually trust in production.

LinkedIn GitHub Portfolio Email


Portfolio Version: 3.6.0 Β· License: MIT Β· Status: βœ… Deployed on GCP (GKE) + AWS (EKS)

Building ML systems that work at 2am πŸŒ™

About

Production-grade MLOps platform: 3 end-to-end ML projects with CI/CD, Terraform (GCP GKE) (AWS EKS), Kubernetes, MLflow, Docker, and 90-96% test coverage

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors