diff --git a/docs/week-06-notes.md b/docs/week-06-notes.md new file mode 100644 index 0000000..232f639 --- /dev/null +++ b/docs/week-06-notes.md @@ -0,0 +1,33 @@ +# Week 6 Notes — Terraform VPC Foundation + +## What we built +- infra/terraform/main.tf: AWS provider ~> 5.0, ap-south-1 region +- infra/terraform/vpc.tf: VPC 10.0.0.0/16, IGW, public/private subnets, route tables +- infra/terraform/security_groups.tf: EKS nodes SG, RDS SG +- infra/terraform/variables.tf: VPC CIDR, subnets, AZs, environment +- infra/terraform/outputs.tf: vpc_id, subnet IDs +- infra/terraform/terraform.tfvars: dev environment, ap-south-1 + +## Key decisions +- NAT Gateway excluded from dev — saves $32/month +- All resources tagged: Project=CRMS, ManagedBy=Terraform +- ap-south-1 (Mumbai) — closest region to Coimbatore + +## Commands used +```bash +terraform init +terraform plan # always plan before apply +terraform apply # creates 14 resources in ~30 seconds +terraform destroy # tears down to $0 when not in use +``` + +## Milestone +crms-dev-vpc visible in AWS Console ap-south-1 +14 resources created: VPC, IGW, 4 subnets, 2 route tables, +4 associations, 2 security groups + +## What we learned +- How Terraform providers and resources work +- Why IaC matters: infrastructure is disposable, code is permanent +- How VPC subnets, route tables, and IGW connect together +- terraform plan shows exactly what will change before touching AWS \ No newline at end of file diff --git a/docs/week-07-notes.md b/docs/week-07-notes.md new file mode 100644 index 0000000..f9d4579 --- /dev/null +++ b/docs/week-07-notes.md @@ -0,0 +1,30 @@ +# Week 7 Notes — S3 Remote Backend for Terraform State + +## What we built +- S3 bucket: crms-terraform-state-055237683990 +- Versioning enabled — full state history preserved +- AES256 encryption — state files encrypted at rest +- Public access blocked — no accidental exposure +- main.tf updated with backend "s3" block + +## Why remote state matters +Without S3 backend, terraform.tfstate lives on your laptop. +If laptop dies or two engineers run terraform simultaneously — state corrupts. +S3 backend: state is shared, versioned, encrypted, team-accessible. + +## Commands used +```bash +aws s3api create-bucket --bucket crms-terraform-state-055237683990 --region ap-south-1 --create-bucket-configuration LocationConstraint=ap-south-1 +aws s3api put-bucket-versioning --bucket crms-terraform-state-055237683990 --versioning-configuration Status=Enabled +terraform init -migrate-state # migrates local state to S3 +aws s3 ls s3://crms-terraform-state-055237683990/dev/ +``` + +## Milestone +terraform.tfstate in S3 — team can now collaborate on infra safely +State never stored in Git — .gitignore covers *.tfstate + +## What we learned +- Why .gitignore must exclude .terraform/, *.tfstate, *.tfstate.backup +- How terraform backend migration works +- Remote state enables team collaboration on infrastructure \ No newline at end of file diff --git a/docs/week-08-notes.md b/docs/week-08-notes.md new file mode 100644 index 0000000..c018bc6 --- /dev/null +++ b/docs/week-08-notes.md @@ -0,0 +1,35 @@ +# Week 8 Notes — EKS Cluster on AWS + +## What we built +- infra/terraform/eks.tf: EKS cluster + node group +- aws_iam_role.eks_cluster: control plane IAM role +- aws_iam_role.eks_nodes: worker node IAM role +- 4 IAM policy attachments: EKSClusterPolicy, WorkerNode, CNI, ECR +- EKS cluster: crms-dev, Kubernetes 1.31, ap-south-1 +- Node group: t3.small x1 (min 1, max 2) — cost optimised + +## Cost awareness +- EKS control plane: $0.10/hour +- t3.small node: $0.023/hour +- Total: ~$0.123/hour +- Always terraform destroy after testing + +## Commands used +```bash +terraform apply -auto-approve # creates cluster in ~10 mins +aws eks update-kubeconfig --name crms-dev --region ap-south-1 +kubectl get nodes # verify worker node READY +kubectl get namespaces # default, kube-system, kube-public +kubectl cluster-info # control plane endpoint +terraform destroy -auto-approve # destroy after testing +``` + +## Milestone +kubectl get nodes output: +ip-10-0-x-x.ap-south-1.compute.internal Ready v1.31.14-eks-3385e9b + +## What we learned +- How EKS control plane + worker nodes relate +- Why IAM roles are required for both control plane and nodes +- t3.small vs t3.medium trade-offs for cost vs capacity +- aws eks update-kubeconfig wires kubectl to the cluster \ No newline at end of file diff --git a/docs/week-09-notes.md b/docs/week-09-notes.md new file mode 100644 index 0000000..b1aa284 --- /dev/null +++ b/docs/week-09-notes.md @@ -0,0 +1,47 @@ +# Week 9 Notes — CRMS Deployed on Kubernetes + +## What we built +- k8s/base/namespace.yaml: crms namespace +- k8s/base/configmap.yaml: non-secret app config +- k8s/base/secret.yaml: DATABASE_URL + SECRET_KEY +- k8s/base/postgres-deployment.yaml: PostgreSQL + ClusterIP service +- k8s/base/backend-deployment.yaml: FastAPI x2 replicas + health probes +- k8s/base/frontend-deployment.yaml: React/nginx x2 + LoadBalancer service +- k8s/base/hpa.yaml: HPA backend 2-10 pods, frontend 2-5 pods + +## Critical fix applied +nginx upstream was "backend" — EKS has no DNS for that. +Fixed to: crms-backend-service.crms.svc.cluster.local:8000 +Full Kubernetes DNS format: ..svc.cluster.local + +## Commands used +```bash +kubectl apply -f k8s/base/namespace.yaml # always apply namespace first +kubectl apply -f k8s/base/ +kubectl get pods -n crms -w +kubectl get service crms-frontend-service -n crms +kubectl logs -n crms # debug crashes +kubectl describe pod -n crms # diagnose pending/error +``` + +## Milestone +All 5 pods 1/1 Running: +- crms-backend x2 +- crms-frontend x2 +- crms-db x1 + +LoadBalancer URL live: +ade5ebcd818514ed69e44386d4b3ca72.elb.ap-south-1.amazonaws.com + +## What we learned +- Kubernetes DNS: services reachable by full DNS name within cluster +- Namespace ordering matters: apply namespace before other resources +- HPA scales pods based on CPU/memory — works with Kubernetes metrics server +- LoadBalancer type service = AWS NLB provisioned automatically +- imagePullPolicy: Always ensures latest image is pulled from GHCR + +## Issues and fixes +- CrashLoopBackOff: nginx couldn't resolve "backend" hostname + Fix: use full K8s DNS crms-backend-service.crms.svc.cluster.local +- Namespace not found on first apply: applied namespace.yaml separately first +- GHCR images must be public for EKS to pull without auth \ No newline at end of file diff --git a/docs/week-11-notes.md b/docs/week-11-notes.md new file mode 100644 index 0000000..32e4c2b --- /dev/null +++ b/docs/week-11-notes.md @@ -0,0 +1,53 @@ +# Week 11 Notes — Prometheus + Grafana Observability + +## What we built +- backend/app/main.py: prometheus-fastapi-instrumentator added + - Exposes /metrics endpoint automatically + - Tracks: request count, latency histograms, in-progress requests +- observability/prometheus-values.yaml: kube-prometheus-stack Helm values + - Prometheus retention: 24h + - Scrapes crms-backend:8000/metrics + - Grafana enabled with admin password +- observability/grafana-dashboards/crms-dashboard.json: + - Panel 1: API request rate (requests/sec) + - Panel 2: p99 API response latency + - Panel 3: Pod CPU usage by pod name + - Panel 4: Pod memory usage by pod name +- observability/prometheus-rules.yaml: 3 alert rules + - CRMSHighErrorRate: fires if 5xx rate > 10% for 2m + - CRMSHighLatency: fires if p99 > 1s for 5m + - CRMSPodDown: fires if backend replicas < 1 + +## Deploy monitoring stack +```bash +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts +helm repo update +kubectl create namespace monitoring +helm upgrade --install kube-prometheus-stack \ + prometheus-community/kube-prometheus-stack \ + --namespace monitoring \ + --values observability/prometheus-values.yaml +kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80 +``` + +## Access Grafana +URL: http://localhost:3000 +Username: admin +Password: crms-grafana-admin + +## Key metrics exposed by /metrics +- http_requests_total: total requests by method, path, status +- http_request_duration_seconds: latency histogram +- http_requests_inprogress: concurrent requests + +## What we learned +- How Prometheus scrapes metrics via pull model +- PromQL: rate(), histogram_quantile(), kube_ metrics +- Grafana dashboards: panels, data sources, time ranges +- Alert rules: expr, for, severity, annotations +- Why observability matters: you can't fix what you can't measure + +## Note on resources +kube-prometheus-stack requires t3.medium (4GB RAM) to run +alongside CRMS application pods. t3.small runs out of memory. +Production: use t3.medium or larger nodes. \ No newline at end of file