Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
## Summary


## Test plan
- [ ] ...

## Documentation
- [ ] Updated relevant docs in `misc/website/docs/` (if adding/changing examples or architecture)
- [ ] Docs site builds cleanly (`cd misc/website && npm ci && npm run build`)
52 changes: 52 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: docs

on:
push:
branches:
- main
paths:
- "misc/website/**"
- "examples/**/README.md"
- ".github/workflows/docs.yml"
workflow_dispatch:

permissions:
contents: read
pages: write
id-token: write

concurrency:
group: pages
cancel-in-progress: false

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-node@v4
with:
node-version: 20
cache: npm
cache-dependency-path: misc/website/package-lock.json

- run: npm ci
working-directory: misc/website

- run: npm run build
working-directory: misc/website

- uses: actions/upload-pages-artifact@v3
with:
path: misc/website/build

deploy:
needs: build
runs-on: ubuntu-latest
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- id: deployment
uses: actions/deploy-pages@v4
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,8 @@ examples/pod-autoscaling/keda/vllm-qwen3/model-qwen3-4b-fp8-with-sqs.yaml
# Claude Generated Artifacts for Management of Repo
PLAN.md
claude-md/

# Docusaurus
misc/website/node_modules/
misc/website/build/
misc/website/.docusaurus/
324 changes: 160 additions & 164 deletions README.md

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions SECURITY_CONSIDERATIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Security Considerations

Our code is continuously scanned using [Checkov](https://www.checkov.io/5.Policy%20Index/kubernetes.html). The following security considerations are documented for transparency:

|Checks |Details |Reasons |
|--- |--- |--- |
|CKV_TF_1 |Ensure Terraform module sources use a commit hash |For easy experimentation, we set version of module, instead of setting a commit hash. Consider implementing a commit hash in a production cluster. [Read more on why we need to set commit hash for modules here.](https://medium.com/boostsecurity/erosion-of-trust-unmasking-supply-chain-vulnerabilities-in-the-terraform-registry-2af48a7eb2) |
|CKV2_K8S_6 |Minimize the admission of pods which lack an associated NetworkPolicy |All Pod to Pod communication is allowed by default for easy experimentation in this project. Amazon VPC CNI now supports [Kubernetes Network Policies](https://aws.amazon.com/blogs/containers/amazon-vpc-cni-now-supports-kubernetes-network-policies/) to secure network traffic in kubernetes clusters |
|CKV_K8S_8 |Liveness Probe Should be Configured |For easy experimentation, no health checks is to be performed against the container to determine whether it is alive or not. Consider implementing [health checks](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) in a production cluster. |
|CKV_K8S_9 |Readiness Probe Should be Configured |For easy experimentation, no health checks is to be performed against the container to determine whether it is alive or not. Consider implementing health checks in a production cluster. |
|CKV_K8S_22 |Use read-only filesystem for containers where possible |We've made an exception for the workloads that requires are Read/Write file system. [Configure your images with read-only root file system](https://docs.aws.amazon.com/eks/latest/best-practices/pod-security.html#_configure_your_images_with_read_only_root_file_system) |
|CKV_K8S_23 |Minimize the admission of root containers |This project uses default root container configurations for demonstration purposes. While this doesn't follow security best practices, it ensures compatibility with demo images. For production, configure runAsNonRoot: true and follow [guidance](https://docs.docker.com/engine/reference/builder/#user) on building images with specified user ID. |
|CKV_K8S_37 |Minimize the admission of containers with capabilities assigned |For easy experimentation, we've made exception for the workloads that requires added capability. For production purposes, we recommend [capabilities field](https://docs.aws.amazon.com/eks/latest/best-practices/pod-security.html#_linux_capabilities) that allows granting certain privileges to a process without granting all the privileges of the root user. |
|CKV_K8S_40 |Containers should run as a high UID to avoid host conflict |We've used publicly available container images in this project for customers' easy access. For test purposes, the container images user id are left intact. See [how to define UID](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-the-security-context-for-a-pod). |
80 changes: 80 additions & 0 deletions examples/batch-jobs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Batch Jobs: Protecting Long-Running Workloads from Disruption

## The problem

Karpenter (and EKS Auto Mode) continuously consolidates underutilized nodes. This is great for cost optimization, but catastrophic for long-running batch jobs. A 6-hour ML training run evicted at hour 5 wastes 5 hours of GPU compute. An ETL pipeline disrupted mid-write can leave data in an inconsistent state.

Without protection, consolidation treats your 8-hour training job the same as a stateless web server -- just another pod to reschedule.

## How `karpenter.sh/do-not-disrupt` works

Adding this annotation to a pod's metadata tells Auto Mode: "do not voluntarily evict this pod for consolidation or drift remediation."

```yaml
metadata:
annotations:
karpenter.sh/do-not-disrupt: "true"
```

When this annotation is present on any pod running on a node, that entire node becomes protected from voluntary disruption. The node will not be consolidated, drifted, or removed for emptiness as long as the annotated pod is running.

## Scope of protection

| Disruption Type | Protected? | Example |
|----------------|-----------|---------|
| Consolidation (underutilized) | Yes | Karpenter wants to bin-pack pods onto fewer nodes |
| Drift remediation | Yes | AMI updated, Karpenter wants to roll nodes |
| Empty node removal | Yes | All other pods drained, but annotated pod remains |
| Spot interruption | **No** | AWS reclaims the instance with 2-min warning |
| Node health failure | **No** | EC2 status check fails |
| Manual `kubectl drain` | **No** | Human or automation explicitly drains |

**Key insight**: this protects against the scheduler's optimization decisions, not against infrastructure failures. For Spot protection, use on-demand instances. For health failures, implement checkpointing.

## Why annotation vs taint

These solve different problems:

- **Taints** control which pods CAN schedule onto a node (admission control)
- **do-not-disrupt** controls whether a node with this pod CAN be consolidated (eviction control)

A GPU taint prevents CPU pods from landing on GPU nodes. `do-not-disrupt` prevents Karpenter from evicting your training job to consolidate that GPU node.

## When to use

- ML training jobs (hours to days)
- ETL pipelines with expensive restart costs
- Video transcoding (long-running, stateful progress)
- Database migrations or backfills
- Any batch workload where: **restart cost > idle node cost**

## Deploy

```bash
kubectl apply -f batch-training-job.yaml

# Verify the job is running
kubectl get jobs -n batch-jobs
kubectl get pods -n batch-jobs -o wide
```

## What to observe

```bash
# Confirm the annotation is on the running pod
kubectl get pod -n batch-jobs -l app=ml-training -o jsonpath='{.items[0].metadata.annotations}'

# Watch karpenter logs -- you should see "cannot disrupt" messages for this node
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -f | grep "cannot disrupt\|do-not-disrupt"

# Identify which node the job landed on
NODE=$(kubectl get pod -n batch-jobs -l app=ml-training -o jsonpath='{.items[0].spec.nodeName}')
echo "Protected node: $NODE"

# Verify that node is NOT being considered for consolidation
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "$NODE"

# Once the job completes, the annotation disappears with the pod.
# The node becomes eligible for consolidation again.
kubectl get nodes -w # Watch the node get consolidated after job completion
```
38 changes: 38 additions & 0 deletions examples/batch-jobs/batch-training-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
apiVersion: v1
kind: Namespace
metadata:
name: batch-jobs
---
apiVersion: batch/v1
kind: Job
metadata:
name: ml-training-example
namespace: batch-jobs
spec:
completions: 1
parallelism: 1
backoffLimit: 2
template:
metadata:
annotations:
karpenter.sh/do-not-disrupt: "true"
labels:
app: ml-training
spec:
restartPolicy: OnFailure
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: training
image: public.ecr.aws/nvidia/cuda:12.4.0-base-ubuntu22.04
command: ["sh", "-c", "echo 'Simulating 2-hour training job...' && sleep 7200"]
resources:
requests:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
143 changes: 143 additions & 0 deletions examples/capacity-reservation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# On-Demand Capacity Reservation (ODCR) Targeting in EKS Auto Mode

## What are ODCRs?

On-Demand Capacity Reservations (ODCRs) let you reserve compute capacity in a specific Availability Zone for a specific instance type. Once created, the capacity is held for you regardless of whether any instances are running against it. You pay the on-demand rate for the reserved capacity whether it is used or not, so the goal is to ensure your workloads actually land on the reservation rather than launching as regular on-demand instances beside it.

ODCRs are not the same as Reserved Instances or Savings Plans. Those are billing constructs that apply discounts retroactively. An ODCR is a physical capacity guarantee: the hosts are allocated and waiting for you.

## Why This Matters for ML/GPU Workloads

GPU instance families (p5, g6e, inf2, trn1) are frequently capacity-constrained in popular regions. You may submit a RunInstances call and receive an InsufficientInstanceCapacity error because the AZ is out of that type.

ODCRs solve this by pre-allocating the capacity. However, if your Karpenter/Auto Mode NodeClass does not explicitly target the reservation, launched instances will consume regular on-demand capacity and leave the ODCR idle (still costing you money). Correctly configuring `capacityReservationSelectorTerms` ensures nodes preferentially land on your reserved capacity.

Common scenarios:

- Multi-day distributed training jobs on p5.48xlarge
- Batch inference pipelines with predictable GPU demand
- Real-time inference with guaranteed baseline capacity
- Compliance requirements mandating dedicated or reserved tenancy

## How `capacityReservationSelectorTerms` Works

The NodeClass field `capacityReservationSelectorTerms` tells Auto Mode which ODCRs to target when launching nodes. There are three targeting strategies:

### 1. Target by Reservation ID (most specific)

```yaml
capacityReservationSelectorTerms:
- id: cr-0a1b2c3d4e5f67890
```

Use this when you have a single known reservation and want deterministic placement.

### 2. Target by Tags (flexible, recommended)

```yaml
capacityReservationSelectorTerms:
- tags:
purpose: ml-training
team: platform
```

Use this when you manage multiple reservations with a tagging convention. As you create or retire ODCRs, the NodeClass automatically picks up matching ones without manifest changes.

### 3. Target by Owner (for shared reservations)

```yaml
capacityReservationSelectorTerms:
- owner: 123456789012
```

Use this when another account shares ODCRs with you via AWS Resource Access Manager (RAM).

You can combine multiple terms; Auto Mode evaluates them in order and uses the first reservation with available capacity.

## Fallback Behavior

If all matching ODCRs are fully utilized (every slot occupied by a running instance), Auto Mode falls back to launching regular on-demand instances. Your workloads still schedule and run; they simply do not benefit from the reservation guarantee.

This means:

- Pods are never stuck Pending solely because a reservation is full.
- You do not need separate "overflow" NodePools for the non-ODCR case.
- The same NodePool handles both reserved and unreserved launches transparently.

Monitor the `UsedInstanceCount` vs `TotalInstanceCount` in the EC2 Capacity Reservations console to see whether your ODCRs are being utilized.

## When to Use

| Scenario | Why ODCR helps |
|----------|---------------|
| GPU training jobs (multi-hour/day) | Guarantees capacity won't be reclaimed mid-job |
| Batch inference with known parallelism | Ensures all workers launch simultaneously |
| Real-time inference baseline | Baseline capacity is always available; burst goes on-demand |
| Compliance / dedicated tenancy | Some regulations require pre-allocated, non-shared capacity |
| Event-driven spikes (launches, demos) | Reserve ahead, release after the event |

## Prerequisites

1. **An existing ODCR** in the target AZ for the instance type you need.
Create one via the EC2 console or CLI:
```
aws ec2 create-capacity-reservation --instance-type g6e.xlarge --instance-platform Linux/UNIX --availability-zone us-west-2a --instance-count 4 --tag-specifications 'ResourceType=capacity-reservation,Tags=[{Key=purpose,Value=ml-training}]'
```
See [AWS docs: Create a Capacity Reservation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html#capacity-reservations-create) for full options.

2. **Appropriate IAM permissions** on the node role to describe and use capacity reservations:
- `ec2:DescribeCapacityReservations`
- `ec2:RunInstances` with the reservation target

3. **Matching AZ and instance type** between the ODCR and the NodePool requirements. Auto Mode will not launch a g6e instance into a p5 reservation.

4. **Tags on the ODCR** if using tag-based selector terms (recommended for flexibility).

## Deploy

1. Render the template with your cluster values:
```
terraform output -raw odcr_nodepool_manifest > odcr-nodepool.yaml
```
Or substitute the variables manually in `odcr-nodepool.yaml.tpl`.

2. Apply to your cluster:
```
kubectl apply -f odcr-nodepool.yaml
```

3. Launch a GPU workload that tolerates the `nvidia.com/gpu` taint:
```yaml
tolerations:
- key: "nvidia.com/gpu"
operator: Equal
value: "true"
effect: NoSchedule
resources:
limits:
nvidia.com/gpu: 1
```

## What to Observe

1. **EC2 Console > Capacity Reservations**: Watch `Used instance count` increase as Auto Mode launches nodes into the reservation.

2. **Node labels**: Nodes launched into an ODCR carry standard EC2 metadata. Check instance details:
```
aws ec2 describe-instances --instance-ids <id> --query 'Reservations[].Instances[].CapacityReservationId'
```

3. **NodePool counters**: Verify the NodePool's resource usage is increasing:
```
kubectl get nodepool odcr-gpu-nodepool -o yaml | grep -A5 status
```

4. **Fallback scenario**: If you scale beyond the reservation size, additional nodes will launch as regular on-demand. The `CapacityReservationId` field will be empty on those instances.

## Cleanup

To release ODCR capacity when no longer needed:

1. Scale down or delete workloads using the GPU taint.
2. Delete the NodePool and NodeClass: `kubectl delete nodepool odcr-gpu-nodepool && kubectl delete nodeclass odcr-gpu-nodeclass`
3. Cancel the capacity reservation: `aws ec2 cancel-capacity-reservation --capacity-reservation-id cr-0a1b2c3d4e5f67890`
Loading