Setu (सेतु) - Sanskrit for "bridge", as in Ram Setu. This controller connects Kueue workload queuing with Karpenter node provisioning.
Status: Alpha — Functional and E2E-tested on AWS EKS. Use in non-production environments first. Contributions welcome!
- The Problem
- The Solution
- When to Use Setu
- Comparison with Alternatives
- Features
- How It Works
- Tech Stack
- Getting Started
- Configuration
- Usage Examples
- Metrics
- Architecture
- How Setu Fits In
- Development
- Project Structure
- Troubleshooting
- Community
Kueue is the standard for Kubernetes job queuing. It has native integration with Cluster Autoscaler via the ProvisioningRequest API — Kueue tells CAS to provision capacity before admitting workloads, so pods schedule instantly on warm nodes.
Karpenter does not implement ProvisioningRequest. If your cluster runs Karpenter, Kueue has no way to pre-provision nodes:
WITH CLUSTER AUTOSCALER WITH KARPENTER (without Setu)
─────────────────────── ────────────────────────────
Kueue ─ProvisioningRequest─▶ CAS Kueue ──── ??? ────▶ Karpenter
(native integration) (NO INTEGRATION)
1. Kueue asks CAS for capacity 1. Kueue admits workload
2. CAS provisions nodes 2. Pods go Pending
3. Nodes ready 3. Karpenter reacts to Pending pods
4. Kueue admits workload 4. Karpenter provisions nodes reactively
5. Pods schedule instantly 5. Pods finally schedule
✅ Zero cold-start ❌ Reactive cold-start
This is a known gap — see Kueue #5133 ("Kueue and Karpenter Support").
Setu bridges this gap today.
Setu uses two existing, stable APIs — Kueue's AdmissionCheck extensibility and Karpenter's NodeClaim v1 API — to close the gap:
┌────────────┐ ┌─────────────────┐ ┌─────────────┐
│ KUEUE │──────▶│ SETU │──────▶│ KARPENTER │
│ Workload │ │ Controller │ │ NodeClaim │
│ (Pending) │ │ Watch/Validate │ │ (Create) │
└────────────┘ │ Provision │ └──────┬──────┘
│ Approve/Reject │ │
└────────┬────────┘ ▼
│ ┌─────────────┐
▼ │ Cloud Node │
┌─────────────────┐ │ (EC2, etc) │
│ AdmissionCheck │ └─────────────┘
│ → Ready │
└─────────────────┘
Zero changes to Kueue. Zero changes to Karpenter. Setu is a standalone controller that uses only public, stable APIs from both projects.
- You run Kueue + Karpenter and want proactive (not reactive) node provisioning
- You run distributed ML training (PyTorch, JAX, MPI) where all workers must start together
- You have GPU batch jobs that need gang scheduling (all-or-nothing GPU allocation)
- You run large-scale batch processing where reactive autoscaling adds unnecessary cold-start delays
- You operate multi-tenant clusters where Kueue manages fairness and Karpenter handles provisioning
- You are migrating from Cluster Autoscaler to Karpenter and need to keep Kueue's provisioning-aware admission
- You use Cluster Autoscaler — Kueue already integrates natively via
ProvisioningRequest - You don't use Kueue — Karpenter's reactive scaling may be sufficient for your workloads
- Your workloads are latency-tolerant and can wait for reactive node provisioning
Without Setu: Kueue admits a workload, pods go Pending, Karpenter reacts to pending pods and provisions nodes reactively. For gang workloads (e.g. 4-GPU training), partial allocation means some GPUs sit idle waiting for the rest — wasting money and time.
With Setu: Nodes are provisioned before pods exist. Setu creates all NodeClaims atomically, waits for every node to be Ready, then tells Kueue to admit. Pods schedule instantly on warm nodes. For gang workloads, all GPUs come online together.
| CAS + ProvisioningRequest | Setu + Karpenter | |
|---|---|---|
| Provisioner | Cluster Autoscaler | Karpenter |
| Kueue integration | Native (built-in) | Via Setu AdmissionCheck |
| Instance selection | Limited (ASG-based) | Advanced (Karpenter's optimizer) |
| Spot + fallback | Manual ASG config | Karpenter native |
| Consolidation | None | Karpenter native |
| Multi-cloud | AWS only (EKS) | AWS, GCP, Azure via Karpenter |
Choose Setu if you prefer Karpenter's instance selection, spot handling, and consolidation over CAS.
Kubernetes 1.35 introduced native gang scheduling (alpha, disabled by default) using SchedulingGate and coscheduling. This is a scheduler-level feature — it solves pod placement, not node provisioning:
| K8s 1.35 Gang Scheduling | Setu | |
|---|---|---|
| Layer | Scheduler (pod placement) | Infrastructure (node provisioning) |
| What it solves | Ensures pods schedule together on existing nodes | Ensures nodes exist before pods are created |
| Handles cold starts? | No — if nodes don't exist, all pods stay Pending together | Yes — creates NodeClaims proactively, nodes are warm before admission |
| Kueue aware? | Yes — Kueue uses it internally | Yes — Setu bridges Kueue to Karpenter |
| Karpenter aware? | No — Karpenter still reacts to pending pods | Yes — Setu tells Karpenter to provision before pods exist |
| Status | Alpha (K8s 1.35, disabled by default) | Works today on K8s 1.29+ |
K8s gang scheduling and Setu complement each other:
- K8s gang scheduling prevents pods from being fragmented — all pods in a group schedule together or not at all.
- Setu ensures the capacity exists in the first place — nodes are provisioned before Kueue admits the workload.
Without Setu, K8s gang scheduling + Karpenter = all pods gang-gated together, but Karpenter still reacts only when it sees pending pods. You still incur reactive cold-start delays. Setu eliminates that wait by pre-provisioning nodes before workload admission.
A common workaround is running low-priority "pause" pods to keep warm capacity. This approach:
- Wastes resources 24/7 (pods consume quota even when idle)
- Doesn't scale with workload demand
- Karpenter may consolidate away the warm nodes
- Provides no gang guarantees
Setu provisions capacity on-demand, per-workload, and cleans up when the workload completes.
| Feature | Description |
|---|---|
| Gang Scheduling | Atomic all-or-nothing node provisioning with rollback |
| Capacity Validation | Checks NodePool limits before provisioning |
| Quota Awareness | Advisory or enforcing ClusterQueue quota checking |
| NodeClass Validation | Pre-flight check that referenced NodeClass exists |
| Exponential Backoff | Retry failed provisions (5s to 80s, max 5 retries) |
| Finalizers | Guaranteed NodeClaim cleanup on workload deletion |
| Cloud Agnostic | Configurable NodeClass for AWS, GCP, Azure |
| Prometheus Metrics | Full observability (9 metrics with labels) |
| Leader Election | HA-ready with controller-runtime leader election |
| Accelerator Support | NVIDIA GPU, AWS Neuron (Trainium/Inferentia), custom resources |
User submits Job ──▶ Kueue creates Workload (AdmissionCheck: Pending)
──▶ Setu validates quota + capacity
──▶ Setu creates NodeClaim(s) via Karpenter API (gang-id labeled)
──▶ Karpenter provisions cloud instances
──▶ Setu polls until ALL NodeClaims are Ready
──▶ Setu approves AdmissionCheck
──▶ Kueue admits Workload
──▶ Pods schedule instantly on pre-provisioned nodes
Step by step:
- User submits a Job with
kueue.x-k8s.io/queue-namelabel - Kueue creates a Workload with
karpenter-provisionAdmissionCheck =Pending - Setu detects the pending check
- Setu validates ClusterQueue quota and NodePool capacity (advisory by default)
- Setu creates NodeClaim(s) — one per pod — all sharing a
gang-idlabel - Karpenter provisions cloud instances (EC2, GCE, Azure VMs)
- Setu polls NodeClaim status every 15 seconds
- All Ready? Setu approves the AdmissionCheck
- Any failed? Setu rolls back all NodeClaims and retries with exponential backoff
- Timeout (10 min)? Setu rejects the AdmissionCheck with a clear error
- Kueue admits the Workload, pods schedule instantly
- On workload deletion, Setu's finalizer cleans up all NodeClaims (no orphaned nodes)
For detailed architecture diagrams, see ARCHITECTURE.md.
| Component | Version | Role |
|---|---|---|
| Go | 1.22+ | Controller language |
| Kubernetes | 1.29+ | Target cluster |
| Kueue | v0.16+ | Job queuing and admission (v1beta2 API) |
| Karpenter | v1.0+ | Node provisioning (v1 NodeClaim API) |
| controller-runtime | v0.19 | Kubernetes controller framework |
| Helm | 3.x | For Helm-based install (optional) |
| Docker | 20.10+ | For building the controller image |
| Prometheus | any | Metrics collection (optional) |
| Environment | Kubernetes | Karpenter | Kueue | Result |
|---|---|---|---|---|
| AWS EKS | v1.35 | v1.8.2 | v0.16.0 | E2E passing (test output) |
Prerequisites: Kueue and Karpenter must already be installed on the cluster. See QUICKSTART.md for a complete AWS EKS setup from scratch.
Before deploying, build and push the controller image to a registry your cluster can pull from:
# Build for Linux (required for Kubernetes)
docker build --platform linux/amd64 -t <your-registry>/setu:latest .
# Push to your registry (ECR, GHCR, Docker Hub, etc.)
docker push <your-registry>/setu:latestOr use the Makefile:
make docker-build IMG=<your-registry>/setu:latest
make docker-push IMG=<your-registry>/setu:latestNote: On macOS/Apple Silicon, always include
--platform linux/amd64. Docker defaults to arm64, which will not run on most Kubernetes nodes.
helm install setu charts/setu \
-n kueue-system \
--create-namespace \
--set image.repository=<your-registry>/setu \
--set image.tag=latestThis creates:
- ServiceAccount, ClusterRole, ClusterRoleBinding
- Deployment (1 replica with leader election)
- AdmissionCheck
karpenter-provision - Service (metrics on port 8080)
To customize:
helm install setu charts/setu \
-n kueue-system \
--create-namespace \
--set image.repository=<your-registry>/setu \
--set image.tag=latest \
--set controller.extraArgs='{--cloud-provider=gcp}' \
--set serviceMonitor.enabled=trueSee charts/setu/values.yaml for all configurable values.
Step 1: Build and push your image
# Build for Linux (required for Kubernetes)
docker build --platform linux/amd64 -t <your-registry>/setu:<tag> .
# Push to your registry
docker push <your-registry>/setu:<tag>Step 2: Update the manifest with your image
Edit deploy/manifests/deployment.yaml and replace the image on line 21:
image: <your-registry>/setu:<tag> # Changed from ghcr.io/sanjeevrg89/setu:latestStep 3: Deploy
kubectl apply -f deploy/manifests/rbac.yaml
kubectl apply -f deploy/manifests/admissioncheck.yaml
kubectl apply -f deploy/manifests/deployment.yaml# Controller is running
kubectl get pods -n kueue-system -l app=setu
# AdmissionCheck is Active
kubectl get admissionchecks karpenter-provision -o yaml
# Look for: status.conditions[].type=Active, status=TrueCreate a ClusterQueue that references the karpenter-provision AdmissionCheck:
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
name: my-queue
spec:
admissionChecksStrategy:
admissionChecks:
- name: karpenter-provision
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: default
resources:
- name: "cpu"
nominalQuota: 1000
- name: "memory"
nominalQuota: 4000Gi
- name: "nvidia.com/gpu"
nominalQuota: 64Create a LocalQueue in the user namespace:
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
name: default
namespace: default
spec:
clusterQueue: my-queue| Flag | Default | Description |
|---|---|---|
--cloud-provider |
aws |
Cloud provider preset (aws, gcp, azure) |
--enforce-quota |
false |
Reject workloads exceeding ClusterQueue quota |
--enforce-capacity |
false |
Reject workloads when NodePool capacity is insufficient |
--validate-node-class |
true |
Pre-flight check that referenced NodeClass exists |
--node-class-group |
(from preset) | Override NodeClass API group |
--node-class-kind |
(from preset) | Override NodeClass CRD kind |
--node-class-default |
default |
NodeClass name for CPU workloads |
--node-class-gpu |
gpu |
NodeClass name for GPU/accelerator workloads |
--leader-elect |
false |
Enable leader election for HA |
--metrics-bind-address |
:8080 |
Metrics endpoint |
--health-probe-bind-address |
:8081 |
Health/readiness probe endpoint |
| Provider | --cloud-provider |
NodeClass Group | NodeClass Kind |
|---|---|---|---|
| AWS | aws |
karpenter.k8s.aws |
EC2NodeClass |
| GCP | gcp |
karpenter.gcp.io |
GCPNodeClass |
| Azure | azure |
karpenter.azure.com |
AKSNodeClass |
apiVersion: batch/v1
kind: Job
metadata:
name: training-job
labels:
kueue.x-k8s.io/queue-name: default
spec:
parallelism: 4
completions: 4
template:
spec:
containers:
- name: train
image: pytorch/pytorch:latest
resources:
requests:
nvidia.com/gpu: "1"
cpu: "4"
memory: "16Gi"
restartPolicy: NeverSetu will create 4 NodeClaims (one per pod) as a gang. All 4 must become Ready before the workload is admitted.
apiVersion: batch/v1
kind: Job
metadata:
name: batch-job
labels:
kueue.x-k8s.io/queue-name: default
spec:
parallelism: 2
completions: 2
template:
spec:
containers:
- name: worker
image: busybox:latest
command: ["sleep", "60"]
resources:
requests:
cpu: "2"
memory: "4Gi"
restartPolicy: Never# Watch workloads
kubectl get workloads -w
# Watch NodeClaims created by Setu (named setu-<workload>-<index>)
kubectl get nodeclaims -w
# Watch pods schedule
kubectl get pods -w
# Check controller logs
kubectl logs -n kueue-system -l app=setu -fAll metrics use the setu_ namespace prefix. Scrape port 8080 at /metrics.
| Metric | Type | Labels | Description |
|---|---|---|---|
setu_workloads_processed_total |
Counter | result |
Workloads processed (approved/rejected/error) |
setu_nodeclaims_created_total |
Counter | nodepool, capacity_type |
NodeClaims created |
setu_nodeclaim_provisioning_duration_seconds |
Histogram | nodepool, result |
Time for NodeClaim to become ready |
setu_gang_provisioning_attempts_total |
Counter | result |
Gang attempts (success/rollback) |
setu_active_nodeclaims |
Gauge | nodepool, state |
Currently active NodeClaims |
setu_admission_check_latency_seconds |
Histogram | result |
End-to-end admission check latency |
setu_capacity_validation_total |
Counter | nodepool, result |
NodePool capacity validation results |
setu_quota_check_total |
Counter | queue, result |
ClusterQueue quota check results |
setu_retry_attempts_total |
Counter | operation, attempt |
Retry attempts |
Enable a ServiceMonitor for Prometheus Operator:
helm upgrade setu charts/setu -n kueue-system --set serviceMonitor.enabled=trueFor the full architecture with detailed flow diagrams, CRD relationships, retry logic, and design decisions, see ARCHITECTURE.md.
Key design principles:
- Karpenter decides instance types, not Setu. Setu passes resource requirements via NodeClaim. Karpenter's scheduling logic selects optimal instances.
- Cloud-agnostic by design. NodeClass references are configurable via
--cloud-providerflag or explicit--node-class-group/--node-class-kindoverrides. - All accelerators, not just NVIDIA. Supports
nvidia.com/gpu,aws.amazon.com/neuron(Trainium/Inferentia), and any custom accelerator resource. - Operators choose enforcement level. Quota and capacity checks are advisory by default. Enable
--enforce-quotaor--enforce-capacityto reject workloads early. - Fail fast, not fail slow. NodeClass existence is validated before creating NodeClaims. Missing NodeClass = immediate rejection, not a 10-minute timeout.
- Setu uses Karpenter's stable v1 NodeClaim API and Kueue's AdmissionCheck extensibility — both are public, stable APIs.
- Kubernetes 1.35 added native gang scheduling (alpha, disabled by default). This is a scheduler-level feature that ensures pods schedule together — it does not provision nodes. See Comparison with Alternatives for details on how Setu and K8s gang scheduling complement each other.
For community context behind the Kueue + Karpenter integration gap, see ISSUES.md.
# Install dependencies
make deps
# Build binary locally
make build
# Run tests (unit + integration)
make test
# Run controller locally against current kubeconfig
make run
# Lint
make lint
# Helm chart validation
make helm-lintSee Building the Docker Image for container builds and CONTRIBUTING.md for contribution guidelines.
setu/
├── cmd/main.go # Entry point, flag parsing
├── pkg/
│ ├── controller/
│ │ ├── controller.go # SetuReconciler (main reconcile loop)
│ │ ├── config.go # SetuConfig (cloud-agnostic config)
│ │ ├── nodeclaim.go # NodeClaim builder + requirement parser
│ │ ├── admissioncheck_reconciler.go # Marks AdmissionCheck as Active
│ │ ├── accessors.go # Unstructured field accessors
│ │ ├── *_test.go # Unit + integration tests
│ │ └── integration_test.go # envtest-based integration tests
│ └── metrics/metrics.go # Prometheus metrics (9 metrics)
├── charts/setu/ # Helm chart
│ ├── Chart.yaml
│ ├── values.yaml
│ └── templates/ # Deployment, RBAC, AdmissionCheck, PDB, etc.
├── deploy/manifests/ # Raw Kubernetes manifests
│ ├── admissioncheck.yaml # AdmissionCheck + example queues
│ ├── rbac.yaml # ServiceAccount, ClusterRole, ClusterRoleBinding
│ └── deployment.yaml # Controller Deployment
├── examples/
│ ├── test-workload.yaml # CPU gang-scheduled job (4 pods)
│ └── gpu-workload.yaml # GPU training job (4 GPUs)
├── test/e2e/ # End-to-end tests
├── scripts/
│ ├── setup-eks.sh # EKS cluster setup automation
│ └── cleanup-eks.sh # EKS cluster cleanup
├── .github/workflows/ # CI/CD (lint, test, build, release)
├── Dockerfile # Multi-stage distroless build
├── Makefile # Build, test, deploy targets
├── ARCHITECTURE.md # Design docs + flow diagrams
├── QUICKSTART.md # Full AWS EKS setup guide
├── DEPLOY.md # Production deployment guide
├── CONTRIBUTING.md # Contribution guidelines
├── CODE_OF_CONDUCT.md # Contributor Covenant
├── SECURITY.md # Vulnerability reporting
└── LICENSE # Apache 2.0
The Setu controller must mark the AdmissionCheck as Active on startup. If it's not Active, Kueue's ClusterQueue will stay inactive.
kubectl get admissionchecks karpenter-provision -o yaml
kubectl logs -n kueue-system -l app=setu --tail=50# Check workload status and admission check state
kubectl describe workload <name>
# Check Setu logs for errors
kubectl logs -n kueue-system -l app=setu -f
# Ensure ClusterQueue references the admission check
kubectl get clusterqueue <name> -o yaml
# Verify LocalQueue points to correct ClusterQueue
kubectl get localqueue <name> -n <namespace> -o yaml | grep clusterQueueSetu skips quota validation when the LocalQueue→ClusterQueue resolution fails (e.g., LocalQueue missing or clusterQueue field empty). This is safe in advisory mode (default) — provisioning continues.
If you see workloads provisioned without quota validation:
# Verify LocalQueue exists and has clusterQueue set
kubectl get localqueue -A
kubectl get localqueue <name> -n <namespace> -o jsonpath='{.spec.clusterQueue}'
# Enable enforcement to reject workloads when quota can't be validated
# Edit deployment: add --enforce-quota=true flag# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=50
# Check NodeClaim status
kubectl describe nodeclaim <name>Common causes:
- Missing or misconfigured NodeClass (EC2NodeClass, GCPNodeClass, etc.)
- IAM permissions (for AWS: node role, Karpenter controller role)
- Subnet/security group tagging (
karpenter.sh/discovery) aws-authConfigMap not updated with node role
Setu rejects the workload after 10 minutes if NodeClaims don't become Ready. All created NodeClaims are rolled back. Check the NodeClaim troubleshooting steps above.
helm uninstall setu -n kueue-systemkubectl delete -f deploy/manifests/deployment.yaml
kubectl delete -f deploy/manifests/admissioncheck.yaml
kubectl delete -f deploy/manifests/rbac.yaml- Issues: GitHub Issues — bug reports and feature requests
- Contributing: CONTRIBUTING.md — how to contribute
- Code of Conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md — vulnerability reporting
- Kueue Documentation
- Kueue AdmissionCheck Concepts
- Karpenter Documentation
- Karpenter NodeClaim Concepts
- Kubernetes Gang Scheduling (v1.35 alpha)
- Kueue #5133: Kueue and Karpenter Support
- Karpenter #749: Manual Node Provisioning
- ARCHITECTURE.md — Detailed design and flow diagrams
- QUICKSTART.md — Full AWS EKS setup from scratch
- DEPLOY.md — Production deployment guide
Apache 2.0 — see LICENSE for details.