Setu - Kueue to Karpenter Controller

Setu (सेतु) - Sanskrit for "bridge", as in Ram Setu. This controller connects Kueue workload queuing with Karpenter node provisioning.

Status: Alpha — Functional and E2E-tested on AWS EKS. Use in non-production environments first. Contributions welcome!

The Problem

Kueue is the standard for Kubernetes job queuing. It has native integration with Cluster Autoscaler via the ProvisioningRequest API — Kueue tells CAS to provision capacity before admitting workloads, so pods schedule instantly on warm nodes.

Karpenter does not implement ProvisioningRequest. If your cluster runs Karpenter, Kueue has no way to pre-provision nodes:

  WITH CLUSTER AUTOSCALER                WITH KARPENTER (without Setu)
  ───────────────────────                ────────────────────────────
  Kueue ─ProvisioningRequest─▶ CAS      Kueue ──── ??? ────▶ Karpenter
       (native integration)                    (NO INTEGRATION)

  1. Kueue asks CAS for capacity          1. Kueue admits workload
  2. CAS provisions nodes                  2. Pods go Pending
  3. Nodes ready                           3. Karpenter reacts to Pending pods
  4. Kueue admits workload                 4. Karpenter provisions nodes reactively
  5. Pods schedule instantly               5. Pods finally schedule
     ✅ Zero cold-start                       ❌ Reactive cold-start

This is a known gap — see Kueue #5133 ("Kueue and Karpenter Support").

Setu bridges this gap today.

The Solution

Setu uses two existing, stable APIs — Kueue's AdmissionCheck extensibility and Karpenter's NodeClaim v1 API — to close the gap:

┌────────────┐       ┌─────────────────┐       ┌─────────────┐
│   KUEUE    │──────▶│      SETU       │──────▶│  KARPENTER  │
│  Workload  │       │   Controller    │       │  NodeClaim  │
│  (Pending) │       │  Watch/Validate │       │  (Create)   │
└────────────┘       │  Provision      │       └──────┬──────┘
                     │  Approve/Reject │              │
                     └────────┬────────┘              ▼
                              │               ┌─────────────┐
                              ▼               │  Cloud Node │
                     ┌─────────────────┐      │  (EC2, etc) │
                     │ AdmissionCheck  │      └─────────────┘
                     │ → Ready         │
                     └─────────────────┘

Zero changes to Kueue. Zero changes to Karpenter. Setu is a standalone controller that uses only public, stable APIs from both projects.

When to Use Setu

You need Setu if:

You run Kueue + Karpenter and want proactive (not reactive) node provisioning
You run distributed ML training (PyTorch, JAX, MPI) where all workers must start together
You have GPU batch jobs that need gang scheduling (all-or-nothing GPU allocation)
You run large-scale batch processing where reactive autoscaling adds unnecessary cold-start delays
You operate multi-tenant clusters where Kueue manages fairness and Karpenter handles provisioning
You are migrating from Cluster Autoscaler to Karpenter and need to keep Kueue's provisioning-aware admission

You do NOT need Setu if:

You use Cluster Autoscaler — Kueue already integrates natively via ProvisioningRequest
You don't use Kueue — Karpenter's reactive scaling may be sufficient for your workloads
Your workloads are latency-tolerant and can wait for reactive node provisioning

Before and After

Without Setu: Kueue admits a workload, pods go Pending, Karpenter reacts to pending pods and provisions nodes reactively. For gang workloads (e.g. 4-GPU training), partial allocation means some GPUs sit idle waiting for the rest — wasting money and time.

With Setu: Nodes are provisioned before pods exist. Setu creates all NodeClaims atomically, waits for every node to be Ready, then tells Kueue to admit. Pods schedule instantly on warm nodes. For gang workloads, all GPUs come online together.

Comparison with Alternatives

Setu vs. Cluster Autoscaler + ProvisioningRequest

	CAS + ProvisioningRequest	Setu + Karpenter
Provisioner	Cluster Autoscaler	Karpenter
Kueue integration	Native (built-in)	Via Setu AdmissionCheck
Instance selection	Limited (ASG-based)	Advanced (Karpenter's optimizer)
Spot + fallback	Manual ASG config	Karpenter native
Consolidation	None	Karpenter native
Multi-cloud	AWS only (EKS)	AWS, GCP, Azure via Karpenter

Choose Setu if you prefer Karpenter's instance selection, spot handling, and consolidation over CAS.

Setu vs. Kubernetes 1.35 Gang Scheduling

Kubernetes 1.35 introduced native gang scheduling (alpha, disabled by default) using SchedulingGate and coscheduling. This is a scheduler-level feature — it solves pod placement, not node provisioning:

	K8s 1.35 Gang Scheduling	Setu
Layer	Scheduler (pod placement)	Infrastructure (node provisioning)
What it solves	Ensures pods schedule together on existing nodes	Ensures nodes exist before pods are created
Handles cold starts?	No — if nodes don't exist, all pods stay Pending together	Yes — creates NodeClaims proactively, nodes are warm before admission
Kueue aware?	Yes — Kueue uses it internally	Yes — Setu bridges Kueue to Karpenter
Karpenter aware?	No — Karpenter still reacts to pending pods	Yes — Setu tells Karpenter to provision before pods exist
Status	Alpha (K8s 1.35, disabled by default)	Works today on K8s 1.29+

K8s gang scheduling and Setu complement each other:

K8s gang scheduling prevents pods from being fragmented — all pods in a group schedule together or not at all.
Setu ensures the capacity exists in the first place — nodes are provisioned before Kueue admits the workload.

Without Setu, K8s gang scheduling + Karpenter = all pods gang-gated together, but Karpenter still reacts only when it sees pending pods. You still incur reactive cold-start delays. Setu eliminates that wait by pre-provisioning nodes before workload admission.

Setu vs. Over-Provisioning with Placeholder Pods

A common workaround is running low-priority "pause" pods to keep warm capacity. This approach:

Wastes resources 24/7 (pods consume quota even when idle)
Doesn't scale with workload demand
Karpenter may consolidate away the warm nodes
Provides no gang guarantees

Setu provisions capacity on-demand, per-workload, and cleans up when the workload completes.

Features

Feature	Description
Gang Scheduling	Atomic all-or-nothing node provisioning with rollback
Capacity Validation	Checks NodePool limits before provisioning
Quota Awareness	Advisory or enforcing ClusterQueue quota checking
NodeClass Validation	Pre-flight check that referenced NodeClass exists
Exponential Backoff	Retry failed provisions (5s to 80s, max 5 retries)
Finalizers	Guaranteed NodeClaim cleanup on workload deletion
Cloud Agnostic	Configurable NodeClass for AWS, GCP, Azure
Prometheus Metrics	Full observability (9 metrics with labels)
Leader Election	HA-ready with controller-runtime leader election
Accelerator Support	NVIDIA GPU, AWS Neuron (Trainium/Inferentia), custom resources

How It Works

User submits Job ──▶ Kueue creates Workload (AdmissionCheck: Pending)
  ──▶ Setu validates quota + capacity
  ──▶ Setu creates NodeClaim(s) via Karpenter API (gang-id labeled)
  ──▶ Karpenter provisions cloud instances
  ──▶ Setu polls until ALL NodeClaims are Ready
  ──▶ Setu approves AdmissionCheck
  ──▶ Kueue admits Workload
  ──▶ Pods schedule instantly on pre-provisioned nodes

Step by step:

User submits a Job with kueue.x-k8s.io/queue-name label
Kueue creates a Workload with karpenter-provision AdmissionCheck = Pending
Setu detects the pending check
Setu validates ClusterQueue quota and NodePool capacity (advisory by default)
Setu creates NodeClaim(s) — one per pod — all sharing a gang-id label
Karpenter provisions cloud instances (EC2, GCE, Azure VMs)
Setu polls NodeClaim status every 15 seconds
All Ready? Setu approves the AdmissionCheck
Any failed? Setu rolls back all NodeClaims and retries with exponential backoff
Timeout (10 min)? Setu rejects the AdmissionCheck with a clear error
Kueue admits the Workload, pods schedule instantly
On workload deletion, Setu's finalizer cleans up all NodeClaims (no orphaned nodes)

For detailed architecture diagrams, see ARCHITECTURE.md.

Tech Stack

Component	Version	Role
Go	1.22+	Controller language
Kubernetes	1.29+	Target cluster
Kueue	v0.16+	Job queuing and admission (v1beta2 API)
Karpenter	v1.0+	Node provisioning (v1 NodeClaim API)
controller-runtime	v0.19	Kubernetes controller framework
Helm	3.x	For Helm-based install (optional)
Docker	20.10+	For building the controller image
Prometheus	any	Metrics collection (optional)

Tested On

Environment	Kubernetes	Karpenter	Kueue	Result
AWS EKS	v1.35	v1.8.2	v0.16.0	E2E passing (test output)

Prerequisites: Kueue and Karpenter must already be installed on the cluster. See QUICKSTART.md for a complete AWS EKS setup from scratch.

Getting Started

Building the Docker Image

Before deploying, build and push the controller image to a registry your cluster can pull from:

# Build for Linux (required for Kubernetes)
docker build --platform linux/amd64 -t <your-registry>/setu:latest .

# Push to your registry (ECR, GHCR, Docker Hub, etc.)
docker push <your-registry>/setu:latest

Or use the Makefile:

make docker-build IMG=<your-registry>/setu:latest
make docker-push IMG=<your-registry>/setu:latest

Note: On macOS/Apple Silicon, always include --platform linux/amd64. Docker defaults to arm64, which will not run on most Kubernetes nodes.

Deployment

Option A: Helm Chart (Recommended)

helm install setu charts/setu \
  -n kueue-system \
  --create-namespace \
  --set image.repository=<your-registry>/setu \
  --set image.tag=latest

This creates:

ServiceAccount, ClusterRole, ClusterRoleBinding
Deployment (1 replica with leader election)
AdmissionCheck karpenter-provision
Service (metrics on port 8080)

To customize:

helm install setu charts/setu \
  -n kueue-system \
  --create-namespace \
  --set image.repository=<your-registry>/setu \
  --set image.tag=latest \
  --set controller.extraArgs='{--cloud-provider=gcp}' \
  --set serviceMonitor.enabled=true

See charts/setu/values.yaml for all configurable values.

Option B: Raw Manifests

Step 1: Build and push your image

# Build for Linux (required for Kubernetes)
docker build --platform linux/amd64 -t <your-registry>/setu:<tag> .

# Push to your registry
docker push <your-registry>/setu:<tag>

Step 2: Update the manifest with your image

Edit deploy/manifests/deployment.yaml and replace the image on line 21:

image: <your-registry>/setu:<tag>  # Changed from ghcr.io/sanjeevrg89/setu:latest

Step 3: Deploy

kubectl apply -f deploy/manifests/rbac.yaml
kubectl apply -f deploy/manifests/admissioncheck.yaml
kubectl apply -f deploy/manifests/deployment.yaml

Verify Installation

# Controller is running
kubectl get pods -n kueue-system -l app=setu

# AdmissionCheck is Active
kubectl get admissionchecks karpenter-provision -o yaml
# Look for: status.conditions[].type=Active, status=True

Configuration

Kueue Setup

Create a ClusterQueue that references the karpenter-provision AdmissionCheck:

apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: my-queue
spec:
  admissionChecksStrategy:
    admissionChecks:
      - name: karpenter-provision
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: default
          resources:
            - name: "cpu"
              nominalQuota: 1000
            - name: "memory"
              nominalQuota: 4000Gi
            - name: "nvidia.com/gpu"
              nominalQuota: 64

Create a LocalQueue in the user namespace:

apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  name: default
  namespace: default
spec:
  clusterQueue: my-queue

Controller Flags

Flag	Default	Description
`--cloud-provider`	`aws`	Cloud provider preset (`aws`, `gcp`, `azure`)
`--enforce-quota`	`false`	Reject workloads exceeding ClusterQueue quota
`--enforce-capacity`	`false`	Reject workloads when NodePool capacity is insufficient
`--validate-node-class`	`true`	Pre-flight check that referenced NodeClass exists
`--node-class-group`	(from preset)	Override NodeClass API group
`--node-class-kind`	(from preset)	Override NodeClass CRD kind
`--node-class-default`	`default`	NodeClass name for CPU workloads
`--node-class-gpu`	`gpu`	NodeClass name for GPU/accelerator workloads
`--leader-elect`	`false`	Enable leader election for HA
`--metrics-bind-address`	`:8080`	Metrics endpoint
`--health-probe-bind-address`	`:8081`	Health/readiness probe endpoint

Cloud Provider Presets

Provider	`--cloud-provider`	NodeClass Group	NodeClass Kind
AWS	`aws`	`karpenter.k8s.aws`	`EC2NodeClass`
GCP	`gcp`	`karpenter.gcp.io`	`GCPNodeClass`
Azure	`azure`	`karpenter.azure.com`	`AKSNodeClass`

Usage Examples

GPU Training Job (4-GPU Gang)

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
  labels:
    kueue.x-k8s.io/queue-name: default
spec:
  parallelism: 4
  completions: 4
  template:
    spec:
      containers:
        - name: train
          image: pytorch/pytorch:latest
          resources:
            requests:
              nvidia.com/gpu: "1"
              cpu: "4"
              memory: "16Gi"
      restartPolicy: Never

Setu will create 4 NodeClaims (one per pod) as a gang. All 4 must become Ready before the workload is admitted.

CPU Batch Job

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
  labels:
    kueue.x-k8s.io/queue-name: default
spec:
  parallelism: 2
  completions: 2
  template:
    spec:
      containers:
        - name: worker
          image: busybox:latest
          command: ["sleep", "60"]
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"
      restartPolicy: Never

Watch the Flow

# Watch workloads
kubectl get workloads -w

# Watch NodeClaims created by Setu (named setu-<workload>-<index>)
kubectl get nodeclaims -w

# Watch pods schedule
kubectl get pods -w

# Check controller logs
kubectl logs -n kueue-system -l app=setu -f

Metrics

All metrics use the setu_ namespace prefix. Scrape port 8080 at /metrics.

Metric	Type	Labels	Description
`setu_workloads_processed_total`	Counter	`result`	Workloads processed (approved/rejected/error)
`setu_nodeclaims_created_total`	Counter	`nodepool`, `capacity_type`	NodeClaims created
`setu_nodeclaim_provisioning_duration_seconds`	Histogram	`nodepool`, `result`	Time for NodeClaim to become ready
`setu_gang_provisioning_attempts_total`	Counter	`result`	Gang attempts (success/rollback)
`setu_active_nodeclaims`	Gauge	`nodepool`, `state`	Currently active NodeClaims
`setu_admission_check_latency_seconds`	Histogram	`result`	End-to-end admission check latency
`setu_capacity_validation_total`	Counter	`nodepool`, `result`	NodePool capacity validation results
`setu_quota_check_total`	Counter	`queue`, `result`	ClusterQueue quota check results
`setu_retry_attempts_total`	Counter	`operation`, `attempt`	Retry attempts

Enable a ServiceMonitor for Prometheus Operator:

helm upgrade setu charts/setu -n kueue-system --set serviceMonitor.enabled=true

Architecture

For the full architecture with detailed flow diagrams, CRD relationships, retry logic, and design decisions, see ARCHITECTURE.md.

Key design principles:

Karpenter decides instance types, not Setu. Setu passes resource requirements via NodeClaim. Karpenter's scheduling logic selects optimal instances.
Cloud-agnostic by design. NodeClass references are configurable via --cloud-provider flag or explicit --node-class-group/--node-class-kind overrides.
All accelerators, not just NVIDIA. Supports nvidia.com/gpu, aws.amazon.com/neuron (Trainium/Inferentia), and any custom accelerator resource.
Operators choose enforcement level. Quota and capacity checks are advisory by default. Enable --enforce-quota or --enforce-capacity to reject workloads early.
Fail fast, not fail slow. NodeClass existence is validated before creating NodeClaims. Missing NodeClass = immediate rejection, not a 10-minute timeout.

How Setu Fits In

Setu uses Karpenter's stable v1 NodeClaim API and Kueue's AdmissionCheck extensibility — both are public, stable APIs.
Kubernetes 1.35 added native gang scheduling (alpha, disabled by default). This is a scheduler-level feature that ensures pods schedule together — it does not provision nodes. See Comparison with Alternatives for details on how Setu and K8s gang scheduling complement each other.

For community context behind the Kueue + Karpenter integration gap, see ISSUES.md.

Development

# Install dependencies
make deps

# Build binary locally
make build

# Run tests (unit + integration)
make test

# Run controller locally against current kubeconfig
make run

# Lint
make lint

# Helm chart validation
make helm-lint

See Building the Docker Image for container builds and CONTRIBUTING.md for contribution guidelines.

Project Structure

setu/
├── cmd/main.go                          # Entry point, flag parsing
├── pkg/
│   ├── controller/
│   │   ├── controller.go                # SetuReconciler (main reconcile loop)
│   │   ├── config.go                    # SetuConfig (cloud-agnostic config)
│   │   ├── nodeclaim.go                 # NodeClaim builder + requirement parser
│   │   ├── admissioncheck_reconciler.go # Marks AdmissionCheck as Active
│   │   ├── accessors.go                 # Unstructured field accessors
│   │   ├── *_test.go                    # Unit + integration tests
│   │   └── integration_test.go          # envtest-based integration tests
│   └── metrics/metrics.go              # Prometheus metrics (9 metrics)
├── charts/setu/                         # Helm chart
│   ├── Chart.yaml
│   ├── values.yaml
│   └── templates/                       # Deployment, RBAC, AdmissionCheck, PDB, etc.
├── deploy/manifests/                    # Raw Kubernetes manifests
│   ├── admissioncheck.yaml              # AdmissionCheck + example queues
│   ├── rbac.yaml                        # ServiceAccount, ClusterRole, ClusterRoleBinding
│   └── deployment.yaml                  # Controller Deployment
├── examples/
│   ├── test-workload.yaml               # CPU gang-scheduled job (4 pods)
│   └── gpu-workload.yaml                # GPU training job (4 GPUs)
├── test/e2e/                            # End-to-end tests
├── scripts/
│   ├── setup-eks.sh                     # EKS cluster setup automation
│   └── cleanup-eks.sh                   # EKS cluster cleanup
├── .github/workflows/                   # CI/CD (lint, test, build, release)
├── Dockerfile                           # Multi-stage distroless build
├── Makefile                             # Build, test, deploy targets
├── ARCHITECTURE.md                      # Design docs + flow diagrams
├── QUICKSTART.md                        # Full AWS EKS setup guide
├── DEPLOY.md                            # Production deployment guide
├── CONTRIBUTING.md                      # Contribution guidelines
├── CODE_OF_CONDUCT.md                   # Contributor Covenant
├── SECURITY.md                          # Vulnerability reporting
└── LICENSE                              # Apache 2.0

Troubleshooting

AdmissionCheck not Active

The Setu controller must mark the AdmissionCheck as Active on startup. If it's not Active, Kueue's ClusterQueue will stay inactive.

kubectl get admissionchecks karpenter-provision -o yaml
kubectl logs -n kueue-system -l app=setu --tail=50

Workloads Stuck in Pending

# Check workload status and admission check state
kubectl describe workload <name>

# Check Setu logs for errors
kubectl logs -n kueue-system -l app=setu -f

# Ensure ClusterQueue references the admission check
kubectl get clusterqueue <name> -o yaml

# Verify LocalQueue points to correct ClusterQueue
kubectl get localqueue <name> -n <namespace> -o yaml | grep clusterQueue

Quota Check Skipped

Setu skips quota validation when the LocalQueue→ClusterQueue resolution fails (e.g., LocalQueue missing or clusterQueue field empty). This is safe in advisory mode (default) — provisioning continues.

If you see workloads provisioned without quota validation:

# Verify LocalQueue exists and has clusterQueue set
kubectl get localqueue -A
kubectl get localqueue <name> -n <namespace> -o jsonpath='{.spec.clusterQueue}'

# Enable enforcement to reject workloads when quota can't be validated
# Edit deployment: add --enforce-quota=true flag

NodeClaims Not Becoming Ready

# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=50

# Check NodeClaim status
kubectl describe nodeclaim <name>

Common causes:

Missing or misconfigured NodeClass (EC2NodeClass, GCPNodeClass, etc.)
IAM permissions (for AWS: node role, Karpenter controller role)
Subnet/security group tagging (karpenter.sh/discovery)
aws-auth ConfigMap not updated with node role

Provisioning Timeout (10 min)

Setu rejects the workload after 10 minutes if NodeClaims don't become Ready. All created NodeClaims are rolled back. Check the NodeClaim troubleshooting steps above.

Uninstall

Helm

helm uninstall setu -n kueue-system

Raw Manifests

kubectl delete -f deploy/manifests/deployment.yaml
kubectl delete -f deploy/manifests/admissioncheck.yaml
kubectl delete -f deploy/manifests/rbac.yaml

Community

Issues: GitHub Issues — bug reports and feature requests
Contributing: CONTRIBUTING.md — how to contribute
Code of Conduct: CODE_OF_CONDUCT.md
Security: SECURITY.md — vulnerability reporting

References

Kueue Documentation
Kueue AdmissionCheck Concepts
Karpenter Documentation
Karpenter NodeClaim Concepts
Kubernetes Gang Scheduling (v1.35 alpha)
Kueue #5133: Kueue and Karpenter Support
Karpenter #749: Manual Node Provisioning
ARCHITECTURE.md — Detailed design and flow diagrams
QUICKSTART.md — Full AWS EKS setup from scratch
DEPLOY.md — Production deployment guide

License

Apache 2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
charts/setu		charts/setu
cmd		cmd
deploy/manifests		deploy/manifests
examples		examples
pkg		pkg
scripts		scripts
test/e2e		test/e2e
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOY.md		DEPLOY.md
Dockerfile		Dockerfile
ISSUES.md		ISSUES.md
LICENSE		LICENSE
Makefile		Makefile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

Setu - Kueue to Karpenter Controller

Table of Contents

The Problem

The Solution

When to Use Setu

You need Setu if:

You do NOT need Setu if:

Before and After

Comparison with Alternatives

Setu vs. Cluster Autoscaler + ProvisioningRequest

Setu vs. Kubernetes 1.35 Gang Scheduling

Setu vs. Over-Provisioning with Placeholder Pods

Features

How It Works

Tech Stack

Tested On

Getting Started

Building the Docker Image

Deployment

Option A: Helm Chart (Recommended)

Option B: Raw Manifests

Verify Installation

Configuration

Kueue Setup

Controller Flags

Cloud Provider Presets

Usage Examples

GPU Training Job (4-GPU Gang)

CPU Batch Job

Watch the Flow

Metrics

Architecture

How Setu Fits In

Development

Project Structure

Troubleshooting

AdmissionCheck not Active

Workloads Stuck in Pending

Quota Check Skipped

NodeClaims Not Becoming Ready

Provisioning Timeout (10 min)

Uninstall

Helm

Raw Manifests

Community

References

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages