Skip to content

sanjeevrg89/Setu

Setu - Kueue to Karpenter Controller

Setu (सेतु) - Sanskrit for "bridge", as in Ram Setu. This controller connects Kueue workload queuing with Karpenter node provisioning.

CI License

Status: Alpha — Functional and E2E-tested on AWS EKS. Use in non-production environments first. Contributions welcome!


Table of Contents


The Problem

Kueue is the standard for Kubernetes job queuing. It has native integration with Cluster Autoscaler via the ProvisioningRequest API — Kueue tells CAS to provision capacity before admitting workloads, so pods schedule instantly on warm nodes.

Karpenter does not implement ProvisioningRequest. If your cluster runs Karpenter, Kueue has no way to pre-provision nodes:

  WITH CLUSTER AUTOSCALER                WITH KARPENTER (without Setu)
  ───────────────────────                ────────────────────────────
  Kueue ─ProvisioningRequest─▶ CAS      Kueue ──── ??? ────▶ Karpenter
       (native integration)                    (NO INTEGRATION)

  1. Kueue asks CAS for capacity          1. Kueue admits workload
  2. CAS provisions nodes                  2. Pods go Pending
  3. Nodes ready                           3. Karpenter reacts to Pending pods
  4. Kueue admits workload                 4. Karpenter provisions nodes reactively
  5. Pods schedule instantly               5. Pods finally schedule
     ✅ Zero cold-start                       ❌ Reactive cold-start

This is a known gap — see Kueue #5133 ("Kueue and Karpenter Support").

Setu bridges this gap today.


The Solution

Setu uses two existing, stable APIs — Kueue's AdmissionCheck extensibility and Karpenter's NodeClaim v1 API — to close the gap:

┌────────────┐       ┌─────────────────┐       ┌─────────────┐
│   KUEUE    │──────▶│      SETU       │──────▶│  KARPENTER  │
│  Workload  │       │   Controller    │       │  NodeClaim  │
│  (Pending) │       │  Watch/Validate │       │  (Create)   │
└────────────┘       │  Provision      │       └──────┬──────┘
                     │  Approve/Reject │              │
                     └────────┬────────┘              ▼
                              │               ┌─────────────┐
                              ▼               │  Cloud Node │
                     ┌─────────────────┐      │  (EC2, etc) │
                     │ AdmissionCheck  │      └─────────────┘
                     │ → Ready         │
                     └─────────────────┘

Zero changes to Kueue. Zero changes to Karpenter. Setu is a standalone controller that uses only public, stable APIs from both projects.


When to Use Setu

You need Setu if:

  • You run Kueue + Karpenter and want proactive (not reactive) node provisioning
  • You run distributed ML training (PyTorch, JAX, MPI) where all workers must start together
  • You have GPU batch jobs that need gang scheduling (all-or-nothing GPU allocation)
  • You run large-scale batch processing where reactive autoscaling adds unnecessary cold-start delays
  • You operate multi-tenant clusters where Kueue manages fairness and Karpenter handles provisioning
  • You are migrating from Cluster Autoscaler to Karpenter and need to keep Kueue's provisioning-aware admission

You do NOT need Setu if:

  • You use Cluster Autoscaler — Kueue already integrates natively via ProvisioningRequest
  • You don't use Kueue — Karpenter's reactive scaling may be sufficient for your workloads
  • Your workloads are latency-tolerant and can wait for reactive node provisioning

Before and After

Without Setu: Kueue admits a workload, pods go Pending, Karpenter reacts to pending pods and provisions nodes reactively. For gang workloads (e.g. 4-GPU training), partial allocation means some GPUs sit idle waiting for the rest — wasting money and time.

With Setu: Nodes are provisioned before pods exist. Setu creates all NodeClaims atomically, waits for every node to be Ready, then tells Kueue to admit. Pods schedule instantly on warm nodes. For gang workloads, all GPUs come online together.


Comparison with Alternatives

Setu vs. Cluster Autoscaler + ProvisioningRequest

CAS + ProvisioningRequest Setu + Karpenter
Provisioner Cluster Autoscaler Karpenter
Kueue integration Native (built-in) Via Setu AdmissionCheck
Instance selection Limited (ASG-based) Advanced (Karpenter's optimizer)
Spot + fallback Manual ASG config Karpenter native
Consolidation None Karpenter native
Multi-cloud AWS only (EKS) AWS, GCP, Azure via Karpenter

Choose Setu if you prefer Karpenter's instance selection, spot handling, and consolidation over CAS.

Setu vs. Kubernetes 1.35 Gang Scheduling

Kubernetes 1.35 introduced native gang scheduling (alpha, disabled by default) using SchedulingGate and coscheduling. This is a scheduler-level feature — it solves pod placement, not node provisioning:

K8s 1.35 Gang Scheduling Setu
Layer Scheduler (pod placement) Infrastructure (node provisioning)
What it solves Ensures pods schedule together on existing nodes Ensures nodes exist before pods are created
Handles cold starts? No — if nodes don't exist, all pods stay Pending together Yes — creates NodeClaims proactively, nodes are warm before admission
Kueue aware? Yes — Kueue uses it internally Yes — Setu bridges Kueue to Karpenter
Karpenter aware? No — Karpenter still reacts to pending pods Yes — Setu tells Karpenter to provision before pods exist
Status Alpha (K8s 1.35, disabled by default) Works today on K8s 1.29+

K8s gang scheduling and Setu complement each other:

  1. K8s gang scheduling prevents pods from being fragmented — all pods in a group schedule together or not at all.
  2. Setu ensures the capacity exists in the first place — nodes are provisioned before Kueue admits the workload.

Without Setu, K8s gang scheduling + Karpenter = all pods gang-gated together, but Karpenter still reacts only when it sees pending pods. You still incur reactive cold-start delays. Setu eliminates that wait by pre-provisioning nodes before workload admission.

Setu vs. Over-Provisioning with Placeholder Pods

A common workaround is running low-priority "pause" pods to keep warm capacity. This approach:

  • Wastes resources 24/7 (pods consume quota even when idle)
  • Doesn't scale with workload demand
  • Karpenter may consolidate away the warm nodes
  • Provides no gang guarantees

Setu provisions capacity on-demand, per-workload, and cleans up when the workload completes.


Features

Feature Description
Gang Scheduling Atomic all-or-nothing node provisioning with rollback
Capacity Validation Checks NodePool limits before provisioning
Quota Awareness Advisory or enforcing ClusterQueue quota checking
NodeClass Validation Pre-flight check that referenced NodeClass exists
Exponential Backoff Retry failed provisions (5s to 80s, max 5 retries)
Finalizers Guaranteed NodeClaim cleanup on workload deletion
Cloud Agnostic Configurable NodeClass for AWS, GCP, Azure
Prometheus Metrics Full observability (9 metrics with labels)
Leader Election HA-ready with controller-runtime leader election
Accelerator Support NVIDIA GPU, AWS Neuron (Trainium/Inferentia), custom resources

How It Works

User submits Job ──▶ Kueue creates Workload (AdmissionCheck: Pending)
  ──▶ Setu validates quota + capacity
  ──▶ Setu creates NodeClaim(s) via Karpenter API (gang-id labeled)
  ──▶ Karpenter provisions cloud instances
  ──▶ Setu polls until ALL NodeClaims are Ready
  ──▶ Setu approves AdmissionCheck
  ──▶ Kueue admits Workload
  ──▶ Pods schedule instantly on pre-provisioned nodes

Step by step:

  1. User submits a Job with kueue.x-k8s.io/queue-name label
  2. Kueue creates a Workload with karpenter-provision AdmissionCheck = Pending
  3. Setu detects the pending check
  4. Setu validates ClusterQueue quota and NodePool capacity (advisory by default)
  5. Setu creates NodeClaim(s) — one per pod — all sharing a gang-id label
  6. Karpenter provisions cloud instances (EC2, GCE, Azure VMs)
  7. Setu polls NodeClaim status every 15 seconds
  8. All Ready? Setu approves the AdmissionCheck
  9. Any failed? Setu rolls back all NodeClaims and retries with exponential backoff
  10. Timeout (10 min)? Setu rejects the AdmissionCheck with a clear error
  11. Kueue admits the Workload, pods schedule instantly
  12. On workload deletion, Setu's finalizer cleans up all NodeClaims (no orphaned nodes)

For detailed architecture diagrams, see ARCHITECTURE.md.


Tech Stack

Component Version Role
Go 1.22+ Controller language
Kubernetes 1.29+ Target cluster
Kueue v0.16+ Job queuing and admission (v1beta2 API)
Karpenter v1.0+ Node provisioning (v1 NodeClaim API)
controller-runtime v0.19 Kubernetes controller framework
Helm 3.x For Helm-based install (optional)
Docker 20.10+ For building the controller image
Prometheus any Metrics collection (optional)

Tested On

Environment Kubernetes Karpenter Kueue Result
AWS EKS v1.35 v1.8.2 v0.16.0 E2E passing (test output)

Prerequisites: Kueue and Karpenter must already be installed on the cluster. See QUICKSTART.md for a complete AWS EKS setup from scratch.


Getting Started

Building the Docker Image

Before deploying, build and push the controller image to a registry your cluster can pull from:

# Build for Linux (required for Kubernetes)
docker build --platform linux/amd64 -t <your-registry>/setu:latest .

# Push to your registry (ECR, GHCR, Docker Hub, etc.)
docker push <your-registry>/setu:latest

Or use the Makefile:

make docker-build IMG=<your-registry>/setu:latest
make docker-push IMG=<your-registry>/setu:latest

Note: On macOS/Apple Silicon, always include --platform linux/amd64. Docker defaults to arm64, which will not run on most Kubernetes nodes.

Deployment

Option A: Helm Chart (Recommended)

helm install setu charts/setu \
  -n kueue-system \
  --create-namespace \
  --set image.repository=<your-registry>/setu \
  --set image.tag=latest

This creates:

  • ServiceAccount, ClusterRole, ClusterRoleBinding
  • Deployment (1 replica with leader election)
  • AdmissionCheck karpenter-provision
  • Service (metrics on port 8080)

To customize:

helm install setu charts/setu \
  -n kueue-system \
  --create-namespace \
  --set image.repository=<your-registry>/setu \
  --set image.tag=latest \
  --set controller.extraArgs='{--cloud-provider=gcp}' \
  --set serviceMonitor.enabled=true

See charts/setu/values.yaml for all configurable values.

Option B: Raw Manifests

Step 1: Build and push your image

# Build for Linux (required for Kubernetes)
docker build --platform linux/amd64 -t <your-registry>/setu:<tag> .

# Push to your registry
docker push <your-registry>/setu:<tag>

Step 2: Update the manifest with your image

Edit deploy/manifests/deployment.yaml and replace the image on line 21:

image: <your-registry>/setu:<tag>  # Changed from ghcr.io/sanjeevrg89/setu:latest

Step 3: Deploy

kubectl apply -f deploy/manifests/rbac.yaml
kubectl apply -f deploy/manifests/admissioncheck.yaml
kubectl apply -f deploy/manifests/deployment.yaml

Verify Installation

# Controller is running
kubectl get pods -n kueue-system -l app=setu

# AdmissionCheck is Active
kubectl get admissionchecks karpenter-provision -o yaml
# Look for: status.conditions[].type=Active, status=True

Configuration

Kueue Setup

Create a ClusterQueue that references the karpenter-provision AdmissionCheck:

apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: my-queue
spec:
  admissionChecksStrategy:
    admissionChecks:
      - name: karpenter-provision
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: default
          resources:
            - name: "cpu"
              nominalQuota: 1000
            - name: "memory"
              nominalQuota: 4000Gi
            - name: "nvidia.com/gpu"
              nominalQuota: 64

Create a LocalQueue in the user namespace:

apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  name: default
  namespace: default
spec:
  clusterQueue: my-queue

Controller Flags

Flag Default Description
--cloud-provider aws Cloud provider preset (aws, gcp, azure)
--enforce-quota false Reject workloads exceeding ClusterQueue quota
--enforce-capacity false Reject workloads when NodePool capacity is insufficient
--validate-node-class true Pre-flight check that referenced NodeClass exists
--node-class-group (from preset) Override NodeClass API group
--node-class-kind (from preset) Override NodeClass CRD kind
--node-class-default default NodeClass name for CPU workloads
--node-class-gpu gpu NodeClass name for GPU/accelerator workloads
--leader-elect false Enable leader election for HA
--metrics-bind-address :8080 Metrics endpoint
--health-probe-bind-address :8081 Health/readiness probe endpoint

Cloud Provider Presets

Provider --cloud-provider NodeClass Group NodeClass Kind
AWS aws karpenter.k8s.aws EC2NodeClass
GCP gcp karpenter.gcp.io GCPNodeClass
Azure azure karpenter.azure.com AKSNodeClass

Usage Examples

GPU Training Job (4-GPU Gang)

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
  labels:
    kueue.x-k8s.io/queue-name: default
spec:
  parallelism: 4
  completions: 4
  template:
    spec:
      containers:
        - name: train
          image: pytorch/pytorch:latest
          resources:
            requests:
              nvidia.com/gpu: "1"
              cpu: "4"
              memory: "16Gi"
      restartPolicy: Never

Setu will create 4 NodeClaims (one per pod) as a gang. All 4 must become Ready before the workload is admitted.

CPU Batch Job

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
  labels:
    kueue.x-k8s.io/queue-name: default
spec:
  parallelism: 2
  completions: 2
  template:
    spec:
      containers:
        - name: worker
          image: busybox:latest
          command: ["sleep", "60"]
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"
      restartPolicy: Never

Watch the Flow

# Watch workloads
kubectl get workloads -w

# Watch NodeClaims created by Setu (named setu-<workload>-<index>)
kubectl get nodeclaims -w

# Watch pods schedule
kubectl get pods -w

# Check controller logs
kubectl logs -n kueue-system -l app=setu -f

Metrics

All metrics use the setu_ namespace prefix. Scrape port 8080 at /metrics.

Metric Type Labels Description
setu_workloads_processed_total Counter result Workloads processed (approved/rejected/error)
setu_nodeclaims_created_total Counter nodepool, capacity_type NodeClaims created
setu_nodeclaim_provisioning_duration_seconds Histogram nodepool, result Time for NodeClaim to become ready
setu_gang_provisioning_attempts_total Counter result Gang attempts (success/rollback)
setu_active_nodeclaims Gauge nodepool, state Currently active NodeClaims
setu_admission_check_latency_seconds Histogram result End-to-end admission check latency
setu_capacity_validation_total Counter nodepool, result NodePool capacity validation results
setu_quota_check_total Counter queue, result ClusterQueue quota check results
setu_retry_attempts_total Counter operation, attempt Retry attempts

Enable a ServiceMonitor for Prometheus Operator:

helm upgrade setu charts/setu -n kueue-system --set serviceMonitor.enabled=true

Architecture

For the full architecture with detailed flow diagrams, CRD relationships, retry logic, and design decisions, see ARCHITECTURE.md.

Key design principles:

  1. Karpenter decides instance types, not Setu. Setu passes resource requirements via NodeClaim. Karpenter's scheduling logic selects optimal instances.
  2. Cloud-agnostic by design. NodeClass references are configurable via --cloud-provider flag or explicit --node-class-group/--node-class-kind overrides.
  3. All accelerators, not just NVIDIA. Supports nvidia.com/gpu, aws.amazon.com/neuron (Trainium/Inferentia), and any custom accelerator resource.
  4. Operators choose enforcement level. Quota and capacity checks are advisory by default. Enable --enforce-quota or --enforce-capacity to reject workloads early.
  5. Fail fast, not fail slow. NodeClass existence is validated before creating NodeClaims. Missing NodeClass = immediate rejection, not a 10-minute timeout.

How Setu Fits In

  • Setu uses Karpenter's stable v1 NodeClaim API and Kueue's AdmissionCheck extensibility — both are public, stable APIs.
  • Kubernetes 1.35 added native gang scheduling (alpha, disabled by default). This is a scheduler-level feature that ensures pods schedule together — it does not provision nodes. See Comparison with Alternatives for details on how Setu and K8s gang scheduling complement each other.

For community context behind the Kueue + Karpenter integration gap, see ISSUES.md.


Development

# Install dependencies
make deps

# Build binary locally
make build

# Run tests (unit + integration)
make test

# Run controller locally against current kubeconfig
make run

# Lint
make lint

# Helm chart validation
make helm-lint

See Building the Docker Image for container builds and CONTRIBUTING.md for contribution guidelines.


Project Structure

setu/
├── cmd/main.go                          # Entry point, flag parsing
├── pkg/
│   ├── controller/
│   │   ├── controller.go                # SetuReconciler (main reconcile loop)
│   │   ├── config.go                    # SetuConfig (cloud-agnostic config)
│   │   ├── nodeclaim.go                 # NodeClaim builder + requirement parser
│   │   ├── admissioncheck_reconciler.go # Marks AdmissionCheck as Active
│   │   ├── accessors.go                 # Unstructured field accessors
│   │   ├── *_test.go                    # Unit + integration tests
│   │   └── integration_test.go          # envtest-based integration tests
│   └── metrics/metrics.go              # Prometheus metrics (9 metrics)
├── charts/setu/                         # Helm chart
│   ├── Chart.yaml
│   ├── values.yaml
│   └── templates/                       # Deployment, RBAC, AdmissionCheck, PDB, etc.
├── deploy/manifests/                    # Raw Kubernetes manifests
│   ├── admissioncheck.yaml              # AdmissionCheck + example queues
│   ├── rbac.yaml                        # ServiceAccount, ClusterRole, ClusterRoleBinding
│   └── deployment.yaml                  # Controller Deployment
├── examples/
│   ├── test-workload.yaml               # CPU gang-scheduled job (4 pods)
│   └── gpu-workload.yaml                # GPU training job (4 GPUs)
├── test/e2e/                            # End-to-end tests
├── scripts/
│   ├── setup-eks.sh                     # EKS cluster setup automation
│   └── cleanup-eks.sh                   # EKS cluster cleanup
├── .github/workflows/                   # CI/CD (lint, test, build, release)
├── Dockerfile                           # Multi-stage distroless build
├── Makefile                             # Build, test, deploy targets
├── ARCHITECTURE.md                      # Design docs + flow diagrams
├── QUICKSTART.md                        # Full AWS EKS setup guide
├── DEPLOY.md                            # Production deployment guide
├── CONTRIBUTING.md                      # Contribution guidelines
├── CODE_OF_CONDUCT.md                   # Contributor Covenant
├── SECURITY.md                          # Vulnerability reporting
└── LICENSE                              # Apache 2.0

Troubleshooting

AdmissionCheck not Active

The Setu controller must mark the AdmissionCheck as Active on startup. If it's not Active, Kueue's ClusterQueue will stay inactive.

kubectl get admissionchecks karpenter-provision -o yaml
kubectl logs -n kueue-system -l app=setu --tail=50

Workloads Stuck in Pending

# Check workload status and admission check state
kubectl describe workload <name>

# Check Setu logs for errors
kubectl logs -n kueue-system -l app=setu -f

# Ensure ClusterQueue references the admission check
kubectl get clusterqueue <name> -o yaml

# Verify LocalQueue points to correct ClusterQueue
kubectl get localqueue <name> -n <namespace> -o yaml | grep clusterQueue

Quota Check Skipped

Setu skips quota validation when the LocalQueue→ClusterQueue resolution fails (e.g., LocalQueue missing or clusterQueue field empty). This is safe in advisory mode (default) — provisioning continues.

If you see workloads provisioned without quota validation:

# Verify LocalQueue exists and has clusterQueue set
kubectl get localqueue -A
kubectl get localqueue <name> -n <namespace> -o jsonpath='{.spec.clusterQueue}'

# Enable enforcement to reject workloads when quota can't be validated
# Edit deployment: add --enforce-quota=true flag

NodeClaims Not Becoming Ready

# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=50

# Check NodeClaim status
kubectl describe nodeclaim <name>

Common causes:

  • Missing or misconfigured NodeClass (EC2NodeClass, GCPNodeClass, etc.)
  • IAM permissions (for AWS: node role, Karpenter controller role)
  • Subnet/security group tagging (karpenter.sh/discovery)
  • aws-auth ConfigMap not updated with node role

Provisioning Timeout (10 min)

Setu rejects the workload after 10 minutes if NodeClaims don't become Ready. All created NodeClaims are rolled back. Check the NodeClaim troubleshooting steps above.


Uninstall

Helm

helm uninstall setu -n kueue-system

Raw Manifests

kubectl delete -f deploy/manifests/deployment.yaml
kubectl delete -f deploy/manifests/admissioncheck.yaml
kubectl delete -f deploy/manifests/rbac.yaml

Community

References

License

Apache 2.0 — see LICENSE for details.

About

Setu - The Kueue-Karpenter Bridge Controller

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors