Skip to content

Add respectNodePodLimits scheduler flag to enforce per-node pod capacity#4841

Open
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:respect-node-pod-limits
Open

Add respectNodePodLimits scheduler flag to enforce per-node pod capacity#4841
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:respect-node-pod-limits

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 16, 2026

Summary

  • Adds scheduling.respectNodePodLimits feature flag (default false) that enables the scheduler to track pods as a resource and reject scheduling to nodes that have exhausted their pod limit (node.Status.Allocatable["pods"])
  • When enabled, the scheduler programmatically registers pods in supportedResourceTypes and indexedResources at startup, and injects pods: 1 into every job's internal resource requirements
  • The executor now always reports non-Armada pod count in NonArmadaAllocatedResources so the scheduler can subtract system/DaemonSet pods from available capacity

Fixes #4515

This PR builds on top of #4517 and big thanks for the initial work to @Sovietaced

Operator upgrade notes

  • Executor change is unconditional. After the executor upgrade, NonArmadaAllocatedResources gains a pods key in every report regardless of whether any scheduler has the flag enabled. Dashboards, metrics, or custom consumers that iterate this map generically (e.g. sum over all keys) will start including pod counts. Audit prometheus / Grafana panels before rollout.
  • Rollback is clean. Reverting the scheduler flag to false stops the scheduler from tracking pods; reverting the executor binary removes the pods key from its reports. Neither requires data migration.
  • Rolling upgrade order is flexible. Old scheduler + new executor is safe (scheduler's FromNodeProto silently drops unknown resources). New scheduler + old executor is safe (only non-Armada pod accounting is slightly pessimistic until executors are upgraded).

Known limitations

  • pods is not added to dominantResourceFairnessResourcesToConsider. On dense-pod nodes (e.g. GKE's 110-pod limit) a queue running many small pods can monopolize pod slots without a fair-share penalty. Deferred per reviewer request; follow-up if this becomes a problem in practice.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 16, 2026

Greptile Summary

This PR introduces a respectNodePodLimits scheduler flag (default false) that enforces per-node pod capacity by registering pods as a tracked resource, injecting pods: 1 into every job's requirements at runtime, and having the executor unconditionally report non-Armada pod counts in NonArmadaAllocatedResources. The implementation is correct end-to-end: ApplyRespectNodePodLimits is called before ResourceListFactory construction in both schedulerapp.go and simulator.go, mutation of the requirements map in getResourceRequirements is safe (backed by K8sResourceListToMap's fresh copy), Clone() propagates the flag, and the eviction/unbind round-trip correctly restores pod slots.

Confidence Score: 5/5

Safe to merge; no blocking issues found across all changed files.

All changes are well-structured and correct: the resource injection is idempotent, the factory/jobDb initialization order is right in every entrypoint, and the test suite covers flag-off, flag-on, resolution normalization, factory-lacks-pods, and the eviction round-trip.

No files require special attention.

Important Files Changed

Filename Overview
internal/scheduler/configuration/configuration.go Adds RespectNodePodLimits flag and ApplyRespectNodePodLimits helper that idempotently registers pods with resolution 1 in both SupportedResourceTypes and IndexedResources.
internal/scheduler/jobdb/jobdb.go Adds respectNodePodLimits field with setter and correct Clone() copy; getResourceRequirements safely injects pods: 1 into a new-map copy from safeGetRequirements.
internal/executor/utilisation/cluster_utilisation.go Injects pods: 1 into each non-Armada pod's resource request so the scheduler can subtract system pod count from available capacity; mutation is safe since TotalPodResourceRequest returns a new map.
internal/scheduler/schedulerapp.go Calls ApplyRespectNodePodLimits before NewResourceListFactory and sets SetRespectNodePodLimits on jobDb after creation; ordering is correct.
internal/scheduler/nodedb/respect_node_pod_limits_test.go New end-to-end test verifying the eviction/unbind round-trip restores the pod slot and allows a follow-up job to bind.

Sequence Diagram

sequenceDiagram
    participant Exec as Executor
    participant Sched as Scheduler
    participant NodeDB as NodeDb
    participant JobDB as JobDb
    Note over Sched: Startup
    Sched->>Sched: ApplyRespectNodePodLimits adds pods to factory config
    Sched->>JobDB: SetRespectNodePodLimits(true)
    Note over Exec: Per heartbeat
    Exec->>Sched: NodeInfo{TotalResources[pods]=110, NonArmadaAllocated[pods]=5}
    Note over Sched: Node ingestion
    Sched->>NodeDB: AllocatableByPriority[p][pods] = 105
    Note over Sched: Scheduling loop
    Sched->>JobDB: NewJob injects pods:1
    Sched->>NodeDB: BindJobToNode → pods -= 1
    Note over Sched: Eviction
    Sched->>NodeDB: UnbindJobFromNode → pods += 1
Loading

Reviews (7): Last reviewed commit: "Add respectNodePodLimits scheduler flag ..." | Re-trigger Greptile

Comment thread internal/scheduler/jobdb/jobdb_test.go Outdated
Comment thread internal/executor/utilisation/cluster_utilisation.go Outdated
@dejanzele dejanzele force-pushed the respect-node-pod-limits branch 7 times, most recently from e95dd38 to 4faa4a9 Compare April 17, 2026 13:52
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the respect-node-pod-limits branch from 4faa4a9 to 57a9176 Compare April 17, 2026 13:54
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scheduler does not respect node pod limits

1 participant