Add respectNodePodLimits scheduler flag to enforce per-node pod capacity#4841
Add respectNodePodLimits scheduler flag to enforce per-node pod capacity#4841dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR introduces a Confidence Score: 5/5Safe to merge; no blocking issues found across all changed files. All changes are well-structured and correct: the resource injection is idempotent, the factory/jobDb initialization order is right in every entrypoint, and the test suite covers flag-off, flag-on, resolution normalization, factory-lacks-pods, and the eviction round-trip. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Exec as Executor
participant Sched as Scheduler
participant NodeDB as NodeDb
participant JobDB as JobDb
Note over Sched: Startup
Sched->>Sched: ApplyRespectNodePodLimits adds pods to factory config
Sched->>JobDB: SetRespectNodePodLimits(true)
Note over Exec: Per heartbeat
Exec->>Sched: NodeInfo{TotalResources[pods]=110, NonArmadaAllocated[pods]=5}
Note over Sched: Node ingestion
Sched->>NodeDB: AllocatableByPriority[p][pods] = 105
Note over Sched: Scheduling loop
Sched->>JobDB: NewJob injects pods:1
Sched->>NodeDB: BindJobToNode → pods -= 1
Note over Sched: Eviction
Sched->>NodeDB: UnbindJobFromNode → pods += 1
Reviews (7): Last reviewed commit: "Add respectNodePodLimits scheduler flag ..." | Re-trigger Greptile |
e95dd38 to
4faa4a9
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
4faa4a9 to
57a9176
Compare
Summary
scheduling.respectNodePodLimitsfeature flag (defaultfalse) that enables the scheduler to trackpodsas a resource and reject scheduling to nodes that have exhausted their pod limit (node.Status.Allocatable["pods"])podsinsupportedResourceTypesandindexedResourcesat startup, and injectspods: 1into every job's internal resource requirementsNonArmadaAllocatedResourcesso the scheduler can subtract system/DaemonSet pods from available capacityFixes #4515
This PR builds on top of #4517 and big thanks for the initial work to @Sovietaced
Operator upgrade notes
NonArmadaAllocatedResourcesgains apodskey in every report regardless of whether any scheduler has the flag enabled. Dashboards, metrics, or custom consumers that iterate this map generically (e.g. sum over all keys) will start including pod counts. Audit prometheus / Grafana panels before rollout.falsestops the scheduler from tracking pods; reverting the executor binary removes thepodskey from its reports. Neither requires data migration.FromNodeProtosilently drops unknown resources). New scheduler + old executor is safe (only non-Armada pod accounting is slightly pessimistic until executors are upgraded).Known limitations
podsis not added todominantResourceFairnessResourcesToConsider. On dense-pod nodes (e.g. GKE's 110-pod limit) a queue running many small pods can monopolize pod slots without a fair-share penalty. Deferred per reviewer request; follow-up if this becomes a problem in practice.