Preempt api response by dslear · Pull Request #4845 · armadaproject/armada

dslear · 2026-04-16T19:31:49Z

What type of PR is this?

Enhancement

What this PR does / why we need it

When a job is preempted in Armada, it should have an error message that includes the user who requested preemption, as well as the reason if one exists. Currently, when a job running as part of a gang job is preempted, the rest of the jobs in the gang are also preempted but do not have the appropriate error message attached. This PR ensures that all jobs in the gang have the requesting user and reason as the error message.

Which issue(s) this PR fixes

Fixes #4832

Special notes for your reviewer

greptile-apps · 2026-04-16T19:34:50Z

Greptile Summary

This PR fixes two related issues: it propagates the preemption reason (including requesting user) to all jobs in a gang when one gang member is preempted, and it cleans up the preemption reason string format (removing the old \" | Preempted by user\" format with a leading space/pipe in favour of \"Preempted by user - reason\").

The core change in reconciliation.go introduces a gangPreemptionRequest struct that tracks the preemption reason per gang, so each gang's members can receive the correct formatted reason rather than a nil. The buildPreemptionReason helper in instructions.go centralises formatting with proper handling of all four user/reason combinations.

Confidence Score: 5/5

Safe to merge — logic is correct, all four reason/user combinations are tested, and gang propagation is properly isolated per gang.

No P0 or P1 issues found. The gang preemption reason extraction correctly uses jst.Job.LatestRun() (the post-reconciliation run with PreemptReason set) rather than the original job. The per-gang grouping via gangPreemptionRequests correctly handles multiple concurrently-preempted gangs with different reasons. The format change from the old " | Preempted by user" pattern (with leading space/pipe on empty reason) to "Preempted by user - reason" is an intentional improvement and is thoroughly tested.

No files require special attention.

Important Files Changed

Filename	Overview
internal/scheduler/jobdb/reconciliation.go	Refactors gang preemption logic to carry per-gang preempt reasons; markJobsAsPreemptionRequested now accepts and applies *string reason to all gang members' runs.
internal/scheduleringester/instructions.go	Extracts buildPreemptionReason helper; fixes old bug where empty reason produced a leading "
internal/scheduler/scheduler_test.go	Adds expectedPreemptReasons test field and a new table-driven case verifying that preempt reason propagates to all gang members after a single member's run update.
internal/scheduleringester/instructions_test.go	Updates existing preemption test expectation to match new format, adds "with reason" case, and adds a dedicated unit test for buildPreemptionReason covering all four input combinations.

Sequence Diagram

sequenceDiagram
    participant API as API Layer
    participant Ingester as scheduleringester
    participant DB as Database
    participant Reconciler as reconciliation.go
    participant GangMember as Gang Member Jobs

    API->>Ingester: JobPreemptionRequested(jobId, reason, user)
    Ingester->>Ingester: buildPreemptionReason(reason, user)
    Ingester->>DB: MarkRunsForJobPreemptRequested{jobId: formattedReason}
    DB-->>DB: Store PreemptReason on run

    Note over Reconciler: Next scheduler cycle
    Reconciler->>DB: Fetch run updates
    Reconciler->>Reconciler: reconcileJobDifferences(gangJob1) → jst.PreemptionRequested=true
    Reconciler->>Reconciler: jst.Job.LatestRun().PreemptReason()
    Reconciler->>Reconciler: Append gangPreemptionRequest{jobIds, reason}
    Reconciler->>GangMember: markJobsAsPreemptionRequested(jobIds, reason)
    GangMember-->>Reconciler: PreemptionRequested=true, reason propagated

_{Reviews (3): Last reviewed commit: "Merge branch 'master' into preempt-api-r..." | Re-trigger Greptile}

Signed-off-by: David Slear <david_slear@yahoo.com>

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com> Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>

This is a bad bit a tech debt we should pay down However for now, I've simply made sure all packages use unique db numbers, to avoid clashes + flakes Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>

This will set the lease namespace to match the current namespace in helm, so deployments into k8s will always use the current namespace creating the lease Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>

Add metrics to event ingester to show number of bytes each event type is taking up This allows us to get an approximate view of what is using all the space in redis --------- Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com> Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>

#### What type of PR is this? Enhancement #### What this PR does / why we need it When a queue is cordoned, the scheduling algo already refuses to schedule jobs from it (`constraints.CheckJobConstraints` returns `QueueCordonedUnschedulableReason`). However, the submit check was still running for cordoned queue jobs. This wasted compute on jobs that could never be scheduled and marked them as "validated" with assigned pools, which is misleading. This PR makes `submitCheck` skip jobs from cordoned queues entirely, leaving them unvalidated. If the queue is later uncordoned, the jobs will be picked up by `UnvalidatedJobs()` on the next scheduler cycle and validated normally. The filtering happens at the top of `scheduler.submitCheck()`, alongside the existing terminal-job filter, so cordoned queue jobs never enter the submit check pipeline. If the queue cache is unavailable, the submit check fails fast and throws the error so this doesn't go undetected / silently fail. #### Special notes for your reviewer - The `Scheduler` struct gains a `queueCache` dependency. This is the same `QueueCache` already used by the `SubmitChecker` and other scheduler components. - Cordoned queue jobs are **skipped** (left unvalidated), not rejected (terminal failure). I did this intentionally as cordoning is temporary, so jobs should survive it. Open to feedback on this approach, though. --------- Signed-off-by: Trey Guckian <24757349+tgucks@users.noreply.github.com> Co-authored-by: Rob Smith <34475852+robertdavidsmith@users.noreply.github.com> Signed-off-by: David Slear <david_slear@yahoo.com>

Signed-off-by: David Slear <david_slear@yahoo.com>

dslear and others added 7 commits April 16, 2026 14:49

add preemption request user/reason for api preemptions

c8d03d0

Signed-off-by: David Slear <david_slear@yahoo.com>

update tests

1ddca2c

Signed-off-by: David Slear <david_slear@yahoo.com>

Add redis metrics giving better view into redis (#4743)

284cb93

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com> Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>

dslear force-pushed the preempt-api-response branch from f7ed8e7 to d232158 Compare April 16, 2026 19:49

Merge branch 'master' into preempt-api-response

69e863b

dslear marked this pull request as draft April 17, 2026 15:11

add preemption reason to gang job preemptions

a8fb129

Signed-off-by: David Slear <david_slear@yahoo.com>

dslear marked this pull request as ready for review April 17, 2026 21:17

Merge branch 'master' into preempt-api-response

1d23a9c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preempt api response#4845

Preempt api response#4845
dslear wants to merge 10 commits intomasterfrom
preempt-api-response

dslear commented Apr 16, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dslear commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Uh oh!

greptile-apps bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dslear commented Apr 16, 2026 •

edited

Loading

greptile-apps bot commented Apr 16, 2026 •

edited

Loading