Conversation
Greptile SummaryThis PR fixes two related issues: it propagates the preemption reason (including requesting user) to all jobs in a gang when one gang member is preempted, and it cleans up the preemption reason string format (removing the old The core change in Confidence Score: 5/5Safe to merge — logic is correct, all four reason/user combinations are tested, and gang propagation is properly isolated per gang. No P0 or P1 issues found. The gang preemption reason extraction correctly uses jst.Job.LatestRun() (the post-reconciliation run with PreemptReason set) rather than the original job. The per-gang grouping via gangPreemptionRequests correctly handles multiple concurrently-preempted gangs with different reasons. The format change from the old " | Preempted by user" pattern (with leading space/pipe on empty reason) to "Preempted by user - reason" is an intentional improvement and is thoroughly tested. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant API as API Layer
participant Ingester as scheduleringester
participant DB as Database
participant Reconciler as reconciliation.go
participant GangMember as Gang Member Jobs
API->>Ingester: JobPreemptionRequested(jobId, reason, user)
Ingester->>Ingester: buildPreemptionReason(reason, user)
Ingester->>DB: MarkRunsForJobPreemptRequested{jobId: formattedReason}
DB-->>DB: Store PreemptReason on run
Note over Reconciler: Next scheduler cycle
Reconciler->>DB: Fetch run updates
Reconciler->>Reconciler: reconcileJobDifferences(gangJob1) → jst.PreemptionRequested=true
Reconciler->>Reconciler: jst.Job.LatestRun().PreemptReason()
Reconciler->>Reconciler: Append gangPreemptionRequest{jobIds, reason}
Reconciler->>GangMember: markJobsAsPreemptionRequested(jobIds, reason)
GangMember-->>Reconciler: PreemptionRequested=true, reason propagated
Reviews (3): Last reviewed commit: "Merge branch 'master' into preempt-api-r..." | Re-trigger Greptile |
Signed-off-by: David Slear <david_slear@yahoo.com>
Signed-off-by: David Slear <david_slear@yahoo.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com> Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>
This is a bad bit a tech debt we should pay down However for now, I've simply made sure all packages use unique db numbers, to avoid clashes + flakes Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>
This will set the lease namespace to match the current namespace in helm, so deployments into k8s will always use the current namespace creating the lease Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>
Add metrics to event ingester to show number of bytes each event type is taking up This allows us to get an approximate view of what is using all the space in redis --------- Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com> Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: David Slear <david_slear@yahoo.com>
<!-- Thanks for sending a pull request! Here are some tips for you: --> #### What type of PR is this? Enhancement #### What this PR does / why we need it When a queue is cordoned, the scheduling algo already refuses to schedule jobs from it (`constraints.CheckJobConstraints` returns `QueueCordonedUnschedulableReason`). However, the submit check was still running for cordoned queue jobs. This wasted compute on jobs that could never be scheduled and marked them as "validated" with assigned pools, which is misleading. This PR makes `submitCheck` skip jobs from cordoned queues entirely, leaving them unvalidated. If the queue is later uncordoned, the jobs will be picked up by `UnvalidatedJobs()` on the next scheduler cycle and validated normally. The filtering happens at the top of `scheduler.submitCheck()`, alongside the existing terminal-job filter, so cordoned queue jobs never enter the submit check pipeline. If the queue cache is unavailable, the submit check fails fast and throws the error so this doesn't go undetected / silently fail. #### Special notes for your reviewer - The `Scheduler` struct gains a `queueCache` dependency. This is the same `QueueCache` already used by the `SubmitChecker` and other scheduler components. - Cordoned queue jobs are **skipped** (left unvalidated), not rejected (terminal failure). I did this intentionally as cordoning is temporary, so jobs should survive it. Open to feedback on this approach, though. --------- Signed-off-by: Trey Guckian <24757349+tgucks@users.noreply.github.com> Co-authored-by: Rob Smith <34475852+robertdavidsmith@users.noreply.github.com> Signed-off-by: David Slear <david_slear@yahoo.com>
f7ed8e7 to
d232158
Compare
Signed-off-by: David Slear <david_slear@yahoo.com>
What type of PR is this?
Enhancement
What this PR does / why we need it
When a job is preempted in Armada, it should have an error message that includes the user who requested preemption, as well as the reason if one exists. Currently, when a job running as part of a gang job is preempted, the rest of the jobs in the gang are also preempted but do not have the appropriate error message attached. This PR ensures that all jobs in the gang have the requesting user and reason as the error message.
Which issue(s) this PR fixes
Fixes #4832
Special notes for your reviewer