SCHED-316: Add configurable job sources for Slurm exporter#2502
Merged
SCHED-316: Add configurable job sources for Slurm exporter#2502
Conversation
itechdima
reviewed
May 1, 2026
Collaborator
|
I don't get why do we need this. Initially we had a problem that exporter is too slow for big clusters. But this change adds retrieval of jobs from accounting, which will slow it down further. |
Collaborator
Author
added the ability to collect metrics from both the controller and accounting, as well as to choose which metrics to collect. For large clusters, you can trade off certain metrics in favor of performance. |
Co-authored-by: itechdima <61321708+itechdima@users.noreply.github.com>
7600ba5 to
5e6f127
Compare
…se state filter and lookback window.
theyoprst
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The exporter only read jobs from the Slurm controller API. On large clusters where the controller endpoint is overloaded, there was no way to switch the exporter to the accounting API as a fallback.
Solution
Added a
JobSourceknob with two values:controller(default, behavior unchanged) andaccounting(slurmdbd). Only one source is used at a time, no dedupe or fan-out.When
accountingis selected:[now - AccountingJobsLookback, now + 5m]. Lookback defaults to1hand accepts both Prometheus durations (1d,1w) and Go durations (1.5h,2h45m30.5s). The +5m end-time skew tolerates clock drift between slurmrestd, slurmctld, and slurmdbd.AccountingJobStatesis a list of Slurm job-state strings (PENDING,RUNNING,COMPLETED, ...) forwarded verbatim to the accountingstatequery parameter. Empty list = no filter. Slurmdbd applies this to the historical states a job held during the window (sacct semantics), not to the current state of returned jobs.disable_truncate_usage_time=true, so original start/end timestamps survive the lookback window.skip_steps=true, to avoid downloading per-step payloads the exporter never reads.Clusteris intentionally left unset; soperator targets a single Slurm cluster per deployment.slurm_job_memory_bytesis omitted rather than reported asMemoryPerNode × 1(which would silently undercount).CRD fields (
jobSource,accountingJobStates,accountingJobsLookback) are all markedEXPERIMENTALin their godoc / CRDdescription.accountingJobsLookbackhas a CEL validation that rejects zero values. Lookback parsing is gated on--job-source=accounting, so a stale env value can't break the default controller path.Testing
go build ./...,go vet ./...go test ./...(full suite green)helm unittest helm/slurm-cluster(89/89)New unit tests cover source dispatch, accounting state CSV, time-window via
testing/synctest,disable_truncate_usage_timeandskip_stepsparams, the dual Prom/Go duration parser, controller-mode tolerance for stale lookback values, and theNodeCount=nilmemory-omission behavior.Release Notes
Feature: Slurm exporter can now collect jobs from the Slurm accounting API (slurmdbd), useful when the controller endpoint is overloaded on large clusters.
Config: New CRD fields
jobSource,accountingJobStates,accountingJobsLookback(and matching exporter flags--job-source,--accounting-job-states,--accounting-jobs-lookback). All experimental.