Skip to content

SCHED-316: Add configurable job sources for Slurm exporter#2502

Merged
theyoprst merged 2 commits intomainfrom
SCHED-316/0
May 6, 2026
Merged

SCHED-316: Add configurable job sources for Slurm exporter#2502
theyoprst merged 2 commits intomainfrom
SCHED-316/0

Conversation

@Uburro
Copy link
Copy Markdown
Collaborator

@Uburro Uburro commented May 1, 2026

Problem

The exporter only read jobs from the Slurm controller API. On large clusters where the controller endpoint is overloaded, there was no way to switch the exporter to the accounting API as a fallback.

Solution

Added a JobSource knob with two values: controller (default, behavior unchanged) and accounting (slurmdbd). Only one source is used at a time, no dedupe or fan-out.

When accounting is selected:

  • Time window is [now - AccountingJobsLookback, now + 5m]. Lookback defaults to 1h and accepts both Prometheus durations (1d, 1w) and Go durations (1.5h, 2h45m30.5s). The +5m end-time skew tolerates clock drift between slurmrestd, slurmctld, and slurmdbd.
  • AccountingJobStates is a list of Slurm job-state strings (PENDING, RUNNING, COMPLETED, ...) forwarded verbatim to the accounting state query parameter. Empty list = no filter. Slurmdbd applies this to the historical states a job held during the window (sacct semantics), not to the current state of returned jobs.
  • disable_truncate_usage_time=true, so original start/end timestamps survive the lookback window.
  • skip_steps=true, to avoid downloading per-step payloads the exporter never reads.
  • Cluster is intentionally left unset; soperator targets a single Slurm cluster per deployment.
  • For pending multi-node jobs the accounting API doesn't expose the originally requested node count, so slurm_job_memory_bytes is omitted rather than reported as MemoryPerNode × 1 (which would silently undercount).

CRD fields (jobSource, accountingJobStates, accountingJobsLookback) are all marked EXPERIMENTAL in their godoc / CRD description. accountingJobsLookback has a CEL validation that rejects zero values. Lookback parsing is gated on --job-source=accounting, so a stale env value can't break the default controller path.

Testing

  • go build ./..., go vet ./...
  • go test ./... (full suite green)
  • helm unittest helm/slurm-cluster (89/89)

New unit tests cover source dispatch, accounting state CSV, time-window via testing/synctest, disable_truncate_usage_time and skip_steps params, the dual Prom/Go duration parser, controller-mode tolerance for stale lookback values, and the NodeCount=nil memory-omission behavior.

Release Notes

Feature: Slurm exporter can now collect jobs from the Slurm accounting API (slurmdbd), useful when the controller endpoint is overloaded on large clusters.
Config: New CRD fields jobSource, accountingJobStates, accountingJobsLookback (and matching exporter flags --job-source, --accounting-job-states, --accounting-jobs-lookback). All experimental.

@Uburro Uburro added the feature label May 1, 2026
Comment thread internal/slurmapi/client.go Outdated
Comment thread internal/slurmapi/client.go Outdated
@theyoprst
Copy link
Copy Markdown
Collaborator

I don't get why do we need this. Initially we had a problem that exporter is too slow for big clusters. But this change adds retrieval of jobs from accounting, which will slow it down further.
@Uburro how it will help?

@Uburro
Copy link
Copy Markdown
Collaborator Author

Uburro commented May 4, 2026

I don't get why do we need this. Initially we had a problem that exporter is too slow for big clusters. But this change adds retrieval of jobs from accounting, which will slow it down further. @Uburro how it will help?

added the ability to collect metrics from both the controller and accounting, as well as to choose which metrics to collect. For large clusters, you can trade off certain metrics in favor of performance.

Co-authored-by: itechdima <61321708+itechdima@users.noreply.github.com>
@theyoprst theyoprst force-pushed the SCHED-316/0 branch 5 times, most recently from 7600ba5 to 5e6f127 Compare May 5, 2026 16:14
@theyoprst theyoprst merged commit ec3de2c into main May 6, 2026
22 checks passed
@theyoprst theyoprst deleted the SCHED-316/0 branch May 6, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants