Skip to content

feat(metrics): export job OpenMetrics from spurctld on :6822#217

Open
sgopinath1 wants to merge 4 commits into
ROCm:mainfrom
sgopinath1:job_metrics_export
Open

feat(metrics): export job OpenMetrics from spurctld on :6822#217
sgopinath1 wants to merge 4 commits into
ROCm:mainfrom
sgopinath1:job_metrics_export

Conversation

@sgopinath1
Copy link
Copy Markdown
Collaborator

Summary

OpenMetrics export from spurctld for cluster job gauges.

  • spur-metrics: OpenMetrics encoder and encode_job_metrics() for spur_jobs, per-state spur_jobs_{state}, and running allocation gauges (spur_jobs_cpus_alloc, spur_jobs_memory_alloc_bytes, spur_jobs_gpus_alloc). Golden fixture test included.
  • [metrics] config (spur-core): enabled, listen_addr (default [::]:6822), bind (loopback127.0.0.1:6822, all → all interfaces), high_cardinality (gates /metrics/jobs-users-accts).
  • spurctld: Axum metrics server on the effective listen address when [metrics].enabled; GET /metrics and GET /metrics/jobs return the job snapshot; Raft followers return 503; other Slurm paths return 404 until later domains land.
  • Removes temporary log_job_metrics_debug() health-loop logging (superseded by HTTP export).

Builds on merged job collection (ClusterManager::job_metrics() / JobMetricsSnapshot from #212).

Deferred (follow-up PRs)

  • Node, scheduler, and partition metrics (collection + export)
  • K8s manifests (Service / port 6822 scrape)
  • deploy/metrics/ AMD device-metrics-exporter hooks
  • Bare-metal E2E curl against :6822

Test plan

  • cargo test -p spur-metrics
  • cargo test -p spurctld (includes leader 200 / follower 503 unit tests)
  • Manual: on bare-metal leader, curl -sS http://127.0.0.1:6822/metrics/jobs | head after sbatch; confirm follower returns 503

Add spur-metrics encoding for spur_jobs_* gauges with golden tests and
[metrics] config (listen :6822, bind loopback|all, high_cardinality).

Serve /metrics and /metrics/jobs from spurctld when enabled; Raft
followers return 503. Remove temporary log_job_metrics_debug path.
Copilot AI review requested due to automatic review settings May 23, 2026 16:30
@shiv-tyagi
Copy link
Copy Markdown
Member

shiv-tyagi commented May 23, 2026 via email

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an HTTP metrics endpoint to spurctld (defaulting to port 6822) and introduces OpenMetrics/Prometheus-style text encoding in spur-metrics to export aggregated cluster job gauges.

Changes:

  • Added spur-metrics OpenMetrics text encoder plus encode_job_metrics() with a golden fixture test.
  • Added [metrics] configuration to spur-core and wired defaults into spurctld configs/tests.
  • Added an Axum-based metrics server in spurctld with leader-gating (followers return 503) and basic endpoint scaffolding.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
crates/spurctld/src/metrics_server.rs New Axum HTTP server exposing /metrics and /metrics/jobs, with leader-gated responses.
crates/spurctld/src/main.rs Starts metrics server when [metrics].enabled; removes temporary debug logging call.
crates/spurctld/src/cluster.rs Removes temporary log_job_metrics_debug() and updates default config in tests.
crates/spurctld/Cargo.toml Adds axum dependency.
crates/spur-metrics/src/openmetrics.rs Adds text exposition builder and label escaping helper.
crates/spur-metrics/src/job_export.rs Implements job snapshot → text encoding + tests (including golden fixture).
crates/spur-metrics/src/lib.rs Exposes new modules and re-exports encode_job_metrics.
crates/spur-metrics/tests/fixtures/job_metrics.prom Golden expected output for job metrics export.
crates/spur-core/src/config.rs Adds MetricsConfig + parsing/default tests + effective listen addr helper.
Cargo.lock Records axum dependency addition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/spurctld/src/metrics_server.rs Outdated
Comment thread crates/spurctld/src/metrics_server.rs Outdated
Comment on lines +20 to +22
/// OpenMetrics 1.0 text exposition (Slurm 25.11 compatible).
pub const OPENMETRICS_CONTENT_TYPE: &str = "text/plain; version=0.0.4; charset=utf-8";

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for parity with Slurm.

Comment thread crates/spur-core/src/config.rs Outdated
Comment thread crates/spur-metrics/src/openmetrics.rs
sgopinath1 and others added 3 commits May 23, 2026 16:53
Co-authored-by: Cursor <cursoragent@cursor.com>
Bind the metrics listener before logging, check Raft leadership before
scanning jobs on /metrics/jobs, and return an error for invalid
metrics.listen_addr instead of silently defaulting the port.
@sgopinath1 sgopinath1 force-pushed the job_metrics_export branch from 38591dc to da96c0e Compare May 24, 2026 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants