feat(metrics): export job OpenMetrics from spurctld on :6822#217
Open
sgopinath1 wants to merge 4 commits into
Open
feat(metrics): export job OpenMetrics from spurctld on :6822#217sgopinath1 wants to merge 4 commits into
sgopinath1 wants to merge 4 commits into
Conversation
Add spur-metrics encoding for spur_jobs_* gauges with golden tests and [metrics] config (listen :6822, bind loopback|all, high_cardinality). Serve /metrics and /metrics/jobs from spurctld when enabled; Raft followers return 503. Remove temporary log_job_metrics_debug path.
Member
|
Please merge #216 and rebase if CI / deny check fails.
…On Sat, 23 May 2026 at 10:00 PM, Sudheendra Gopinath < ***@***.***> wrote:
@sgopinath1 <https://github.com/sgopinath1> requested your review on:
ROCm/spur#217 <#217> feat(metrics):
export job OpenMetrics from spurctld on :6822 as a code owner.
—
Reply to this email directly, view it on GitHub
<#217 (comment)>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQGYQ64NUJSS3BTBOHZCQYD44HG3JAVCNFSM6AAAAACZKO7MVGVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMRVHA4DMMBSHE2DCNA>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Pull request overview
Adds an HTTP metrics endpoint to spurctld (defaulting to port 6822) and introduces OpenMetrics/Prometheus-style text encoding in spur-metrics to export aggregated cluster job gauges.
Changes:
- Added
spur-metricsOpenMetrics text encoder plusencode_job_metrics()with a golden fixture test. - Added
[metrics]configuration tospur-coreand wired defaults intospurctldconfigs/tests. - Added an Axum-based metrics server in
spurctldwith leader-gating (followers return 503) and basic endpoint scaffolding.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/spurctld/src/metrics_server.rs | New Axum HTTP server exposing /metrics and /metrics/jobs, with leader-gated responses. |
| crates/spurctld/src/main.rs | Starts metrics server when [metrics].enabled; removes temporary debug logging call. |
| crates/spurctld/src/cluster.rs | Removes temporary log_job_metrics_debug() and updates default config in tests. |
| crates/spurctld/Cargo.toml | Adds axum dependency. |
| crates/spur-metrics/src/openmetrics.rs | Adds text exposition builder and label escaping helper. |
| crates/spur-metrics/src/job_export.rs | Implements job snapshot → text encoding + tests (including golden fixture). |
| crates/spur-metrics/src/lib.rs | Exposes new modules and re-exports encode_job_metrics. |
| crates/spur-metrics/tests/fixtures/job_metrics.prom | Golden expected output for job metrics export. |
| crates/spur-core/src/config.rs | Adds MetricsConfig + parsing/default tests + effective listen addr helper. |
| Cargo.lock | Records axum dependency addition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+20
to
+22
| /// OpenMetrics 1.0 text exposition (Slurm 25.11 compatible). | ||
| pub const OPENMETRICS_CONTENT_TYPE: &str = "text/plain; version=0.0.4; charset=utf-8"; | ||
|
|
Collaborator
Author
There was a problem hiding this comment.
This is for parity with Slurm.
Co-authored-by: Cursor <cursoragent@cursor.com>
Bind the metrics listener before logging, check Raft leadership before scanning jobs on /metrics/jobs, and return an error for invalid metrics.listen_addr instead of silently defaulting the port.
38591dc to
da96c0e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OpenMetrics export from spurctld for cluster job gauges.
spur-metrics: OpenMetrics encoder andencode_job_metrics()forspur_jobs, per-statespur_jobs_{state}, and running allocation gauges (spur_jobs_cpus_alloc,spur_jobs_memory_alloc_bytes,spur_jobs_gpus_alloc). Golden fixture test included.[metrics]config (spur-core):enabled,listen_addr(default[::]:6822),bind(loopback→127.0.0.1:6822,all→ all interfaces),high_cardinality(gates/metrics/jobs-users-accts).spurctld: Axum metrics server on the effective listen address when[metrics].enabled;GET /metricsandGET /metrics/jobsreturn the job snapshot; Raft followers return 503; other Slurm paths return 404 until later domains land.log_job_metrics_debug()health-loop logging (superseded by HTTP export).Builds on merged job collection (
ClusterManager::job_metrics()/JobMetricsSnapshotfrom #212).Deferred (follow-up PRs)
deploy/metrics/AMD device-metrics-exporter hooks:6822Test plan
cargo test -p spur-metricscargo test -p spurctld(includes leader 200 / follower 503 unit tests)curl -sS http://127.0.0.1:6822/metrics/jobs | headaftersbatch; confirm follower returns 503