Feature Description
AMD GPU observability for the fleet: node-level GPU temperature/power/utilization from the AMD node surfaced into Prometheus + Grafana, the AMD analog of the DCGM exporter used for NVIDIA. For the iGPU the realistic path is a community rocm-smi-based exporter plus the inference signals already emitted by llama.cpp /metrics.
Problem Statement
As a fleet operator, I want the AMD node's GPU health and the inference SLO signals in the same Grafana I use for NVIDIA, so that an AMD tier is observable, not a blind spot.
AMD's official device-metrics-exporter is Instinct/datacenter-scoped and does not enumerate the Strix Halo iGPU, so the NVIDIA DCGM approach does not port directly.
Proposed Solution
- Deploy a community
rocm-smi-based exporter (e.g. rocm-smi-exporter) as a DaemonSet on AMD nodes for edge temperature and socket power; document the limited metric set honestly.
- Rely on llama.cpp
/metrics (already enabled via --metrics) for the SLO-relevant signals: tokens/sec, queue depth, KV-cache occupancy. This is the primary inference signal and is backend-agnostic.
- A Grafana panel/row for the AMD tier, mirroring the GPU dashboard layout.
- Note the gap: official
device-metrics-exporter is the path if/when Instinct/MI hardware is added; it does not help the APU.
Alternatives Considered
amd/amd_smi_exporter: also datacenter/Instinct-oriented; unlikely to enumerate the iGPU cleanly.
- Official
ROCm/device-metrics-exporter: Instinct-scoped; deferred to a future discrete-AMD tier.
Additional Context
Priority
Willingness to Contribute
Feature Description
AMD GPU observability for the fleet: node-level GPU temperature/power/utilization from the AMD node surfaced into Prometheus + Grafana, the AMD analog of the DCGM exporter used for NVIDIA. For the iGPU the realistic path is a community
rocm-smi-based exporter plus the inference signals already emitted by llama.cpp/metrics.Problem Statement
AMD's official
device-metrics-exporteris Instinct/datacenter-scoped and does not enumerate the Strix Halo iGPU, so the NVIDIA DCGM approach does not port directly.Proposed Solution
rocm-smi-based exporter (e.g.rocm-smi-exporter) as a DaemonSet on AMD nodes for edge temperature and socket power; document the limited metric set honestly./metrics(already enabled via--metrics) for the SLO-relevant signals: tokens/sec, queue depth, KV-cache occupancy. This is the primary inference signal and is backend-agnostic.device-metrics-exporteris the path if/when Instinct/MI hardware is added; it does not help the APU.Alternatives Considered
amd/amd_smi_exporter: also datacenter/Instinct-oriented; unlikely to enumerate the iGPU cleanly.ROCm/device-metrics-exporter: Instinct-scoped; deferred to a future discrete-AMD tier.Additional Context
gfx1151; scope the issue to "what actually reports numbers on this box."Priority
Willingness to Contribute