Skip to content

[FEATURE] AMD GPU observability: rocm-smi exporter + llama.cpp /metrics + Grafana panel #700

@Defilan

Description

@Defilan

Feature Description

AMD GPU observability for the fleet: node-level GPU temperature/power/utilization from the AMD node surfaced into Prometheus + Grafana, the AMD analog of the DCGM exporter used for NVIDIA. For the iGPU the realistic path is a community rocm-smi-based exporter plus the inference signals already emitted by llama.cpp /metrics.

Problem Statement

As a fleet operator, I want the AMD node's GPU health and the inference SLO signals in the same Grafana I use for NVIDIA, so that an AMD tier is observable, not a blind spot.

AMD's official device-metrics-exporter is Instinct/datacenter-scoped and does not enumerate the Strix Halo iGPU, so the NVIDIA DCGM approach does not port directly.

Proposed Solution

  • Deploy a community rocm-smi-based exporter (e.g. rocm-smi-exporter) as a DaemonSet on AMD nodes for edge temperature and socket power; document the limited metric set honestly.
  • Rely on llama.cpp /metrics (already enabled via --metrics) for the SLO-relevant signals: tokens/sec, queue depth, KV-cache occupancy. This is the primary inference signal and is backend-agnostic.
  • A Grafana panel/row for the AMD tier, mirroring the GPU dashboard layout.
  • Note the gap: official device-metrics-exporter is the path if/when Instinct/MI hardware is added; it does not help the APU.

Alternatives Considered

  • amd/amd_smi_exporter: also datacenter/Instinct-oriented; unlikely to enumerate the iGPU cleanly.
  • Official ROCm/device-metrics-exporter: Instinct-scoped; deferred to a future discrete-AMD tier.

Additional Context

Priority

  • Medium - Nice to have

Willingness to Contribute

  • Yes, I can submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions