[FEATURE] AMD GPU observability: rocm-smi exporter + llama.cpp /metrics + Grafana panel

## Feature Description

AMD GPU observability for the fleet: node-level GPU temperature/power/utilization from the AMD node surfaced into Prometheus + Grafana, the AMD analog of the DCGM exporter used for NVIDIA. For the iGPU the realistic path is a community `rocm-smi`-based exporter plus the inference signals already emitted by llama.cpp `/metrics`.

## Problem Statement

> As a fleet operator, I want the AMD node's GPU health and the inference SLO signals in the same Grafana I use for NVIDIA, so that an AMD tier is observable, not a blind spot.

AMD's official `device-metrics-exporter` is Instinct/datacenter-scoped and does not enumerate the Strix Halo iGPU, so the NVIDIA DCGM approach does not port directly.

## Proposed Solution

- Deploy a community `rocm-smi`-based exporter (e.g. `rocm-smi-exporter`) as a DaemonSet on AMD nodes for edge temperature and socket power; document the limited metric set honestly.
- Rely on llama.cpp `/metrics` (already enabled via `--metrics`) for the SLO-relevant signals: tokens/sec, queue depth, KV-cache occupancy. This is the primary inference signal and is backend-agnostic.
- A Grafana panel/row for the AMD tier, mirroring the GPU dashboard layout.
- Note the gap: official `device-metrics-exporter` is the path if/when Instinct/MI hardware is added; it does not help the APU.

## Alternatives Considered

- **`amd/amd_smi_exporter`:** also datacenter/Instinct-oriented; unlikely to enumerate the iGPU cleanly.
- **Official `ROCm/device-metrics-exporter`:** Instinct-scoped; deferred to a future discrete-AMD tier.

## Additional Context

- Related issues: #696 (epic)
- The iGPU support of community exporters and the exact metric set should be hands-verified on `gfx1151`; scope the issue to "what actually reports numbers on this box."

## Priority

- [x] Medium - Nice to have

## Willingness to Contribute

- [x] Yes, I can submit a PR


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] AMD GPU observability: rocm-smi exporter + llama.cpp /metrics + Grafana panel #700

Feature Description

Problem Statement

Proposed Solution

Alternatives Considered

Additional Context

Priority

Willingness to Contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE] AMD GPU observability: rocm-smi exporter + llama.cpp /metrics + Grafana panel #700

Description

Feature Description

Problem Statement

Proposed Solution

Alternatives Considered

Additional Context

Priority

Willingness to Contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions