feat: add cluster-troubleshoot skill#33
Conversation
Add OpenShift cluster diagnostics skill for investigating live cluster issues (pod crashes, node failures, operator degradation, etc.) using oc commands and Prometheus metrics. Signed-off-by: Alberto Falossi <afalossi@redhat.com> Assisted-by: Claude Code:claude-opus-4-6
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: falox The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| - falox | ||
| - tremes | ||
| - iNecas | ||
| - harche |
There was a problem hiding this comment.
You can drop me from this list, the core idea was to ensure individual teams are the owners of the respective skills.
| - falox | ||
| - tremes | ||
| - iNecas | ||
| - harche |
There was a problem hiding this comment.
You can drop me from this list, the core idea was to ensure individual teams are the owners of the respective skills.
|
|
||
| When diagnosing a specific symptom: | ||
|
|
||
| 1. **Scope the blast radius** — is it one pod, one node, one namespace, or cluster-wide? This determines which layer to start from. |
There was a problem hiding this comment.
what is the target group of users/agents?
This seems to me a bit too general. If we want to use it in https://github.com/openshift/lightspeed-agentic-alerts-adapter, I would do something like:
1. **Start from the alert** — if the user provides an alert name, query Prometheus for its firing instance first:
```bash
wget -qO- --no-check-certificate --header="Authorization: Bearer $TOKEN" \
"https://${THANOS_URL}/api/v1/query?query=$(python3 -c 'import urllib.parse; print(urllib.parse.quote("ALERTS{alertname=\"<NAME>\"}"))')" | jq '.data.result[]'
The response contains the full label set (namespace, pod, node, service, etc.) — use these exact values in all subsequent `oc` commands. If the alert is not currently firing, note this and proceed with the user-provided labels.
2. **Fetch the alerting rule** — query the rules API to get the alert's PromQL expression and thresholds:
```bash
wget -qO- --no-check-certificate --header="Authorization: Bearer $TOKEN" \
"https://${THANOS_URL}/api/v1/rules" | jq '.data.groups[].rules[] | select(.name=="<NAME>" and .type=="alerting")'
The `expr` field tells you exactly which metric breached which threshold. Use it to query the underlying metric directly and understand the current vs. expected values. Also extract the rule's `annotations` — OpenShift alerts carry `description`, `summary`, and often `runbook_url`. The description explains the impact; the runbook provides vendor-recommended remediation steps. Follow the runbook if one exists before improvising.
Add OpenShift cluster diagnostics skill for investigating live cluster issues (pod crashes, node failures, operator degradation, etc.) using oc commands and Prometheus metrics.
Assisted-by: Claude Code:claude-opus-4-6