feat: add cluster-troubleshoot skill by falox · Pull Request #33 · openshift/agentic-skills

falox · 2026-06-15T07:20:59Z

Add OpenShift cluster diagnostics skill for investigating live cluster issues (pod crashes, node failures, operator degradation, etc.) using oc commands and Prometheus metrics.

Assisted-by: Claude Code:claude-opus-4-6

Add OpenShift cluster diagnostics skill for investigating live cluster issues (pod crashes, node failures, operator degradation, etc.) using oc commands and Prometheus metrics. Signed-off-by: Alberto Falossi <afalossi@redhat.com> Assisted-by: Claude Code:claude-opus-4-6

openshift-ci · 2026-06-15T07:21:03Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-06-15T07:21:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: falox
Once this PR has been reviewed and has the lgtm label, please assign matzew for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

harche · 2026-06-15T13:02:29Z

+  - falox
+  - tremes
+  - iNecas
+  - harche


You can drop me from this list, the core idea was to ensure individual teams are the owners of the respective skills.

harche · 2026-06-15T13:02:34Z

+  - falox
+  - tremes
+  - iNecas
+  - harche


You can drop me from this list, the core idea was to ensure individual teams are the owners of the respective skills.

tremes · 2026-06-22T11:59:25Z

+
+When diagnosing a specific symptom:
+
+1. **Scope the blast radius** — is it one pod, one node, one namespace, or cluster-wide? This determines which layer to start from.


what is the target group of users/agents?

This seems to me a bit too general. If we want to use it in https://github.com/openshift/lightspeed-agentic-alerts-adapter, I would do something like:

1. **Start from the alert** — if the user provides an alert name, query Prometheus for its firing instance first: ```bash wget -qO- --no-check-certificate --header="Authorization: Bearer $TOKEN" \ "https://${THANOS_URL}/api/v1/query?query=$(python3 -c 'import urllib.parse; print(urllib.parse.quote("ALERTS{alertname=\"<NAME>\"}"))')" | jq '.data.result[]' The response contains the full label set (namespace, pod, node, service, etc.) — use these exact values in all subsequent `oc` commands. If the alert is not currently firing, note this and proceed with the user-provided labels. 2. **Fetch the alerting rule** — query the rules API to get the alert's PromQL expression and thresholds: ```bash wget -qO- --no-check-certificate --header="Authorization: Bearer $TOKEN" \ "https://${THANOS_URL}/api/v1/rules" | jq '.data.groups[].rules[] | select(.name=="<NAME>" and .type=="alerting")' The `expr` field tells you exactly which metric breached which threshold. Use it to query the underlying metric directly and understand the current vs. expected values. Also extract the rule's `annotations` — OpenShift alerts carry `description`, `summary`, and often `runbook_url`. The description explains the impact; the runbook provides vendor-recommended remediation steps. Follow the runbook if one exists before improvising.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 15, 2026

harche mentioned this pull request Jun 15, 2026

[WIP] skills: Add monitoring skills (prometheus, monitoring-ops) #7

Closed

3 tasks

harche reviewed Jun 15, 2026

View reviewed changes

tremes reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add cluster-troubleshoot skill#33

feat: add cluster-troubleshoot skill#33
falox wants to merge 1 commit into
openshift:mainfrom
falox:cluster-troubleshoot-skill

falox commented Jun 15, 2026

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

harche Jun 15, 2026

Uh oh!

harche Jun 15, 2026

Uh oh!

tremes Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		When diagnosing a specific symptom:

		1. Scope the blast radius — is it one pod, one node, one namespace, or cluster-wide? This determines which layer to start from.

Conversation

falox commented Jun 15, 2026

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

harche Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

harche Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

tremes Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants