Skip to content

feat: add cluster-troubleshoot skill#33

Draft
falox wants to merge 1 commit into
openshift:mainfrom
falox:cluster-troubleshoot-skill
Draft

feat: add cluster-troubleshoot skill#33
falox wants to merge 1 commit into
openshift:mainfrom
falox:cluster-troubleshoot-skill

Conversation

@falox

@falox falox commented Jun 15, 2026

Copy link
Copy Markdown

Add OpenShift cluster diagnostics skill for investigating live cluster issues (pod crashes, node failures, operator degradation, etc.) using oc commands and Prometheus metrics.

Assisted-by: Claude Code:claude-opus-4-6

Add OpenShift cluster diagnostics skill for investigating live cluster
issues (pod crashes, node failures, operator degradation, etc.) using
oc commands and Prometheus metrics.

Signed-off-by: Alberto Falossi <afalossi@redhat.com>
Assisted-by: Claude Code:claude-opus-4-6
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 15, 2026
@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: falox
Once this PR has been reviewed and has the lgtm label, please assign matzew for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- falox
- tremes
- iNecas
- harche

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop me from this list, the core idea was to ensure individual teams are the owners of the respective skills.

- falox
- tremes
- iNecas
- harche

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop me from this list, the core idea was to ensure individual teams are the owners of the respective skills.


When diagnosing a specific symptom:

1. **Scope the blast radius** — is it one pod, one node, one namespace, or cluster-wide? This determines which layer to start from.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the target group of users/agents?

This seems to me a bit too general. If we want to use it in https://github.com/openshift/lightspeed-agentic-alerts-adapter, I would do something like:

1. **Start from the alert** — if the user provides an alert name, query Prometheus for its firing instance first:
   ```bash
   wget -qO- --no-check-certificate --header="Authorization: Bearer $TOKEN" \
     "https://${THANOS_URL}/api/v1/query?query=$(python3 -c 'import urllib.parse; print(urllib.parse.quote("ALERTS{alertname=\"<NAME>\"}"))')" | jq '.data.result[]'
   
   The response contains the full label set (namespace, pod, node, service, etc.) — use these exact values in all subsequent `oc` commands. If the alert is not currently firing, note this and proceed with the user-provided labels.

2. **Fetch the alerting rule** — query the rules API to get the alert's PromQL expression and thresholds:
   ```bash
   wget -qO- --no-check-certificate --header="Authorization: Bearer $TOKEN" \
     "https://${THANOS_URL}/api/v1/rules" | jq '.data.groups[].rules[] | select(.name=="<NAME>" and .type=="alerting")'
   
   The `expr` field tells you exactly which metric breached which threshold. Use it to query the underlying metric directly and understand the current vs. expected values. Also extract the rule's `annotations` — OpenShift alerts carry `description`, `summary`, and often `runbook_url`. The description explains the impact; the runbook provides vendor-recommended remediation steps. Follow the runbook if one exists before improvising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants