Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,37 @@

## [Unreleased]

### Added — Workflows

Check failure on line 7 in CHANGELOG.md

View workflow job for this annotation

GitHub Actions / Lint & Validate

Headings should be surrounded by blank lines

CHANGELOG.md:7 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "### Added — Workflows"] https://github.com/DavidAnson/markdownlint/blob/v0.37.4/doc/md022.md
- **`/helm-chart-review`** — review Helm charts for security, reliability, and best practices (kubernetes/)

Check failure on line 8 in CHANGELOG.md

View workflow job for this annotation

GitHub Actions / Lint & Validate

Lists should be surrounded by blank lines

CHANGELOG.md:8 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- **`/helm-chart-review`** — r..."] https://github.com/DavidAnson/markdownlint/blob/v0.37.4/doc/md032.md
- **`/secrets-leak-scan`** — scan git repos for leaked secrets using gitleaks, trufflehog, or regex (security/)
- **`/incident-triage`** — guided first 15 minutes of a production incident (observability/)
- **`/release-checklist`** — pre-release safety gate covering scope, deploy order, rollback, tests, monitoring, and communication (cicd/)
- **`/repo-health`** — repository hygiene audit for docs, CI, ownership, branch/release hygiene, and secrets risk (security/)

### Added — Prompts

Check failure on line 14 in CHANGELOG.md

View workflow job for this annotation

GitHub Actions / Lint & Validate

Headings should be surrounded by blank lines

CHANGELOG.md:14 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "### Added — Prompts"] https://github.com/DavidAnson/markdownlint/blob/v0.37.4/doc/md022.md
- **`pr-description.md`** — generate PR descriptions from diffs

Check failure on line 15 in CHANGELOG.md

View workflow job for this annotation

GitHub Actions / Lint & Validate

Lists should be surrounded by blank lines

CHANGELOG.md:15 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- **`pr-description.md`** — ge..."] https://github.com/DavidAnson/markdownlint/blob/v0.37.4/doc/md032.md
- **`explain-like-a-senior.md`** — explain infrastructure code to junior engineers
- **`runbook-from-incident.md`** — turn incident notes or post-mortems into reusable runbooks

### Added — Scripts

Check failure on line 19 in CHANGELOG.md

View workflow job for this annotation

GitHub Actions / Lint & Validate

Headings should be surrounded by blank lines

CHANGELOG.md:19 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "### Added — Scripts"] https://github.com/DavidAnson/markdownlint/blob/v0.37.4/doc/md022.md
- **`aws-whoami.sh`** — quick AWS identity and account context check

Check failure on line 20 in CHANGELOG.md

View workflow job for this annotation

GitHub Actions / Lint & Validate

Lists should be surrounded by blank lines

CHANGELOG.md:20 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- **`aws-whoami.sh`** — quick ..."] https://github.com/DavidAnson/markdownlint/blob/v0.37.4/doc/md032.md
- **`stale-branches.sh`** — list git branches older than N days
- **`validate-repo.sh`** — local validation for workflow frontmatter, README links, executable scripts, and optional lint checks

### Added — CI

Check failure on line 24 in CHANGELOG.md

View workflow job for this annotation

GitHub Actions / Lint & Validate

Headings should be surrounded by blank lines

CHANGELOG.md:24 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "### Added — CI"] https://github.com/DavidAnson/markdownlint/blob/v0.37.4/doc/md022.md
- GitHub Actions CI: markdown lint, link check, frontmatter validation, README link verification

Check failure on line 25 in CHANGELOG.md

View workflow job for this annotation

GitHub Actions / Lint & Validate

Lists should be surrounded by blank lines

CHANGELOG.md:25 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- GitHub Actions CI: markdown ..."] https://github.com/DavidAnson/markdownlint/blob/v0.37.4/doc/md032.md

### Improved

Check failure on line 27 in CHANGELOG.md

View workflow job for this annotation

GitHub Actions / Lint & Validate

Headings should be surrounded by blank lines

CHANGELOG.md:27 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "### Improved"] https://github.com/DavidAnson/markdownlint/blob/v0.37.4/doc/md022.md
- **`/aws-account-audit`** — added `FAST=yes` input to skip slow per-policy IAM loops on large accounts

Check failure on line 28 in CHANGELOG.md

View workflow job for this annotation

GitHub Actions / Lint & Validate

Lists should be surrounded by blank lines

CHANGELOG.md:28 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- **`/aws-account-audit`** — a..."] https://github.com/DavidAnson/markdownlint/blob/v0.37.4/doc/md032.md
- **`/aws-cost-quickscan`** — added `DEEP=yes` input for per-instance CPU utilization analysis
- **`/terraform-plan-review`** — added Step 0 with plan generation commands (including Terragrunt)
- **`/k8s-debug`** — enhanced log analysis (Step 5) with init container logs, structured error extraction, severity classification, and "noisiest pods" scan; added restart timeline analysis (Step 6a) and HPA health check (Step 6b); expanded triage cheat-sheet with startup-order, Redis, autoscaling, and webhook patterns
- **`/k8s-workload-debug`** — added init/sidecar analysis and GitOps/controller ownership checks
- **`/k8s-rbac-audit`** — added ServiceAccount token exposure checks
- **`/helm-release-debug`** — added ArgoCD/Flux ownership checks before suggesting manual Helm recovery
- **`/aws-vpc-debug`** — clarified source/destination variable resolution for VPC, subnet, security groups, and destination IP
- **`postmortem-writer.md`** — added SLO/data impact, recurrence risk, and action item type classification
- **`explain-like-a-senior.md`** — added prerequisite knowledge, safe validation, and team-question sections

---

Expand Down
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,15 @@ A growing collection of **AI-agent workflows, prompts, and rules** for day-to-da
|---|---|---|---|
| [ci-debug](./workflows/cicd/ci-debug.md) | `/ci-debug` | Diagnose a failing CI/CD pipeline: parse build logs from Jenkins, GitHub Actions, GitLab CI, or Bitbucket Pipelines. Root cause analysis and fix suggestions. | Build log output. Optional: repo source, CI config file. |
| [jenkins-pipeline-review](./workflows/cicd/jenkins-pipeline-review.md) | `/jenkins-pipeline-review` | Review Jenkinsfile / shared-library Groovy for security risks, anti-patterns, missing error handling, credential leaks, CPS issues, and build config cross-references. | Jenkinsfile(s) or `vars/*.groovy`. Optional: `repositories_v2.json`. |
| [release-checklist](./workflows/cicd/release-checklist.md) | `/release-checklist` | Pre-release safety gate: scope, deploy order, rollback, tests, monitoring, and communication before production release. | PR/diff summary. Optional: test results, plans, diffs. |
| [dockerfile-review](./workflows/containers/dockerfile-review.md) | `/dockerfile-review` | Review Dockerfiles for security, size, caching, and best practices. Flags CVE-prone bases, leaked secrets, missing health checks. | Dockerfile(s). Optional: `docker`, `trivy`. |

### Security

| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| [secrets-leak-scan](./workflows/security/secrets-leak-scan.md) | `/secrets-leak-scan` | Scan git repo history for leaked secrets: API keys, passwords, tokens, private keys. Uses gitleaks, trufflehog, or regex fallback. | Git repo. Optional: `gitleaks`, `trufflehog`. |
| [repo-health](./workflows/security/repo-health.md) | `/repo-health` | Audit repository hygiene: README, license, CI, branch/release hygiene, tracked secrets, ownership, and automation gaps. | Local git repo. Optional: `gh`, `jq`. |

### Observability & Incident

Expand All @@ -75,6 +77,7 @@ Reusable system prompts you can paste into any AI agent for common DevOps tasks:
| [code-review-devops](./prompts/code-review-devops.md) | Reviews IaC / pipeline / Docker / K8s code with a security-first DevOps lens. |
| [pr-description](./prompts/pr-description.md) | Generates a PR description from a diff: what, why, how, testing, risk, rollback plan. |
| [explain-like-a-senior](./prompts/explain-like-a-senior.md) | Explains infrastructure code to junior engineers: what it does, why, gotchas, and how it fits together. |
| [runbook-from-incident](./prompts/runbook-from-incident.md) | Converts incident notes or post-mortems into reusable runbooks with diagnosis, mitigation, escalation, and follow-up steps. |

## Rules

Expand All @@ -95,6 +98,7 @@ Standalone shell utilities referenced by workflows or useful on their own:
| [k8s-snapshot.sh](./scripts/k8s-snapshot.sh) | `./k8s-snapshot.sh [namespace\|all] [output-dir]` — dump cluster state (nodes, pods, events, services, top) to a timestamped Markdown file. |
| [aws-whoami.sh](./scripts/aws-whoami.sh) | `./aws-whoami.sh [profile]` — quick AWS identity check: caller, region, account alias, org, SSO role. |
| [stale-branches.sh](./scripts/stale-branches.sh) | `./stale-branches.sh [days] [--remote]` — list git branches older than N days with last commit info. |
| [validate-repo.sh](./scripts/validate-repo.sh) | `./scripts/validate-repo.sh` — validate workflow frontmatter, README links, script executability, and optional lint checks. |

## Using a workflow

Expand Down Expand Up @@ -147,7 +151,6 @@ Ideas I plan to add (PRs welcome):
**Containers & CI/CD**
- [ ] `/image-cve-triage` — prioritise CVE scanner output by exploitability + fix availability
- [ ] `/github-actions-review` — security review of GitHub Actions workflow files
- [ ] `/release-checklist` — pre-release gate

**Observability & incident**
- [ ] `/prometheus-query-helper` — intent → PromQL with rationale
Expand Down
10 changes: 10 additions & 0 deletions prompts/explain-like-a-senior.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ You are a **senior DevOps/SRE engineer** explaining infrastructure code to a jun
## Overview
<big picture: what this code does and why it exists>

## Prerequisite knowledge
<concepts the reader should understand first: VPC, IAM role, HPA, Terraform state, etc.>

## Walk-through
<section by section explanation>

Expand All @@ -44,13 +47,20 @@ You are a **senior DevOps/SRE engineer** explaining infrastructure code to a jun
## Things to watch out for
<list of common mistakes or misconfigurations>

## How to validate it safely
<read-only commands, tests, dry-runs, or checks the reader can run>

## If I were reviewing this
<what a senior would suggest improving>

## Good questions to ask the team
<questions about ownership, production usage, failure modes, and historical context>
```

### Rules

- **No condescension.** Junior doesn't mean stupid. Explain clearly without being patronizing.
- **No hand-waving.** If you don't know why something is done a certain way, say "I'm not sure why this specific choice was made — it might be historical. Here's what I'd investigate."
- **Use the actual code.** Reference specific lines, variables, and resource names.
- **Teach safe validation.** Prefer read-only commands, dry-runs, local tests, and plan output.
- **Encourage questions.** End with "Good questions to ask your team about this: ..."
18 changes: 14 additions & 4 deletions prompts/postmortem-writer.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,9 @@ Generate the post-mortem in this structure:
- **Services affected:** <list>
- **Duration of impact:** <how long users experienced degradation>
- **SLA impact:** <was an SLA breached?>
- **SLO impact:** <which SLO/error budget was consumed?>
- **Revenue impact:** <if applicable>
- **Data impact:** <data loss, corruption, delay, or "none">

## Timeline (UTC)

Expand Down Expand Up @@ -65,6 +67,13 @@ Generate the post-mortem in this structure:
- **What was the mitigation?** <rollback / config change / scale up / etc.>
- **Was the runbook followed?** <yes / no / no runbook existed>

## Recurrence risk

- **Likelihood of recurrence:** Low / Medium / High
- **Why:** <what conditions would cause this again?>
- **Existing guardrails:** <tests, alerts, automation, runbooks>
- **Missing guardrails:** <what would have prevented or shortened the incident?>

## Contributing factors

<List all factors that contributed. Not just the trigger, but also:>
Expand All @@ -91,10 +100,10 @@ Generate the post-mortem in this structure:

## Action items

| # | Action | Owner | Priority | Due date | Status |
|---|---|---|---|---|---|
| 1 | <specific action> | <name> | P1/P2/P3 | <date> | Open |
| 2 | ... | ... | ... | ... | ... |
| # | Action | Owner | Priority | Due date | Type | Status |
|---|---|---|---|---|---|---|
| 1 | <specific action> | <name> | P1/P2/P3 | <date> | Prevent / Detect / Mitigate | Open |
| 2 | ... | ... | ... | ... | ... | ... |

## Lessons learned

Expand All @@ -108,4 +117,5 @@ Generate the post-mortem in this structure:
- **Honest.** If the root cause is unknown, say so. "Root cause is not fully determined; the leading hypothesis is X" is better than guessing.
- **Action-oriented.** Every "what could be improved" must have a corresponding action item with an owner.
- **Time-bounded.** Action items need due dates. "Eventually" means "never."
- **Prevention-balanced.** Include at least one action item for detection/alerting and one for prevention when applicable.
- **Ask for missing information.** If the user's notes don't cover detection, response, or contributing factors, ask specifically.
91 changes: 91 additions & 0 deletions prompts/runbook-from-incident.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Runbook From Incident — System Prompt

Paste this into any AI agent after an incident, post-mortem, or debugging session to turn the learned procedure into a reusable runbook.

---

## System prompt

You are a **senior SRE runbook writer**. Given incident notes, a post-mortem, chat transcript, or troubleshooting commands, create a practical runbook that another engineer can follow during a future incident.

### Output format

```markdown
# Runbook: <Problem / Alert / Service>

**Owner:** <team/person>
**Service:** <service/system>
**Severity:** <expected severity>
**Last updated:** <YYYY-MM-DD>
**Related alerts:** <alert names>
**Related dashboards:** <links or names>

---

## When to use this runbook

Use this when:
- <symptom 1>
- <symptom 2>

Do not use this when:
- <case where this runbook does not apply>

## Quick diagnosis

| Check | Command / Dashboard | Expected healthy result | Bad result |
|---|---|---|---|
| <check> | `<command>` | <healthy> | <bad> |

## Triage steps

### Step 1 — Confirm impact

```bash
<read-only command>
```

Expected result:
- <what healthy looks like>

If bad:
- <what to do next>

### Step 2 — Identify likely root cause

```bash
<read-only command>
```

## Mitigation options

> Do not execute mitigations automatically. Confirm environment and impact first.

| Option | When to use | Command | Risk | Rollback |
|---|---|---|---|---|
| Rollback | Bad deploy suspected | `<command>` | <risk> | <rollback> |
| Scale up | Load/resource pressure | `<command>` | <risk> | <rollback> |

## Escalation

Escalate when:
- <condition>

Escalate to:
- <team/person/channel>

## Post-incident follow-up

- [ ] Update this runbook with new findings
- [ ] Add/adjust alert if detection was slow
- [ ] Add test/guardrail if prevention was possible
```

### Rules

- **Prefer read-only diagnosis first.** Commands under diagnosis should not mutate state.
- **Separate diagnosis from mitigation.** Mitigation commands must be clearly marked and require human confirmation.
- **Make commands copy-pastable.** Use placeholders like `<namespace>` only when the value is genuinely environment-specific.
- **Include expected output.** A runbook is only useful if the reader knows what good and bad look like.
- **Preserve safety context.** Always include environment confirmation for production-impacting steps.
- **Avoid tribal knowledge.** If the original incident required someone knowing a hidden dependency, document it explicitly.
78 changes: 78 additions & 0 deletions scripts/validate-repo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#!/usr/bin/env bash
# ────────────────────────────────────────────────────────────────
# validate-repo.sh — Local validation for devops-ai-workflows
# ────────────────────────────────────────────────────────────────
# Usage: ./scripts/validate-repo.sh
#
# Checks:
# - workflow markdown files have YAML frontmatter
# - README links point to existing local files
# - shell scripts are executable
# - optional markdownlint/shellcheck if installed
# ────────────────────────────────────────────────────────────────
set -euo pipefail

ROOT_DIR=$(git rev-parse --show-toplevel 2>/dev/null || pwd)
cd "$ROOT_DIR"

errors=0

echo "🔎 Validating devops-ai-workflows repo"
echo "Root: $ROOT_DIR"
echo ""

echo "== Workflow frontmatter =="
while IFS= read -r file; do
if ! head -1 "$file" | grep -q '^---$'; then
echo "❌ Missing frontmatter: $file"
errors=$((errors + 1))
fi
done < <(find workflows -name '*.md' | sort)
[ "$errors" -eq 0 ] && echo "✅ Workflow frontmatter OK"
echo ""

echo "== README local links =="
while IFS= read -r link; do
path=${link#./}
if [ ! -e "$path" ]; then
echo "❌ Broken README link: $link"
errors=$((errors + 1))
fi
done < <(grep -oE '\]\(\./[^)]+\)' README.md | sed -E 's/^.*\((.*)\)$/\1/' | sort -u)
[ "$errors" -eq 0 ] && echo "✅ README local links OK"
echo ""

echo "== Script executability =="
while IFS= read -r file; do
if [ ! -x "$file" ]; then
echo "❌ Script not executable: $file"
errors=$((errors + 1))
fi
done < <(find scripts -name '*.sh' | sort)
[ "$errors" -eq 0 ] && echo "✅ Scripts executable"
echo ""

if command -v markdownlint >/dev/null 2>&1; then
echo "== markdownlint =="
markdownlint '**/*.md' || errors=$((errors + 1))
echo ""
else
echo "ℹ️ markdownlint not installed; skipping"
echo ""
fi

if command -v shellcheck >/dev/null 2>&1; then
echo "== shellcheck =="
shellcheck scripts/*.sh || errors=$((errors + 1))
echo ""
else
echo "ℹ️ shellcheck not installed; skipping"
echo ""
fi

if [ "$errors" -eq 0 ]; then
echo "✅ Validation passed"
else
echo "❌ Validation failed with $errors issue(s)"
exit 1
fi
Loading
Loading