23seriy · 23seriy · May 22, 2026 · May 22, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,27 +4,37 @@

 ## [Unreleased]

 ### Added — Workflows
 - **`/helm-chart-review`** — review Helm charts for security, reliability, and best practices (kubernetes/)
 - **`/secrets-leak-scan`** — scan git repos for leaked secrets using gitleaks, trufflehog, or regex (security/)
 - **`/incident-triage`** — guided first 15 minutes of a production incident (observability/)
+- **`/release-checklist`** — pre-release safety gate covering scope, deploy order, rollback, tests, monitoring, and communication (cicd/)
+- **`/repo-health`** — repository hygiene audit for docs, CI, ownership, branch/release hygiene, and secrets risk (security/)
 
 ### Added — Prompts
 - **`pr-description.md`** — generate PR descriptions from diffs
 - **`explain-like-a-senior.md`** — explain infrastructure code to junior engineers
+- **`runbook-from-incident.md`** — turn incident notes or post-mortems into reusable runbooks
 
 ### Added — Scripts
 - **`aws-whoami.sh`** — quick AWS identity and account context check
 - **`stale-branches.sh`** — list git branches older than N days
+- **`validate-repo.sh`** — local validation for workflow frontmatter, README links, executable scripts, and optional lint checks
 
 ### Added — CI
 - GitHub Actions CI: markdown lint, link check, frontmatter validation, README link verification

 ### Improved
 - **`/aws-account-audit`** — added `FAST=yes` input to skip slow per-policy IAM loops on large accounts
 - **`/aws-cost-quickscan`** — added `DEEP=yes` input for per-instance CPU utilization analysis
 - **`/terraform-plan-review`** — added Step 0 with plan generation commands (including Terragrunt)
 - **`/k8s-debug`** — enhanced log analysis (Step 5) with init container logs, structured error extraction, severity classification, and "noisiest pods" scan; added restart timeline analysis (Step 6a) and HPA health check (Step 6b); expanded triage cheat-sheet with startup-order, Redis, autoscaling, and webhook patterns
+- **`/k8s-workload-debug`** — added init/sidecar analysis and GitOps/controller ownership checks
+- **`/k8s-rbac-audit`** — added ServiceAccount token exposure checks
+- **`/helm-release-debug`** — added ArgoCD/Flux ownership checks before suggesting manual Helm recovery
+- **`/aws-vpc-debug`** — clarified source/destination variable resolution for VPC, subnet, security groups, and destination IP
+- **`postmortem-writer.md`** — added SLO/data impact, recurrence risk, and action item type classification
+- **`explain-like-a-senior.md`** — added prerequisite knowledge, safe validation, and team-question sections
 
 ---
 

diff --git a/README.md b/README.md
@@ -48,13 +48,15 @@ A growing collection of **AI-agent workflows, prompts, and rules** for day-to-da
 |---|---|---|---|
 | [ci-debug](./workflows/cicd/ci-debug.md) | `/ci-debug` | Diagnose a failing CI/CD pipeline: parse build logs from Jenkins, GitHub Actions, GitLab CI, or Bitbucket Pipelines. Root cause analysis and fix suggestions. | Build log output. Optional: repo source, CI config file. |
 | [jenkins-pipeline-review](./workflows/cicd/jenkins-pipeline-review.md) | `/jenkins-pipeline-review` | Review Jenkinsfile / shared-library Groovy for security risks, anti-patterns, missing error handling, credential leaks, CPS issues, and build config cross-references. | Jenkinsfile(s) or `vars/*.groovy`. Optional: `repositories_v2.json`. |
+| [release-checklist](./workflows/cicd/release-checklist.md) | `/release-checklist` | Pre-release safety gate: scope, deploy order, rollback, tests, monitoring, and communication before production release. | PR/diff summary. Optional: test results, plans, diffs. |
 | [dockerfile-review](./workflows/containers/dockerfile-review.md) | `/dockerfile-review` | Review Dockerfiles for security, size, caching, and best practices. Flags CVE-prone bases, leaked secrets, missing health checks. | Dockerfile(s). Optional: `docker`, `trivy`. |
 
 ### Security
 
 | Workflow | Slash command | Description | Prerequisites |
 |---|---|---|---|
 | [secrets-leak-scan](./workflows/security/secrets-leak-scan.md) | `/secrets-leak-scan` | Scan git repo history for leaked secrets: API keys, passwords, tokens, private keys. Uses gitleaks, trufflehog, or regex fallback. | Git repo. Optional: `gitleaks`, `trufflehog`. |
+| [repo-health](./workflows/security/repo-health.md) | `/repo-health` | Audit repository hygiene: README, license, CI, branch/release hygiene, tracked secrets, ownership, and automation gaps. | Local git repo. Optional: `gh`, `jq`. |
 
 ### Observability & Incident
 
@@ -75,6 +77,7 @@ Reusable system prompts you can paste into any AI agent for common DevOps tasks:
 | [code-review-devops](./prompts/code-review-devops.md) | Reviews IaC / pipeline / Docker / K8s code with a security-first DevOps lens. |
 | [pr-description](./prompts/pr-description.md) | Generates a PR description from a diff: what, why, how, testing, risk, rollback plan. |
 | [explain-like-a-senior](./prompts/explain-like-a-senior.md) | Explains infrastructure code to junior engineers: what it does, why, gotchas, and how it fits together. |
+| [runbook-from-incident](./prompts/runbook-from-incident.md) | Converts incident notes or post-mortems into reusable runbooks with diagnosis, mitigation, escalation, and follow-up steps. |
 
 ## Rules
 
@@ -95,6 +98,7 @@ Standalone shell utilities referenced by workflows or useful on their own:
 | [k8s-snapshot.sh](./scripts/k8s-snapshot.sh) | `./k8s-snapshot.sh [namespace\|all] [output-dir]` — dump cluster state (nodes, pods, events, services, top) to a timestamped Markdown file. |
 | [aws-whoami.sh](./scripts/aws-whoami.sh) | `./aws-whoami.sh [profile]` — quick AWS identity check: caller, region, account alias, org, SSO role. |
 | [stale-branches.sh](./scripts/stale-branches.sh) | `./stale-branches.sh [days] [--remote]` — list git branches older than N days with last commit info. |
+| [validate-repo.sh](./scripts/validate-repo.sh) | `./scripts/validate-repo.sh` — validate workflow frontmatter, README links, script executability, and optional lint checks. |
 
 ## Using a workflow
 
@@ -147,7 +151,6 @@ Ideas I plan to add (PRs welcome):
 **Containers & CI/CD**
 - [ ] `/image-cve-triage` — prioritise CVE scanner output by exploitability + fix availability
 - [ ] `/github-actions-review` — security review of GitHub Actions workflow files
-- [ ] `/release-checklist` — pre-release gate
 
 **Observability & incident**
 - [ ] `/prometheus-query-helper` — intent → PromQL with rationale

diff --git a/prompts/explain-like-a-senior.md b/prompts/explain-like-a-senior.md
@@ -30,6 +30,9 @@ You are a **senior DevOps/SRE engineer** explaining infrastructure code to a jun
 ## Overview
 <big picture: what this code does and why it exists>
 
+## Prerequisite knowledge
+<concepts the reader should understand first: VPC, IAM role, HPA, Terraform state, etc.>
+
 ## Walk-through
 <section by section explanation>
 
@@ -44,13 +47,20 @@ You are a **senior DevOps/SRE engineer** explaining infrastructure code to a jun
 ## Things to watch out for
 <list of common mistakes or misconfigurations>
 
+## How to validate it safely
+<read-only commands, tests, dry-runs, or checks the reader can run>
+
 ## If I were reviewing this
 <what a senior would suggest improving>
+
+## Good questions to ask the team
+<questions about ownership, production usage, failure modes, and historical context>
 ```
 
 ### Rules
 
 - **No condescension.** Junior doesn't mean stupid. Explain clearly without being patronizing.
 - **No hand-waving.** If you don't know why something is done a certain way, say "I'm not sure why this specific choice was made — it might be historical. Here's what I'd investigate."
 - **Use the actual code.** Reference specific lines, variables, and resource names.
+- **Teach safe validation.** Prefer read-only commands, dry-runs, local tests, and plan output.
 - **Encourage questions.** End with "Good questions to ask your team about this: ..."
diff --git a/prompts/postmortem-writer.md b/prompts/postmortem-writer.md
@@ -33,7 +33,9 @@ Generate the post-mortem in this structure:
 - **Services affected:** <list>
 - **Duration of impact:** <how long users experienced degradation>
 - **SLA impact:** <was an SLA breached?>
+- **SLO impact:** <which SLO/error budget was consumed?>
 - **Revenue impact:** <if applicable>
+- **Data impact:** <data loss, corruption, delay, or "none">
 
 ## Timeline (UTC)
 
@@ -65,6 +67,13 @@ Generate the post-mortem in this structure:
 - **What was the mitigation?** <rollback / config change / scale up / etc.>
 - **Was the runbook followed?** <yes / no / no runbook existed>
 
+## Recurrence risk
+
+- **Likelihood of recurrence:** Low / Medium / High
+- **Why:** <what conditions would cause this again?>
+- **Existing guardrails:** <tests, alerts, automation, runbooks>
+- **Missing guardrails:** <what would have prevented or shortened the incident?>
+
 ## Contributing factors
 
 <List all factors that contributed. Not just the trigger, but also:>
@@ -91,10 +100,10 @@ Generate the post-mortem in this structure:
 
 ## Action items
 
-| # | Action | Owner | Priority | Due date | Status |
-|---|---|---|---|---|---|
-| 1 | <specific action> | <name> | P1/P2/P3 | <date> | Open |
-| 2 | ... | ... | ... | ... | ... |
+| # | Action | Owner | Priority | Due date | Type | Status |
+|---|---|---|---|---|---|---|
+| 1 | <specific action> | <name> | P1/P2/P3 | <date> | Prevent / Detect / Mitigate | Open |
+| 2 | ... | ... | ... | ... | ... | ... |
 
 ## Lessons learned
 
@@ -108,4 +117,5 @@ Generate the post-mortem in this structure:
 - **Honest.** If the root cause is unknown, say so. "Root cause is not fully determined; the leading hypothesis is X" is better than guessing.
 - **Action-oriented.** Every "what could be improved" must have a corresponding action item with an owner.
 - **Time-bounded.** Action items need due dates. "Eventually" means "never."
+- **Prevention-balanced.** Include at least one action item for detection/alerting and one for prevention when applicable.
 - **Ask for missing information.** If the user's notes don't cover detection, response, or contributing factors, ask specifically.
diff --git a/prompts/runbook-from-incident.md b/prompts/runbook-from-incident.md
@@ -0,0 +1,91 @@
+# Runbook From Incident — System Prompt
+
+Paste this into any AI agent after an incident, post-mortem, or debugging session to turn the learned procedure into a reusable runbook.
+
+---
+
+## System prompt
+
+You are a **senior SRE runbook writer**. Given incident notes, a post-mortem, chat transcript, or troubleshooting commands, create a practical runbook that another engineer can follow during a future incident.
+
+### Output format
+
+```markdown
+# Runbook: <Problem / Alert / Service>
+
+**Owner:** <team/person>
+**Service:** <service/system>
+**Severity:** <expected severity>
+**Last updated:** <YYYY-MM-DD>
+**Related alerts:** <alert names>
+**Related dashboards:** <links or names>
+
+---
+
+## When to use this runbook
+
+Use this when:
+- <symptom 1>
+- <symptom 2>
+
+Do not use this when:
+- <case where this runbook does not apply>
+
+## Quick diagnosis
+
+| Check | Command / Dashboard | Expected healthy result | Bad result |
+|---|---|---|---|
+| <check> | `<command>` | <healthy> | <bad> |
+
+## Triage steps
+
+### Step 1 — Confirm impact
+
+```bash
+<read-only command>
+```
+
+Expected result:
+- <what healthy looks like>
+
+If bad:
+- <what to do next>
+
+### Step 2 — Identify likely root cause
+
+```bash
+<read-only command>
+```
+
+## Mitigation options
+
+> Do not execute mitigations automatically. Confirm environment and impact first.
+
+| Option | When to use | Command | Risk | Rollback |
+|---|---|---|---|---|
+| Rollback | Bad deploy suspected | `<command>` | <risk> | <rollback> |
+| Scale up | Load/resource pressure | `<command>` | <risk> | <rollback> |
+
+## Escalation
+
+Escalate when:
+- <condition>
+
+Escalate to:
+- <team/person/channel>
+
+## Post-incident follow-up
+
+- [ ] Update this runbook with new findings
+- [ ] Add/adjust alert if detection was slow
+- [ ] Add test/guardrail if prevention was possible
+```
+
+### Rules
+
+- **Prefer read-only diagnosis first.** Commands under diagnosis should not mutate state.
+- **Separate diagnosis from mitigation.** Mitigation commands must be clearly marked and require human confirmation.
+- **Make commands copy-pastable.** Use placeholders like `<namespace>` only when the value is genuinely environment-specific.
+- **Include expected output.** A runbook is only useful if the reader knows what good and bad look like.
+- **Preserve safety context.** Always include environment confirmation for production-impacting steps.
+- **Avoid tribal knowledge.** If the original incident required someone knowing a hidden dependency, document it explicitly.
diff --git a/scripts/validate-repo.sh b/scripts/validate-repo.sh
@@ -0,0 +1,78 @@
+#!/usr/bin/env bash
+# ────────────────────────────────────────────────────────────────
+# validate-repo.sh — Local validation for devops-ai-workflows
+# ────────────────────────────────────────────────────────────────
+# Usage: ./scripts/validate-repo.sh
+#
+# Checks:
+#   - workflow markdown files have YAML frontmatter
+#   - README links point to existing local files
+#   - shell scripts are executable
+#   - optional markdownlint/shellcheck if installed
+# ────────────────────────────────────────────────────────────────
+set -euo pipefail
+
+ROOT_DIR=$(git rev-parse --show-toplevel 2>/dev/null || pwd)
+cd "$ROOT_DIR"
+
+errors=0
+
+echo "🔎 Validating devops-ai-workflows repo"
+echo "Root: $ROOT_DIR"
+echo ""
+
+echo "== Workflow frontmatter =="
+while IFS= read -r file; do
+  if ! head -1 "$file" | grep -q '^---$'; then
+    echo "❌ Missing frontmatter: $file"
+    errors=$((errors + 1))
+  fi
+done < <(find workflows -name '*.md' | sort)
+[ "$errors" -eq 0 ] && echo "✅ Workflow frontmatter OK"
+echo ""
+
+echo "== README local links =="
+while IFS= read -r link; do
+  path=${link#./}
+  if [ ! -e "$path" ]; then
+    echo "❌ Broken README link: $link"
+    errors=$((errors + 1))
+  fi
+done < <(grep -oE '\]\(\./[^)]+\)' README.md | sed -E 's/^.*\((.*)\)$/\1/' | sort -u)
+[ "$errors" -eq 0 ] && echo "✅ README local links OK"
+echo ""
+
+echo "== Script executability =="
+while IFS= read -r file; do
+  if [ ! -x "$file" ]; then
+    echo "❌ Script not executable: $file"
+    errors=$((errors + 1))
+  fi
+done < <(find scripts -name '*.sh' | sort)
+[ "$errors" -eq 0 ] && echo "✅ Scripts executable"
+echo ""
+
+if command -v markdownlint >/dev/null 2>&1; then
+  echo "== markdownlint =="
+  markdownlint '**/*.md' || errors=$((errors + 1))
+  echo ""
+else
+  echo "ℹ️ markdownlint not installed; skipping"
+  echo ""
+fi
+
+if command -v shellcheck >/dev/null 2>&1; then
+  echo "== shellcheck =="
+  shellcheck scripts/*.sh || errors=$((errors + 1))
+  echo ""
+else
+  echo "ℹ️ shellcheck not installed; skipping"
+  echo ""
+fi
+
+if [ "$errors" -eq 0 ]; then
+  echo "✅ Validation passed"
+else
+  echo "❌ Validation failed with $errors issue(s)"
+  exit 1
+fi