From a685e39150ddb2d6e041bc69fb17d55498d1d5ef Mon Sep 17 00:00:00 2001
From: Sergei Olshanetski <solshanetski@proofpoint.com>
Date: Fri, 22 May 2026 09:14:53 -0400
Subject: [PATCH] Improve workflows and add repo operations helpers

Improve existing workflows:
- k8s-workload-debug: sidecar/init-container analysis and GitOps ownership checks
- helm-release-debug: GitOps ownership, CRD lifecycle, and hook cleanup checks
- aws-vpc-debug: clearer source/destination resolution, DNS split-horizon notes, Reachability Analyzer guidance
- k8s-rbac-audit: ServiceAccount token exposure checks

Improve prompts:
- postmortem-writer: SLO/data impact, recurrence risk, action item types
- explain-like-a-senior: prerequisite knowledge, safe validation, team questions

Add new entities:
- runbook-from-incident prompt
- release-checklist workflow
- repo-health workflow
- validate-repo.sh local validation script

Docs:
- update README and CHANGELOG
---
 CHANGELOG.md                               |  10 ++
 README.md                                  |   5 +-
 prompts/explain-like-a-senior.md           |  10 ++
 prompts/postmortem-writer.md               |  18 +-
 prompts/runbook-from-incident.md           |  91 +++++++++++
 scripts/validate-repo.sh                   |  78 +++++++++
 workflows/aws/aws-vpc-debug.md             |  63 +++++++
 workflows/cicd/release-checklist.md        | 181 +++++++++++++++++++++
 workflows/kubernetes/helm-release-debug.md |  42 +++++
 workflows/kubernetes/k8s-rbac-audit.md     |  24 +++
 workflows/kubernetes/k8s-workload-debug.md |  40 ++++-
 workflows/security/repo-health.md          | 181 +++++++++++++++++++++
 12 files changed, 737 insertions(+), 6 deletions(-)
 create mode 100644 prompts/runbook-from-incident.md
 create mode 100755 scripts/validate-repo.sh
 create mode 100644 workflows/cicd/release-checklist.md
 create mode 100644 workflows/security/repo-health.md

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 4c15f2f..f724b18 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,14 +8,18 @@ All notable changes to this project will be documented in this file.
 - **`/helm-chart-review`** — review Helm charts for security, reliability, and best practices (kubernetes/)
 - **`/secrets-leak-scan`** — scan git repos for leaked secrets using gitleaks, trufflehog, or regex (security/)
 - **`/incident-triage`** — guided first 15 minutes of a production incident (observability/)
+- **`/release-checklist`** — pre-release safety gate covering scope, deploy order, rollback, tests, monitoring, and communication (cicd/)
+- **`/repo-health`** — repository hygiene audit for docs, CI, ownership, branch/release hygiene, and secrets risk (security/)
 
 ### Added — Prompts
 - **`pr-description.md`** — generate PR descriptions from diffs
 - **`explain-like-a-senior.md`** — explain infrastructure code to junior engineers
+- **`runbook-from-incident.md`** — turn incident notes or post-mortems into reusable runbooks
 
 ### Added — Scripts
 - **`aws-whoami.sh`** — quick AWS identity and account context check
 - **`stale-branches.sh`** — list git branches older than N days
+- **`validate-repo.sh`** — local validation for workflow frontmatter, README links, executable scripts, and optional lint checks
 
 ### Added — CI
 - GitHub Actions CI: markdown lint, link check, frontmatter validation, README link verification
@@ -25,6 +29,12 @@ All notable changes to this project will be documented in this file.
 - **`/aws-cost-quickscan`** — added `DEEP=yes` input for per-instance CPU utilization analysis
 - **`/terraform-plan-review`** — added Step 0 with plan generation commands (including Terragrunt)
 - **`/k8s-debug`** — enhanced log analysis (Step 5) with init container logs, structured error extraction, severity classification, and "noisiest pods" scan; added restart timeline analysis (Step 6a) and HPA health check (Step 6b); expanded triage cheat-sheet with startup-order, Redis, autoscaling, and webhook patterns
+- **`/k8s-workload-debug`** — added init/sidecar analysis and GitOps/controller ownership checks
+- **`/k8s-rbac-audit`** — added ServiceAccount token exposure checks
+- **`/helm-release-debug`** — added ArgoCD/Flux ownership checks before suggesting manual Helm recovery
+- **`/aws-vpc-debug`** — clarified source/destination variable resolution for VPC, subnet, security groups, and destination IP
+- **`postmortem-writer.md`** — added SLO/data impact, recurrence risk, and action item type classification
+- **`explain-like-a-senior.md`** — added prerequisite knowledge, safe validation, and team-question sections
 
 ---
 
diff --git a/README.md b/README.md
index fed27bd..e55a43e 100644
--- a/README.md
+++ b/README.md
@@ -48,6 +48,7 @@ A growing collection of **AI-agent workflows, prompts, and rules** for day-to-da
 |---|---|---|---|
 | [ci-debug](./workflows/cicd/ci-debug.md) | `/ci-debug` | Diagnose a failing CI/CD pipeline: parse build logs from Jenkins, GitHub Actions, GitLab CI, or Bitbucket Pipelines. Root cause analysis and fix suggestions. | Build log output. Optional: repo source, CI config file. |
 | [jenkins-pipeline-review](./workflows/cicd/jenkins-pipeline-review.md) | `/jenkins-pipeline-review` | Review Jenkinsfile / shared-library Groovy for security risks, anti-patterns, missing error handling, credential leaks, CPS issues, and build config cross-references. | Jenkinsfile(s) or `vars/*.groovy`. Optional: `repositories_v2.json`. |
+| [release-checklist](./workflows/cicd/release-checklist.md) | `/release-checklist` | Pre-release safety gate: scope, deploy order, rollback, tests, monitoring, and communication before production release. | PR/diff summary. Optional: test results, plans, diffs. |
 | [dockerfile-review](./workflows/containers/dockerfile-review.md) | `/dockerfile-review` | Review Dockerfiles for security, size, caching, and best practices. Flags CVE-prone bases, leaked secrets, missing health checks. | Dockerfile(s). Optional: `docker`, `trivy`. |
 
 ### Security
@@ -55,6 +56,7 @@ A growing collection of **AI-agent workflows, prompts, and rules** for day-to-da
 | Workflow | Slash command | Description | Prerequisites |
 |---|---|---|---|
 | [secrets-leak-scan](./workflows/security/secrets-leak-scan.md) | `/secrets-leak-scan` | Scan git repo history for leaked secrets: API keys, passwords, tokens, private keys. Uses gitleaks, trufflehog, or regex fallback. | Git repo. Optional: `gitleaks`, `trufflehog`. |
+| [repo-health](./workflows/security/repo-health.md) | `/repo-health` | Audit repository hygiene: README, license, CI, branch/release hygiene, tracked secrets, ownership, and automation gaps. | Local git repo. Optional: `gh`, `jq`. |
 
 ### Observability & Incident
 
@@ -75,6 +77,7 @@ Reusable system prompts you can paste into any AI agent for common DevOps tasks:
 | [code-review-devops](./prompts/code-review-devops.md) | Reviews IaC / pipeline / Docker / K8s code with a security-first DevOps lens. |
 | [pr-description](./prompts/pr-description.md) | Generates a PR description from a diff: what, why, how, testing, risk, rollback plan. |
 | [explain-like-a-senior](./prompts/explain-like-a-senior.md) | Explains infrastructure code to junior engineers: what it does, why, gotchas, and how it fits together. |
+| [runbook-from-incident](./prompts/runbook-from-incident.md) | Converts incident notes or post-mortems into reusable runbooks with diagnosis, mitigation, escalation, and follow-up steps. |
 
 ## Rules
 
@@ -95,6 +98,7 @@ Standalone shell utilities referenced by workflows or useful on their own:
 | [k8s-snapshot.sh](./scripts/k8s-snapshot.sh) | `./k8s-snapshot.sh [namespace\|all] [output-dir]` — dump cluster state (nodes, pods, events, services, top) to a timestamped Markdown file. |
 | [aws-whoami.sh](./scripts/aws-whoami.sh) | `./aws-whoami.sh [profile]` — quick AWS identity check: caller, region, account alias, org, SSO role. |
 | [stale-branches.sh](./scripts/stale-branches.sh) | `./stale-branches.sh [days] [--remote]` — list git branches older than N days with last commit info. |
+| [validate-repo.sh](./scripts/validate-repo.sh) | `./scripts/validate-repo.sh` — validate workflow frontmatter, README links, script executability, and optional lint checks. |
 
 ## Using a workflow
 
@@ -147,7 +151,6 @@ Ideas I plan to add (PRs welcome):
 **Containers & CI/CD**
 - [ ] `/image-cve-triage` — prioritise CVE scanner output by exploitability + fix availability
 - [ ] `/github-actions-review` — security review of GitHub Actions workflow files
-- [ ] `/release-checklist` — pre-release gate
 
 **Observability & incident**
 - [ ] `/prometheus-query-helper` — intent → PromQL with rationale
diff --git a/prompts/explain-like-a-senior.md b/prompts/explain-like-a-senior.md
index b357531..1299049 100644
--- a/prompts/explain-like-a-senior.md
+++ b/prompts/explain-like-a-senior.md
@@ -30,6 +30,9 @@ You are a **senior DevOps/SRE engineer** explaining infrastructure code to a jun
 ## Overview
 <big picture: what this code does and why it exists>
 
+## Prerequisite knowledge
+<concepts the reader should understand first: VPC, IAM role, HPA, Terraform state, etc.>
+
 ## Walk-through
 <section by section explanation>
 
@@ -44,8 +47,14 @@ You are a **senior DevOps/SRE engineer** explaining infrastructure code to a jun
 ## Things to watch out for
 <list of common mistakes or misconfigurations>
 
+## How to validate it safely
+<read-only commands, tests, dry-runs, or checks the reader can run>
+
 ## If I were reviewing this
 <what a senior would suggest improving>
+
+## Good questions to ask the team
+<questions about ownership, production usage, failure modes, and historical context>
 ```
 
 ### Rules
@@ -53,4 +62,5 @@ You are a **senior DevOps/SRE engineer** explaining infrastructure code to a jun
 - **No condescension.** Junior doesn't mean stupid. Explain clearly without being patronizing.
 - **No hand-waving.** If you don't know why something is done a certain way, say "I'm not sure why this specific choice was made — it might be historical. Here's what I'd investigate."
 - **Use the actual code.** Reference specific lines, variables, and resource names.
+- **Teach safe validation.** Prefer read-only commands, dry-runs, local tests, and plan output.
 - **Encourage questions.** End with "Good questions to ask your team about this: ..."
diff --git a/prompts/postmortem-writer.md b/prompts/postmortem-writer.md
index f8b43a8..ebb6ba7 100644
--- a/prompts/postmortem-writer.md
+++ b/prompts/postmortem-writer.md
@@ -33,7 +33,9 @@ Generate the post-mortem in this structure:
 - **Services affected:** <list>
 - **Duration of impact:** <how long users experienced degradation>
 - **SLA impact:** <was an SLA breached?>
+- **SLO impact:** <which SLO/error budget was consumed?>
 - **Revenue impact:** <if applicable>
+- **Data impact:** <data loss, corruption, delay, or "none">
 
 ## Timeline (UTC)
 
@@ -65,6 +67,13 @@ Generate the post-mortem in this structure:
 - **What was the mitigation?** <rollback / config change / scale up / etc.>
 - **Was the runbook followed?** <yes / no / no runbook existed>
 
+## Recurrence risk
+
+- **Likelihood of recurrence:** Low / Medium / High
+- **Why:** <what conditions would cause this again?>
+- **Existing guardrails:** <tests, alerts, automation, runbooks>
+- **Missing guardrails:** <what would have prevented or shortened the incident?>
+
 ## Contributing factors
 
 <List all factors that contributed. Not just the trigger, but also:>
@@ -91,10 +100,10 @@ Generate the post-mortem in this structure:
 
 ## Action items
 
-| # | Action | Owner | Priority | Due date | Status |
-|---|---|---|---|---|---|
-| 1 | <specific action> | <name> | P1/P2/P3 | <date> | Open |
-| 2 | ... | ... | ... | ... | ... |
+| # | Action | Owner | Priority | Due date | Type | Status |
+|---|---|---|---|---|---|---|
+| 1 | <specific action> | <name> | P1/P2/P3 | <date> | Prevent / Detect / Mitigate | Open |
+| 2 | ... | ... | ... | ... | ... | ... |
 
 ## Lessons learned
 
@@ -108,4 +117,5 @@ Generate the post-mortem in this structure:
 - **Honest.** If the root cause is unknown, say so. "Root cause is not fully determined; the leading hypothesis is X" is better than guessing.
 - **Action-oriented.** Every "what could be improved" must have a corresponding action item with an owner.
 - **Time-bounded.** Action items need due dates. "Eventually" means "never."
+- **Prevention-balanced.** Include at least one action item for detection/alerting and one for prevention when applicable.
 - **Ask for missing information.** If the user's notes don't cover detection, response, or contributing factors, ask specifically.
diff --git a/prompts/runbook-from-incident.md b/prompts/runbook-from-incident.md
new file mode 100644
index 0000000..e2dcd8b
--- /dev/null
+++ b/prompts/runbook-from-incident.md
@@ -0,0 +1,91 @@
+# Runbook From Incident — System Prompt
+
+Paste this into any AI agent after an incident, post-mortem, or debugging session to turn the learned procedure into a reusable runbook.
+
+---
+
+## System prompt
+
+You are a **senior SRE runbook writer**. Given incident notes, a post-mortem, chat transcript, or troubleshooting commands, create a practical runbook that another engineer can follow during a future incident.
+
+### Output format
+
+```markdown
+# Runbook: <Problem / Alert / Service>
+
+**Owner:** <team/person>
+**Service:** <service/system>
+**Severity:** <expected severity>
+**Last updated:** <YYYY-MM-DD>
+**Related alerts:** <alert names>
+**Related dashboards:** <links or names>
+
+---
+
+## When to use this runbook
+
+Use this when:
+- <symptom 1>
+- <symptom 2>
+
+Do not use this when:
+- <case where this runbook does not apply>
+
+## Quick diagnosis
+
+| Check | Command / Dashboard | Expected healthy result | Bad result |
+|---|---|---|---|
+| <check> | `<command>` | <healthy> | <bad> |
+
+## Triage steps
+
+### Step 1 — Confirm impact
+
+```bash
+<read-only command>
+```
+
+Expected result:
+- <what healthy looks like>
+
+If bad:
+- <what to do next>
+
+### Step 2 — Identify likely root cause
+
+```bash
+<read-only command>
+```
+
+## Mitigation options
+
+> Do not execute mitigations automatically. Confirm environment and impact first.
+
+| Option | When to use | Command | Risk | Rollback |
+|---|---|---|---|---|
+| Rollback | Bad deploy suspected | `<command>` | <risk> | <rollback> |
+| Scale up | Load/resource pressure | `<command>` | <risk> | <rollback> |
+
+## Escalation
+
+Escalate when:
+- <condition>
+
+Escalate to:
+- <team/person/channel>
+
+## Post-incident follow-up
+
+- [ ] Update this runbook with new findings
+- [ ] Add/adjust alert if detection was slow
+- [ ] Add test/guardrail if prevention was possible
+```
+
+### Rules
+
+- **Prefer read-only diagnosis first.** Commands under diagnosis should not mutate state.
+- **Separate diagnosis from mitigation.** Mitigation commands must be clearly marked and require human confirmation.
+- **Make commands copy-pastable.** Use placeholders like `<namespace>` only when the value is genuinely environment-specific.
+- **Include expected output.** A runbook is only useful if the reader knows what good and bad look like.
+- **Preserve safety context.** Always include environment confirmation for production-impacting steps.
+- **Avoid tribal knowledge.** If the original incident required someone knowing a hidden dependency, document it explicitly.
diff --git a/scripts/validate-repo.sh b/scripts/validate-repo.sh
new file mode 100755
index 0000000..80950d5
--- /dev/null
+++ b/scripts/validate-repo.sh
@@ -0,0 +1,78 @@
+#!/usr/bin/env bash
+# ────────────────────────────────────────────────────────────────
+# validate-repo.sh — Local validation for devops-ai-workflows
+# ────────────────────────────────────────────────────────────────
+# Usage: ./scripts/validate-repo.sh
+#
+# Checks:
+#   - workflow markdown files have YAML frontmatter
+#   - README links point to existing local files
+#   - shell scripts are executable
+#   - optional markdownlint/shellcheck if installed
+# ────────────────────────────────────────────────────────────────
+set -euo pipefail
+
+ROOT_DIR=$(git rev-parse --show-toplevel 2>/dev/null || pwd)
+cd "$ROOT_DIR"
+
+errors=0
+
+echo "🔎 Validating devops-ai-workflows repo"
+echo "Root: $ROOT_DIR"
+echo ""
+
+echo "== Workflow frontmatter =="
+while IFS= read -r file; do
+  if ! head -1 "$file" | grep -q '^---$'; then
+    echo "❌ Missing frontmatter: $file"
+    errors=$((errors + 1))
+  fi
+done < <(find workflows -name '*.md' | sort)
+[ "$errors" -eq 0 ] && echo "✅ Workflow frontmatter OK"
+echo ""
+
+echo "== README local links =="
+while IFS= read -r link; do
+  path=${link#./}
+  if [ ! -e "$path" ]; then
+    echo "❌ Broken README link: $link"
+    errors=$((errors + 1))
+  fi
+done < <(grep -oE '\]\(\./[^)]+\)' README.md | sed -E 's/^.*\((.*)\)$/\1/' | sort -u)
+[ "$errors" -eq 0 ] && echo "✅ README local links OK"
+echo ""
+
+echo "== Script executability =="
+while IFS= read -r file; do
+  if [ ! -x "$file" ]; then
+    echo "❌ Script not executable: $file"
+    errors=$((errors + 1))
+  fi
+done < <(find scripts -name '*.sh' | sort)
+[ "$errors" -eq 0 ] && echo "✅ Scripts executable"
+echo ""
+
+if command -v markdownlint >/dev/null 2>&1; then
+  echo "== markdownlint =="
+  markdownlint '**/*.md' || errors=$((errors + 1))
+  echo ""
+else
+  echo "ℹ️ markdownlint not installed; skipping"
+  echo ""
+fi
+
+if command -v shellcheck >/dev/null 2>&1; then
+  echo "== shellcheck =="
+  shellcheck scripts/*.sh || errors=$((errors + 1))
+  echo ""
+else
+  echo "ℹ️ shellcheck not installed; skipping"
+  echo ""
+fi
+
+if [ "$errors" -eq 0 ]; then
+  echo "✅ Validation passed"
+else
+  echo "❌ Validation failed with $errors issue(s)"
+  exit 1
+fi
diff --git a/workflows/aws/aws-vpc-debug.md b/workflows/aws/aws-vpc-debug.md
index 6583f8e..307005a 100644
--- a/workflows/aws/aws-vpc-debug.md
+++ b/workflows/aws/aws-vpc-debug.md
@@ -32,6 +32,14 @@ Ask the user for the following:
 ```bash
 aws sts get-caller-identity
 REGION=${REGION:-$(aws configure get region)}
+SRC_PRIVATE_IP=""
+SRC_SUBNET_ID=""
+SRC_VPC_ID=""
+SRC_SECURITY_GROUPS=""
+DEST_IP="$DESTINATION"
+DST_VPC_ID=""
+DST_SUBNET_ID=""
+DST_SECURITY_GROUPS=""
 
 echo "=== Resolve SOURCE ==="
 # If SOURCE looks like an instance ID
@@ -39,6 +47,10 @@ if echo "$SOURCE" | grep -qE '^i-[0-9a-f]+$'; then
   aws ec2 describe-instances --region $REGION --instance-ids $SOURCE \
     --query 'Reservations[0].Instances[0].{InstanceId:InstanceId,PrivateIp:PrivateIpAddress,PublicIp:PublicIpAddress,SubnetId:SubnetId,VpcId:VpcId,SecurityGroups:SecurityGroups[].GroupId,Name:Tags[?Key==`Name`]|[0].Value}' \
     --output json
+  SRC_PRIVATE_IP=$(aws ec2 describe-instances --region $REGION --instance-ids $SOURCE --query 'Reservations[0].Instances[0].PrivateIpAddress' --output text)
+  SRC_SUBNET_ID=$(aws ec2 describe-instances --region $REGION --instance-ids $SOURCE --query 'Reservations[0].Instances[0].SubnetId' --output text)
+  SRC_VPC_ID=$(aws ec2 describe-instances --region $REGION --instance-ids $SOURCE --query 'Reservations[0].Instances[0].VpcId' --output text)
+  SRC_SECURITY_GROUPS=$(aws ec2 describe-instances --region $REGION --instance-ids $SOURCE --query 'Reservations[0].Instances[0].SecurityGroups[].GroupId' --output text)
 fi
 
 # If SOURCE looks like an ENI
@@ -46,6 +58,10 @@ if echo "$SOURCE" | grep -qE '^eni-[0-9a-f]+$'; then
   aws ec2 describe-network-interfaces --region $REGION --network-interface-ids $SOURCE \
     --query 'NetworkInterfaces[0].{Id:NetworkInterfaceId,PrivateIp:PrivateIpAddress,SubnetId:SubnetId,VpcId:VpcId,SecurityGroups:Groups[].GroupId,Description:Description}' \
     --output json
+  SRC_PRIVATE_IP=$(aws ec2 describe-network-interfaces --region $REGION --network-interface-ids $SOURCE --query 'NetworkInterfaces[0].PrivateIpAddress' --output text)
+  SRC_SUBNET_ID=$(aws ec2 describe-network-interfaces --region $REGION --network-interface-ids $SOURCE --query 'NetworkInterfaces[0].SubnetId' --output text)
+  SRC_VPC_ID=$(aws ec2 describe-network-interfaces --region $REGION --network-interface-ids $SOURCE --query 'NetworkInterfaces[0].VpcId' --output text)
+  SRC_SECURITY_GROUPS=$(aws ec2 describe-network-interfaces --region $REGION --network-interface-ids $SOURCE --query 'NetworkInterfaces[0].Groups[].GroupId' --output text)
 fi
 
 echo "=== Resolve DESTINATION ==="
@@ -53,6 +69,7 @@ echo "=== Resolve DESTINATION ==="
 if echo "$DESTINATION" | grep -qE '[a-zA-Z]'; then
   echo "DNS resolution:"
   dig +short "$DESTINATION" 2>/dev/null || nslookup "$DESTINATION" 2>/dev/null || echo "Could not resolve"
+  DEST_IP=$(dig +short "$DESTINATION" 2>/dev/null | grep -E '^[0-9]+\.' | head -1 || echo "$DESTINATION")
 fi
 
 # If DESTINATION looks like an RDS endpoint
@@ -62,6 +79,18 @@ if echo "$DESTINATION" | grep -qE '\.rds\.amazonaws\.com$'; then
     --query 'DBInstances[0].{Id:DBInstanceIdentifier,Endpoint:Endpoint,VpcSecurityGroups:VpcSecurityGroups[].VpcSecurityGroupId,SubnetGroup:DBSubnetGroup.DBSubnetGroupName}' \
     --output json 2>/dev/null || true
 fi
+
+echo "=== Resolved variables for later steps ==="
+cat <<EOF
+SRC_PRIVATE_IP=$SRC_PRIVATE_IP
+SRC_SUBNET_ID=$SRC_SUBNET_ID
+SRC_VPC_ID=$SRC_VPC_ID
+SRC_SECURITY_GROUPS=$SRC_SECURITY_GROUPS
+DEST_IP=$DEST_IP
+DST_VPC_ID=$DST_VPC_ID
+DST_SUBNET_ID=$DST_SUBNET_ID
+DST_SECURITY_GROUPS=$DST_SECURITY_GROUPS
+EOF
 ```
 
 Stop if the source cannot be resolved — report the error and suggest checking the instance/ENI ID.
@@ -90,6 +119,9 @@ if echo "$DEST_IP" | grep -qE '^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.)';
     --filters Name=addresses.private-ip-address,Values=$DEST_IP \
     --query 'NetworkInterfaces[0].{VpcId:VpcId,SubnetId:SubnetId,SecurityGroups:Groups[].GroupId,Description:Description}' \
     --output json 2>/dev/null || echo "Destination IP not found as ENI address"
+  DST_VPC_ID=$(aws ec2 describe-network-interfaces --region $REGION --filters Name=addresses.private-ip-address,Values=$DEST_IP --query 'NetworkInterfaces[0].VpcId' --output text 2>/dev/null || true)
+  DST_SUBNET_ID=$(aws ec2 describe-network-interfaces --region $REGION --filters Name=addresses.private-ip-address,Values=$DEST_IP --query 'NetworkInterfaces[0].SubnetId' --output text 2>/dev/null || true)
+  DST_SECURITY_GROUPS=$(aws ec2 describe-network-interfaces --region $REGION --filters Name=addresses.private-ip-address,Values=$DEST_IP --query 'NetworkInterfaces[0].Groups[].GroupId' --output text 2>/dev/null || true)
 fi
 
 echo "=== VPC peering connections ==="
@@ -108,6 +140,7 @@ Flag:
 
 - Source and destination in different VPCs without peering/TGW.
 - Public vs private subnet classification.
+- Missing resolved variables (`SRC_VPC_ID`, `SRC_SUBNET_ID`, `DEST_IP`) — later checks depend on them.
 
 ---
 
@@ -268,6 +301,36 @@ Flag:
 - `enableDnsSupport` or `enableDnsHostnames` disabled.
 - Destination hostname not resolvable.
 - Missing private hosted zone association for cross-VPC DNS.
+- Split-horizon mismatch: local laptop resolves a different IP than the source VPC would resolve.
+- Resolver endpoint exists but has unhealthy status or insufficient IP addresses.
+
+---
+
+## Step 7b — Optional Reachability Analyzer path check
+
+> Use this only if the source and destination are ENIs or instances and the account has EC2 Reachability Analyzer permissions. It creates no traffic; it analyzes control-plane configuration.
+
+```bash
+echo "=== Reachability Analyzer guidance ==="
+cat <<EOF
+If SOURCE and DESTINATION can be represented as ENIs, optionally run:
+  aws ec2 create-network-insights-path \\
+    --region $REGION \\
+    --source <source-eni-id> \\
+    --destination <destination-eni-id> \\
+    --protocol ${PROTOCOL^^} \\
+    --destination-port $PORT
+
+Then:
+  aws ec2 start-network-insights-analysis --network-insights-path-id <path-id>
+  aws ec2 describe-network-insights-analyses --network-insights-analysis-ids <analysis-id>
+
+Clean up:
+  aws ec2 delete-network-insights-path --network-insights-path-id <path-id>
+EOF
+```
+
+Flag: Reachability Analyzer finding the exact blocker (route table, SG, NACL, endpoint policy) should override heuristic guesses from earlier steps.
 
 ---
 
diff --git a/workflows/cicd/release-checklist.md b/workflows/cicd/release-checklist.md
new file mode 100644
index 0000000..68350c4
--- /dev/null
+++ b/workflows/cicd/release-checklist.md
@@ -0,0 +1,181 @@
+---
+description: Pre-release safety checklist for application, infrastructure, or platform changes. Reviews deploy order, rollback, tests, monitoring, and communication before release.
+---
+
+# /release-checklist — Pre-Release Safety Gate
+
+Use before releasing application code, infrastructure changes, Helm charts, Terraform modules, or CI/CD updates. The workflow produces a release readiness report and highlights blockers before production deployment.
+
+## Prerequisites
+
+- PR/diff summary or release notes.
+- Target environment and deployment method.
+- Optional: test results, Terraform plan, Helm diff, ArgoCD app status, CI build URL.
+
+## Inputs
+
+- **RELEASE_NAME** *(required)* — short name of the change/release.
+- **ENVIRONMENT** — target environment. Default: `staging`.
+- **CHANGE_TYPE** — `app` | `infra` | `helm` | `terraform` | `pipeline` | `mixed`.
+- **REPORT_DIR** — Default: `./release-checklist-reports`.
+
+---
+
+## Step 1 — Identify release scope
+
+Gather:
+
+- What is changing?
+- Which repos/services/environments are affected?
+- Is this a single-repo or multi-repo release?
+- Is there a database, schema, IAM, networking, or config change?
+- Is there a feature flag or staged rollout mechanism?
+
+Classify risk:
+
+| Risk | Criteria |
+|---|---|
+| Low | Backward-compatible, tested, easy rollback, small blast radius |
+| Medium | Config/IaC changes, multiple services, partial rollback complexity |
+| High | Data migration, IAM/networking, irreversible changes, production-wide impact |
+
+---
+
+## Step 2 — Validate test and build evidence
+
+Check:
+
+- CI build passed for the exact commit being deployed.
+- Unit/integration/e2e tests relevant to the change passed.
+- Security scans completed or exceptions are documented.
+- Artifact/image tag is immutable and traceable to commit SHA.
+- No local-only changes are required for deploy.
+
+For infrastructure:
+
+```bash
+terraform validate
+terraform plan -out=plan.bin
+terraform show -json plan.bin > plan.json
+```
+
+For Helm/Kubernetes:
+
+```bash
+helm lint <chart>
+helm template <release> <chart> --values <values.yaml> >/tmp/rendered.yaml
+kubectl diff -f /tmp/rendered.yaml --server-side 2>/dev/null || true
+```
+
+---
+
+## Step 3 — Deploy order and dependency check
+
+Document:
+
+| Item | Value |
+|---|---|
+| Must deploy before | <repos/services/infra> |
+| Must deploy after | <repos/services/infra> |
+| Can deploy independently | yes/no |
+| Requires feature flag | yes/no |
+| Requires maintenance window | yes/no |
+
+Common ordering rules:
+
+- Database/schema backward-compatible change before app rollout.
+- IAM/networking prerequisites before service deployment.
+- CRDs before custom resources.
+- Shared libraries/build seed changes before dependent service builds.
+- Producer/consumer API compatibility verified before either side is deployed.
+
+---
+
+## Step 4 — Rollback and recovery plan
+
+Every release needs a rollback plan:
+
+| Area | Rollback approach | Time estimate | Risk |
+|---|---|---|---|
+| App | redeploy previous image/tag | <time> | <risk> |
+| Helm | `helm rollback` or GitOps revert | <time> | <risk> |
+| Terraform | revert code + apply plan | <time> | <risk> |
+| DB | forward-fix / restore / migration rollback | <time> | <risk> |
+
+Flag blockers:
+
+- No rollback path for data migration.
+- Terraform plan destroys/replaces stateful resources.
+- Previous app version cannot run against new schema.
+- Rollback requires manual console steps.
+
+---
+
+## Step 5 — Monitoring and communication
+
+Before release:
+
+- Identify dashboards to watch.
+- Identify alerts expected to fire (if any).
+- Confirm on-call owner and escalation channel.
+- Define success metrics and abort thresholds.
+
+Example release watchlist:
+
+| Signal | Healthy | Abort threshold |
+|---|---|---|
+| Error rate | <1% | >5% for 5 min |
+| p95 latency | <baseline + 20% | >baseline + 50% |
+| Pod restarts | 0–1 expected | repeated CrashLoopBackOff |
+| Queue lag | stable/decreasing | sustained growth |
+
+---
+
+## Step 6 — Generate report
+
+Write:
+
+```
+$REPORT_DIR/release-checklist-<release-name>-<YYYYMMDD-HHMMSS>.md
+```
+
+### Report structure
+
+```markdown
+# Release Checklist Report
+
+| Field | Value |
+|---|---|
+| Release | <name> |
+| Environment | <env> |
+| Change type | <type> |
+| Risk | Low / Medium / High |
+| Verdict | Ready / Ready with cautions / Blocked |
+
+## Scope
+<what changes>
+
+## Evidence
+<tests, CI, plans, diffs>
+
+## Deploy order
+<dependencies and sequencing>
+
+## Rollback plan
+<commands/process and risk>
+
+## Monitoring plan
+<dashboards, alerts, abort thresholds>
+
+## Blockers / cautions
+<items to resolve before release>
+```
+
+---
+
+## Safety rules
+
+- This workflow is a **review gate**. It should not deploy anything.
+- Commands are validation/diff/plan commands only.
+- Never recommend production deploy if rollback is unknown for high-risk changes.
+- Never ignore failed tests or scans; document explicit risk acceptance if release proceeds.
diff --git a/workflows/kubernetes/helm-release-debug.md b/workflows/kubernetes/helm-release-debug.md
index 8bb52df..d7565d1 100644
--- a/workflows/kubernetes/helm-release-debug.md
+++ b/workflows/kubernetes/helm-release-debug.md
@@ -71,6 +71,25 @@ Flag: changes in image tags, replica counts, resources, securityContext, ingress
 
 ---
 
+## Step 3b — GitOps ownership check
+
+// turbo
+
+```bash
+echo "=== Helm release labels/secrets ==="
+kubectl -n $NAMESPACE get secret -l "owner=helm,name=$RELEASE" -o custom-columns=NAME:.metadata.name,STATUS:.metadata.labels.status,VERSION:.metadata.labels.version,CREATED:.metadata.creationTimestamp 2>/dev/null
+
+echo "=== ArgoCD Applications that may manage this release ==="
+kubectl get applications -A 2>/dev/null | grep -E "$RELEASE|$NAMESPACE" || echo "No matching ArgoCD Application found (or CRD not installed/no permission)"
+
+echo "=== Flux HelmReleases that may manage this release ==="
+kubectl get helmreleases -A 2>/dev/null | grep -E "$RELEASE|$NAMESPACE" || echo "No matching Flux HelmRelease found (or CRD not installed/no permission)"
+```
+
+Flag: release managed by ArgoCD/Flux where manual `helm rollback` or `helm upgrade` will conflict with the GitOps controller. Prefer changing the Git source and syncing the application.
+
+---
+
 ## Step 4 — Rendered manifest sanity
 
 // turbo
@@ -87,6 +106,27 @@ Flag: deprecated apiVersions (compare against current cluster version — see `/
 
 ---
 
+## Step 4b — CRD dependency and lifecycle checks
+
+// turbo
+
+```bash
+echo "=== CRDs referenced by rendered manifests ==="
+grep -E '^kind: ' /tmp/manifest.yaml | awk '{print $2}' | sort -u | while read kind; do
+  kubectl api-resources 2>/dev/null | awk '{print $1,$2,$NF}' | grep -i "^$kind\\b" || true
+done
+
+echo "=== CRDs included in this release manifest ==="
+awk '/^kind: CustomResourceDefinition/{print "CRD found"} /^  name:/{if(found!=""){print $2; found=""}} /^kind: CustomResourceDefinition/{found=1}' /tmp/manifest.yaml
+
+echo "=== Recently established/non-established CRDs ==="
+kubectl get crd -o custom-columns=NAME:.metadata.name,ESTABLISHED:.status.conditions[?(@.type==\"Established\")].status,AGE:.metadata.creationTimestamp 2>/dev/null | tail -50
+```
+
+Flag: custom resources rendered before their CRDs exist, CRDs managed by a different chart/release, CRD upgrades that Helm will not apply automatically, or non-established CRDs blocking dependent resources.
+
+---
+
 ## Step 5 — Hooks (pre/post-install/upgrade/delete)
 
 ```bash
@@ -106,6 +146,8 @@ done
 
 Flag: `pre-upgrade` / `pre-install` jobs failing (these block the whole release); `post-delete` hooks left running on a previous uninstall.
 
+Also flag: hook Jobs with no `activeDeadlineSeconds`, no `ttlSecondsAfterFinished`, very high `backoffLimit`, or hooks missing `helm.sh/hook-delete-policy` (these often leave stale failed Jobs that block later upgrades).
+
 ---
 
 ## Step 6 — Resources actually created
diff --git a/workflows/kubernetes/k8s-rbac-audit.md b/workflows/kubernetes/k8s-rbac-audit.md
index 4f01a90..7f4afe4 100644
--- a/workflows/kubernetes/k8s-rbac-audit.md
+++ b/workflows/kubernetes/k8s-rbac-audit.md
@@ -141,6 +141,30 @@ Flag: workloads still on `default` SA (best practice: dedicated SA per app); `au
 
 ---
 
+## Step 6b — ServiceAccount token exposure and secret references
+
+// turbo
+
+```bash
+[ "$NAMESPACE" = "all" ] && S="-A" || S="-n $NAMESPACE"
+
+echo "=== ServiceAccounts with automount enabled or unspecified ==="
+kubectl get serviceaccounts $S -o json | jq -r '
+  .items[] |
+  select(.automountServiceAccountToken == true or .automountServiceAccountToken == null) |
+  "\(.metadata.namespace)/\(.metadata.name)\tautomount=\(.automountServiceAccountToken // "default-true")"'
+
+echo "=== Pods mounting projected service account tokens ==="
+kubectl get pods $S -o json | jq -r '
+  .items[] |
+  select([.spec.volumes[]? | select(.projected.sources[]?.serviceAccountToken)] | length > 0) |
+  "\(.metadata.namespace)/\(.metadata.name)\tSA=\(.spec.serviceAccountName // "default")"'
+```
+
+Flag: ServiceAccounts that do not need Kubernetes API access but still automount tokens; pods using long-lived or broadly scoped projected tokens; high-privilege ServiceAccounts used by many workloads.
+
+---
+
 ## Step 7 — Aggregated ClusterRoles & built-in escalation paths
 
 // turbo
diff --git a/workflows/kubernetes/k8s-workload-debug.md b/workflows/kubernetes/k8s-workload-debug.md
index 3e735fd..118cfc2 100644
--- a/workflows/kubernetes/k8s-workload-debug.md
+++ b/workflows/kubernetes/k8s-workload-debug.md
@@ -76,6 +76,7 @@ kubectl -n $NAMESPACE get $KIND $NAME -o json | jq '{
   tolerations: .spec.template.spec.tolerations,
   affinity: .spec.template.spec.affinity,
   topologySpread: .spec.template.spec.topologySpreadConstraints,
+  automountServiceAccountToken: .spec.template.spec.automountServiceAccountToken,
   containers: [.spec.template.spec.containers[] | {
     name, image,
     resources,
@@ -89,7 +90,7 @@ kubectl -n $NAMESPACE get $KIND $NAME -o json | jq '{
 }'
 ```
 
-Flag: missing requests/limits, no probes, no readinessProbe (rolling updates lie about readiness), `latest` tag, runs as root, missing `imagePullSecrets` for private registry, oversized resources vs cluster.
+Flag: missing requests/limits, no probes, no readinessProbe (rolling updates lie about readiness), `latest` tag, runs as root, missing `imagePullSecrets` for private registry, oversized resources vs cluster, `automountServiceAccountToken: true` on workloads that do not call the Kubernetes API.
 
 ---
 
@@ -135,6 +136,26 @@ Search for: `error`, `fatal`, `panic`, `exception`, `traceback`, `failed`, `time
 
 ---
 
+## Step 5b — Sidecar and init-container analysis
+
+```bash
+kubectl -n $NAMESPACE get pods -l "$SEL" -o json | jq -r '
+  .items[] | .metadata.name as $p |
+  "POD=\($p)",
+  "  initContainers=\([.spec.initContainers[]?.name] | join(","))",
+  "  containers=\([.spec.containers[]?.name] | join(","))"'
+
+for p in $(kubectl -n $NAMESPACE get pods -l "$SEL" -o name); do
+  kubectl -n $NAMESPACE get $p -o json | jq -r '.spec.initContainers[]?.name' | while read c; do
+    [ -n "$c" ] && echo "===== $p init/$c =====" && kubectl -n $NAMESPACE logs $p -c "$c" --tail=$LOG_TAIL --timestamps 2>/dev/null
+  done
+done
+```
+
+Flag: init container failures hidden behind `Init:Error` / `PodInitializing`; sidecars such as Istio/Envoy, Vault agent, Fluent Bit, or OpenTelemetry collector restarting while the app container looks healthy; sidecar injection changing ports, env vars, or filesystem mounts.
+
+---
+
 ## Step 6 — Probe failure analysis
 
 // turbo
@@ -190,6 +211,23 @@ Flag: service with no endpoints, named-port mismatch, NetworkPolicy denying expe
 
 ---
 
+## Step 8b — GitOps and controller ownership
+
+```bash
+echo "=== GitOps/controller ownership labels ==="
+kubectl -n $NAMESPACE get $KIND $NAME -o json | jq '.metadata.labels, .metadata.annotations'
+
+echo "=== ArgoCD Applications referencing namespace/workload (if installed) ==="
+kubectl get applications -A 2>/dev/null | grep -E "$NAMESPACE|$NAME" || echo "No matching ArgoCD Application found (or CRD not installed/no permission)"
+
+echo "=== Flux Kustomizations/HelmReleases (if installed) ==="
+kubectl get kustomizations,helmreleases -A 2>/dev/null | grep -E "$NAMESPACE|$NAME" || echo "No matching Flux resources found (or CRDs not installed/no permission)"
+```
+
+Flag: resource managed by ArgoCD/Flux/Helm where manual `kubectl edit/apply` changes will be reverted; sync status unhealthy/out-of-sync; multiple controllers owning the same workload.
+
+---
+
 ## Step 9 — Storage (PVCs and mounts)
 
 // turbo
diff --git a/workflows/security/repo-health.md b/workflows/security/repo-health.md
new file mode 100644
index 0000000..adb2291
--- /dev/null
+++ b/workflows/security/repo-health.md
@@ -0,0 +1,181 @@
+---
+description: Audit repository hygiene and maintainability. Checks README, license, CI, branch protection indicators, stale branches, secrets hygiene, dependency files, and release readiness.
+---
+
+# /repo-health — Repository Hygiene Audit
+
+Read-only repository audit for maintainability, security hygiene, and operational readiness. Useful before open-sourcing a repo, onboarding a team, or preparing a production service for long-term ownership.
+
+## Prerequisites
+
+- Local git repository.
+- Optional: `gh` CLI for GitHub metadata.
+- Optional: `jq`.
+
+## Inputs
+
+- **REPO_PATH** *(required)* — path to the git repository root.
+- **STALE_DAYS** — branch staleness threshold. Default: `90`.
+- **REPORT_DIR** — Default: `./repo-health-reports`.
+
+---
+
+## Step 1 — Basic repository inventory
+
+// turbo
+
+```bash
+cd $REPO_PATH
+
+echo "=== Git basics ==="
+git remote -v
+git branch --show-current
+git log --oneline -5
+git status --short
+
+echo "=== Top-level files ==="
+find . -maxdepth 2 -type f | sed 's#^./##' | sort | head -100
+```
+
+Flag:
+- No remote configured.
+- Dirty working tree when preparing release/PR.
+- Missing standard files: README, LICENSE, CHANGELOG, CONTRIBUTING, CODEOWNERS.
+
+---
+
+## Step 2 — Documentation and ownership
+
+// turbo
+
+```bash
+cd $REPO_PATH
+
+for f in README.md LICENSE CHANGELOG.md CONTRIBUTING.md CODEOWNERS .github/CODEOWNERS; do
+  [ -f "$f" ] && echo "FOUND: $f" || echo "MISSING: $f"
+done
+
+echo "=== README sections ==="
+grep -nE '^##? (Overview|Quick start|Usage|Development|Testing|Deployment|Configuration|Troubleshooting|Contributing|License)' README.md 2>/dev/null || true
+```
+
+Flag:
+- README without quick start, usage, testing, or deployment instructions.
+- No ownership file for review routing.
+- CHANGELOG absent for reusable tools/libraries.
+
+---
+
+## Step 3 — CI/CD and automation
+
+// turbo
+
+```bash
+cd $REPO_PATH
+
+echo "=== CI configs ==="
+find .github/workflows .gitlab-ci.yml bitbucket-pipelines.yml Jenkinsfile -maxdepth 2 -type f 2>/dev/null | sort
+
+echo "=== Common quality configs ==="
+find . -maxdepth 3 -type f \( -name '.pre-commit-config.yaml' -o -name '.editorconfig' -o -name '.markdownlint*' -o -name 'renovate.json' -o -name 'dependabot.yml' \) | sort
+```
+
+Flag:
+- No CI workflow/pipeline.
+- No dependency update automation (`renovate`/`dependabot`).
+- No formatting/lint configuration for active languages.
+- CI exists but no security or dependency scanning.
+
+---
+
+## Step 4 — Secrets and ignore hygiene
+
+// turbo
+
+```bash
+cd $REPO_PATH
+
+echo "=== .gitignore present ==="
+[ -f .gitignore ] && cat .gitignore || echo "NO .gitignore"
+
+echo "=== Sensitive files tracked ==="
+for pattern in '.env' '.env.*' '*.pem' '*.key' '*.p12' '*.pfx' '*.tfstate' '*.tfvars' 'kubeconfig' 'credentials' 'secrets.yaml' 'secrets.yml'; do
+  git ls-files "$pattern" 2>/dev/null | sed "s/^/TRACKED: /"
+done
+
+echo "=== High-confidence secret-looking strings in current tree ==="
+git grep -nE 'AKIA[0-9A-Z]{16}|ghp_[0-9A-Za-z]{36}|glpat-[0-9A-Za-z\-]{20}|BEGIN (RSA |OPENSSH |EC )?PRIVATE KEY' 2>/dev/null | head -20 || true
+```
+
+If suspicious findings appear, recommend `/secrets-leak-scan` for full history scanning.
+
+---
+
+## Step 5 — Branch and release hygiene
+
+// turbo
+
+```bash
+cd $REPO_PATH
+
+echo "=== Recent branches ==="
+git for-each-ref --sort=-committerdate --format='%(refname:short) %(committerdate:short) %(authorname)' refs/heads/ refs/remotes/origin/ | head -30
+
+echo "=== Tags/releases ==="
+git tag --sort=-creatordate | head -20
+```
+
+Flag:
+- Many stale branches older than `STALE_DAYS`.
+- No tags/releases for production software.
+- No clear branching strategy documented.
+
+---
+
+## Step 6 — Generate report
+
+Write:
+
+```
+$REPORT_DIR/repo-health-<repo-name>-<YYYYMMDD-HHMMSS>.md
+```
+
+### Report structure
+
+```markdown
+# Repository Health Report
+
+| Field | Value |
+|---|---|
+| Repo | <name> |
+| Branch | <branch> |
+| Generated | <timestamp> |
+| Verdict | Healthy / Needs attention / High risk |
+
+## Summary
+<top findings>
+
+## Documentation and ownership
+<findings>
+
+## CI/CD and automation
+<findings>
+
+## Security hygiene
+<findings>
+
+## Branch and release hygiene
+<findings>
+
+## Recommended actions
+<prioritized list>
+```
+
+---
+
+## Safety rules
+
+- This workflow is **read-only**.
+- Do not print full secret values. Redact if reporting.
+- Do not delete branches or modify repository settings.
+- If `gh` or remote API access is unavailable, record that limitation and continue with local checks.