From 3f7ff97b9ea48d4cf7dc0e153b72722a9e2284cb Mon Sep 17 00:00:00 2001
From: Sergei Olshanetski <solshanetski@proofpoint.com>
Date: Thu, 14 May 2026 10:33:14 -0400
Subject: [PATCH 1/2] Btter worflows and more prompts

---
 prompts/explain-like-a-senior.md       | 56 ++++++++++++++++++++
 prompts/pr-description.md              | 54 ++++++++++++++++++++
 scripts/aws-whoami.sh                  | 45 ++++++++++++++++
 scripts/stale-branches.sh              | 71 ++++++++++++++++++++++++++
 workflows/aws/aws-account-audit.md     | 41 +++++++++++----
 workflows/aws/aws-cost-quickscan.md    | 46 ++++++++++++++++-
 workflows/iac/terraform-plan-review.md | 26 ++++++++++
 7 files changed, 326 insertions(+), 13 deletions(-)
 create mode 100644 prompts/explain-like-a-senior.md
 create mode 100644 prompts/pr-description.md
 create mode 100644 scripts/aws-whoami.sh
 create mode 100644 scripts/stale-branches.sh
diff --git a/prompts/explain-like-a-senior.md b/prompts/explain-like-a-senior.md
new file mode 100644
index 0000000..b357531
--- /dev/null
+++ b/prompts/explain-like-a-senior.md
@@ -0,0 +1,56 @@
+# Explain Like a Senior — System Prompt
+
+Paste this into any AI agent when you want a clear, educational explanation of infrastructure code for a junior engineer or new team member.
+
+---
+
+## System prompt
+
+You are a **senior DevOps/SRE engineer** explaining infrastructure code to a junior team member. Your goal is to build understanding, not just describe syntax.
+
+### For each piece of code, explain
+
+1. **What it does** — plain English, no jargon. If jargon is unavoidable, define it.
+2. **Why it's designed this way** — what problem does this solve? What trade-offs were made?
+3. **What could go wrong** — common failure modes, misconfigurations, and gotchas.
+4. **How it connects** — how does this piece fit into the bigger picture? What depends on it? What does it depend on?
+5. **What you'd change** — if anything looks suboptimal, explain what a senior would do differently and why.
+
+### Explanation style
+
+- **Start with the big picture**, then zoom in. "This Terraform module creates a VPC with public and private subnets. Here's how each piece works..."
+- **Use analogies** where they help. "A NAT Gateway is like a mail forwarding service — private instances send mail through it so they can reach the internet without being directly addressable."
+- **Show the mental model.** How would a senior engineer think about this? What questions would they ask?
+- **Point out non-obvious things.** "This `depends_on` might look unnecessary, but without it, the IAM role gets created before the policy is attached, and the Lambda function fails on first deploy."
+- **Be honest about complexity.** If something is genuinely confusing or poorly designed, say so — don't pretend it's simple.
+
+### Format
+
+```markdown
+## Overview
+<big picture: what this code does and why it exists>
+
+## Walk-through
+<section by section explanation>
+
+### <section name>
+**What:** <what this block does>
+**Why:** <why it's needed>
+**Gotcha:** <what could go wrong>
+
+## How it fits together
+<architecture context — what calls this, what this calls>
+
+## Things to watch out for
+<list of common mistakes or misconfigurations>
+
+## If I were reviewing this
+<what a senior would suggest improving>
+```
+
+### Rules
+
+- **No condescension.** Junior doesn't mean stupid. Explain clearly without being patronizing.
+- **No hand-waving.** If you don't know why something is done a certain way, say "I'm not sure why this specific choice was made — it might be historical. Here's what I'd investigate."
+- **Use the actual code.** Reference specific lines, variables, and resource names.
+- **Encourage questions.** End with "Good questions to ask your team about this: ..."
diff --git a/prompts/pr-description.md b/prompts/pr-description.md
new file mode 100644
index 0000000..a7a47e8
--- /dev/null
+++ b/prompts/pr-description.md
@@ -0,0 +1,54 @@
+# PR Description Generator — System Prompt
+
+Paste this into any AI agent along with your `git diff` or list of changes to generate a PR description.
+
+---
+
+## System prompt
+
+You are a **PR description writer** for a DevOps/infrastructure team. Given a diff, commit list, or description of changes, generate a clear, reviewable pull request description.
+
+### Output format
+
+```markdown
+## What
+
+<1-3 sentences: what this PR does in plain English>
+
+## Why
+
+<1-3 sentences: why this change is needed — the problem, feature request, or improvement>
+
+## How
+
+<bullet list of the key changes, grouped by file or area>
+
+## Testing
+
+<what was tested and how — manual steps, CI results, environments used>
+
+## Risk
+
+<what could go wrong, blast radius, rollback plan>
+- **Risk level:** Low / Medium / High
+- **Rollback:** <how to revert if needed>
+- **Affected environments:** <which envs will be impacted>
+
+## Checklist
+
+- [ ] Code follows project conventions
+- [ ] Tests added/updated
+- [ ] Documentation updated (if applicable)
+- [ ] No secrets or credentials in the diff
+- [ ] Reviewed for security implications
+```
+
+### Rules
+
+- **Be specific.** Don't say "updated the config" — say "changed the RDS instance class from `db.t3.medium` to `db.t3.large` to handle increased query load."
+- **Group changes logically.** If the PR touches 5 files across 2 concerns, group by concern, not by file.
+- **Flag breaking changes** prominently with ⚠️.
+- **Mention dependencies** — does this PR need to be merged/deployed before or after another PR?
+- **Include the diff context.** If the user provides a diff, reference specific file paths and line changes.
+- **Never include secret values** from the diff. If the diff contains credentials, flag it as a blocker.
+- **For infrastructure PRs**, always include: what resources are created/modified/destroyed, blast radius, and rollback plan.
diff --git a/scripts/aws-whoami.sh b/scripts/aws-whoami.sh
new file mode 100644
index 0000000..b6aedaa
--- /dev/null
+++ b/scripts/aws-whoami.sh
@@ -0,0 +1,45 @@
+#!/usr/bin/env bash
+# ────────────────────────────────────────────────────────────────
+# aws-whoami.sh — Quick AWS identity and account context
+# ────────────────────────────────────────────────────────────────
+# Usage: ./aws-whoami.sh [profile]
+#
+# Shows: caller identity, account alias, region, organization,
+# and SSO role (if using AWS SSO).
+# ────────────────────────────────────────────────────────────────
+set -euo pipefail
+
+PROFILE_FLAG=""
+[ -n "${1:-}" ] && PROFILE_FLAG="--profile $1"
+
+echo "🔍 AWS Identity Check"
+echo "====================="
+echo ""
+
+echo "--- Caller Identity ---"
+aws sts get-caller-identity $PROFILE_FLAG --output table 2>&1
+
+echo ""
+echo "--- Region ---"
+REGION=$(aws configure get region $PROFILE_FLAG 2>/dev/null || echo "not set")
+echo "Region: $REGION"
+
+echo ""
+echo "--- Account Aliases ---"
+aws iam list-account-aliases $PROFILE_FLAG --query 'AccountAliases[]' --output text 2>/dev/null || echo "(none or no permission)"
+
+echo ""
+echo "--- Organization ---"
+aws organizations describe-organization $PROFILE_FLAG --query 'Organization.{Id:Id,Master:MasterAccountId,Email:MasterAccountEmail}' --output table 2>/dev/null || echo "Not in an org (or no permission)"
+
+echo ""
+echo "--- SSO Role (if applicable) ---"
+ARN=$(aws sts get-caller-identity $PROFILE_FLAG --query 'Arn' --output text 2>/dev/null)
+if echo "$ARN" | grep -q 'assumed-role'; then
+  ROLE=$(echo "$ARN" | awk -F/ '{print $2}')
+  USER=$(echo "$ARN" | awk -F/ '{print $3}')
+  echo "Role: $ROLE"
+  echo "User: $USER"
+else
+  echo "Not using assumed role"
+fi
diff --git a/scripts/stale-branches.sh b/scripts/stale-branches.sh
new file mode 100644
index 0000000..87c9043
--- /dev/null
+++ b/scripts/stale-branches.sh
@@ -0,0 +1,71 @@
+#!/usr/bin/env bash
+# ────────────────────────────────────────────────────────────────
+# stale-branches.sh — List git branches older than N days
+# ────────────────────────────────────────────────────────────────
+# Usage: ./stale-branches.sh [days] [--remote]
+#
+# Defaults: 90 days, local branches only.
+# Add --remote to include remote tracking branches.
+# ────────────────────────────────────────────────────────────────
+set -euo pipefail
+
+DAYS="${1:-90}"
+INCLUDE_REMOTE=false
+[ "${2:-}" = "--remote" ] && INCLUDE_REMOTE=true
+
+CUTOFF=$(date -u -v-${DAYS}d +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d "${DAYS} days ago" +%Y-%m-%dT%H:%M:%SZ 2>/dev/null)
+
+echo "🌿 Stale Branch Report"
+echo "======================"
+echo "Repo: $(basename "$(git rev-parse --show-toplevel 2>/dev/null || echo '?')")"
+echo "Threshold: ${DAYS} days (before $(echo "$CUTOFF" | cut -dT -f1))"
+echo "Scope: $([ "$INCLUDE_REMOTE" = true ] && echo 'local + remote' || echo 'local only')"
+echo ""
+
+# Current branch (don't flag this one)
+CURRENT=$(git branch --show-current 2>/dev/null || echo "")
+
+echo "--- Stale local branches ---"
+stale_local=0
+for branch in $(git for-each-ref --sort=committerdate --format='%(refname:short) %(committerdate:iso8601)' refs/heads/ | while read name date; do
+  # Compare dates
+  branch_epoch=$(date -jf "%Y-%m-%d %H:%M:%S %z" "$date" +%s 2>/dev/null || date -d "$date" +%s 2>/dev/null || echo 0)
+  cutoff_epoch=$(date -jf "%Y-%m-%dT%H:%M:%SZ" "$CUTOFF" +%s 2>/dev/null || date -d "$CUTOFF" +%s 2>/dev/null || echo 0)
+  [ "$branch_epoch" -lt "$cutoff_epoch" ] 2>/dev/null && echo "$name"
+done); do
+  [ "$branch" = "$CURRENT" ] && continue
+  [ "$branch" = "main" ] || [ "$branch" = "master" ] && continue
+  last_commit=$(git log -1 --format='%ci (%cr)' "$branch" 2>/dev/null || echo "unknown")
+  author=$(git log -1 --format='%an' "$branch" 2>/dev/null || echo "unknown")
+  echo "  $branch"
+  echo "    Last commit: $last_commit"
+  echo "    Author: $author"
+  stale_local=$((stale_local + 1))
+done
+[ "$stale_local" -eq 0 ] && echo "  (none)"
+echo ""
+echo "Stale local branches: $stale_local"
+
+if [ "$INCLUDE_REMOTE" = true ]; then
+  echo ""
+  echo "--- Stale remote branches ---"
+  git fetch --prune 2>/dev/null || true
+  stale_remote=0
+  git for-each-ref --sort=committerdate --format='%(refname:short) %(committerdate:iso8601)' refs/remotes/origin/ | while read name date; do
+    # Skip HEAD and main/master
+    echo "$name" | grep -qE 'HEAD|/main$|/master$' && continue
+    branch_epoch=$(date -jf "%Y-%m-%d %H:%M:%S %z" "$date" +%s 2>/dev/null || date -d "$date" +%s 2>/dev/null || echo 0)
+    cutoff_epoch=$(date -jf "%Y-%m-%dT%H:%M:%SZ" "$CUTOFF" +%s 2>/dev/null || date -d "$CUTOFF" +%s 2>/dev/null || echo 0)
+    if [ "$branch_epoch" -lt "$cutoff_epoch" ] 2>/dev/null; then
+      last_commit=$(git log -1 --format='%ci (%cr)' "$name" 2>/dev/null || echo "unknown")
+      echo "  $name — $last_commit"
+      stale_remote=$((stale_remote + 1))
+    fi
+  done
+  echo ""
+  echo "Stale remote branches: $stale_remote"
+fi
+
+echo ""
+echo "💡 To delete a stale local branch:  git branch -d <branch>"
+echo "💡 To delete a stale remote branch: git push origin --delete <branch>"
diff --git a/workflows/aws/aws-account-audit.md b/workflows/aws/aws-account-audit.md
index cad6451..5605d0f 100644
--- a/workflows/aws/aws-account-audit.md
+++ b/workflows/aws/aws-account-audit.md
@@ -19,10 +19,13 @@ Ask the user for the following before starting (use sensible defaults if not pro
 - **PROFILE** — AWS CLI profile name. Default: current default profile.
 - **REGION** — primary region to audit. Default: current default region (`aws configure get region`).
 - **ALL_REGIONS** — `yes`/`no`. If `yes`, repeat region-scoped checks across all enabled regions. Default: `no` (primary region only).
+- **FAST** — `yes`/`no`. If `yes`, skip slow per-user/per-policy iteration loops and use bulk API calls only. Recommended for large enterprise accounts (>1000 roles or >500 policies) to avoid API throttling. Default: `no`.
 - **REPORT_DIR** — where to write the report. Default: `./aws-account-audit-reports`.
 
 Confirm the inputs and caller identity with the user before proceeding.
 
+> **Performance note:** On large enterprise accounts (thousands of roles/policies), the per-policy admin-access scan in Step 3 can take 30+ minutes due to AWS IAM API throttling. Use `FAST=yes` to skip these loops and rely on `list-entities-for-policy` bulk checks for `AdministratorAccess`, `IAMFullAccess`, and `PowerUserAccess` instead.
+
 ---
 
 ## Step 1 — Verify identity and account context
@@ -85,21 +88,37 @@ Flag:
 
 // turbo
 
+### Always run — bulk privilege checks (fast)
+
+```bash
+echo "=== Account summary (role/policy counts) ==="
+aws iam get-account-summary --output json 2>/dev/null | jq '{Users:.SummaryMap.Users,Roles:.SummaryMap.Roles,Policies:.SummaryMap.Policies,MFADevicesInUse:.SummaryMap.MFADevicesInUse}'
+
+echo "=== Entities with AdministratorAccess ==="
+aws iam list-entities-for-policy --policy-arn "arn:aws:iam::aws:policy/AdministratorAccess" \
+  --query '{Users:PolicyUsers[].UserName,Roles:PolicyRoles[].RoleName|length(@),Groups:PolicyGroups[].GroupName}' --output json 2>/dev/null
+
+echo "=== Entities with IAMFullAccess ==="
+aws iam list-entities-for-policy --policy-arn "arn:aws:iam::aws:policy/IAMFullAccess" \
+  --query '{Users:PolicyUsers[].UserName,Roles:PolicyRoles[].RoleName|length(@),Groups:PolicyGroups[].GroupName}' --output json 2>/dev/null
+
+echo "=== Entities with PowerUserAccess ==="
+aws iam list-entities-for-policy --policy-arn "arn:aws:iam::aws:policy/PowerUserAccess" \
+  --query '{Users:PolicyUsers[].UserName,Roles:PolicyRoles[].RoleName|length(@),Groups:PolicyGroups[].GroupName}' --output json 2>/dev/null
+```
+
+### Only when FAST!=yes — deep per-policy scan (slow on large accounts)
+
+> **Skip this section if `FAST=yes`.** On accounts with thousands of policies, this loop makes one API call per policy and can take 30+ minutes due to IAM throttling.
+
 ```bash
-echo "=== Policies with admin access ==="
+echo "=== Customer-managed policies with admin access ==="
 for arn in $(aws iam list-policies --scope Local --query 'Policies[].Arn' --output text); do
   ver=$(aws iam get-policy --policy-arn "$arn" --query 'Policy.DefaultVersionId' --output text)
   doc=$(aws iam get-policy-version --policy-arn "$arn" --version-id "$ver" --query 'PolicyVersion.Document' --output json 2>/dev/null)
   echo "$doc" | jq -e '.Statement[] | select(.Effect=="Allow" and .Action=="*" and .Resource=="*")' >/dev/null 2>&1 && echo "ADMIN-POLICY: $arn"
 done
 
-echo "=== Users/roles with AdministratorAccess ==="
-for arn in "arn:aws:iam::policy/AdministratorAccess" "arn:aws:iam::policy/IAMFullAccess"; do
-  full_arn="arn:aws:iam::policy/${arn##*/}"
-  managed_arn="arn:aws:iam::aws:policy/${arn##*/}"
-  aws iam list-entities-for-policy --policy-arn "$managed_arn" --query '{Users:PolicyUsers[].UserName,Roles:PolicyRoles[].RoleName,Groups:PolicyGroups[].GroupName}' --output json 2>/dev/null || true
-done
-
 echo "=== Inline policies with wildcards ==="
 for user in $(aws iam list-users --query 'Users[].UserName' --output text); do
   for pol in $(aws iam list-user-policies --user-name "$user" --query 'PolicyNames[]' --output text); do
@@ -111,9 +130,9 @@ done
 
 Flag:
 
-- Customer-managed policies granting `*:*`.
-- Principals with `AdministratorAccess` or `IAMFullAccess`.
-- Inline policies with wildcard actions.
+- Principals with `AdministratorAccess`, `IAMFullAccess`, or `PowerUserAccess`.
+- Customer-managed policies granting `*:*` (deep scan only).
+- Inline policies with wildcard actions (deep scan only).
 - Roles allowing `iam:PassRole` with `*` resource.
 
 ---
diff --git a/workflows/aws/aws-cost-quickscan.md b/workflows/aws/aws-cost-quickscan.md
index 77152dd..d022f8c 100644
--- a/workflows/aws/aws-cost-quickscan.md
+++ b/workflows/aws/aws-cost-quickscan.md
@@ -19,6 +19,7 @@ Quick, **read-only** scan of an AWS account to surface the biggest cost drivers
 - **REGION** — primary region. Default: current default.
 - **ALL_REGIONS** — `yes`/`no`. Default: `no`.
 - **LOOKBACK_DAYS** — Cost Explorer lookback period. Default: `30`.
+- **DEEP** — `yes`/`no`. If `yes`, also check per-instance CPU/memory utilization (requires CloudWatch `GetMetricStatistics`, slower on large fleets). Default: `no`.
 - **REPORT_DIR** — Default: `./aws-cost-quickscan-reports`.
 
 ---
@@ -305,7 +306,48 @@ Flag:
 
 ---
 
-## Step 9 — Savings Plans and Reserved Instances coverage
+## Step 9 — EC2 and RDS utilization analysis (only when DEEP=yes)
+
+> **Skip this step if `DEEP!=yes`.** This makes one CloudWatch API call per instance and can be slow on large fleets (50+ instances). The workflow caps at 50 instances.
+
+```bash
+REGIONS="${ALL_REGIONS_LIST:-$REGION}"
+
+for r in $REGIONS; do
+  echo "=== Low-CPU EC2 instances (avg < 5% over 7d) region=$r ==="
+  for iid in $(aws ec2 describe-instances --region "$r" --filters Name=instance-state-name,Values=running --query 'Reservations[].Instances[].InstanceId' --output text 2>/dev/null | head -50); do
+    avg=$(aws cloudwatch get-metric-statistics --region "$r" \
+      --namespace AWS/EC2 --metric-name CPUUtilization \
+      --dimensions Name=InstanceId,Value=$iid \
+      --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ) \
+      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
+      --period 86400 --statistics Average \
+      --query 'Datapoints[].Average' --output text 2>/dev/null | awk '{s+=$1; n++} END {if(n>0) printf "%.1f", s/n; else print "N/A"}')
+    [ "$avg" != "N/A" ] && [ "$(echo "$avg < 5" | bc 2>/dev/null)" = "1" ] && echo "LOW-CPU: $iid avg=${avg}%"
+  done
+
+  echo "=== Low-CPU RDS instances (avg < 10% over 7d) region=$r ==="
+  for dbid in $(aws rds describe-db-instances --region "$r" --query 'DBInstances[?DBInstanceStatus==`available`].DBInstanceIdentifier' --output text 2>/dev/null | head -30); do
+    avg=$(aws cloudwatch get-metric-statistics --region "$r" \
+      --namespace AWS/RDS --metric-name CPUUtilization \
+      --dimensions Name=DBInstanceIdentifier,Value=$dbid \
+      --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ) \
+      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
+      --period 86400 --statistics Average \
+      --query 'Datapoints[].Average' --output text 2>/dev/null | awk '{s+=$1; n++} END {if(n>0) printf "%.1f", s/n; else print "N/A"}')
+    [ "$avg" != "N/A" ] && [ "$(echo "$avg < 10" | bc 2>/dev/null)" = "1" ] && echo "LOW-CPU-RDS: $dbid avg=${avg}%"
+  done
+done
+```
+
+Flag:
+- EC2 instances with avg CPU < 5% — candidates for downsizing or termination.
+- RDS instances with avg CPU < 10% — candidates for smaller instance class.
+- Cross-reference with instance type to estimate savings from right-sizing.
+
+---
+
+## Step 10 — Savings Plans and Reserved Instances coverage
 
 // turbo
 
@@ -334,7 +376,7 @@ Flag:
 
 ---
 
-## Step 10 — Generate report
+## Step 11 — Generate report
 
 Compile all findings into a timestamped Markdown report:
 
diff --git a/workflows/iac/terraform-plan-review.md b/workflows/iac/terraform-plan-review.md
index aa551b1..768fb32 100644
--- a/workflows/iac/terraform-plan-review.md
+++ b/workflows/iac/terraform-plan-review.md
@@ -25,6 +25,32 @@ Feed in a `terraform plan` (text, JSON, or saved plan file) and get a plain-Engl
 
 ---
 
+## Step 0 — Generate the plan (if not already available)
+
+If the user doesn't have a plan output yet, help them generate one:
+
+```bash
+# Option A: Generate text plan
+cd <terraform-directory>
+terraform init -backend=false    # safe: no backend state access needed for plan review
+terraform plan -no-color 2>&1 | tee /tmp/tf-plan.txt
+
+# Option B: Generate JSON plan (richer, recommended)
+terraform plan -out=/tmp/tf-plan.bin
+terraform show -json /tmp/tf-plan.bin > /tmp/tf-plan.json
+
+# Option C: If the user only has a saved binary plan file
+terraform show -json <planfile> > /tmp/tf-plan.json
+
+# Option D: If using Terragrunt
+terragrunt plan -out=/tmp/tf-plan.bin
+terraform show -json /tmp/tf-plan.bin > /tmp/tf-plan.json
+```
+
+> **Note:** `terraform init -backend=false` is safe and does not access remote state. It only downloads providers and modules needed to validate the config. If the user has already run `terraform init`, skip this.
+
+---
+
 ## Step 1 — Ingest and parse the plan
 
 If the input is a binary plan file, convert it:

From 7472060aeff80b0ab4f0031c8810c2d7c652f046 Mon Sep 17 00:00:00 2001
From: Sergei Olshanetski <solshanetski@proofpoint.com>
Date: Thu, 14 May 2026 21:59:48 -0400
Subject: [PATCH 2/2] Add new workflows, CI, and CHANGELOG

Workflows:
- helm-chart-review: Helm chart best practices review (kubernetes/)
- secrets-leak-scan: git history secret scanner (security/)
- incident-triage: guided first 15 min of an incident (observability/)

Improvements:
- aws-account-audit: FAST=yes mode for large accounts
- aws-cost-quickscan: DEEP=yes for EC2/RDS CPU utilization
- terraform-plan-review: Step 0 plan generation commands
- k8s-debug: enhanced logs, restart timeline, HPA checks

Prompts:
- pr-description.md
- explain-like-a-senior.md

Scripts:
- aws-whoami.sh
- stale-branches.sh

Repo:
- GitHub Actions CI (lint, link check, frontmatter validation)
- CHANGELOG.md
---
 .github/.markdownlint.json                 |   8 +
 .github/mlc-config.json                    |  11 +
 .github/workflows/ci.yml                   |  71 +++++
 CHANGELOG.md                               |  64 +++++
 README.md                                  |  24 +-
 scripts/aws-whoami.sh                      |   0
 scripts/stale-branches.sh                  |   0
 workflows/kubernetes/helm-chart-review.md  | 254 ++++++++++++++++++
 workflows/observability/incident-triage.md | 289 +++++++++++++++++++++
 workflows/security/secrets-leak-scan.md    | 221 ++++++++++++++++
 10 files changed, 938 insertions(+), 4 deletions(-)
 create mode 100644 .github/.markdownlint.json
 create mode 100644 .github/mlc-config.json
 create mode 100644 .github/workflows/ci.yml
 create mode 100644 CHANGELOG.md
 mode change 100644 => 100755 scripts/aws-whoami.sh
 mode change 100644 => 100755 scripts/stale-branches.sh
 create mode 100644 workflows/kubernetes/helm-chart-review.md
 create mode 100644 workflows/observability/incident-triage.md
 create mode 100644 workflows/security/secrets-leak-scan.md

diff --git a/.github/.markdownlint.json b/.github/.markdownlint.json
new file mode 100644
index 0000000..44c0ac1
--- /dev/null
+++ b/.github/.markdownlint.json
@@ -0,0 +1,8 @@
+{
+  "default": true,
+  "MD013": false,
+  "MD033": false,
+  "MD041": false,
+  "MD024": { "siblings_only": true },
+  "MD046": { "style": "fenced" }
+}
diff --git a/.github/mlc-config.json b/.github/mlc-config.json
new file mode 100644
index 0000000..c6eac5c
--- /dev/null
+++ b/.github/mlc-config.json
@@ -0,0 +1,11 @@
+{
+  "ignorePatterns": [
+    {
+      "pattern": "^http://localhost"
+    },
+    {
+      "pattern": "^http://prometheus"
+    }
+  ],
+  "aliveStatusCodes": [200, 206, 301, 302, 403]
+}
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
new file mode 100644
index 0000000..fc42c01
--- /dev/null
+++ b/.github/workflows/ci.yml
@@ -0,0 +1,71 @@
+name: CI
+
+on:
+  push:
+    branches: [main, master]
+  pull_request:
+    branches: [main, master]
+
+jobs:
+  lint:
+    name: Lint & Validate
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Check markdown links
+        uses: gaurav-nelson/github-action-markdown-link-check@v1
+        with:
+          use-quiet-mode: 'yes'
+          config-file: '.github/mlc-config.json'
+        continue-on-error: true
+
+      - name: Lint markdown
+        uses: DavidAnson/markdownlint-cli2-action@v19
+        with:
+          globs: '**/*.md'
+          config: '.github/.markdownlint.json'
+        continue-on-error: true
+
+      - name: Validate workflow frontmatter
+        run: |
+          echo "Checking all workflows have frontmatter..."
+          errors=0
+          for f in workflows/**/*.md; do
+            if ! head -1 "$f" | grep -q '^---$'; then
+              echo "❌ Missing frontmatter: $f"
+              errors=$((errors + 1))
+            fi
+          done
+          echo "Checked $(find workflows -name '*.md' | wc -l) workflows, $errors missing frontmatter"
+          [ "$errors" -eq 0 ] && echo "✅ All workflows have frontmatter"
+
+      - name: Check README workflow table matches files
+        run: |
+          echo "Checking README links match actual files..."
+          errors=0
+          for f in $(grep -oE '\./workflows/[^)]+\.md' README.md); do
+            if [ ! -f "$f" ]; then
+              echo "❌ README links to $f but file doesn't exist"
+              errors=$((errors + 1))
+            fi
+          done
+          echo "Checked $(grep -coE '\./workflows/[^)]+\.md' README.md) README links, $errors broken"
+          [ "$errors" -eq 0 ] && echo "✅ All README links are valid"
+
+      - name: Check scripts are executable
+        run: |
+          for f in scripts/*.sh; do
+            [ -f "$f" ] || continue
+            if [ ! -x "$f" ]; then
+              echo "❌ Not executable: $f"
+            fi
+          done
+
+      - name: Shellcheck scripts
+        run: |
+          if command -v shellcheck >/dev/null; then
+            shellcheck scripts/*.sh || true
+          else
+            echo "shellcheck not available, skipping"
+          fi
diff --git a/CHANGELOG.md b/CHANGELOG.md
new file mode 100644
index 0000000..4c15f2f
--- /dev/null
+++ b/CHANGELOG.md
@@ -0,0 +1,64 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+## [Unreleased]
+
+### Added — Workflows
+- **`/helm-chart-review`** — review Helm charts for security, reliability, and best practices (kubernetes/)
+- **`/secrets-leak-scan`** — scan git repos for leaked secrets using gitleaks, trufflehog, or regex (security/)
+- **`/incident-triage`** — guided first 15 minutes of a production incident (observability/)
+
+### Added — Prompts
+- **`pr-description.md`** — generate PR descriptions from diffs
+- **`explain-like-a-senior.md`** — explain infrastructure code to junior engineers
+
+### Added — Scripts
+- **`aws-whoami.sh`** — quick AWS identity and account context check
+- **`stale-branches.sh`** — list git branches older than N days
+
+### Added — CI
+- GitHub Actions CI: markdown lint, link check, frontmatter validation, README link verification
+
+### Improved
+- **`/aws-account-audit`** — added `FAST=yes` input to skip slow per-policy IAM loops on large accounts
+- **`/aws-cost-quickscan`** — added `DEEP=yes` input for per-instance CPU utilization analysis
+- **`/terraform-plan-review`** — added Step 0 with plan generation commands (including Terragrunt)
+- **`/k8s-debug`** — enhanced log analysis (Step 5) with init container logs, structured error extraction, severity classification, and "noisiest pods" scan; added restart timeline analysis (Step 6a) and HPA health check (Step 6b); expanded triage cheat-sheet with startup-order, Redis, autoscaling, and webhook patterns
+
+---
+
+## [0.1.0] — 2026-05-04
+
+### Added — Workflows
+- **`/k8s-debug`** — general-purpose Kubernetes cluster debugger (kubernetes/)
+- **`/k8s-workload-debug`** — deep-dive on a single workload (kubernetes/)
+- **`/k8s-rbac-audit`** — RBAC security audit (kubernetes/)
+- **`/k8s-cost-hotspots`** — cost and waste analysis (kubernetes/)
+- **`/k8s-upgrade-readiness`** — pre-flight checks for K8s upgrades (kubernetes/)
+- **`/helm-release-debug`** — diagnose stuck or failed Helm releases (kubernetes/)
+- **`/aws-account-audit`** — AWS account security audit (aws/)
+- **`/aws-cost-quickscan`** — AWS cost waste analysis (aws/)
+- **`/aws-vpc-debug`** — VPC connectivity triage (aws/)
+- **`/aws-iam-policy-review`** — IAM policy risk analysis (aws/)
+- **`/terraform-plan-review`** — Terraform plan risk analysis (iac/)
+- **`/ci-debug`** — CI/CD pipeline failure diagnosis (cicd/)
+- **`/jenkins-pipeline-review`** — Jenkinsfile code review (cicd/)
+- **`/dockerfile-review`** — Dockerfile security and optimization review (containers/)
+
+### Added — Prompts
+- **`incident-commander.md`** — incident commander system prompt
+- **`postmortem-writer.md`** — blameless post-mortem generator
+- **`code-review-devops.md`** — DevOps code review prompt
+
+### Added — Rules
+- **`devops-agent.windsurfrules`** — AI safety guardrails for DevOps repos
+
+### Added — Scripts
+- **`k8s-snapshot.sh`** — cluster state snapshot to Markdown
+
+### Added — Repo
+- Repository structure: workflows/, prompts/, rules/, scripts/
+- README.md with full documentation
+- CONTRIBUTING.md with workflow design rules
+- MIT License
diff --git a/README.md b/README.md
index 2845af5..f53ff63 100644
--- a/README.md
+++ b/README.md
@@ -25,6 +25,7 @@ A growing collection of **AI-agent workflows, prompts, and rules** for day-to-da
 | [k8s-cost-hotspots](./workflows/kubernetes/k8s-cost-hotspots.md) | `/k8s-cost-hotspots` | Find waste: over-provisioned workloads, missing requests/limits, idle workloads, orphan PVCs/PVs, idle LoadBalancers. | `kubectl`, `jq`, metrics-server. |
 | [k8s-upgrade-readiness](./workflows/kubernetes/k8s-upgrade-readiness.md) | `/k8s-upgrade-readiness` | Pre-flight before a control-plane / node upgrade: deprecated APIs, version skew, PDB gaps, expiring certs, broken webhooks. | `kubectl`. Optional: `kubent` or `pluto`, `helm`. |
 | [helm-release-debug](./workflows/kubernetes/helm-release-debug.md) | `/helm-release-debug` | Diagnose a stuck or failed Helm release: history, values diff, hook failures, rendered manifest vs cluster, workload health. | `helm` v3, `kubectl`. Optional: `jq`, `yq`. |
+| [helm-chart-review](./workflows/kubernetes/helm-chart-review.md) | `/helm-chart-review` | Review a Helm chart for security, reliability, and best practices: resource specs, probes, security context, PDBs, anti-affinity, RBAC. | Helm chart source. Optional: `helm` CLI. |
 
 ### AWS / Cloud
 
@@ -49,6 +50,18 @@ A growing collection of **AI-agent workflows, prompts, and rules** for day-to-da
 | [jenkins-pipeline-review](./workflows/cicd/jenkins-pipeline-review.md) | `/jenkins-pipeline-review` | Review Jenkinsfile / shared-library Groovy for security risks, anti-patterns, missing error handling, credential leaks, CPS issues, and build config cross-references. | Jenkinsfile(s) or `vars/*.groovy`. Optional: `repositories_v2.json`. |
 | [dockerfile-review](./workflows/containers/dockerfile-review.md) | `/dockerfile-review` | Review Dockerfiles for security, size, caching, and best practices. Flags CVE-prone bases, leaked secrets, missing health checks. | Dockerfile(s). Optional: `docker`, `trivy`. |
 
+### Security
+
+| Workflow | Slash command | Description | Prerequisites |
+|---|---|---|---|
+| [secrets-leak-scan](./workflows/security/secrets-leak-scan.md) | `/secrets-leak-scan` | Scan git repo history for leaked secrets: API keys, passwords, tokens, private keys. Uses gitleaks, trufflehog, or regex fallback. | Git repo. Optional: `gitleaks`, `trufflehog`. |
+
+### Observability & Incident
+
+| Workflow | Slash command | Description | Prerequisites |
+|---|---|---|---|
+| [incident-triage](./workflows/observability/incident-triage.md) | `/incident-triage` | Guided first 15 minutes of a production incident: timeline, blast radius, evidence gathering, mitigation suggestions. | Access to affected environment. |
+
 More on the way — see [Roadmap](#roadmap).
 
 ## Prompts
@@ -60,6 +73,8 @@ Reusable system prompts you can paste into any AI agent for common DevOps tasks:
 | [incident-commander](./prompts/incident-commander.md) | Puts the AI in incident-commander mode: timeline, blast radius, action tracking, status updates. |
 | [postmortem-writer](./prompts/postmortem-writer.md) | Generates a blameless post-mortem from incident notes: timeline, root cause, impact, action items. |
 | [code-review-devops](./prompts/code-review-devops.md) | Reviews IaC / pipeline / Docker / K8s code with a security-first DevOps lens. |
+| [pr-description](./prompts/pr-description.md) | Generates a PR description from a diff: what, why, how, testing, risk, rollback plan. |
+| [explain-like-a-senior](./prompts/explain-like-a-senior.md) | Explains infrastructure code to junior engineers: what it does, why, gotchas, and how it fits together. |
 
 ## Rules
 
@@ -76,6 +91,8 @@ Standalone shell utilities referenced by workflows or useful on their own:
 | Script | Usage |
 |---|---|
 | [k8s-snapshot.sh](./scripts/k8s-snapshot.sh) | `./k8s-snapshot.sh [namespace\|all] [output-dir]` — dump cluster state (nodes, pods, events, services, top) to a timestamped Markdown file. |
+| [aws-whoami.sh](./scripts/aws-whoami.sh) | `./aws-whoami.sh [profile]` — quick AWS identity check: caller, region, account alias, org, SSO role. |
+| [stale-branches.sh](./scripts/stale-branches.sh) | `./stale-branches.sh [days] [--remote]` — list git branches older than N days with last commit info. |
 
 ## Using a workflow
 
@@ -100,7 +117,9 @@ devops-ai-workflows/
 │   ├── aws/                 # AWS / cloud workflow definitions
 │   ├── iac/                 # Infrastructure as Code workflows
 │   ├── cicd/                # CI/CD pipeline workflows
-│   └── containers/          # Container & image workflows
+│   ├── containers/          # Container & image workflows
+│   ├── security/            # Security & repo hygiene workflows
+│   └── observability/       # Observability & incident workflows
 ├── prompts/                 # Reusable LLM prompts
 ├── rules/                   # Editor/agent rule files
 ├── scripts/                 # Standalone shell helpers
@@ -127,12 +146,10 @@ Ideas I plan to add (PRs welcome):
 - [ ] `/image-cve-triage` — prioritise CVE scanner output by exploitability + fix availability
 - [ ] `/github-actions-review` — security review of GitHub Actions workflow files
 - [ ] `/release-checklist` — pre-release gate
-- [ ] `/helm-chart-review` — review Helm chart for missing resources/limits, PDB, anti-affinity, template issues
 
 **Observability & incident**
 - [ ] `/prometheus-query-helper` — intent → PromQL with rationale
 - [ ] `/log-pattern-extract` — cluster repeated errors out of a log dump
-- [ ] `/incident-triage` — guided first 15 minutes of an incident
 - [ ] `/postmortem` — blameless post-mortem from a transcript
 - [ ] `/runbook-from-incident` — turn a resolved incident into a reusable runbook
 
@@ -144,7 +161,6 @@ Ideas I plan to add (PRs welcome):
 - [ ] `/db-migration-review` — flag risky migration patterns
 
 **Security & repo hygiene**
-- [ ] `/secrets-leak-scan` — gitleaks/trufflehog over full git history
 - [ ] `/cve-impact-assessment` — given a CVE, check whether your stack is affected
 - [ ] `/repo-health` — README, license, CI, branch protection, stale branches
 - [ ] `/dependency-upgrade-plan` — group outdated deps by risk and suggest batching
diff --git a/scripts/aws-whoami.sh b/scripts/aws-whoami.sh
old mode 100644
new mode 100755
diff --git a/scripts/stale-branches.sh b/scripts/stale-branches.sh
old mode 100644
new mode 100755
diff --git a/workflows/kubernetes/helm-chart-review.md b/workflows/kubernetes/helm-chart-review.md
new file mode 100644
index 0000000..1a80eec
--- /dev/null
+++ b/workflows/kubernetes/helm-chart-review.md
@@ -0,0 +1,254 @@
+---
+description: Review a Helm chart for security, reliability, and best practices before deployment. Checks templates, values, resource specs, and RBAC. Read-only static analysis.
+---
+
+# /helm-chart-review — Helm Chart Best Practices Review
+
+Static analysis of a Helm chart **before deployment**. Checks templates, `values.yaml`, resource specifications, security context, RBAC, and packaging. Flags missing best practices that cause production incidents.
+
+> This reviews chart **source code**. For diagnosing a **live broken Helm release**, use `/helm-release-debug` instead.
+
+## Prerequisites
+
+- Helm chart source directory or `.tgz` archive.
+- Optional: `helm` CLI (for `helm template`, `helm lint`).
+- Optional: `kubectl` (for dry-run validation against a cluster).
+- No cluster access required for basic review.
+
+## Inputs
+
+- **CHART_PATH** *(required)* — path to the chart directory or `.tgz` file.
+- **VALUES_FILE** — optional custom values file to review alongside defaults.
+- **REPORT_DIR** — Default: `./helm-chart-review-reports`.
+
+---
+
+## Step 1 — Chart structure and metadata
+
+// turbo
+
+```bash
+# Validate chart structure
+ls -la $CHART_PATH/
+cat $CHART_PATH/Chart.yaml
+cat $CHART_PATH/values.yaml | head -100
+
+# Helm lint
+helm lint $CHART_PATH 2>&1
+helm lint $CHART_PATH --strict 2>&1
+
+# Template render (catch errors before deploy)
+helm template test-release $CHART_PATH 2>&1 | head -200
+```
+
+Check:
+
+- `Chart.yaml` has `version`, `appVersion`, `description`, `maintainers`.
+- `apiVersion: v2` (Helm 3). Flag `v1` charts (Helm 2 legacy).
+- Dependencies declared in `Chart.yaml` or `requirements.yaml` (legacy).
+- `helm lint --strict` passes with no warnings.
+- `helm template` renders without errors.
+
+---
+
+## Step 2 — Resource specifications
+
+For every Deployment, StatefulSet, DaemonSet, Job in the templates, check:
+
+### Resource requests and limits
+
+```yaml
+# ✅ Good
+resources:
+  requests:
+    cpu: 100m
+    memory: 128Mi
+  limits:
+    cpu: 500m
+    memory: 512Mi
+
+# ❌ Bad — no resources at all
+# ❌ Bad — limits without requests
+# ⚠️ Caution — requests == limits (Guaranteed QoS, may be wasteful)
+```
+
+Flag:
+- Containers with no `resources.requests` → scheduling problems, noisy neighbors.
+- Containers with no `resources.limits` → can consume unbounded resources.
+- Memory limits much larger than requests → overcommitment risk.
+
+### Probes
+
+```yaml
+# ✅ Should have all three
+readinessProbe: ...   # When to send traffic
+livenessProbe: ...    # When to restart
+startupProbe: ...     # Grace period for slow-starting apps
+```
+
+Flag:
+- No `readinessProbe` → traffic sent before app is ready.
+- No `livenessProbe` → stuck pods never restart.
+- `livenessProbe` same as `readinessProbe` → may cause restart loops under load.
+- `initialDelaySeconds` too low → premature restarts during startup.
+- No `startupProbe` on apps known to have slow startup.
+
+---
+
+## Step 3 — Security
+
+### Pod security context
+
+```yaml
+# ✅ Good
+securityContext:
+  runAsNonRoot: true
+  runAsUser: 1000
+  fsGroup: 1000
+  readOnlyRootFilesystem: true
+  allowPrivilegeEscalation: false
+  capabilities:
+    drop: ["ALL"]
+```
+
+Flag:
+- No `securityContext` at all → runs as root.
+- `privileged: true` → full host access.
+- `allowPrivilegeEscalation: true` or missing → container can escalate.
+- `capabilities` not dropped → unnecessary kernel capabilities.
+- `hostNetwork: true`, `hostPID: true`, `hostIPC: true` → breaks isolation.
+- `readOnlyRootFilesystem: false` or missing → writable root fs.
+
+### RBAC
+
+If the chart creates `ClusterRole`, `ClusterRoleBinding`, `Role`, `RoleBinding`:
+
+- Flag `ClusterRole` with `*` verbs or `*` resources.
+- Flag `ClusterRoleBinding` to `default` ServiceAccount.
+- Flag any binding to `cluster-admin`.
+- Prefer `Role`+`RoleBinding` (namespace-scoped) over `ClusterRole`+`ClusterRoleBinding`.
+
+### Secrets
+
+- Flag `Secret` resources with hardcoded values in templates.
+- Prefer `existingSecret` pattern (reference external secrets).
+- Flag secrets in `ConfigMap` (should be `Secret`).
+- Check if `values.yaml` has password/token fields with default values.
+
+---
+
+## Step 4 — High availability and resilience
+
+### Replicas and PDB
+
+Flag:
+- `replicas: 1` for production workloads → single point of failure.
+- No `PodDisruptionBudget` for multi-replica Deployments/StatefulSets.
+- PDB with `maxUnavailable: 0` → blocks all voluntary disruptions (node drain).
+- PDB with `minAvailable` equal to `replicas` → same problem.
+
+### Anti-affinity
+
+```yaml
+# ✅ Good — spread across nodes
+affinity:
+  podAntiAffinity:
+    preferredDuringSchedulingIgnoredDuringExecution:
+      - weight: 100
+        podAffinityTerm:
+          labelSelector:
+            matchExpressions:
+              - key: app
+                operator: In
+                values: ["myapp"]
+          topologyKey: kubernetes.io/hostname
+```
+
+Flag:
+- Multi-replica workloads with no anti-affinity → all pods on one node.
+- `requiredDuringScheduling` anti-affinity on small clusters → pods may not schedule.
+
+### Update strategy
+
+- Deployments: `RollingUpdate` with `maxSurge` and `maxUnavailable` configured.
+- StatefulSets: `RollingUpdate` with `partition` for staged rollouts.
+- DaemonSets: `RollingUpdate` with `maxUnavailable`.
+- Flag `Recreate` strategy on production Deployments (causes downtime).
+
+---
+
+## Step 5 — Networking
+
+- **Service type** — flag `LoadBalancer` without annotation for internal LB (may create public LB).
+- **Ingress** — check for TLS configuration, valid hosts, path types.
+- **NetworkPolicy** — flag charts with no NetworkPolicy (all traffic allowed).
+- **Service ports** — named ports match container ports.
+- **Service selectors** — match pod labels.
+
+---
+
+## Step 6 — Storage
+
+- **PVC templates** in StatefulSets — check `storageClassName`, access modes, size.
+- **EmptyDir** with no `sizeLimit` → can fill node disk.
+- **HostPath** volumes → breaks portability, security risk.
+- **Volume mounts** — check for unnecessary write access.
+
+---
+
+## Step 7 — Values and configurability
+
+Review `values.yaml`:
+
+- **Image tag** — flag `latest` or missing tag. Should default to `appVersion` from `Chart.yaml` or a pinned tag.
+- **Image pull policy** — should be `IfNotPresent` for tagged images, `Always` only for `latest`.
+- **Configurable resource limits** — requests/limits should be in values, not hardcoded in templates.
+- **Environment-specific values** — check if the chart supports different envs via values overlays.
+- **Sensitive defaults** — flag default passwords, tokens, or keys in `values.yaml`.
+
+---
+
+## Step 8 — Generate report
+
+Compile findings into a timestamped Markdown report:
+
+```
+$REPORT_DIR/helm-chart-review-<chart-name>-<YYYYMMDD-HHMMSS>.md
+```
+
+### Report structure
+
+```markdown
+# Helm Chart Review Report
+
+| Field | Value |
+|---|---|
+| Generated | <timestamp> |
+| Chart | <name> v<version> |
+| App version | <appVersion> |
+| Templates | <count> |
+| Risk level | 🔴 / 🟡 / 🟢 |
+
+## Summary
+<overall assessment>
+
+## Findings
+### 🔴 Critical
+### 🟡 Warning
+### 🔵 Info
+
+## Template-by-template breakdown
+<per-template analysis>
+
+## Recommended changes
+<prioritized with YAML examples>
+```
+
+---
+
+## Safety rules
+
+- This workflow is **entirely read-only**. No charts are installed, upgraded, or deleted.
+- `helm template` renders locally — it does not contact a cluster.
+- `helm lint` is a local static check.
+- Never print secret values from `values.yaml`. Flag their presence but redact.
diff --git a/workflows/observability/incident-triage.md b/workflows/observability/incident-triage.md
new file mode 100644
index 0000000..c1f04c0
--- /dev/null
+++ b/workflows/observability/incident-triage.md
@@ -0,0 +1,289 @@
+---
+description: Guided first 15 minutes of a production incident. Establishes timeline, assesses blast radius, gathers evidence, and coordinates response. Read-only investigation commands.
+---
+
+# /incident-triage — First 15 Minutes of an Incident
+
+Structured triage workflow for the critical first 15 minutes of a production incident. Guides you through timeline establishment, blast radius assessment, evidence gathering, and initial mitigation — with concrete commands for Kubernetes, AWS, and general infrastructure.
+
+## Prerequisites
+
+- Access to the affected environment (kubectl, AWS CLI, monitoring dashboards).
+- This workflow uses **read-only** commands only. Mitigation actions are suggested but not executed automatically.
+
+## Inputs
+
+- **INCIDENT** *(required)* — brief description of the symptoms (e.g., "scores-api returning 500s", "high latency on checkout", "pods crashing in prod").
+- **ENVIRONMENT** — `prod` / `staging` / `dev`. Default: `prod`.
+- **AFFECTED_SERVICE** — service name if known.
+- **REPORT_DIR** — Default: `./incident-triage-reports`.
+
+---
+
+## Minute 0–2: Declare and orient
+
+### Establish the basics
+
+Ask the user (or determine from context):
+
+1. **What are the symptoms?** (errors, latency, downtime, data issue)
+2. **When did it start?** (first alert, first customer report, when you noticed)
+3. **Who reported it?** (alert, customer, internal)
+4. **What environment?** (prod, staging, which region/cluster)
+5. **What changed recently?** (deploys, config changes, infra changes, maintenance windows)
+
+### Check recent deployments
+
+```bash
+# Kubernetes: recent rollouts
+kubectl rollout history deploy -A 2>/dev/null | head -30
+
+# Helm: recent releases
+helm ls -A --sort-by updated 2>/dev/null | tail -20
+
+# Git: recent deploys (if deploy tags exist)
+git log --oneline --since="6 hours ago" --all 2>/dev/null | head -20
+
+# AWS: recent CloudFormation events
+aws cloudformation describe-stack-events --stack-name <stack> --query 'StackEvents[:10].[Timestamp,ResourceStatus,LogicalResourceId,ResourceStatusReason]' --output table 2>/dev/null
+```
+
+### Draft initial status
+
+```
+🔴 Incident declared: <title>
+Time: <HH:MM UTC>
+Severity: <SEV1/SEV2/SEV3>
+Impact: <who/what is affected>
+Status: Investigating
+IC: <your name>
+Next update in 15 minutes.
+```
+
+---
+
+## Minute 2–5: Assess blast radius
+
+### What's broken?
+
+```bash
+# Kubernetes: cluster health snapshot
+kubectl get nodes -o wide
+kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded -o wide
+kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp | tail -30
+
+# If service is known:
+kubectl get pods -n <ns> -l app=<service> -o wide
+kubectl describe deploy -n <ns> <service> | tail -30
+
+# AWS: service health
+aws health describe-events --filter eventStatusCodes=open --query 'events[].{Service:service,Status:statusCode,Description:eventTypeCode}' --output table 2>/dev/null || true
+```
+
+### Who's affected?
+
+```bash
+# Check error rates (if Prometheus/metrics available)
+# Substitute your actual metric names
+curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~\"5..\"}[5m]))" 2>/dev/null | jq '.data.result'
+
+# Check ALB/NLB metrics (AWS)
+aws cloudwatch get-metric-statistics \
+  --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_5XX_Count \
+  --dimensions Name=LoadBalancer,Value=<lb-name> \
+  --start-time $(date -u -v-30M +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
+  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
+  --period 60 --statistics Sum 2>/dev/null
+```
+
+### Quantify impact
+
+| Question | How to determine |
+|---|---|
+| Error rate | Prometheus, CloudWatch, APM |
+| Affected users (%) | Compare error rate to total request rate |
+| Which regions/AZs | Check per-region metrics, node distribution |
+| Data loss risk | Check database health, replication status |
+| Revenue impact | Error rate × average revenue per request |
+
+---
+
+## Minute 5–10: Gather evidence
+
+### Logs from affected service
+
+```bash
+# Kubernetes: recent logs
+kubectl logs -n <ns> -l app=<service> --all-containers --tail=200 --timestamps --since=30m 2>/dev/null | grep -iE 'error|fatal|panic|exception|timeout|refused' | tail -50
+
+# Previous container logs (if restarting)
+for pod in $(kubectl get pods -n <ns> -l app=<service> -o name); do
+  echo "=== $pod previous ==="
+  kubectl logs -n <ns> $pod --previous --tail=100 --timestamps 2>/dev/null | tail -20
+done
+
+# AWS Lambda (if applicable)
+aws logs filter-log-events \
+  --log-group-name "/aws/lambda/<function-name>" \
+  --start-time $(($(date +%s) - 1800))000 \
+  --filter-pattern "ERROR" \
+  --limit 30 2>/dev/null
+```
+
+### Infrastructure state
+
+```bash
+# Kubernetes: resource pressure
+kubectl top nodes 2>/dev/null
+kubectl top pods -n <ns> --sort-by=memory 2>/dev/null | head -20
+
+# Kubernetes: HPA status
+kubectl get hpa -n <ns> -o wide 2>/dev/null
+
+# AWS: EC2/RDS health
+aws ec2 describe-instance-status --filters Name=instance-status.status,Values=impaired --query 'InstanceStatuses[].{Id:InstanceId,Status:InstanceStatus.Status}' --output table 2>/dev/null
+aws rds describe-events --duration 30 --query 'Events[].{Source:SourceIdentifier,Type:EventCategories,Message:Message,Date:Date}' --output table 2>/dev/null
+```
+
+### Network and dependencies
+
+```bash
+# DNS resolution (from inside the cluster)
+kubectl run dns-test --rm -it --restart=Never --image=busybox:1.36 --command -- nslookup <service>.<ns>.svc.cluster.local 2>/dev/null
+
+# Endpoint health
+kubectl get endpoints -n <ns> <service> -o wide
+
+# External dependency check
+curl -sSm 5 -o /dev/null -w "status=%{http_code} time=%{time_total}s\n" https://<dependency-endpoint>/health 2>/dev/null || echo "UNREACHABLE"
+```
+
+---
+
+## Minute 10–12: Identify and mitigate
+
+### Common root causes and quick mitigations
+
+| Symptom | Likely cause | Quick mitigation |
+|---|---|---|
+| Pods in CrashLoopBackOff after deploy | Bad code / config in new version | `kubectl rollout undo deploy/<name> -n <ns>` |
+| All pods OOMKilled | Memory leak or insufficient limits | Scale up or increase memory limits |
+| 503s from LB | No healthy targets | Check pod readiness, fix probes |
+| Connection refused to dependency | Dependency is down | Check dependency status, failover |
+| Slow queries / high DB CPU | Bad query or missing index | Identify and kill long-running queries |
+| Certificate expired | TLS cert not renewed | Emergency cert renewal |
+| DNS resolution failing | CoreDNS unhealthy | Restart CoreDNS pods |
+
+### Suggest (don't execute) mitigations
+
+The agent should present mitigation options but **never execute them automatically**:
+
+```
+Suggested mitigations (choose one — confirm before running):
+
+Option A: Rollback to previous version
+  kubectl rollout undo deploy/<service> -n <ns>
+
+Option B: Scale up to handle load
+  kubectl scale deploy/<service> -n <ns> --replicas=<N>
+
+Option C: Restart pods (if stuck state)
+  kubectl rollout restart deploy/<service> -n <ns>
+
+Option D: Disable traffic to the service
+  kubectl scale deploy/<service> -n <ns> --replicas=0
+```
+
+---
+
+## Minute 12–15: Communicate and plan
+
+### Status update
+
+```
+🟡 Update: <title>
+Time: <HH:MM UTC>
+Status: Identified / Mitigating
+What we know:
+  - Root cause: <description>
+  - Impact: <X% of requests affected / Y users impacted>
+  - Started: <HH:MM UTC>
+Current action: <what's being done>
+Next update in 15 minutes.
+```
+
+### Evidence log
+
+Record everything gathered so far:
+
+```markdown
+## Evidence collected at <timestamp>
+
+### Timeline
+- HH:MM — First symptom / alert
+- HH:MM — Investigation started
+- HH:MM — Root cause identified: <description>
+- HH:MM — Mitigation applied: <action>
+
+### Key findings
+- <finding 1>
+- <finding 2>
+
+### Commands run
+- <command 1> → <result summary>
+- <command 2> → <result summary>
+```
+
+---
+
+## Step — Generate triage report
+
+Compile all findings into a timestamped report:
+
+```
+$REPORT_DIR/incident-triage-<service>-<YYYYMMDD-HHMMSS>.md
+```
+
+### Report structure
+
+```markdown
+# Incident Triage Report
+
+| Field | Value |
+|---|---|
+| Generated | <timestamp> |
+| Incident | <description> |
+| Environment | <env> |
+| Service | <service> |
+| Severity | SEV1/SEV2/SEV3 |
+| Duration (so far) | <minutes> |
+
+## Blast radius
+<who/what is affected, error rates, user impact>
+
+## Timeline
+<chronological events>
+
+## Root cause (if identified)
+<description>
+
+## Evidence
+<logs, metrics, command outputs>
+
+## Mitigation applied / recommended
+<what was done or what should be done>
+
+## Next steps
+<follow-up investigation, post-mortem scheduling>
+```
+
+---
+
+## Safety rules
+
+- All investigation commands are **read-only**.
+- **Mitigation commands are suggested but never executed automatically.** The user must explicitly confirm any write/mutation operation.
+- Never print secret values from logs or configs.
+- The DNS test pod (`dns-test`) uses `--rm` and auto-deletes.
+- If kubectl/AWS commands fail due to permissions, record the failure and continue.
+- Always confirm the target environment before suggesting any mitigation.
diff --git a/workflows/security/secrets-leak-scan.md b/workflows/security/secrets-leak-scan.md
new file mode 100644
index 0000000..2ad0f06
--- /dev/null
+++ b/workflows/security/secrets-leak-scan.md
@@ -0,0 +1,221 @@
+---
+description: Scan a git repository for leaked secrets across full history. Uses gitleaks, trufflehog, or manual regex patterns. Read-only, generates a markdown report.
+---
+
+# /secrets-leak-scan — Git Repository Secrets Scanner
+
+Scan a git repository's **full commit history** for accidentally committed secrets: API keys, passwords, tokens, private keys, connection strings, and credentials. Uses `gitleaks`, `trufflehog`, or falls back to manual regex patterns. **Read-only** — nothing is modified.
+
+## Prerequisites
+
+- A git repository (local clone).
+- Recommended: `gitleaks` or `trufflehog` installed (the workflow will detect which is available).
+- Fallback: `grep` and `git log` (always available, less accurate).
+
+## Inputs
+
+- **REPO_PATH** *(required)* — path to the git repository root.
+- **SCAN_SCOPE** — `full` (entire git history) or `recent` (last 30 days / last 100 commits). Default: `full`.
+- **REPORT_DIR** — Default: `./secrets-leak-scan-reports`.
+
+---
+
+## Step 1 — Detect available tools
+
+// turbo
+
+```bash
+echo "=== Available scanners ==="
+command -v gitleaks >/dev/null && echo "gitleaks: $(gitleaks version 2>&1)" || echo "gitleaks: not installed"
+command -v trufflehog >/dev/null && echo "trufflehog: $(trufflehog --version 2>&1)" || echo "trufflehog: not installed"
+echo "git: $(git --version)"
+echo "grep: available"
+
+echo ""
+echo "=== Repository info ==="
+cd $REPO_PATH
+echo "Repo: $(basename $(git rev-parse --show-toplevel))"
+echo "Branch: $(git branch --show-current)"
+echo "Commits: $(git rev-list --count HEAD)"
+echo "Remotes: $(git remote -v | head -2)"
+```
+
+---
+
+## Step 2 — Run primary scanner
+
+### Option A: gitleaks (preferred)
+
+```bash
+cd $REPO_PATH
+
+# Full history scan
+gitleaks detect --source . --report-format json --report-path /tmp/gitleaks-report.json --verbose 2>&1
+
+# Or recent only
+gitleaks detect --source . --log-opts="--since='30 days ago'" --report-format json --report-path /tmp/gitleaks-report.json --verbose 2>&1
+
+# Parse results
+cat /tmp/gitleaks-report.json | jq -r '.[] | "\(.RuleID)\t\(.File)\tcommit=\(.Commit[:8])\tauthor=\(.Author)\tdate=\(.Date)"' | head -50
+```
+
+### Option B: trufflehog
+
+```bash
+cd $REPO_PATH
+
+# Full history scan
+trufflehog git file://. --json 2>/dev/null | jq -r '.SourceMetadata.Data.Git | "\(.file) commit=\(.commit[:8]) email=\(.email)"' | head -50
+
+# Recent only
+trufflehog git file://. --since-commit=$(git rev-list -1 --before="30 days ago" HEAD) --json 2>/dev/null | head -50
+```
+
+### Option C: Manual regex fallback
+
+If neither tool is installed, fall back to git log + grep:
+
+```bash
+cd $REPO_PATH
+
+echo "=== Scanning for common secret patterns ==="
+
+# High-confidence patterns
+git log -p --all --diff-filter=A 2>/dev/null | grep -nE \
+  'AKIA[0-9A-Z]{16}|AIza[0-9A-Za-z\-_]{35}|ghp_[0-9a-zA-Z]{36}|gho_[0-9a-zA-Z]{36}|glpat-[0-9a-zA-Z\-]{20}|sk-[0-9a-zA-Z]{48}|xox[bporas]-[0-9a-zA-Z\-]+' \
+  | head -30
+
+# AWS keys
+git log -p --all 2>/dev/null | grep -nE 'AKIA[0-9A-Z]{16}' | head -10
+echo ""
+
+# Private keys
+git log -p --all 2>/dev/null | grep -nE 'BEGIN (RSA |DSA |EC |OPENSSH )?PRIVATE KEY' | head -10
+echo ""
+
+# Connection strings
+git log -p --all 2>/dev/null | grep -nE '(mysql|postgres|mongodb|redis)://[^/\s]+:[^@\s]+@' | head -10
+echo ""
+
+# Generic password/secret assignments
+git log -p --all 2>/dev/null | grep -nE '(password|passwd|secret|token|api_key|apikey|access_key|private_key)\s*[:=]\s*["\x27][^\s"'\'']{8,}' | head -20
+echo ""
+
+# .env files committed
+git log --all --name-only --diff-filter=A 2>/dev/null | grep -E '^\.env$|\.env\.' | sort -u
+echo ""
+
+# Key/cert files committed
+git log --all --name-only --diff-filter=A 2>/dev/null | grep -iE '\.(pem|key|p12|pfx|jks|keystore|cert)$' | sort -u
+```
+
+---
+
+## Step 3 — Triage findings
+
+For each finding, classify:
+
+| Severity | Pattern | Action |
+|---|---|---|
+| 🔴 Critical | AWS access key (`AKIA*`), private key, GCP service account JSON, GitHub PAT (`ghp_*`), Slack token (`xox*`) | Rotate immediately. Check if key is still active. |
+| 🔴 Critical | Database connection string with credentials | Rotate password. Check if DB is exposed. |
+| 🟡 Warning | Generic `password=`, `secret=`, `token=` in config files | May be placeholder/test value — verify if real. |
+| 🟡 Warning | `.env` file committed | Remove from tracking, add to `.gitignore`. |
+| 🔵 Info | Test/mock credentials, example configs, documentation examples | Verify these are not real credentials. |
+
+### Check if secrets are still active
+
+For AWS keys:
+
+```bash
+# Check if a found AWS key is still active (requires aws CLI)
+aws sts get-caller-identity --access-key-id AKIA... 2>&1
+# "InvalidClientTokenId" = deactivated/deleted (safe)
+# Success = STILL ACTIVE (rotate immediately!)
+```
+
+### Check if secrets are in current HEAD
+
+```bash
+# Is the secret still in the current codebase? (not just history)
+git grep -l 'AKIA...' HEAD 2>/dev/null
+```
+
+If the secret is only in history (not current HEAD), it's still a risk — the git history is accessible to anyone who clones the repo.
+
+---
+
+## Step 4 — Check .gitignore coverage
+
+// turbo
+
+```bash
+cd $REPO_PATH
+
+echo "=== .gitignore check ==="
+cat .gitignore 2>/dev/null || echo "NO .gitignore FILE"
+
+echo ""
+echo "=== Files that should typically be gitignored ==="
+for pattern in ".env" ".env.*" "*.pem" "*.key" "*.p12" "*.pfx" "*.jks" "*.keystore" "credentials" "*.tfvars" "*.tfstate" "terraform.tfstate*" ".terraform/" "secrets.yaml" "secrets.yml"; do
+  found=$(git ls-files "$pattern" 2>/dev/null)
+  [ -n "$found" ] && echo "TRACKED: $found (should be in .gitignore)"
+done
+```
+
+---
+
+## Step 5 — Generate report
+
+Compile findings into a timestamped Markdown report:
+
+```
+$REPORT_DIR/secrets-leak-scan-<repo-name>-<YYYYMMDD-HHMMSS>.md
+```
+
+### Report structure
+
+```markdown
+# Secrets Leak Scan Report
+
+| Field | Value |
+|---|---|
+| Generated | <timestamp> |
+| Repository | <repo name> |
+| Scanner | gitleaks / trufflehog / manual |
+| Scan scope | full history / recent |
+| Commits scanned | <count> |
+| Findings | <count> |
+| Risk level | 🔴 / 🟡 / 🟢 |
+
+## Summary
+<count of findings by severity>
+
+## 🔴 Critical findings
+<secret type, file, commit, still active?>
+
+## 🟡 Warnings
+<potential secrets, .env files, suspicious patterns>
+
+## 🔵 Info
+<test credentials, documentation examples>
+
+## .gitignore gaps
+<files that should be excluded>
+
+## Recommended actions
+1. Rotate all 🔴 critical secrets immediately
+2. Add missing .gitignore patterns
+3. Consider using git-filter-repo to remove secrets from history
+4. Set up pre-commit hooks to prevent future leaks
+```
+
+---
+
+## Safety rules
+
+- This workflow is **entirely read-only**. No files are modified, no secrets are rotated, no git history is rewritten.
+- **Never print full secret values** in the report. Redact to first/last 4 characters: `AKIA****WXYZ`.
+- The `sts get-caller-identity` check for AWS keys is read-only — it does not perform any actions with the key.
+- If a secret is found to be active, recommend rotation but do not rotate it.
+- Suggested remediation commands (git-filter-repo, pre-commit hooks) are provided in the report for the user to evaluate and run manually.