From 3f7ff97b9ea48d4cf7dc0e153b72722a9e2284cb Mon Sep 17 00:00:00 2001 From: Sergei Olshanetski Date: Thu, 14 May 2026 10:33:14 -0400 Subject: [PATCH 1/2] Btter worflows and more prompts --- prompts/explain-like-a-senior.md | 56 ++++++++++++++++++++ prompts/pr-description.md | 54 ++++++++++++++++++++ scripts/aws-whoami.sh | 45 ++++++++++++++++ scripts/stale-branches.sh | 71 ++++++++++++++++++++++++++ workflows/aws/aws-account-audit.md | 41 +++++++++++---- workflows/aws/aws-cost-quickscan.md | 46 ++++++++++++++++- workflows/iac/terraform-plan-review.md | 26 ++++++++++ 7 files changed, 326 insertions(+), 13 deletions(-) create mode 100644 prompts/explain-like-a-senior.md create mode 100644 prompts/pr-description.md create mode 100644 scripts/aws-whoami.sh create mode 100644 scripts/stale-branches.sh diff --git a/prompts/explain-like-a-senior.md b/prompts/explain-like-a-senior.md new file mode 100644 index 0000000..b357531 --- /dev/null +++ b/prompts/explain-like-a-senior.md @@ -0,0 +1,56 @@ +# Explain Like a Senior — System Prompt + +Paste this into any AI agent when you want a clear, educational explanation of infrastructure code for a junior engineer or new team member. + +--- + +## System prompt + +You are a **senior DevOps/SRE engineer** explaining infrastructure code to a junior team member. Your goal is to build understanding, not just describe syntax. + +### For each piece of code, explain + +1. **What it does** — plain English, no jargon. If jargon is unavoidable, define it. +2. **Why it's designed this way** — what problem does this solve? What trade-offs were made? +3. **What could go wrong** — common failure modes, misconfigurations, and gotchas. +4. **How it connects** — how does this piece fit into the bigger picture? What depends on it? What does it depend on? +5. **What you'd change** — if anything looks suboptimal, explain what a senior would do differently and why. + +### Explanation style + +- **Start with the big picture**, then zoom in. "This Terraform module creates a VPC with public and private subnets. Here's how each piece works..." +- **Use analogies** where they help. "A NAT Gateway is like a mail forwarding service — private instances send mail through it so they can reach the internet without being directly addressable." +- **Show the mental model.** How would a senior engineer think about this? What questions would they ask? +- **Point out non-obvious things.** "This `depends_on` might look unnecessary, but without it, the IAM role gets created before the policy is attached, and the Lambda function fails on first deploy." +- **Be honest about complexity.** If something is genuinely confusing or poorly designed, say so — don't pretend it's simple. + +### Format + +```markdown +## Overview + + +## Walk-through +
+ +###
+**What:** +**Why:** +**Gotcha:** + +## How it fits together + + +## Things to watch out for + + +## If I were reviewing this + +``` + +### Rules + +- **No condescension.** Junior doesn't mean stupid. Explain clearly without being patronizing. +- **No hand-waving.** If you don't know why something is done a certain way, say "I'm not sure why this specific choice was made — it might be historical. Here's what I'd investigate." +- **Use the actual code.** Reference specific lines, variables, and resource names. +- **Encourage questions.** End with "Good questions to ask your team about this: ..." diff --git a/prompts/pr-description.md b/prompts/pr-description.md new file mode 100644 index 0000000..a7a47e8 --- /dev/null +++ b/prompts/pr-description.md @@ -0,0 +1,54 @@ +# PR Description Generator — System Prompt + +Paste this into any AI agent along with your `git diff` or list of changes to generate a PR description. + +--- + +## System prompt + +You are a **PR description writer** for a DevOps/infrastructure team. Given a diff, commit list, or description of changes, generate a clear, reviewable pull request description. + +### Output format + +```markdown +## What + +<1-3 sentences: what this PR does in plain English> + +## Why + +<1-3 sentences: why this change is needed — the problem, feature request, or improvement> + +## How + + + +## Testing + + + +## Risk + + +- **Risk level:** Low / Medium / High +- **Rollback:** +- **Affected environments:** + +## Checklist + +- [ ] Code follows project conventions +- [ ] Tests added/updated +- [ ] Documentation updated (if applicable) +- [ ] No secrets or credentials in the diff +- [ ] Reviewed for security implications +``` + +### Rules + +- **Be specific.** Don't say "updated the config" — say "changed the RDS instance class from `db.t3.medium` to `db.t3.large` to handle increased query load." +- **Group changes logically.** If the PR touches 5 files across 2 concerns, group by concern, not by file. +- **Flag breaking changes** prominently with ⚠️. +- **Mention dependencies** — does this PR need to be merged/deployed before or after another PR? +- **Include the diff context.** If the user provides a diff, reference specific file paths and line changes. +- **Never include secret values** from the diff. If the diff contains credentials, flag it as a blocker. +- **For infrastructure PRs**, always include: what resources are created/modified/destroyed, blast radius, and rollback plan. diff --git a/scripts/aws-whoami.sh b/scripts/aws-whoami.sh new file mode 100644 index 0000000..b6aedaa --- /dev/null +++ b/scripts/aws-whoami.sh @@ -0,0 +1,45 @@ +#!/usr/bin/env bash +# ──────────────────────────────────────────────────────────────── +# aws-whoami.sh — Quick AWS identity and account context +# ──────────────────────────────────────────────────────────────── +# Usage: ./aws-whoami.sh [profile] +# +# Shows: caller identity, account alias, region, organization, +# and SSO role (if using AWS SSO). +# ──────────────────────────────────────────────────────────────── +set -euo pipefail + +PROFILE_FLAG="" +[ -n "${1:-}" ] && PROFILE_FLAG="--profile $1" + +echo "🔍 AWS Identity Check" +echo "=====================" +echo "" + +echo "--- Caller Identity ---" +aws sts get-caller-identity $PROFILE_FLAG --output table 2>&1 + +echo "" +echo "--- Region ---" +REGION=$(aws configure get region $PROFILE_FLAG 2>/dev/null || echo "not set") +echo "Region: $REGION" + +echo "" +echo "--- Account Aliases ---" +aws iam list-account-aliases $PROFILE_FLAG --query 'AccountAliases[]' --output text 2>/dev/null || echo "(none or no permission)" + +echo "" +echo "--- Organization ---" +aws organizations describe-organization $PROFILE_FLAG --query 'Organization.{Id:Id,Master:MasterAccountId,Email:MasterAccountEmail}' --output table 2>/dev/null || echo "Not in an org (or no permission)" + +echo "" +echo "--- SSO Role (if applicable) ---" +ARN=$(aws sts get-caller-identity $PROFILE_FLAG --query 'Arn' --output text 2>/dev/null) +if echo "$ARN" | grep -q 'assumed-role'; then + ROLE=$(echo "$ARN" | awk -F/ '{print $2}') + USER=$(echo "$ARN" | awk -F/ '{print $3}') + echo "Role: $ROLE" + echo "User: $USER" +else + echo "Not using assumed role" +fi diff --git a/scripts/stale-branches.sh b/scripts/stale-branches.sh new file mode 100644 index 0000000..87c9043 --- /dev/null +++ b/scripts/stale-branches.sh @@ -0,0 +1,71 @@ +#!/usr/bin/env bash +# ──────────────────────────────────────────────────────────────── +# stale-branches.sh — List git branches older than N days +# ──────────────────────────────────────────────────────────────── +# Usage: ./stale-branches.sh [days] [--remote] +# +# Defaults: 90 days, local branches only. +# Add --remote to include remote tracking branches. +# ──────────────────────────────────────────────────────────────── +set -euo pipefail + +DAYS="${1:-90}" +INCLUDE_REMOTE=false +[ "${2:-}" = "--remote" ] && INCLUDE_REMOTE=true + +CUTOFF=$(date -u -v-${DAYS}d +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d "${DAYS} days ago" +%Y-%m-%dT%H:%M:%SZ 2>/dev/null) + +echo "🌿 Stale Branch Report" +echo "======================" +echo "Repo: $(basename "$(git rev-parse --show-toplevel 2>/dev/null || echo '?')")" +echo "Threshold: ${DAYS} days (before $(echo "$CUTOFF" | cut -dT -f1))" +echo "Scope: $([ "$INCLUDE_REMOTE" = true ] && echo 'local + remote' || echo 'local only')" +echo "" + +# Current branch (don't flag this one) +CURRENT=$(git branch --show-current 2>/dev/null || echo "") + +echo "--- Stale local branches ---" +stale_local=0 +for branch in $(git for-each-ref --sort=committerdate --format='%(refname:short) %(committerdate:iso8601)' refs/heads/ | while read name date; do + # Compare dates + branch_epoch=$(date -jf "%Y-%m-%d %H:%M:%S %z" "$date" +%s 2>/dev/null || date -d "$date" +%s 2>/dev/null || echo 0) + cutoff_epoch=$(date -jf "%Y-%m-%dT%H:%M:%SZ" "$CUTOFF" +%s 2>/dev/null || date -d "$CUTOFF" +%s 2>/dev/null || echo 0) + [ "$branch_epoch" -lt "$cutoff_epoch" ] 2>/dev/null && echo "$name" +done); do + [ "$branch" = "$CURRENT" ] && continue + [ "$branch" = "main" ] || [ "$branch" = "master" ] && continue + last_commit=$(git log -1 --format='%ci (%cr)' "$branch" 2>/dev/null || echo "unknown") + author=$(git log -1 --format='%an' "$branch" 2>/dev/null || echo "unknown") + echo " $branch" + echo " Last commit: $last_commit" + echo " Author: $author" + stale_local=$((stale_local + 1)) +done +[ "$stale_local" -eq 0 ] && echo " (none)" +echo "" +echo "Stale local branches: $stale_local" + +if [ "$INCLUDE_REMOTE" = true ]; then + echo "" + echo "--- Stale remote branches ---" + git fetch --prune 2>/dev/null || true + stale_remote=0 + git for-each-ref --sort=committerdate --format='%(refname:short) %(committerdate:iso8601)' refs/remotes/origin/ | while read name date; do + # Skip HEAD and main/master + echo "$name" | grep -qE 'HEAD|/main$|/master$' && continue + branch_epoch=$(date -jf "%Y-%m-%d %H:%M:%S %z" "$date" +%s 2>/dev/null || date -d "$date" +%s 2>/dev/null || echo 0) + cutoff_epoch=$(date -jf "%Y-%m-%dT%H:%M:%SZ" "$CUTOFF" +%s 2>/dev/null || date -d "$CUTOFF" +%s 2>/dev/null || echo 0) + if [ "$branch_epoch" -lt "$cutoff_epoch" ] 2>/dev/null; then + last_commit=$(git log -1 --format='%ci (%cr)' "$name" 2>/dev/null || echo "unknown") + echo " $name — $last_commit" + stale_remote=$((stale_remote + 1)) + fi + done + echo "" + echo "Stale remote branches: $stale_remote" +fi + +echo "" +echo "💡 To delete a stale local branch: git branch -d " +echo "💡 To delete a stale remote branch: git push origin --delete " diff --git a/workflows/aws/aws-account-audit.md b/workflows/aws/aws-account-audit.md index cad6451..5605d0f 100644 --- a/workflows/aws/aws-account-audit.md +++ b/workflows/aws/aws-account-audit.md @@ -19,10 +19,13 @@ Ask the user for the following before starting (use sensible defaults if not pro - **PROFILE** — AWS CLI profile name. Default: current default profile. - **REGION** — primary region to audit. Default: current default region (`aws configure get region`). - **ALL_REGIONS** — `yes`/`no`. If `yes`, repeat region-scoped checks across all enabled regions. Default: `no` (primary region only). +- **FAST** — `yes`/`no`. If `yes`, skip slow per-user/per-policy iteration loops and use bulk API calls only. Recommended for large enterprise accounts (>1000 roles or >500 policies) to avoid API throttling. Default: `no`. - **REPORT_DIR** — where to write the report. Default: `./aws-account-audit-reports`. Confirm the inputs and caller identity with the user before proceeding. +> **Performance note:** On large enterprise accounts (thousands of roles/policies), the per-policy admin-access scan in Step 3 can take 30+ minutes due to AWS IAM API throttling. Use `FAST=yes` to skip these loops and rely on `list-entities-for-policy` bulk checks for `AdministratorAccess`, `IAMFullAccess`, and `PowerUserAccess` instead. + --- ## Step 1 — Verify identity and account context @@ -85,21 +88,37 @@ Flag: // turbo +### Always run — bulk privilege checks (fast) + +```bash +echo "=== Account summary (role/policy counts) ===" +aws iam get-account-summary --output json 2>/dev/null | jq '{Users:.SummaryMap.Users,Roles:.SummaryMap.Roles,Policies:.SummaryMap.Policies,MFADevicesInUse:.SummaryMap.MFADevicesInUse}' + +echo "=== Entities with AdministratorAccess ===" +aws iam list-entities-for-policy --policy-arn "arn:aws:iam::aws:policy/AdministratorAccess" \ + --query '{Users:PolicyUsers[].UserName,Roles:PolicyRoles[].RoleName|length(@),Groups:PolicyGroups[].GroupName}' --output json 2>/dev/null + +echo "=== Entities with IAMFullAccess ===" +aws iam list-entities-for-policy --policy-arn "arn:aws:iam::aws:policy/IAMFullAccess" \ + --query '{Users:PolicyUsers[].UserName,Roles:PolicyRoles[].RoleName|length(@),Groups:PolicyGroups[].GroupName}' --output json 2>/dev/null + +echo "=== Entities with PowerUserAccess ===" +aws iam list-entities-for-policy --policy-arn "arn:aws:iam::aws:policy/PowerUserAccess" \ + --query '{Users:PolicyUsers[].UserName,Roles:PolicyRoles[].RoleName|length(@),Groups:PolicyGroups[].GroupName}' --output json 2>/dev/null +``` + +### Only when FAST!=yes — deep per-policy scan (slow on large accounts) + +> **Skip this section if `FAST=yes`.** On accounts with thousands of policies, this loop makes one API call per policy and can take 30+ minutes due to IAM throttling. + ```bash -echo "=== Policies with admin access ===" +echo "=== Customer-managed policies with admin access ===" for arn in $(aws iam list-policies --scope Local --query 'Policies[].Arn' --output text); do ver=$(aws iam get-policy --policy-arn "$arn" --query 'Policy.DefaultVersionId' --output text) doc=$(aws iam get-policy-version --policy-arn "$arn" --version-id "$ver" --query 'PolicyVersion.Document' --output json 2>/dev/null) echo "$doc" | jq -e '.Statement[] | select(.Effect=="Allow" and .Action=="*" and .Resource=="*")' >/dev/null 2>&1 && echo "ADMIN-POLICY: $arn" done -echo "=== Users/roles with AdministratorAccess ===" -for arn in "arn:aws:iam::policy/AdministratorAccess" "arn:aws:iam::policy/IAMFullAccess"; do - full_arn="arn:aws:iam::policy/${arn##*/}" - managed_arn="arn:aws:iam::aws:policy/${arn##*/}" - aws iam list-entities-for-policy --policy-arn "$managed_arn" --query '{Users:PolicyUsers[].UserName,Roles:PolicyRoles[].RoleName,Groups:PolicyGroups[].GroupName}' --output json 2>/dev/null || true -done - echo "=== Inline policies with wildcards ===" for user in $(aws iam list-users --query 'Users[].UserName' --output text); do for pol in $(aws iam list-user-policies --user-name "$user" --query 'PolicyNames[]' --output text); do @@ -111,9 +130,9 @@ done Flag: -- Customer-managed policies granting `*:*`. -- Principals with `AdministratorAccess` or `IAMFullAccess`. -- Inline policies with wildcard actions. +- Principals with `AdministratorAccess`, `IAMFullAccess`, or `PowerUserAccess`. +- Customer-managed policies granting `*:*` (deep scan only). +- Inline policies with wildcard actions (deep scan only). - Roles allowing `iam:PassRole` with `*` resource. --- diff --git a/workflows/aws/aws-cost-quickscan.md b/workflows/aws/aws-cost-quickscan.md index 77152dd..d022f8c 100644 --- a/workflows/aws/aws-cost-quickscan.md +++ b/workflows/aws/aws-cost-quickscan.md @@ -19,6 +19,7 @@ Quick, **read-only** scan of an AWS account to surface the biggest cost drivers - **REGION** — primary region. Default: current default. - **ALL_REGIONS** — `yes`/`no`. Default: `no`. - **LOOKBACK_DAYS** — Cost Explorer lookback period. Default: `30`. +- **DEEP** — `yes`/`no`. If `yes`, also check per-instance CPU/memory utilization (requires CloudWatch `GetMetricStatistics`, slower on large fleets). Default: `no`. - **REPORT_DIR** — Default: `./aws-cost-quickscan-reports`. --- @@ -305,7 +306,48 @@ Flag: --- -## Step 9 — Savings Plans and Reserved Instances coverage +## Step 9 — EC2 and RDS utilization analysis (only when DEEP=yes) + +> **Skip this step if `DEEP!=yes`.** This makes one CloudWatch API call per instance and can be slow on large fleets (50+ instances). The workflow caps at 50 instances. + +```bash +REGIONS="${ALL_REGIONS_LIST:-$REGION}" + +for r in $REGIONS; do + echo "=== Low-CPU EC2 instances (avg < 5% over 7d) region=$r ===" + for iid in $(aws ec2 describe-instances --region "$r" --filters Name=instance-state-name,Values=running --query 'Reservations[].Instances[].InstanceId' --output text 2>/dev/null | head -50); do + avg=$(aws cloudwatch get-metric-statistics --region "$r" \ + --namespace AWS/EC2 --metric-name CPUUtilization \ + --dimensions Name=InstanceId,Value=$iid \ + --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ) \ + --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ + --period 86400 --statistics Average \ + --query 'Datapoints[].Average' --output text 2>/dev/null | awk '{s+=$1; n++} END {if(n>0) printf "%.1f", s/n; else print "N/A"}') + [ "$avg" != "N/A" ] && [ "$(echo "$avg < 5" | bc 2>/dev/null)" = "1" ] && echo "LOW-CPU: $iid avg=${avg}%" + done + + echo "=== Low-CPU RDS instances (avg < 10% over 7d) region=$r ===" + for dbid in $(aws rds describe-db-instances --region "$r" --query 'DBInstances[?DBInstanceStatus==`available`].DBInstanceIdentifier' --output text 2>/dev/null | head -30); do + avg=$(aws cloudwatch get-metric-statistics --region "$r" \ + --namespace AWS/RDS --metric-name CPUUtilization \ + --dimensions Name=DBInstanceIdentifier,Value=$dbid \ + --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ) \ + --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ + --period 86400 --statistics Average \ + --query 'Datapoints[].Average' --output text 2>/dev/null | awk '{s+=$1; n++} END {if(n>0) printf "%.1f", s/n; else print "N/A"}') + [ "$avg" != "N/A" ] && [ "$(echo "$avg < 10" | bc 2>/dev/null)" = "1" ] && echo "LOW-CPU-RDS: $dbid avg=${avg}%" + done +done +``` + +Flag: +- EC2 instances with avg CPU < 5% — candidates for downsizing or termination. +- RDS instances with avg CPU < 10% — candidates for smaller instance class. +- Cross-reference with instance type to estimate savings from right-sizing. + +--- + +## Step 10 — Savings Plans and Reserved Instances coverage // turbo @@ -334,7 +376,7 @@ Flag: --- -## Step 10 — Generate report +## Step 11 — Generate report Compile all findings into a timestamped Markdown report: diff --git a/workflows/iac/terraform-plan-review.md b/workflows/iac/terraform-plan-review.md index aa551b1..768fb32 100644 --- a/workflows/iac/terraform-plan-review.md +++ b/workflows/iac/terraform-plan-review.md @@ -25,6 +25,32 @@ Feed in a `terraform plan` (text, JSON, or saved plan file) and get a plain-Engl --- +## Step 0 — Generate the plan (if not already available) + +If the user doesn't have a plan output yet, help them generate one: + +```bash +# Option A: Generate text plan +cd +terraform init -backend=false # safe: no backend state access needed for plan review +terraform plan -no-color 2>&1 | tee /tmp/tf-plan.txt + +# Option B: Generate JSON plan (richer, recommended) +terraform plan -out=/tmp/tf-plan.bin +terraform show -json /tmp/tf-plan.bin > /tmp/tf-plan.json + +# Option C: If the user only has a saved binary plan file +terraform show -json > /tmp/tf-plan.json + +# Option D: If using Terragrunt +terragrunt plan -out=/tmp/tf-plan.bin +terraform show -json /tmp/tf-plan.bin > /tmp/tf-plan.json +``` + +> **Note:** `terraform init -backend=false` is safe and does not access remote state. It only downloads providers and modules needed to validate the config. If the user has already run `terraform init`, skip this. + +--- + ## Step 1 — Ingest and parse the plan If the input is a binary plan file, convert it: From 7472060aeff80b0ab4f0031c8810c2d7c652f046 Mon Sep 17 00:00:00 2001 From: Sergei Olshanetski Date: Thu, 14 May 2026 21:59:48 -0400 Subject: [PATCH 2/2] Add new workflows, CI, and CHANGELOG Workflows: - helm-chart-review: Helm chart best practices review (kubernetes/) - secrets-leak-scan: git history secret scanner (security/) - incident-triage: guided first 15 min of an incident (observability/) Improvements: - aws-account-audit: FAST=yes mode for large accounts - aws-cost-quickscan: DEEP=yes for EC2/RDS CPU utilization - terraform-plan-review: Step 0 plan generation commands - k8s-debug: enhanced logs, restart timeline, HPA checks Prompts: - pr-description.md - explain-like-a-senior.md Scripts: - aws-whoami.sh - stale-branches.sh Repo: - GitHub Actions CI (lint, link check, frontmatter validation) - CHANGELOG.md --- .github/.markdownlint.json | 8 + .github/mlc-config.json | 11 + .github/workflows/ci.yml | 71 +++++ CHANGELOG.md | 64 +++++ README.md | 24 +- scripts/aws-whoami.sh | 0 scripts/stale-branches.sh | 0 workflows/kubernetes/helm-chart-review.md | 254 ++++++++++++++++++ workflows/observability/incident-triage.md | 289 +++++++++++++++++++++ workflows/security/secrets-leak-scan.md | 221 ++++++++++++++++ 10 files changed, 938 insertions(+), 4 deletions(-) create mode 100644 .github/.markdownlint.json create mode 100644 .github/mlc-config.json create mode 100644 .github/workflows/ci.yml create mode 100644 CHANGELOG.md mode change 100644 => 100755 scripts/aws-whoami.sh mode change 100644 => 100755 scripts/stale-branches.sh create mode 100644 workflows/kubernetes/helm-chart-review.md create mode 100644 workflows/observability/incident-triage.md create mode 100644 workflows/security/secrets-leak-scan.md diff --git a/.github/.markdownlint.json b/.github/.markdownlint.json new file mode 100644 index 0000000..44c0ac1 --- /dev/null +++ b/.github/.markdownlint.json @@ -0,0 +1,8 @@ +{ + "default": true, + "MD013": false, + "MD033": false, + "MD041": false, + "MD024": { "siblings_only": true }, + "MD046": { "style": "fenced" } +} diff --git a/.github/mlc-config.json b/.github/mlc-config.json new file mode 100644 index 0000000..c6eac5c --- /dev/null +++ b/.github/mlc-config.json @@ -0,0 +1,11 @@ +{ + "ignorePatterns": [ + { + "pattern": "^http://localhost" + }, + { + "pattern": "^http://prometheus" + } + ], + "aliveStatusCodes": [200, 206, 301, 302, 403] +} diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 0000000..fc42c01 --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,71 @@ +name: CI + +on: + push: + branches: [main, master] + pull_request: + branches: [main, master] + +jobs: + lint: + name: Lint & Validate + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Check markdown links + uses: gaurav-nelson/github-action-markdown-link-check@v1 + with: + use-quiet-mode: 'yes' + config-file: '.github/mlc-config.json' + continue-on-error: true + + - name: Lint markdown + uses: DavidAnson/markdownlint-cli2-action@v19 + with: + globs: '**/*.md' + config: '.github/.markdownlint.json' + continue-on-error: true + + - name: Validate workflow frontmatter + run: | + echo "Checking all workflows have frontmatter..." + errors=0 + for f in workflows/**/*.md; do + if ! head -1 "$f" | grep -q '^---$'; then + echo "❌ Missing frontmatter: $f" + errors=$((errors + 1)) + fi + done + echo "Checked $(find workflows -name '*.md' | wc -l) workflows, $errors missing frontmatter" + [ "$errors" -eq 0 ] && echo "✅ All workflows have frontmatter" + + - name: Check README workflow table matches files + run: | + echo "Checking README links match actual files..." + errors=0 + for f in $(grep -oE '\./workflows/[^)]+\.md' README.md); do + if [ ! -f "$f" ]; then + echo "❌ README links to $f but file doesn't exist" + errors=$((errors + 1)) + fi + done + echo "Checked $(grep -coE '\./workflows/[^)]+\.md' README.md) README links, $errors broken" + [ "$errors" -eq 0 ] && echo "✅ All README links are valid" + + - name: Check scripts are executable + run: | + for f in scripts/*.sh; do + [ -f "$f" ] || continue + if [ ! -x "$f" ]; then + echo "❌ Not executable: $f" + fi + done + + - name: Shellcheck scripts + run: | + if command -v shellcheck >/dev/null; then + shellcheck scripts/*.sh || true + else + echo "shellcheck not available, skipping" + fi diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..4c15f2f --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,64 @@ +# Changelog + +All notable changes to this project will be documented in this file. + +## [Unreleased] + +### Added — Workflows +- **`/helm-chart-review`** — review Helm charts for security, reliability, and best practices (kubernetes/) +- **`/secrets-leak-scan`** — scan git repos for leaked secrets using gitleaks, trufflehog, or regex (security/) +- **`/incident-triage`** — guided first 15 minutes of a production incident (observability/) + +### Added — Prompts +- **`pr-description.md`** — generate PR descriptions from diffs +- **`explain-like-a-senior.md`** — explain infrastructure code to junior engineers + +### Added — Scripts +- **`aws-whoami.sh`** — quick AWS identity and account context check +- **`stale-branches.sh`** — list git branches older than N days + +### Added — CI +- GitHub Actions CI: markdown lint, link check, frontmatter validation, README link verification + +### Improved +- **`/aws-account-audit`** — added `FAST=yes` input to skip slow per-policy IAM loops on large accounts +- **`/aws-cost-quickscan`** — added `DEEP=yes` input for per-instance CPU utilization analysis +- **`/terraform-plan-review`** — added Step 0 with plan generation commands (including Terragrunt) +- **`/k8s-debug`** — enhanced log analysis (Step 5) with init container logs, structured error extraction, severity classification, and "noisiest pods" scan; added restart timeline analysis (Step 6a) and HPA health check (Step 6b); expanded triage cheat-sheet with startup-order, Redis, autoscaling, and webhook patterns + +--- + +## [0.1.0] — 2026-05-04 + +### Added — Workflows +- **`/k8s-debug`** — general-purpose Kubernetes cluster debugger (kubernetes/) +- **`/k8s-workload-debug`** — deep-dive on a single workload (kubernetes/) +- **`/k8s-rbac-audit`** — RBAC security audit (kubernetes/) +- **`/k8s-cost-hotspots`** — cost and waste analysis (kubernetes/) +- **`/k8s-upgrade-readiness`** — pre-flight checks for K8s upgrades (kubernetes/) +- **`/helm-release-debug`** — diagnose stuck or failed Helm releases (kubernetes/) +- **`/aws-account-audit`** — AWS account security audit (aws/) +- **`/aws-cost-quickscan`** — AWS cost waste analysis (aws/) +- **`/aws-vpc-debug`** — VPC connectivity triage (aws/) +- **`/aws-iam-policy-review`** — IAM policy risk analysis (aws/) +- **`/terraform-plan-review`** — Terraform plan risk analysis (iac/) +- **`/ci-debug`** — CI/CD pipeline failure diagnosis (cicd/) +- **`/jenkins-pipeline-review`** — Jenkinsfile code review (cicd/) +- **`/dockerfile-review`** — Dockerfile security and optimization review (containers/) + +### Added — Prompts +- **`incident-commander.md`** — incident commander system prompt +- **`postmortem-writer.md`** — blameless post-mortem generator +- **`code-review-devops.md`** — DevOps code review prompt + +### Added — Rules +- **`devops-agent.windsurfrules`** — AI safety guardrails for DevOps repos + +### Added — Scripts +- **`k8s-snapshot.sh`** — cluster state snapshot to Markdown + +### Added — Repo +- Repository structure: workflows/, prompts/, rules/, scripts/ +- README.md with full documentation +- CONTRIBUTING.md with workflow design rules +- MIT License diff --git a/README.md b/README.md index 2845af5..f53ff63 100644 --- a/README.md +++ b/README.md @@ -25,6 +25,7 @@ A growing collection of **AI-agent workflows, prompts, and rules** for day-to-da | [k8s-cost-hotspots](./workflows/kubernetes/k8s-cost-hotspots.md) | `/k8s-cost-hotspots` | Find waste: over-provisioned workloads, missing requests/limits, idle workloads, orphan PVCs/PVs, idle LoadBalancers. | `kubectl`, `jq`, metrics-server. | | [k8s-upgrade-readiness](./workflows/kubernetes/k8s-upgrade-readiness.md) | `/k8s-upgrade-readiness` | Pre-flight before a control-plane / node upgrade: deprecated APIs, version skew, PDB gaps, expiring certs, broken webhooks. | `kubectl`. Optional: `kubent` or `pluto`, `helm`. | | [helm-release-debug](./workflows/kubernetes/helm-release-debug.md) | `/helm-release-debug` | Diagnose a stuck or failed Helm release: history, values diff, hook failures, rendered manifest vs cluster, workload health. | `helm` v3, `kubectl`. Optional: `jq`, `yq`. | +| [helm-chart-review](./workflows/kubernetes/helm-chart-review.md) | `/helm-chart-review` | Review a Helm chart for security, reliability, and best practices: resource specs, probes, security context, PDBs, anti-affinity, RBAC. | Helm chart source. Optional: `helm` CLI. | ### AWS / Cloud @@ -49,6 +50,18 @@ A growing collection of **AI-agent workflows, prompts, and rules** for day-to-da | [jenkins-pipeline-review](./workflows/cicd/jenkins-pipeline-review.md) | `/jenkins-pipeline-review` | Review Jenkinsfile / shared-library Groovy for security risks, anti-patterns, missing error handling, credential leaks, CPS issues, and build config cross-references. | Jenkinsfile(s) or `vars/*.groovy`. Optional: `repositories_v2.json`. | | [dockerfile-review](./workflows/containers/dockerfile-review.md) | `/dockerfile-review` | Review Dockerfiles for security, size, caching, and best practices. Flags CVE-prone bases, leaked secrets, missing health checks. | Dockerfile(s). Optional: `docker`, `trivy`. | +### Security + +| Workflow | Slash command | Description | Prerequisites | +|---|---|---|---| +| [secrets-leak-scan](./workflows/security/secrets-leak-scan.md) | `/secrets-leak-scan` | Scan git repo history for leaked secrets: API keys, passwords, tokens, private keys. Uses gitleaks, trufflehog, or regex fallback. | Git repo. Optional: `gitleaks`, `trufflehog`. | + +### Observability & Incident + +| Workflow | Slash command | Description | Prerequisites | +|---|---|---|---| +| [incident-triage](./workflows/observability/incident-triage.md) | `/incident-triage` | Guided first 15 minutes of a production incident: timeline, blast radius, evidence gathering, mitigation suggestions. | Access to affected environment. | + More on the way — see [Roadmap](#roadmap). ## Prompts @@ -60,6 +73,8 @@ Reusable system prompts you can paste into any AI agent for common DevOps tasks: | [incident-commander](./prompts/incident-commander.md) | Puts the AI in incident-commander mode: timeline, blast radius, action tracking, status updates. | | [postmortem-writer](./prompts/postmortem-writer.md) | Generates a blameless post-mortem from incident notes: timeline, root cause, impact, action items. | | [code-review-devops](./prompts/code-review-devops.md) | Reviews IaC / pipeline / Docker / K8s code with a security-first DevOps lens. | +| [pr-description](./prompts/pr-description.md) | Generates a PR description from a diff: what, why, how, testing, risk, rollback plan. | +| [explain-like-a-senior](./prompts/explain-like-a-senior.md) | Explains infrastructure code to junior engineers: what it does, why, gotchas, and how it fits together. | ## Rules @@ -76,6 +91,8 @@ Standalone shell utilities referenced by workflows or useful on their own: | Script | Usage | |---|---| | [k8s-snapshot.sh](./scripts/k8s-snapshot.sh) | `./k8s-snapshot.sh [namespace\|all] [output-dir]` — dump cluster state (nodes, pods, events, services, top) to a timestamped Markdown file. | +| [aws-whoami.sh](./scripts/aws-whoami.sh) | `./aws-whoami.sh [profile]` — quick AWS identity check: caller, region, account alias, org, SSO role. | +| [stale-branches.sh](./scripts/stale-branches.sh) | `./stale-branches.sh [days] [--remote]` — list git branches older than N days with last commit info. | ## Using a workflow @@ -100,7 +117,9 @@ devops-ai-workflows/ │ ├── aws/ # AWS / cloud workflow definitions │ ├── iac/ # Infrastructure as Code workflows │ ├── cicd/ # CI/CD pipeline workflows -│ └── containers/ # Container & image workflows +│ ├── containers/ # Container & image workflows +│ ├── security/ # Security & repo hygiene workflows +│ └── observability/ # Observability & incident workflows ├── prompts/ # Reusable LLM prompts ├── rules/ # Editor/agent rule files ├── scripts/ # Standalone shell helpers @@ -127,12 +146,10 @@ Ideas I plan to add (PRs welcome): - [ ] `/image-cve-triage` — prioritise CVE scanner output by exploitability + fix availability - [ ] `/github-actions-review` — security review of GitHub Actions workflow files - [ ] `/release-checklist` — pre-release gate -- [ ] `/helm-chart-review` — review Helm chart for missing resources/limits, PDB, anti-affinity, template issues **Observability & incident** - [ ] `/prometheus-query-helper` — intent → PromQL with rationale - [ ] `/log-pattern-extract` — cluster repeated errors out of a log dump -- [ ] `/incident-triage` — guided first 15 minutes of an incident - [ ] `/postmortem` — blameless post-mortem from a transcript - [ ] `/runbook-from-incident` — turn a resolved incident into a reusable runbook @@ -144,7 +161,6 @@ Ideas I plan to add (PRs welcome): - [ ] `/db-migration-review` — flag risky migration patterns **Security & repo hygiene** -- [ ] `/secrets-leak-scan` — gitleaks/trufflehog over full git history - [ ] `/cve-impact-assessment` — given a CVE, check whether your stack is affected - [ ] `/repo-health` — README, license, CI, branch protection, stale branches - [ ] `/dependency-upgrade-plan` — group outdated deps by risk and suggest batching diff --git a/scripts/aws-whoami.sh b/scripts/aws-whoami.sh old mode 100644 new mode 100755 diff --git a/scripts/stale-branches.sh b/scripts/stale-branches.sh old mode 100644 new mode 100755 diff --git a/workflows/kubernetes/helm-chart-review.md b/workflows/kubernetes/helm-chart-review.md new file mode 100644 index 0000000..1a80eec --- /dev/null +++ b/workflows/kubernetes/helm-chart-review.md @@ -0,0 +1,254 @@ +--- +description: Review a Helm chart for security, reliability, and best practices before deployment. Checks templates, values, resource specs, and RBAC. Read-only static analysis. +--- + +# /helm-chart-review — Helm Chart Best Practices Review + +Static analysis of a Helm chart **before deployment**. Checks templates, `values.yaml`, resource specifications, security context, RBAC, and packaging. Flags missing best practices that cause production incidents. + +> This reviews chart **source code**. For diagnosing a **live broken Helm release**, use `/helm-release-debug` instead. + +## Prerequisites + +- Helm chart source directory or `.tgz` archive. +- Optional: `helm` CLI (for `helm template`, `helm lint`). +- Optional: `kubectl` (for dry-run validation against a cluster). +- No cluster access required for basic review. + +## Inputs + +- **CHART_PATH** *(required)* — path to the chart directory or `.tgz` file. +- **VALUES_FILE** — optional custom values file to review alongside defaults. +- **REPORT_DIR** — Default: `./helm-chart-review-reports`. + +--- + +## Step 1 — Chart structure and metadata + +// turbo + +```bash +# Validate chart structure +ls -la $CHART_PATH/ +cat $CHART_PATH/Chart.yaml +cat $CHART_PATH/values.yaml | head -100 + +# Helm lint +helm lint $CHART_PATH 2>&1 +helm lint $CHART_PATH --strict 2>&1 + +# Template render (catch errors before deploy) +helm template test-release $CHART_PATH 2>&1 | head -200 +``` + +Check: + +- `Chart.yaml` has `version`, `appVersion`, `description`, `maintainers`. +- `apiVersion: v2` (Helm 3). Flag `v1` charts (Helm 2 legacy). +- Dependencies declared in `Chart.yaml` or `requirements.yaml` (legacy). +- `helm lint --strict` passes with no warnings. +- `helm template` renders without errors. + +--- + +## Step 2 — Resource specifications + +For every Deployment, StatefulSet, DaemonSet, Job in the templates, check: + +### Resource requests and limits + +```yaml +# ✅ Good +resources: + requests: + cpu: 100m + memory: 128Mi + limits: + cpu: 500m + memory: 512Mi + +# ❌ Bad — no resources at all +# ❌ Bad — limits without requests +# ⚠️ Caution — requests == limits (Guaranteed QoS, may be wasteful) +``` + +Flag: +- Containers with no `resources.requests` → scheduling problems, noisy neighbors. +- Containers with no `resources.limits` → can consume unbounded resources. +- Memory limits much larger than requests → overcommitment risk. + +### Probes + +```yaml +# ✅ Should have all three +readinessProbe: ... # When to send traffic +livenessProbe: ... # When to restart +startupProbe: ... # Grace period for slow-starting apps +``` + +Flag: +- No `readinessProbe` → traffic sent before app is ready. +- No `livenessProbe` → stuck pods never restart. +- `livenessProbe` same as `readinessProbe` → may cause restart loops under load. +- `initialDelaySeconds` too low → premature restarts during startup. +- No `startupProbe` on apps known to have slow startup. + +--- + +## Step 3 — Security + +### Pod security context + +```yaml +# ✅ Good +securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 + readOnlyRootFilesystem: true + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"] +``` + +Flag: +- No `securityContext` at all → runs as root. +- `privileged: true` → full host access. +- `allowPrivilegeEscalation: true` or missing → container can escalate. +- `capabilities` not dropped → unnecessary kernel capabilities. +- `hostNetwork: true`, `hostPID: true`, `hostIPC: true` → breaks isolation. +- `readOnlyRootFilesystem: false` or missing → writable root fs. + +### RBAC + +If the chart creates `ClusterRole`, `ClusterRoleBinding`, `Role`, `RoleBinding`: + +- Flag `ClusterRole` with `*` verbs or `*` resources. +- Flag `ClusterRoleBinding` to `default` ServiceAccount. +- Flag any binding to `cluster-admin`. +- Prefer `Role`+`RoleBinding` (namespace-scoped) over `ClusterRole`+`ClusterRoleBinding`. + +### Secrets + +- Flag `Secret` resources with hardcoded values in templates. +- Prefer `existingSecret` pattern (reference external secrets). +- Flag secrets in `ConfigMap` (should be `Secret`). +- Check if `values.yaml` has password/token fields with default values. + +--- + +## Step 4 — High availability and resilience + +### Replicas and PDB + +Flag: +- `replicas: 1` for production workloads → single point of failure. +- No `PodDisruptionBudget` for multi-replica Deployments/StatefulSets. +- PDB with `maxUnavailable: 0` → blocks all voluntary disruptions (node drain). +- PDB with `minAvailable` equal to `replicas` → same problem. + +### Anti-affinity + +```yaml +# ✅ Good — spread across nodes +affinity: + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchExpressions: + - key: app + operator: In + values: ["myapp"] + topologyKey: kubernetes.io/hostname +``` + +Flag: +- Multi-replica workloads with no anti-affinity → all pods on one node. +- `requiredDuringScheduling` anti-affinity on small clusters → pods may not schedule. + +### Update strategy + +- Deployments: `RollingUpdate` with `maxSurge` and `maxUnavailable` configured. +- StatefulSets: `RollingUpdate` with `partition` for staged rollouts. +- DaemonSets: `RollingUpdate` with `maxUnavailable`. +- Flag `Recreate` strategy on production Deployments (causes downtime). + +--- + +## Step 5 — Networking + +- **Service type** — flag `LoadBalancer` without annotation for internal LB (may create public LB). +- **Ingress** — check for TLS configuration, valid hosts, path types. +- **NetworkPolicy** — flag charts with no NetworkPolicy (all traffic allowed). +- **Service ports** — named ports match container ports. +- **Service selectors** — match pod labels. + +--- + +## Step 6 — Storage + +- **PVC templates** in StatefulSets — check `storageClassName`, access modes, size. +- **EmptyDir** with no `sizeLimit` → can fill node disk. +- **HostPath** volumes → breaks portability, security risk. +- **Volume mounts** — check for unnecessary write access. + +--- + +## Step 7 — Values and configurability + +Review `values.yaml`: + +- **Image tag** — flag `latest` or missing tag. Should default to `appVersion` from `Chart.yaml` or a pinned tag. +- **Image pull policy** — should be `IfNotPresent` for tagged images, `Always` only for `latest`. +- **Configurable resource limits** — requests/limits should be in values, not hardcoded in templates. +- **Environment-specific values** — check if the chart supports different envs via values overlays. +- **Sensitive defaults** — flag default passwords, tokens, or keys in `values.yaml`. + +--- + +## Step 8 — Generate report + +Compile findings into a timestamped Markdown report: + +``` +$REPORT_DIR/helm-chart-review--.md +``` + +### Report structure + +```markdown +# Helm Chart Review Report + +| Field | Value | +|---|---| +| Generated | | +| Chart | v | +| App version | | +| Templates | | +| Risk level | 🔴 / 🟡 / 🟢 | + +## Summary + + +## Findings +### 🔴 Critical +### 🟡 Warning +### 🔵 Info + +## Template-by-template breakdown + + +## Recommended changes + +``` + +--- + +## Safety rules + +- This workflow is **entirely read-only**. No charts are installed, upgraded, or deleted. +- `helm template` renders locally — it does not contact a cluster. +- `helm lint` is a local static check. +- Never print secret values from `values.yaml`. Flag their presence but redact. diff --git a/workflows/observability/incident-triage.md b/workflows/observability/incident-triage.md new file mode 100644 index 0000000..c1f04c0 --- /dev/null +++ b/workflows/observability/incident-triage.md @@ -0,0 +1,289 @@ +--- +description: Guided first 15 minutes of a production incident. Establishes timeline, assesses blast radius, gathers evidence, and coordinates response. Read-only investigation commands. +--- + +# /incident-triage — First 15 Minutes of an Incident + +Structured triage workflow for the critical first 15 minutes of a production incident. Guides you through timeline establishment, blast radius assessment, evidence gathering, and initial mitigation — with concrete commands for Kubernetes, AWS, and general infrastructure. + +## Prerequisites + +- Access to the affected environment (kubectl, AWS CLI, monitoring dashboards). +- This workflow uses **read-only** commands only. Mitigation actions are suggested but not executed automatically. + +## Inputs + +- **INCIDENT** *(required)* — brief description of the symptoms (e.g., "scores-api returning 500s", "high latency on checkout", "pods crashing in prod"). +- **ENVIRONMENT** — `prod` / `staging` / `dev`. Default: `prod`. +- **AFFECTED_SERVICE** — service name if known. +- **REPORT_DIR** — Default: `./incident-triage-reports`. + +--- + +## Minute 0–2: Declare and orient + +### Establish the basics + +Ask the user (or determine from context): + +1. **What are the symptoms?** (errors, latency, downtime, data issue) +2. **When did it start?** (first alert, first customer report, when you noticed) +3. **Who reported it?** (alert, customer, internal) +4. **What environment?** (prod, staging, which region/cluster) +5. **What changed recently?** (deploys, config changes, infra changes, maintenance windows) + +### Check recent deployments + +```bash +# Kubernetes: recent rollouts +kubectl rollout history deploy -A 2>/dev/null | head -30 + +# Helm: recent releases +helm ls -A --sort-by updated 2>/dev/null | tail -20 + +# Git: recent deploys (if deploy tags exist) +git log --oneline --since="6 hours ago" --all 2>/dev/null | head -20 + +# AWS: recent CloudFormation events +aws cloudformation describe-stack-events --stack-name --query 'StackEvents[:10].[Timestamp,ResourceStatus,LogicalResourceId,ResourceStatusReason]' --output table 2>/dev/null +``` + +### Draft initial status + +``` +🔴 Incident declared: +Time: <HH:MM UTC> +Severity: <SEV1/SEV2/SEV3> +Impact: <who/what is affected> +Status: Investigating +IC: <your name> +Next update in 15 minutes. +``` + +--- + +## Minute 2–5: Assess blast radius + +### What's broken? + +```bash +# Kubernetes: cluster health snapshot +kubectl get nodes -o wide +kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded -o wide +kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp | tail -30 + +# If service is known: +kubectl get pods -n <ns> -l app=<service> -o wide +kubectl describe deploy -n <ns> <service> | tail -30 + +# AWS: service health +aws health describe-events --filter eventStatusCodes=open --query 'events[].{Service:service,Status:statusCode,Description:eventTypeCode}' --output table 2>/dev/null || true +``` + +### Who's affected? + +```bash +# Check error rates (if Prometheus/metrics available) +# Substitute your actual metric names +curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~\"5..\"}[5m]))" 2>/dev/null | jq '.data.result' + +# Check ALB/NLB metrics (AWS) +aws cloudwatch get-metric-statistics \ + --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_5XX_Count \ + --dimensions Name=LoadBalancer,Value=<lb-name> \ + --start-time $(date -u -v-30M +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \ + --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ + --period 60 --statistics Sum 2>/dev/null +``` + +### Quantify impact + +| Question | How to determine | +|---|---| +| Error rate | Prometheus, CloudWatch, APM | +| Affected users (%) | Compare error rate to total request rate | +| Which regions/AZs | Check per-region metrics, node distribution | +| Data loss risk | Check database health, replication status | +| Revenue impact | Error rate × average revenue per request | + +--- + +## Minute 5–10: Gather evidence + +### Logs from affected service + +```bash +# Kubernetes: recent logs +kubectl logs -n <ns> -l app=<service> --all-containers --tail=200 --timestamps --since=30m 2>/dev/null | grep -iE 'error|fatal|panic|exception|timeout|refused' | tail -50 + +# Previous container logs (if restarting) +for pod in $(kubectl get pods -n <ns> -l app=<service> -o name); do + echo "=== $pod previous ===" + kubectl logs -n <ns> $pod --previous --tail=100 --timestamps 2>/dev/null | tail -20 +done + +# AWS Lambda (if applicable) +aws logs filter-log-events \ + --log-group-name "/aws/lambda/<function-name>" \ + --start-time $(($(date +%s) - 1800))000 \ + --filter-pattern "ERROR" \ + --limit 30 2>/dev/null +``` + +### Infrastructure state + +```bash +# Kubernetes: resource pressure +kubectl top nodes 2>/dev/null +kubectl top pods -n <ns> --sort-by=memory 2>/dev/null | head -20 + +# Kubernetes: HPA status +kubectl get hpa -n <ns> -o wide 2>/dev/null + +# AWS: EC2/RDS health +aws ec2 describe-instance-status --filters Name=instance-status.status,Values=impaired --query 'InstanceStatuses[].{Id:InstanceId,Status:InstanceStatus.Status}' --output table 2>/dev/null +aws rds describe-events --duration 30 --query 'Events[].{Source:SourceIdentifier,Type:EventCategories,Message:Message,Date:Date}' --output table 2>/dev/null +``` + +### Network and dependencies + +```bash +# DNS resolution (from inside the cluster) +kubectl run dns-test --rm -it --restart=Never --image=busybox:1.36 --command -- nslookup <service>.<ns>.svc.cluster.local 2>/dev/null + +# Endpoint health +kubectl get endpoints -n <ns> <service> -o wide + +# External dependency check +curl -sSm 5 -o /dev/null -w "status=%{http_code} time=%{time_total}s\n" https://<dependency-endpoint>/health 2>/dev/null || echo "UNREACHABLE" +``` + +--- + +## Minute 10–12: Identify and mitigate + +### Common root causes and quick mitigations + +| Symptom | Likely cause | Quick mitigation | +|---|---|---| +| Pods in CrashLoopBackOff after deploy | Bad code / config in new version | `kubectl rollout undo deploy/<name> -n <ns>` | +| All pods OOMKilled | Memory leak or insufficient limits | Scale up or increase memory limits | +| 503s from LB | No healthy targets | Check pod readiness, fix probes | +| Connection refused to dependency | Dependency is down | Check dependency status, failover | +| Slow queries / high DB CPU | Bad query or missing index | Identify and kill long-running queries | +| Certificate expired | TLS cert not renewed | Emergency cert renewal | +| DNS resolution failing | CoreDNS unhealthy | Restart CoreDNS pods | + +### Suggest (don't execute) mitigations + +The agent should present mitigation options but **never execute them automatically**: + +``` +Suggested mitigations (choose one — confirm before running): + +Option A: Rollback to previous version + kubectl rollout undo deploy/<service> -n <ns> + +Option B: Scale up to handle load + kubectl scale deploy/<service> -n <ns> --replicas=<N> + +Option C: Restart pods (if stuck state) + kubectl rollout restart deploy/<service> -n <ns> + +Option D: Disable traffic to the service + kubectl scale deploy/<service> -n <ns> --replicas=0 +``` + +--- + +## Minute 12–15: Communicate and plan + +### Status update + +``` +🟡 Update: <title> +Time: <HH:MM UTC> +Status: Identified / Mitigating +What we know: + - Root cause: <description> + - Impact: <X% of requests affected / Y users impacted> + - Started: <HH:MM UTC> +Current action: <what's being done> +Next update in 15 minutes. +``` + +### Evidence log + +Record everything gathered so far: + +```markdown +## Evidence collected at <timestamp> + +### Timeline +- HH:MM — First symptom / alert +- HH:MM — Investigation started +- HH:MM — Root cause identified: <description> +- HH:MM — Mitigation applied: <action> + +### Key findings +- <finding 1> +- <finding 2> + +### Commands run +- <command 1> → <result summary> +- <command 2> → <result summary> +``` + +--- + +## Step — Generate triage report + +Compile all findings into a timestamped report: + +``` +$REPORT_DIR/incident-triage-<service>-<YYYYMMDD-HHMMSS>.md +``` + +### Report structure + +```markdown +# Incident Triage Report + +| Field | Value | +|---|---| +| Generated | <timestamp> | +| Incident | <description> | +| Environment | <env> | +| Service | <service> | +| Severity | SEV1/SEV2/SEV3 | +| Duration (so far) | <minutes> | + +## Blast radius +<who/what is affected, error rates, user impact> + +## Timeline +<chronological events> + +## Root cause (if identified) +<description> + +## Evidence +<logs, metrics, command outputs> + +## Mitigation applied / recommended +<what was done or what should be done> + +## Next steps +<follow-up investigation, post-mortem scheduling> +``` + +--- + +## Safety rules + +- All investigation commands are **read-only**. +- **Mitigation commands are suggested but never executed automatically.** The user must explicitly confirm any write/mutation operation. +- Never print secret values from logs or configs. +- The DNS test pod (`dns-test`) uses `--rm` and auto-deletes. +- If kubectl/AWS commands fail due to permissions, record the failure and continue. +- Always confirm the target environment before suggesting any mitigation. diff --git a/workflows/security/secrets-leak-scan.md b/workflows/security/secrets-leak-scan.md new file mode 100644 index 0000000..2ad0f06 --- /dev/null +++ b/workflows/security/secrets-leak-scan.md @@ -0,0 +1,221 @@ +--- +description: Scan a git repository for leaked secrets across full history. Uses gitleaks, trufflehog, or manual regex patterns. Read-only, generates a markdown report. +--- + +# /secrets-leak-scan — Git Repository Secrets Scanner + +Scan a git repository's **full commit history** for accidentally committed secrets: API keys, passwords, tokens, private keys, connection strings, and credentials. Uses `gitleaks`, `trufflehog`, or falls back to manual regex patterns. **Read-only** — nothing is modified. + +## Prerequisites + +- A git repository (local clone). +- Recommended: `gitleaks` or `trufflehog` installed (the workflow will detect which is available). +- Fallback: `grep` and `git log` (always available, less accurate). + +## Inputs + +- **REPO_PATH** *(required)* — path to the git repository root. +- **SCAN_SCOPE** — `full` (entire git history) or `recent` (last 30 days / last 100 commits). Default: `full`. +- **REPORT_DIR** — Default: `./secrets-leak-scan-reports`. + +--- + +## Step 1 — Detect available tools + +// turbo + +```bash +echo "=== Available scanners ===" +command -v gitleaks >/dev/null && echo "gitleaks: $(gitleaks version 2>&1)" || echo "gitleaks: not installed" +command -v trufflehog >/dev/null && echo "trufflehog: $(trufflehog --version 2>&1)" || echo "trufflehog: not installed" +echo "git: $(git --version)" +echo "grep: available" + +echo "" +echo "=== Repository info ===" +cd $REPO_PATH +echo "Repo: $(basename $(git rev-parse --show-toplevel))" +echo "Branch: $(git branch --show-current)" +echo "Commits: $(git rev-list --count HEAD)" +echo "Remotes: $(git remote -v | head -2)" +``` + +--- + +## Step 2 — Run primary scanner + +### Option A: gitleaks (preferred) + +```bash +cd $REPO_PATH + +# Full history scan +gitleaks detect --source . --report-format json --report-path /tmp/gitleaks-report.json --verbose 2>&1 + +# Or recent only +gitleaks detect --source . --log-opts="--since='30 days ago'" --report-format json --report-path /tmp/gitleaks-report.json --verbose 2>&1 + +# Parse results +cat /tmp/gitleaks-report.json | jq -r '.[] | "\(.RuleID)\t\(.File)\tcommit=\(.Commit[:8])\tauthor=\(.Author)\tdate=\(.Date)"' | head -50 +``` + +### Option B: trufflehog + +```bash +cd $REPO_PATH + +# Full history scan +trufflehog git file://. --json 2>/dev/null | jq -r '.SourceMetadata.Data.Git | "\(.file) commit=\(.commit[:8]) email=\(.email)"' | head -50 + +# Recent only +trufflehog git file://. --since-commit=$(git rev-list -1 --before="30 days ago" HEAD) --json 2>/dev/null | head -50 +``` + +### Option C: Manual regex fallback + +If neither tool is installed, fall back to git log + grep: + +```bash +cd $REPO_PATH + +echo "=== Scanning for common secret patterns ===" + +# High-confidence patterns +git log -p --all --diff-filter=A 2>/dev/null | grep -nE \ + 'AKIA[0-9A-Z]{16}|AIza[0-9A-Za-z\-_]{35}|ghp_[0-9a-zA-Z]{36}|gho_[0-9a-zA-Z]{36}|glpat-[0-9a-zA-Z\-]{20}|sk-[0-9a-zA-Z]{48}|xox[bporas]-[0-9a-zA-Z\-]+' \ + | head -30 + +# AWS keys +git log -p --all 2>/dev/null | grep -nE 'AKIA[0-9A-Z]{16}' | head -10 +echo "" + +# Private keys +git log -p --all 2>/dev/null | grep -nE 'BEGIN (RSA |DSA |EC |OPENSSH )?PRIVATE KEY' | head -10 +echo "" + +# Connection strings +git log -p --all 2>/dev/null | grep -nE '(mysql|postgres|mongodb|redis)://[^/\s]+:[^@\s]+@' | head -10 +echo "" + +# Generic password/secret assignments +git log -p --all 2>/dev/null | grep -nE '(password|passwd|secret|token|api_key|apikey|access_key|private_key)\s*[:=]\s*["\x27][^\s"'\'']{8,}' | head -20 +echo "" + +# .env files committed +git log --all --name-only --diff-filter=A 2>/dev/null | grep -E '^\.env$|\.env\.' | sort -u +echo "" + +# Key/cert files committed +git log --all --name-only --diff-filter=A 2>/dev/null | grep -iE '\.(pem|key|p12|pfx|jks|keystore|cert)$' | sort -u +``` + +--- + +## Step 3 — Triage findings + +For each finding, classify: + +| Severity | Pattern | Action | +|---|---|---| +| 🔴 Critical | AWS access key (`AKIA*`), private key, GCP service account JSON, GitHub PAT (`ghp_*`), Slack token (`xox*`) | Rotate immediately. Check if key is still active. | +| 🔴 Critical | Database connection string with credentials | Rotate password. Check if DB is exposed. | +| 🟡 Warning | Generic `password=`, `secret=`, `token=` in config files | May be placeholder/test value — verify if real. | +| 🟡 Warning | `.env` file committed | Remove from tracking, add to `.gitignore`. | +| 🔵 Info | Test/mock credentials, example configs, documentation examples | Verify these are not real credentials. | + +### Check if secrets are still active + +For AWS keys: + +```bash +# Check if a found AWS key is still active (requires aws CLI) +aws sts get-caller-identity --access-key-id AKIA... 2>&1 +# "InvalidClientTokenId" = deactivated/deleted (safe) +# Success = STILL ACTIVE (rotate immediately!) +``` + +### Check if secrets are in current HEAD + +```bash +# Is the secret still in the current codebase? (not just history) +git grep -l 'AKIA...' HEAD 2>/dev/null +``` + +If the secret is only in history (not current HEAD), it's still a risk — the git history is accessible to anyone who clones the repo. + +--- + +## Step 4 — Check .gitignore coverage + +// turbo + +```bash +cd $REPO_PATH + +echo "=== .gitignore check ===" +cat .gitignore 2>/dev/null || echo "NO .gitignore FILE" + +echo "" +echo "=== Files that should typically be gitignored ===" +for pattern in ".env" ".env.*" "*.pem" "*.key" "*.p12" "*.pfx" "*.jks" "*.keystore" "credentials" "*.tfvars" "*.tfstate" "terraform.tfstate*" ".terraform/" "secrets.yaml" "secrets.yml"; do + found=$(git ls-files "$pattern" 2>/dev/null) + [ -n "$found" ] && echo "TRACKED: $found (should be in .gitignore)" +done +``` + +--- + +## Step 5 — Generate report + +Compile findings into a timestamped Markdown report: + +``` +$REPORT_DIR/secrets-leak-scan-<repo-name>-<YYYYMMDD-HHMMSS>.md +``` + +### Report structure + +```markdown +# Secrets Leak Scan Report + +| Field | Value | +|---|---| +| Generated | <timestamp> | +| Repository | <repo name> | +| Scanner | gitleaks / trufflehog / manual | +| Scan scope | full history / recent | +| Commits scanned | <count> | +| Findings | <count> | +| Risk level | 🔴 / 🟡 / 🟢 | + +## Summary +<count of findings by severity> + +## 🔴 Critical findings +<secret type, file, commit, still active?> + +## 🟡 Warnings +<potential secrets, .env files, suspicious patterns> + +## 🔵 Info +<test credentials, documentation examples> + +## .gitignore gaps +<files that should be excluded> + +## Recommended actions +1. Rotate all 🔴 critical secrets immediately +2. Add missing .gitignore patterns +3. Consider using git-filter-repo to remove secrets from history +4. Set up pre-commit hooks to prevent future leaks +``` + +--- + +## Safety rules + +- This workflow is **entirely read-only**. No files are modified, no secrets are rotated, no git history is rewritten. +- **Never print full secret values** in the report. Redact to first/last 4 characters: `AKIA****WXYZ`. +- The `sts get-caller-identity` check for AWS keys is read-only — it does not perform any actions with the key. +- If a secret is found to be active, recommend rotation but do not rotate it. +- Suggested remediation commands (git-filter-repo, pre-commit hooks) are provided in the report for the user to evaluate and run manually.