From a685e39150ddb2d6e041bc69fb17d55498d1d5ef Mon Sep 17 00:00:00 2001 From: Sergei Olshanetski Date: Fri, 22 May 2026 09:14:53 -0400 Subject: [PATCH] Improve workflows and add repo operations helpers Improve existing workflows: - k8s-workload-debug: sidecar/init-container analysis and GitOps ownership checks - helm-release-debug: GitOps ownership, CRD lifecycle, and hook cleanup checks - aws-vpc-debug: clearer source/destination resolution, DNS split-horizon notes, Reachability Analyzer guidance - k8s-rbac-audit: ServiceAccount token exposure checks Improve prompts: - postmortem-writer: SLO/data impact, recurrence risk, action item types - explain-like-a-senior: prerequisite knowledge, safe validation, team questions Add new entities: - runbook-from-incident prompt - release-checklist workflow - repo-health workflow - validate-repo.sh local validation script Docs: - update README and CHANGELOG --- CHANGELOG.md | 10 ++ README.md | 5 +- prompts/explain-like-a-senior.md | 10 ++ prompts/postmortem-writer.md | 18 +- prompts/runbook-from-incident.md | 91 +++++++++++ scripts/validate-repo.sh | 78 +++++++++ workflows/aws/aws-vpc-debug.md | 63 +++++++ workflows/cicd/release-checklist.md | 181 +++++++++++++++++++++ workflows/kubernetes/helm-release-debug.md | 42 +++++ workflows/kubernetes/k8s-rbac-audit.md | 24 +++ workflows/kubernetes/k8s-workload-debug.md | 40 ++++- workflows/security/repo-health.md | 181 +++++++++++++++++++++ 12 files changed, 737 insertions(+), 6 deletions(-) create mode 100644 prompts/runbook-from-incident.md create mode 100755 scripts/validate-repo.sh create mode 100644 workflows/cicd/release-checklist.md create mode 100644 workflows/security/repo-health.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 4c15f2f..f724b18 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,14 +8,18 @@ All notable changes to this project will be documented in this file. - **`/helm-chart-review`** — review Helm charts for security, reliability, and best practices (kubernetes/) - **`/secrets-leak-scan`** — scan git repos for leaked secrets using gitleaks, trufflehog, or regex (security/) - **`/incident-triage`** — guided first 15 minutes of a production incident (observability/) +- **`/release-checklist`** — pre-release safety gate covering scope, deploy order, rollback, tests, monitoring, and communication (cicd/) +- **`/repo-health`** — repository hygiene audit for docs, CI, ownership, branch/release hygiene, and secrets risk (security/) ### Added — Prompts - **`pr-description.md`** — generate PR descriptions from diffs - **`explain-like-a-senior.md`** — explain infrastructure code to junior engineers +- **`runbook-from-incident.md`** — turn incident notes or post-mortems into reusable runbooks ### Added — Scripts - **`aws-whoami.sh`** — quick AWS identity and account context check - **`stale-branches.sh`** — list git branches older than N days +- **`validate-repo.sh`** — local validation for workflow frontmatter, README links, executable scripts, and optional lint checks ### Added — CI - GitHub Actions CI: markdown lint, link check, frontmatter validation, README link verification @@ -25,6 +29,12 @@ All notable changes to this project will be documented in this file. - **`/aws-cost-quickscan`** — added `DEEP=yes` input for per-instance CPU utilization analysis - **`/terraform-plan-review`** — added Step 0 with plan generation commands (including Terragrunt) - **`/k8s-debug`** — enhanced log analysis (Step 5) with init container logs, structured error extraction, severity classification, and "noisiest pods" scan; added restart timeline analysis (Step 6a) and HPA health check (Step 6b); expanded triage cheat-sheet with startup-order, Redis, autoscaling, and webhook patterns +- **`/k8s-workload-debug`** — added init/sidecar analysis and GitOps/controller ownership checks +- **`/k8s-rbac-audit`** — added ServiceAccount token exposure checks +- **`/helm-release-debug`** — added ArgoCD/Flux ownership checks before suggesting manual Helm recovery +- **`/aws-vpc-debug`** — clarified source/destination variable resolution for VPC, subnet, security groups, and destination IP +- **`postmortem-writer.md`** — added SLO/data impact, recurrence risk, and action item type classification +- **`explain-like-a-senior.md`** — added prerequisite knowledge, safe validation, and team-question sections --- diff --git a/README.md b/README.md index fed27bd..e55a43e 100644 --- a/README.md +++ b/README.md @@ -48,6 +48,7 @@ A growing collection of **AI-agent workflows, prompts, and rules** for day-to-da |---|---|---|---| | [ci-debug](./workflows/cicd/ci-debug.md) | `/ci-debug` | Diagnose a failing CI/CD pipeline: parse build logs from Jenkins, GitHub Actions, GitLab CI, or Bitbucket Pipelines. Root cause analysis and fix suggestions. | Build log output. Optional: repo source, CI config file. | | [jenkins-pipeline-review](./workflows/cicd/jenkins-pipeline-review.md) | `/jenkins-pipeline-review` | Review Jenkinsfile / shared-library Groovy for security risks, anti-patterns, missing error handling, credential leaks, CPS issues, and build config cross-references. | Jenkinsfile(s) or `vars/*.groovy`. Optional: `repositories_v2.json`. | +| [release-checklist](./workflows/cicd/release-checklist.md) | `/release-checklist` | Pre-release safety gate: scope, deploy order, rollback, tests, monitoring, and communication before production release. | PR/diff summary. Optional: test results, plans, diffs. | | [dockerfile-review](./workflows/containers/dockerfile-review.md) | `/dockerfile-review` | Review Dockerfiles for security, size, caching, and best practices. Flags CVE-prone bases, leaked secrets, missing health checks. | Dockerfile(s). Optional: `docker`, `trivy`. | ### Security @@ -55,6 +56,7 @@ A growing collection of **AI-agent workflows, prompts, and rules** for day-to-da | Workflow | Slash command | Description | Prerequisites | |---|---|---|---| | [secrets-leak-scan](./workflows/security/secrets-leak-scan.md) | `/secrets-leak-scan` | Scan git repo history for leaked secrets: API keys, passwords, tokens, private keys. Uses gitleaks, trufflehog, or regex fallback. | Git repo. Optional: `gitleaks`, `trufflehog`. | +| [repo-health](./workflows/security/repo-health.md) | `/repo-health` | Audit repository hygiene: README, license, CI, branch/release hygiene, tracked secrets, ownership, and automation gaps. | Local git repo. Optional: `gh`, `jq`. | ### Observability & Incident @@ -75,6 +77,7 @@ Reusable system prompts you can paste into any AI agent for common DevOps tasks: | [code-review-devops](./prompts/code-review-devops.md) | Reviews IaC / pipeline / Docker / K8s code with a security-first DevOps lens. | | [pr-description](./prompts/pr-description.md) | Generates a PR description from a diff: what, why, how, testing, risk, rollback plan. | | [explain-like-a-senior](./prompts/explain-like-a-senior.md) | Explains infrastructure code to junior engineers: what it does, why, gotchas, and how it fits together. | +| [runbook-from-incident](./prompts/runbook-from-incident.md) | Converts incident notes or post-mortems into reusable runbooks with diagnosis, mitigation, escalation, and follow-up steps. | ## Rules @@ -95,6 +98,7 @@ Standalone shell utilities referenced by workflows or useful on their own: | [k8s-snapshot.sh](./scripts/k8s-snapshot.sh) | `./k8s-snapshot.sh [namespace\|all] [output-dir]` — dump cluster state (nodes, pods, events, services, top) to a timestamped Markdown file. | | [aws-whoami.sh](./scripts/aws-whoami.sh) | `./aws-whoami.sh [profile]` — quick AWS identity check: caller, region, account alias, org, SSO role. | | [stale-branches.sh](./scripts/stale-branches.sh) | `./stale-branches.sh [days] [--remote]` — list git branches older than N days with last commit info. | +| [validate-repo.sh](./scripts/validate-repo.sh) | `./scripts/validate-repo.sh` — validate workflow frontmatter, README links, script executability, and optional lint checks. | ## Using a workflow @@ -147,7 +151,6 @@ Ideas I plan to add (PRs welcome): **Containers & CI/CD** - [ ] `/image-cve-triage` — prioritise CVE scanner output by exploitability + fix availability - [ ] `/github-actions-review` — security review of GitHub Actions workflow files -- [ ] `/release-checklist` — pre-release gate **Observability & incident** - [ ] `/prometheus-query-helper` — intent → PromQL with rationale diff --git a/prompts/explain-like-a-senior.md b/prompts/explain-like-a-senior.md index b357531..1299049 100644 --- a/prompts/explain-like-a-senior.md +++ b/prompts/explain-like-a-senior.md @@ -30,6 +30,9 @@ You are a **senior DevOps/SRE engineer** explaining infrastructure code to a jun ## Overview +## Prerequisite knowledge + + ## Walk-through
@@ -44,8 +47,14 @@ You are a **senior DevOps/SRE engineer** explaining infrastructure code to a jun ## Things to watch out for +## How to validate it safely + + ## If I were reviewing this + +## Good questions to ask the team + ``` ### Rules @@ -53,4 +62,5 @@ You are a **senior DevOps/SRE engineer** explaining infrastructure code to a jun - **No condescension.** Junior doesn't mean stupid. Explain clearly without being patronizing. - **No hand-waving.** If you don't know why something is done a certain way, say "I'm not sure why this specific choice was made — it might be historical. Here's what I'd investigate." - **Use the actual code.** Reference specific lines, variables, and resource names. +- **Teach safe validation.** Prefer read-only commands, dry-runs, local tests, and plan output. - **Encourage questions.** End with "Good questions to ask your team about this: ..." diff --git a/prompts/postmortem-writer.md b/prompts/postmortem-writer.md index f8b43a8..ebb6ba7 100644 --- a/prompts/postmortem-writer.md +++ b/prompts/postmortem-writer.md @@ -33,7 +33,9 @@ Generate the post-mortem in this structure: - **Services affected:** - **Duration of impact:** - **SLA impact:** +- **SLO impact:** - **Revenue impact:** +- **Data impact:** ## Timeline (UTC) @@ -65,6 +67,13 @@ Generate the post-mortem in this structure: - **What was the mitigation?** - **Was the runbook followed?** +## Recurrence risk + +- **Likelihood of recurrence:** Low / Medium / High +- **Why:** +- **Existing guardrails:** +- **Missing guardrails:** + ## Contributing factors @@ -91,10 +100,10 @@ Generate the post-mortem in this structure: ## Action items -| # | Action | Owner | Priority | Due date | Status | -|---|---|---|---|---|---| -| 1 | | | P1/P2/P3 | | Open | -| 2 | ... | ... | ... | ... | ... | +| # | Action | Owner | Priority | Due date | Type | Status | +|---|---|---|---|---|---|---| +| 1 | | | P1/P2/P3 | | Prevent / Detect / Mitigate | Open | +| 2 | ... | ... | ... | ... | ... | ... | ## Lessons learned @@ -108,4 +117,5 @@ Generate the post-mortem in this structure: - **Honest.** If the root cause is unknown, say so. "Root cause is not fully determined; the leading hypothesis is X" is better than guessing. - **Action-oriented.** Every "what could be improved" must have a corresponding action item with an owner. - **Time-bounded.** Action items need due dates. "Eventually" means "never." +- **Prevention-balanced.** Include at least one action item for detection/alerting and one for prevention when applicable. - **Ask for missing information.** If the user's notes don't cover detection, response, or contributing factors, ask specifically. diff --git a/prompts/runbook-from-incident.md b/prompts/runbook-from-incident.md new file mode 100644 index 0000000..e2dcd8b --- /dev/null +++ b/prompts/runbook-from-incident.md @@ -0,0 +1,91 @@ +# Runbook From Incident — System Prompt + +Paste this into any AI agent after an incident, post-mortem, or debugging session to turn the learned procedure into a reusable runbook. + +--- + +## System prompt + +You are a **senior SRE runbook writer**. Given incident notes, a post-mortem, chat transcript, or troubleshooting commands, create a practical runbook that another engineer can follow during a future incident. + +### Output format + +```markdown +# Runbook: + +**Owner:** +**Service:** +**Severity:** +**Last updated:** +**Related alerts:** +**Related dashboards:** + +--- + +## When to use this runbook + +Use this when: +- +- + +Do not use this when: +- + +## Quick diagnosis + +| Check | Command / Dashboard | Expected healthy result | Bad result | +|---|---|---|---| +| | `` | | | + +## Triage steps + +### Step 1 — Confirm impact + +```bash + +``` + +Expected result: +- + +If bad: +- + +### Step 2 — Identify likely root cause + +```bash + +``` + +## Mitigation options + +> Do not execute mitigations automatically. Confirm environment and impact first. + +| Option | When to use | Command | Risk | Rollback | +|---|---|---|---|---| +| Rollback | Bad deploy suspected | `` | | | +| Scale up | Load/resource pressure | `` | | | + +## Escalation + +Escalate when: +- + +Escalate to: +- + +## Post-incident follow-up + +- [ ] Update this runbook with new findings +- [ ] Add/adjust alert if detection was slow +- [ ] Add test/guardrail if prevention was possible +``` + +### Rules + +- **Prefer read-only diagnosis first.** Commands under diagnosis should not mutate state. +- **Separate diagnosis from mitigation.** Mitigation commands must be clearly marked and require human confirmation. +- **Make commands copy-pastable.** Use placeholders like `` only when the value is genuinely environment-specific. +- **Include expected output.** A runbook is only useful if the reader knows what good and bad look like. +- **Preserve safety context.** Always include environment confirmation for production-impacting steps. +- **Avoid tribal knowledge.** If the original incident required someone knowing a hidden dependency, document it explicitly. diff --git a/scripts/validate-repo.sh b/scripts/validate-repo.sh new file mode 100755 index 0000000..80950d5 --- /dev/null +++ b/scripts/validate-repo.sh @@ -0,0 +1,78 @@ +#!/usr/bin/env bash +# ──────────────────────────────────────────────────────────────── +# validate-repo.sh — Local validation for devops-ai-workflows +# ──────────────────────────────────────────────────────────────── +# Usage: ./scripts/validate-repo.sh +# +# Checks: +# - workflow markdown files have YAML frontmatter +# - README links point to existing local files +# - shell scripts are executable +# - optional markdownlint/shellcheck if installed +# ──────────────────────────────────────────────────────────────── +set -euo pipefail + +ROOT_DIR=$(git rev-parse --show-toplevel 2>/dev/null || pwd) +cd "$ROOT_DIR" + +errors=0 + +echo "🔎 Validating devops-ai-workflows repo" +echo "Root: $ROOT_DIR" +echo "" + +echo "== Workflow frontmatter ==" +while IFS= read -r file; do + if ! head -1 "$file" | grep -q '^---$'; then + echo "❌ Missing frontmatter: $file" + errors=$((errors + 1)) + fi +done < <(find workflows -name '*.md' | sort) +[ "$errors" -eq 0 ] && echo "✅ Workflow frontmatter OK" +echo "" + +echo "== README local links ==" +while IFS= read -r link; do + path=${link#./} + if [ ! -e "$path" ]; then + echo "❌ Broken README link: $link" + errors=$((errors + 1)) + fi +done < <(grep -oE '\]\(\./[^)]+\)' README.md | sed -E 's/^.*\((.*)\)$/\1/' | sort -u) +[ "$errors" -eq 0 ] && echo "✅ README local links OK" +echo "" + +echo "== Script executability ==" +while IFS= read -r file; do + if [ ! -x "$file" ]; then + echo "❌ Script not executable: $file" + errors=$((errors + 1)) + fi +done < <(find scripts -name '*.sh' | sort) +[ "$errors" -eq 0 ] && echo "✅ Scripts executable" +echo "" + +if command -v markdownlint >/dev/null 2>&1; then + echo "== markdownlint ==" + markdownlint '**/*.md' || errors=$((errors + 1)) + echo "" +else + echo "ℹ️ markdownlint not installed; skipping" + echo "" +fi + +if command -v shellcheck >/dev/null 2>&1; then + echo "== shellcheck ==" + shellcheck scripts/*.sh || errors=$((errors + 1)) + echo "" +else + echo "ℹ️ shellcheck not installed; skipping" + echo "" +fi + +if [ "$errors" -eq 0 ]; then + echo "✅ Validation passed" +else + echo "❌ Validation failed with $errors issue(s)" + exit 1 +fi diff --git a/workflows/aws/aws-vpc-debug.md b/workflows/aws/aws-vpc-debug.md index 6583f8e..307005a 100644 --- a/workflows/aws/aws-vpc-debug.md +++ b/workflows/aws/aws-vpc-debug.md @@ -32,6 +32,14 @@ Ask the user for the following: ```bash aws sts get-caller-identity REGION=${REGION:-$(aws configure get region)} +SRC_PRIVATE_IP="" +SRC_SUBNET_ID="" +SRC_VPC_ID="" +SRC_SECURITY_GROUPS="" +DEST_IP="$DESTINATION" +DST_VPC_ID="" +DST_SUBNET_ID="" +DST_SECURITY_GROUPS="" echo "=== Resolve SOURCE ===" # If SOURCE looks like an instance ID @@ -39,6 +47,10 @@ if echo "$SOURCE" | grep -qE '^i-[0-9a-f]+$'; then aws ec2 describe-instances --region $REGION --instance-ids $SOURCE \ --query 'Reservations[0].Instances[0].{InstanceId:InstanceId,PrivateIp:PrivateIpAddress,PublicIp:PublicIpAddress,SubnetId:SubnetId,VpcId:VpcId,SecurityGroups:SecurityGroups[].GroupId,Name:Tags[?Key==`Name`]|[0].Value}' \ --output json + SRC_PRIVATE_IP=$(aws ec2 describe-instances --region $REGION --instance-ids $SOURCE --query 'Reservations[0].Instances[0].PrivateIpAddress' --output text) + SRC_SUBNET_ID=$(aws ec2 describe-instances --region $REGION --instance-ids $SOURCE --query 'Reservations[0].Instances[0].SubnetId' --output text) + SRC_VPC_ID=$(aws ec2 describe-instances --region $REGION --instance-ids $SOURCE --query 'Reservations[0].Instances[0].VpcId' --output text) + SRC_SECURITY_GROUPS=$(aws ec2 describe-instances --region $REGION --instance-ids $SOURCE --query 'Reservations[0].Instances[0].SecurityGroups[].GroupId' --output text) fi # If SOURCE looks like an ENI @@ -46,6 +58,10 @@ if echo "$SOURCE" | grep -qE '^eni-[0-9a-f]+$'; then aws ec2 describe-network-interfaces --region $REGION --network-interface-ids $SOURCE \ --query 'NetworkInterfaces[0].{Id:NetworkInterfaceId,PrivateIp:PrivateIpAddress,SubnetId:SubnetId,VpcId:VpcId,SecurityGroups:Groups[].GroupId,Description:Description}' \ --output json + SRC_PRIVATE_IP=$(aws ec2 describe-network-interfaces --region $REGION --network-interface-ids $SOURCE --query 'NetworkInterfaces[0].PrivateIpAddress' --output text) + SRC_SUBNET_ID=$(aws ec2 describe-network-interfaces --region $REGION --network-interface-ids $SOURCE --query 'NetworkInterfaces[0].SubnetId' --output text) + SRC_VPC_ID=$(aws ec2 describe-network-interfaces --region $REGION --network-interface-ids $SOURCE --query 'NetworkInterfaces[0].VpcId' --output text) + SRC_SECURITY_GROUPS=$(aws ec2 describe-network-interfaces --region $REGION --network-interface-ids $SOURCE --query 'NetworkInterfaces[0].Groups[].GroupId' --output text) fi echo "=== Resolve DESTINATION ===" @@ -53,6 +69,7 @@ echo "=== Resolve DESTINATION ===" if echo "$DESTINATION" | grep -qE '[a-zA-Z]'; then echo "DNS resolution:" dig +short "$DESTINATION" 2>/dev/null || nslookup "$DESTINATION" 2>/dev/null || echo "Could not resolve" + DEST_IP=$(dig +short "$DESTINATION" 2>/dev/null | grep -E '^[0-9]+\.' | head -1 || echo "$DESTINATION") fi # If DESTINATION looks like an RDS endpoint @@ -62,6 +79,18 @@ if echo "$DESTINATION" | grep -qE '\.rds\.amazonaws\.com$'; then --query 'DBInstances[0].{Id:DBInstanceIdentifier,Endpoint:Endpoint,VpcSecurityGroups:VpcSecurityGroups[].VpcSecurityGroupId,SubnetGroup:DBSubnetGroup.DBSubnetGroupName}' \ --output json 2>/dev/null || true fi + +echo "=== Resolved variables for later steps ===" +cat </dev/null || echo "Destination IP not found as ENI address" + DST_VPC_ID=$(aws ec2 describe-network-interfaces --region $REGION --filters Name=addresses.private-ip-address,Values=$DEST_IP --query 'NetworkInterfaces[0].VpcId' --output text 2>/dev/null || true) + DST_SUBNET_ID=$(aws ec2 describe-network-interfaces --region $REGION --filters Name=addresses.private-ip-address,Values=$DEST_IP --query 'NetworkInterfaces[0].SubnetId' --output text 2>/dev/null || true) + DST_SECURITY_GROUPS=$(aws ec2 describe-network-interfaces --region $REGION --filters Name=addresses.private-ip-address,Values=$DEST_IP --query 'NetworkInterfaces[0].Groups[].GroupId' --output text 2>/dev/null || true) fi echo "=== VPC peering connections ===" @@ -108,6 +140,7 @@ Flag: - Source and destination in different VPCs without peering/TGW. - Public vs private subnet classification. +- Missing resolved variables (`SRC_VPC_ID`, `SRC_SUBNET_ID`, `DEST_IP`) — later checks depend on them. --- @@ -268,6 +301,36 @@ Flag: - `enableDnsSupport` or `enableDnsHostnames` disabled. - Destination hostname not resolvable. - Missing private hosted zone association for cross-VPC DNS. +- Split-horizon mismatch: local laptop resolves a different IP than the source VPC would resolve. +- Resolver endpoint exists but has unhealthy status or insufficient IP addresses. + +--- + +## Step 7b — Optional Reachability Analyzer path check + +> Use this only if the source and destination are ENIs or instances and the account has EC2 Reachability Analyzer permissions. It creates no traffic; it analyzes control-plane configuration. + +```bash +echo "=== Reachability Analyzer guidance ===" +cat < \\ + --destination \\ + --protocol ${PROTOCOL^^} \\ + --destination-port $PORT + +Then: + aws ec2 start-network-insights-analysis --network-insights-path-id + aws ec2 describe-network-insights-analyses --network-insights-analysis-ids + +Clean up: + aws ec2 delete-network-insights-path --network-insights-path-id +EOF +``` + +Flag: Reachability Analyzer finding the exact blocker (route table, SG, NACL, endpoint policy) should override heuristic guesses from earlier steps. --- diff --git a/workflows/cicd/release-checklist.md b/workflows/cicd/release-checklist.md new file mode 100644 index 0000000..68350c4 --- /dev/null +++ b/workflows/cicd/release-checklist.md @@ -0,0 +1,181 @@ +--- +description: Pre-release safety checklist for application, infrastructure, or platform changes. Reviews deploy order, rollback, tests, monitoring, and communication before release. +--- + +# /release-checklist — Pre-Release Safety Gate + +Use before releasing application code, infrastructure changes, Helm charts, Terraform modules, or CI/CD updates. The workflow produces a release readiness report and highlights blockers before production deployment. + +## Prerequisites + +- PR/diff summary or release notes. +- Target environment and deployment method. +- Optional: test results, Terraform plan, Helm diff, ArgoCD app status, CI build URL. + +## Inputs + +- **RELEASE_NAME** *(required)* — short name of the change/release. +- **ENVIRONMENT** — target environment. Default: `staging`. +- **CHANGE_TYPE** — `app` | `infra` | `helm` | `terraform` | `pipeline` | `mixed`. +- **REPORT_DIR** — Default: `./release-checklist-reports`. + +--- + +## Step 1 — Identify release scope + +Gather: + +- What is changing? +- Which repos/services/environments are affected? +- Is this a single-repo or multi-repo release? +- Is there a database, schema, IAM, networking, or config change? +- Is there a feature flag or staged rollout mechanism? + +Classify risk: + +| Risk | Criteria | +|---|---| +| Low | Backward-compatible, tested, easy rollback, small blast radius | +| Medium | Config/IaC changes, multiple services, partial rollback complexity | +| High | Data migration, IAM/networking, irreversible changes, production-wide impact | + +--- + +## Step 2 — Validate test and build evidence + +Check: + +- CI build passed for the exact commit being deployed. +- Unit/integration/e2e tests relevant to the change passed. +- Security scans completed or exceptions are documented. +- Artifact/image tag is immutable and traceable to commit SHA. +- No local-only changes are required for deploy. + +For infrastructure: + +```bash +terraform validate +terraform plan -out=plan.bin +terraform show -json plan.bin > plan.json +``` + +For Helm/Kubernetes: + +```bash +helm lint +helm template --values >/tmp/rendered.yaml +kubectl diff -f /tmp/rendered.yaml --server-side 2>/dev/null || true +``` + +--- + +## Step 3 — Deploy order and dependency check + +Document: + +| Item | Value | +|---|---| +| Must deploy before | | +| Must deploy after | | +| Can deploy independently | yes/no | +| Requires feature flag | yes/no | +| Requires maintenance window | yes/no | + +Common ordering rules: + +- Database/schema backward-compatible change before app rollout. +- IAM/networking prerequisites before service deployment. +- CRDs before custom resources. +- Shared libraries/build seed changes before dependent service builds. +- Producer/consumer API compatibility verified before either side is deployed. + +--- + +## Step 4 — Rollback and recovery plan + +Every release needs a rollback plan: + +| Area | Rollback approach | Time estimate | Risk | +|---|---|---|---| +| App | redeploy previous image/tag |