A growing collection of AI-agent workflows, prompts, and rules for day-to-day DevOps / SRE / platform work.
Note: "workflows" here means AI coding-agent workflows (Windsurf, Cursor, Claude Code, etc.) — not GitHub Actions.
| Folder | Purpose | Audience |
|---|---|---|
workflows/ |
Workflow definitions, grouped by domain | Everyone |
prompts/ |
Reusable system / task prompts (incident triage, code review, post-mortem, etc.) | Any LLM |
rules/ |
Editor / agent rule files (.windsurfrules, .cursorrules, Copilot instructions) |
Per-tool |
scripts/ |
Standalone shell scripts referenced by workflows | Anyone with a shell |
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| k8s-debug | /k8s-debug |
General-purpose, read-only cluster diagnostics across nodes, pods, workloads, networking, storage, RBAC, events, and resource pressure. | kubectl. Optional: jq, metrics-server. |
| k8s-workload-debug | /k8s-workload-debug |
Deep-dive on a single Deployment / StatefulSet / DaemonSet / Job / Pod: rollout, spec, probes, resources, logs, networking, storage, config. | kubectl. Optional: jq, metrics-server. |
| k8s-rbac-audit | /k8s-rbac-audit |
RBAC risk audit — wildcards, cluster-admin bindings, risky verb/resource combos, over-privileged ServiceAccounts, anonymous access. | kubectl, jq. Optional: kubectl-who-can. |
| k8s-cost-hotspots | /k8s-cost-hotspots |
Find waste: over-provisioned workloads, missing requests/limits, idle workloads, orphan PVCs/PVs, idle LoadBalancers. | kubectl, jq, metrics-server. |
| k8s-upgrade-readiness | /k8s-upgrade-readiness |
Pre-flight before a control-plane / node upgrade: deprecated APIs, version skew, PDB gaps, expiring certs, broken webhooks. | kubectl. Optional: kubent or pluto, helm. |
| helm-release-debug | /helm-release-debug |
Diagnose a stuck or failed Helm release: history, values diff, hook failures, rendered manifest vs cluster, workload health. | helm v3, kubectl. Optional: jq, yq. |
| helm-chart-review | /helm-chart-review |
Review a Helm chart for security, reliability, and best practices: resource specs, probes, security context, PDBs, anti-affinity, RBAC. | Helm chart source. Optional: helm CLI. |
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| aws-account-audit | /aws-account-audit |
Read-only AWS account security & hygiene audit: IAM, S3, EC2, RDS, CloudTrail, encryption, GuardDuty, SecurityHub. | aws CLI. Optional: jq. |
| aws-cost-quickscan | /aws-cost-quickscan |
Find AWS cost waste: idle EC2/RDS, unattached EBS, old snapshots, expensive log groups, NAT data processing, missing Savings Plans. | aws CLI, Cost Explorer enabled. Optional: jq. |
| aws-vpc-debug | /aws-vpc-debug |
Diagnose VPC connectivity: trace path across SGs, NACLs, route tables, NAT/IGW/TGW, VPC endpoints, DNS, and flow logs. | aws CLI. Optional: jq, dig. |
| aws-iam-policy-review | /aws-iam-policy-review |
Explain an IAM policy and flag risks: admin-equivalent access, privilege escalation paths, wildcard actions, missing conditions. | aws CLI. Optional: jq. |
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| terraform-plan-review | /terraform-plan-review |
Explain a Terraform plan and flag risky changes: destroys, replacements, security group mutations, IAM changes, blast radius. | terraform plan output. Optional: terraform CLI, jq. |
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| ci-debug | /ci-debug |
Diagnose a failing CI/CD pipeline: parse build logs from Jenkins, GitHub Actions, GitLab CI, or Bitbucket Pipelines. Root cause analysis and fix suggestions. | Build log output. Optional: repo source, CI config file. |
| jenkins-pipeline-review | /jenkins-pipeline-review |
Review Jenkinsfile / shared-library Groovy for security risks, anti-patterns, missing error handling, credential leaks, CPS issues, and build config cross-references. | Jenkinsfile(s) or vars/*.groovy. Optional: repositories_v2.json. |
| release-checklist | /release-checklist |
Pre-release safety gate: scope, deploy order, rollback, tests, monitoring, and communication before production release. | PR/diff summary. Optional: test results, plans, diffs. |
| dockerfile-review | /dockerfile-review |
Review Dockerfiles for security, size, caching, and best practices. Flags CVE-prone bases, leaked secrets, missing health checks. | Dockerfile(s). Optional: docker, trivy. |
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| secrets-leak-scan | /secrets-leak-scan |
Scan git repo history for leaked secrets: API keys, passwords, tokens, private keys. Uses gitleaks, trufflehog, or regex fallback. | Git repo. Optional: gitleaks, trufflehog. |
| repo-health | /repo-health |
Audit repository hygiene: README, license, CI, branch/release hygiene, tracked secrets, ownership, and automation gaps. | Local git repo. Optional: gh, jq. |
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| incident-triage | /incident-triage |
Guided first 15 minutes of a production incident: timeline, blast radius, evidence gathering, mitigation suggestions. | Access to affected environment. |
More on the way — see Roadmap.
Reusable system prompts you can paste into any AI agent for common DevOps tasks:
| Prompt | What it does |
|---|---|
| incident-commander | Puts the AI in incident-commander mode: timeline, blast radius, action tracking, status updates. |
| postmortem-writer | Generates a blameless post-mortem from incident notes: timeline, root cause, impact, action items. |
| code-review-devops | Reviews IaC / pipeline / Docker / K8s code with a security-first DevOps lens. |
| pr-description | Generates a PR description from a diff: what, why, how, testing, risk, rollback plan. |
| explain-like-a-senior | Explains infrastructure code to junior engineers: what it does, why, gotchas, and how it fits together. |
| runbook-from-incident | Converts incident notes or post-mortems into reusable runbooks with diagnosis, mitigation, escalation, and follow-up steps. |
Persistent instruction files that shape AI behavior. Copy into a project's .windsurf/rules/ or use as .windsurfrules:
| Rule file | What it does |
|---|---|
| devops-agent.windsurfrules | Safety guardrails for AI in DevOps repos: never modify prod without confirmation, prefer read-only, never hardcode secrets, always check context, GitOps awareness, multi-repo coordination. |
| terraform.windsurfrules | Terraform-specific: state safety, ForceNew attribute warnings, provider/module pinning, workspace safety, import workflow, prevent_destroy reminders. |
| kubernetes.windsurfrules | Kubernetes-specific: context verification, dry-run first, Helm safety, ArgoCD/GitOps awareness, secret handling, debugging approach, RBAC best practices. |
Standalone shell utilities referenced by workflows or useful on their own:
| Script | Usage |
|---|---|
| k8s-snapshot.sh | ./k8s-snapshot.sh [namespace|all] [output-dir] — dump cluster state (nodes, pods, events, services, top) to a timestamped Markdown file. |
| aws-whoami.sh | ./aws-whoami.sh [profile] — quick AWS identity check: caller, region, account alias, org, SSO role. |
| stale-branches.sh | ./stale-branches.sh [days] [--remote] — list git branches older than N days with last commit info. |
| validate-repo.sh | ./scripts/validate-repo.sh — validate workflow frontmatter, README links, script executability, and optional lint checks. |
Open the matching file in workflows/ and either:
- invoke it as a slash command if your agent supports workflow discovery from this repo,
- paste the relevant section into the agent's chat, or
- include the file as context and ask the agent to follow it.
Every workflow is just Markdown with shell commands. You can run the steps yourself in a terminal — no AI required.
devops-ai-workflows/
├── workflows/
│ ├── kubernetes/ # Kubernetes workflow definitions
│ ├── aws/ # AWS / cloud workflow definitions
│ ├── iac/ # Infrastructure as Code workflows
│ ├── cicd/ # CI/CD pipeline workflows
│ ├── containers/ # Container & image workflows
│ ├── security/ # Security & repo hygiene workflows
│ └── observability/ # Observability & incident workflows
├── prompts/ # Reusable LLM prompts
├── rules/ # Editor/agent rule files
├── scripts/ # Standalone shell helpers
├── CONTRIBUTING.md
├── LICENSE
└── README.md
Ideas I plan to add (PRs welcome):
AWS / cloud
-
/aws-eks-debug— bridge EKS + Kubernetes: node groups, OIDC, add-ons, IAM roles for service accounts -
/aws-rds-health— RDS/Aurora diagnostics: events, metrics, parameter groups, replication lag -
/aws-lambda-debug— Lambda diagnostics: errors, throttles, DLQ, VPC/ENI, CloudWatch logs -
/aws-ecs-service-debug— ECS/Fargate service rollout failures: task events, target group health, IAM roles
IaC
-
/terraform-state-debug— diagnose locks, drift, orphans -
/iac-secrets-scan— repo-wide hardcoded-secret sweep
Containers & CI/CD
-
/image-cve-triage— prioritise CVE scanner output by exploitability + fix availability -
/github-actions-review— security review of GitHub Actions workflow files
Observability & incident
-
/prometheus-query-helper— intent → PromQL with rationale -
/log-pattern-extract— cluster repeated errors out of a log dump -
/postmortem— blameless post-mortem from a transcript -
/runbook-from-incident— turn a resolved incident into a reusable runbook
Networking / database
-
/dns-debug— multi-resolver dig, propagation, DNSSEC -
/tls-cert-audit— chain inspection, expiry, weak ciphers across a list of hosts -
/postgres-health— bloat, long queries, replication lag, missing indexes -
/redis-health— memory pressure, slow log, persistence config, eviction patterns -
/db-migration-review— flag risky migration patterns
Security & repo hygiene
-
/cve-impact-assessment— given a CVE, check whether your stack is affected -
/repo-health— README, license, CI, branch protection, stale branches -
/dependency-upgrade-plan— group outdated deps by risk and suggest batching
See CONTRIBUTING.md. The short version:
- Add the canonical workflow to
workflows/<domain>/<name>.md. - Update the Available workflows table in this README.
- Keep workflows read-only by default. Anything mutating must be opt-in (e.g. a
DEEP=yesflag) and clearly flagged.
MIT — use freely, attribution appreciated but not required.