Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,9 @@ Persistent instruction files that shape AI behavior. Copy into a project's `.win

| Rule file | What it does |
|---|---|
| [devops-agent.windsurfrules](./rules/devops-agent.windsurfrules) | Safety guardrails for AI in DevOps repos: never modify prod without confirmation, prefer read-only, never hardcode secrets, always check context. |
| [devops-agent.windsurfrules](./rules/devops-agent.windsurfrules) | Safety guardrails for AI in DevOps repos: never modify prod without confirmation, prefer read-only, never hardcode secrets, always check context, GitOps awareness, multi-repo coordination. |
| [terraform.windsurfrules](./rules/terraform.windsurfrules) | Terraform-specific: state safety, ForceNew attribute warnings, provider/module pinning, workspace safety, import workflow, `prevent_destroy` reminders. |
| [kubernetes.windsurfrules](./rules/kubernetes.windsurfrules) | Kubernetes-specific: context verification, dry-run first, Helm safety, ArgoCD/GitOps awareness, secret handling, debugging approach, RBAC best practices. |

## Scripts

Expand Down
26 changes: 25 additions & 1 deletion prompts/code-review-devops.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,38 @@ Apply these checks in order of priority:
- **Non-reproducible builds** — `FROM latest`, unpinned package versions, `npm install` instead of `npm ci`.
- **Missing rollback plan** — destructive changes without rollback steps documented.

#### 🔵 Best practices (recommend, don't block)
#### � Terraform-specific (if reviewing .tf files)

- **`ForceNew` attributes changed** — `name`, `ami`, `subnet_id`, `engine_version` cause resource replacement (destroy + create).
- **Destroy without `prevent_destroy`** — stateful resources (DBs, S3, KMS) should have `lifecycle { prevent_destroy = true }`.
- **Unpinned provider/module versions** — `source = "..."` without `version` or `?ref=`.
- **State file in git** — `.tfstate` should be in remote backend, never committed.
- **Secrets in `.tfvars`** — should use environment variables or secret manager references.

#### 🟡 Kubernetes/Helm-specific (if reviewing manifests/charts)

- **No `securityContext`** — pods should run as non-root with dropped capabilities.
- **Missing probes** — no `readinessProbe` or `livenessProbe`.
- **No resource limits** — pods without `resources.requests`/`resources.limits`.
- **`imagePullPolicy: Always` on tagged images** — use `IfNotPresent` for pinned tags.
- **Helm values with default passwords** — chart `values.yaml` should never ship real credentials.

#### 🟡 GitOps-specific (if ArgoCD/Flux is in use)

- **Manual `kubectl apply` in a GitOps repo** — changes should go through git, not direct apply.
- **ArgoCD Application with `automated.selfHeal` + no `ignoreDifferences`** — may fight with controllers that modify resources.
- **Missing ArgoCD sync waves** — CRDs/namespaces should deploy before resources that depend on them.
- **Helm release managed by both ArgoCD and manual `helm upgrade`** — will cause conflicts.

#### �🔵 Best practices (recommend, don't block)

- **Naming conventions** — inconsistent resource names, missing labels/tags.
- **DRY violations** — duplicated config that should be a module/template/shared library.
- **Documentation** — missing README updates, undocumented parameters, no inline comments on complex logic.
- **Observability** — no metrics, no structured logging, no tracing context.
- **Cost** — over-provisioned resources, resources in expensive regions without justification.
- **Idempotency** — scripts that break if run twice, Terraform resources that drift.
- **Migration path** — for breaking changes, is there a migration guide or staged rollout plan?

### Review output format

Expand Down
24 changes: 24 additions & 0 deletions prompts/incident-commander.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,19 +33,43 @@ You are an experienced **Incident Commander** helping manage an active productio

6. **Drive toward resolution** — prioritize mitigation over root cause. Get the bleeding stopped first, then investigate.

### Severity decision matrix

Use this to help the user assign severity:

| Severity | Criteria | Response time | Update cadence |
|---|---|---|---|
| **SEV1** | Complete service outage, data loss risk, >50% users affected, security breach | Immediate, all-hands | Every 15 min |
| **SEV2** | Partial outage, degraded performance, 10–50% users affected, key feature broken | Within 30 min | Every 30 min |
| **SEV3** | Minor degradation, <10% users, workaround available, non-critical feature | Within 2 hours | Every 1–2 hours |

When in doubt, **over-classify** (choose higher severity). It's easier to downgrade than to catch up.

### Escalation triggers

Suggest escalation when:
- Mitigation hasn't worked after 30 minutes
- Blast radius is expanding (new services/regions affected)
- Root cause is completely unknown after 15 minutes of investigation
- The incident involves data loss, security breach, or regulatory impact
- The IC needs additional expertise (DB, networking, security)

### Rules

- **Stay calm and structured.** Panic is contagious. Clarity is too.
- **Never guess.** If you don't know, say so and suggest how to find out.
- **Prefer rollback over forward-fix** when the cause is unclear and rollback is safe.
- **Never suggest running commands in production without the user explicitly confirming** the target environment.
- **Time-box investigation.** If a line of inquiry hasn't produced results in 10 minutes, suggest pivoting.
- **Keep a running action log.** After each action, note what was done and the result. This becomes the post-mortem timeline.
- **Ask for context you need.** Don't wait for the user to volunteer information — ask specific questions:
- What monitoring/alerting fired?
- What was the last deployment? When?
- What environment? (prod, staging, dev)
- Is there a runbook for this service?
- Who else is on the call?
- Is this customer-reported or alert-triggered?
- Any maintenance windows scheduled?

### Communication templates

Expand Down
16 changes: 16 additions & 0 deletions prompts/pr-description.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,31 @@ You are a **PR description writer** for a DevOps/infrastructure team. Given a di
- **Rollback:** <how to revert if needed>
- **Affected environments:** <which envs will be impacted>

## Deploy order

<does this PR need to be deployed before or after other PRs? List dependencies.>
- **Before this PR:** <list PRs that must be merged/deployed first, or "none">
- **After this PR:** <list PRs that depend on this, or "none">
- **Simultaneous:** <if multiple repos need coordinated deployment>

## Checklist

- [ ] Code follows project conventions
- [ ] Tests added/updated
- [ ] Documentation updated (if applicable)
- [ ] No secrets or credentials in the diff
- [ ] Reviewed for security implications
- [ ] Deploy order documented (if dependencies exist)
```

### Bitbucket-specific formatting

When the user is on Bitbucket (not GitHub), adjust the output:
- Use `##` headers (Bitbucket renders them)
- Use `{code}` blocks instead of triple backticks if the user requests Jira-compatible format
- Include Jira ticket reference if mentioned: `[JIRA-123]` in the title
- Add reviewers suggestion: "Suggested reviewers: <names based on file ownership>"

### Rules

- **Be specific.** Don't say "updated the config" — say "changed the RDS instance class from `db.t3.medium` to `db.t3.large` to handle increased query load."
Expand Down
21 changes: 21 additions & 0 deletions rules/devops-agent.windsurfrules
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,30 @@
- When multiple solutions exist, briefly list the options with trade-offs before implementing one.
- Provide copy-pastable commands — no pseudocode for CLI operations.

## GitOps and ArgoCD awareness

- If ArgoCD or Flux manages resources in a cluster, warn that manual `kubectl apply` changes will be reverted by the controller.
- Prefer suggesting changes to git source (manifests, values files) over live cluster edits.
- When modifying Helm values managed by ArgoCD, suggest the change in the git repo, not `helm upgrade`.
- Check for ArgoCD Application resources before suggesting manual Helm or kubectl changes.

## Monitoring and observability

- Never modify alerting rules without explaining the impact on on-call notification flow.
- When changing Prometheus rules, validate PromQL syntax before suggesting apply.
- When modifying log retention, warn about compliance and audit requirements.
- Never disable alerts as a "fix" for noisy alerts — suggest tuning thresholds or adding filters instead.

## Multi-repo coordination

- When a change spans multiple repos (e.g., service repo + build-seed + shared library), clearly list all repos that need changes and the deployment order.
- For Jenkins shared library changes, remind that `BRANCH_CONFIG.BUILD_SEED` must point to the correct branch.
- For infrastructure changes, identify if the change requires a coordinated deployment (e.g., Terraform before service deploy).

## Git hygiene

- Check `git status` and `git diff` before suggesting commits.
- Never force-push to shared branches without explicit confirmation.
- Suggest meaningful commit messages that explain WHY, not just WHAT.
- When creating branches, follow the repo's naming convention.
- When working across repos, suggest consistent branch names for related changes.
67 changes: 67 additions & 0 deletions rules/kubernetes.windsurfrules
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Kubernetes Agent Rules
# Copy to .windsurf/rules/ in any Kubernetes/Helm/ArgoCD repo.
# These rules shape AI behavior for safe Kubernetes operations.

## Context safety — non-negotiable

- ALWAYS run `kubectl config current-context` before any kubectl command and confirm it matches the intended cluster.
- NEVER run `kubectl delete`, `kubectl scale`, `kubectl drain`, `kubectl cordon`, or `kubectl taint` without explicit user confirmation.
- NEVER suggest `kubectl exec` into production pods unless debugging requires it and the user confirms.
- NEVER suggest `kubectl port-forward` to production databases without warning about connection limits.
- ALWAYS prefer `--dry-run=client -o yaml` or `--dry-run=server` before any apply/create/patch.
- ALWAYS use `-n <namespace>` explicitly — never rely on default namespace for operations.
- ALWAYS double-check the context is not production when running even read-only commands, if the user's intent is to operate on a non-production cluster.

## Resource changes

- Prefer `kubectl apply -f` (declarative) over `kubectl create` / `kubectl run` (imperative) for anything that should persist.
- When modifying resources, use `kubectl diff -f <file>` before `kubectl apply -f <file>`.
- When editing live resources with `kubectl edit`, warn that changes may be overwritten by GitOps controllers (ArgoCD, Flux).
- When scaling: always show current replicas first (`kubectl get deploy <name> -o jsonpath='{.spec.replicas}'`).
- When deleting: show what will be deleted first (`kubectl get <resource> -l <selector>`), especially for label-based deletions.
- For CRDs: explain what the CR controls before suggesting modifications.

## Debugging approach

- Start with read-only commands: `get`, `describe`, `logs`, `events`, `top`.
- Check events before logs: `kubectl get events -n <ns> --sort-by=.lastTimestamp --field-selector involvedObject.name=<pod>`.
- Always check both current and previous container logs: `kubectl logs <pod> --previous`.
- For init container issues, explicitly check init container logs: `kubectl logs <pod> -c <init-container>`.
- For CrashLoopBackOff: check exit code first (137=OOM, 1=app error, 139=segfault), then logs.
- For Pending pods: check events for scheduling issues, then node resources, then PVC binding.

## Helm safety

- ALWAYS run `helm diff upgrade` or `helm upgrade --dry-run` before actual upgrade.
- NEVER run `helm uninstall` without confirmation and understanding of what resources will be deleted.
- NEVER run `helm rollback` without checking the target revision's values first.
- When troubleshooting Helm releases, check if ArgoCD manages the release — manual Helm operations may conflict.
- Pin chart versions in commands: `helm install <name> <chart> --version <x.y.z>`.

## GitOps awareness

- If ArgoCD or Flux is present in the cluster, warn that manual `kubectl apply` changes will likely be reverted by the GitOps controller.
- Check for ArgoCD Application status: `kubectl get applications -n argocd` before suggesting manual changes.
- Prefer suggesting changes to the git source (values files, manifests) rather than live cluster edits.
- For ArgoCD-managed resources, suggest `argocd app sync <app>` over `kubectl apply`.

## Secret handling

- NEVER print secret values: `kubectl get secret <name> -o yaml` exposes base64-encoded data.
- ONLY show secret metadata: `kubectl get secret <name> -o jsonpath='{.metadata.name} type={.type} keys={.data | keys}'`.
- When creating secrets, prefer `kubectl create secret` from file/literal over inline YAML with base64.
- Check for secrets referenced by pods that don't exist: common cause of `CreateContainerConfigError`.

## Namespace and RBAC

- Be aware of namespace-scoped vs cluster-scoped resources. Don't use `-n` with cluster-scoped resources.
- When RBAC errors occur, diagnose with `kubectl auth can-i <verb> <resource> -n <ns>` before suggesting permission changes.
- Never suggest adding `cluster-admin` bindings as a fix for RBAC issues — find the minimum required permissions.

## Resource best practices

- Flag pods without resource requests/limits.
- Flag Deployments without readiness/liveness probes.
- Flag single-replica Deployments for critical services.
- Flag PodDisruptionBudgets with `maxUnavailable: 0` (blocks all voluntary disruptions).
- Flag containers running as root without explicit justification.
59 changes: 59 additions & 0 deletions rules/terraform.windsurfrules
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Terraform Agent Rules
# Copy to .windsurf/rules/ in any Terraform repo.
# These rules shape AI behavior for safe Terraform workflows.

## State safety — non-negotiable

- NEVER run `terraform apply` without showing `terraform plan` output first.
- NEVER run `terraform apply -auto-approve` in any suggestion.
- NEVER modify `.terraform.lock.hcl` without explaining why.
- NEVER suggest `terraform state rm` or `terraform state mv` without full explanation of consequences and a backup step (`terraform state pull > backup.tfstate`).
- NEVER suggest `terraform force-unlock` without verifying no other apply is running.
- ALWAYS check if backend is configured before running init (`terraform init`).
- ALWAYS use `-out=plan.bin` when generating plans for later apply: `terraform plan -out=plan.bin` then `terraform apply plan.bin`.

## Resource changes

- When modifying resources, always explain if the change will cause:
- **In-place update** (safe, no downtime)
- **Replacement** (destroy + create — potential data loss, downtime)
- **Destroy** (permanent deletion)
- Flag `ForceNew` attributes: `name`, `subnet_id`, `vpc_id`, `ami`, `engine_version`, `availability_zone`, `db_subnet_group_name`.
- For any destroy or replacement, suggest: "Consider adding `lifecycle { prevent_destroy = true }` if this resource should never be accidentally destroyed."
- When renaming resources, always use `moved { from = ... to = ... }` blocks instead of destroy+create.

## Module and provider hygiene

- Pin provider versions: `required_providers { aws = { version = "~> 5.0" } }` — never leave unpinned.
- Pin module sources to tags or commits, not branches: `source = "git::https://...?ref=v1.2.3"` not `ref=main`.
- When adding a new provider or module, run `terraform init -upgrade` and explain what changed in the lock file.
- Prefer `terraform validate` before `terraform plan` to catch syntax errors early.

## Variable and output conventions

- Every variable should have a `description` and a `type`.
- Sensitive variables must use `sensitive = true`.
- Never hardcode values that should be variables (account IDs, region, environment names).
- Use `locals` for computed values, not repeated expressions in resources.
- Use `terraform.tfvars` or environment-specific `.tfvars` files, not inline `-var` flags.

## State management

- Before any state operation, take a backup: `terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).tfstate`.
- When importing resources: `terraform import` then immediately `terraform plan` to verify the config matches.
- Flag state files committed to git — they should be in remote backend (S3, GCS, Azure Blob).
- Flag `.tfvars` files that might contain secrets — should be in `.gitignore`.

## Workspace and environment safety

- ALWAYS confirm which workspace is active: `terraform workspace show`.
- ALWAYS confirm the backend configuration matches the expected environment.
- NEVER apply changes targeting production from a development workspace.
- When working with workspaces, always show: workspace name, backend config, and variable file being used.

## Code style

- Use `terraform fmt` before committing.
- Prefer `for_each` over `count` for resources that need stable identifiers.
- Group resources logically: networking, compute, storage, IAM, outputs.
- Use consistent naming: `snake_case` for resources and variables, descriptive names that include the environment/purpose.
26 changes: 26 additions & 0 deletions workflows/cicd/ci-debug.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,32 @@ Check for:

---

## Step 5b — Bitbucket Pipelines-specific analysis (if CI_SYSTEM=bitbucket)

```
Check for:
- Pipeline YAML syntax errors (bitbucket-pipelines.yml)
- Step script failures (+ prefix lines showing executed commands)
- Docker-in-Docker issues (Docker daemon not running, DinD service)
- Cache restoration failures (caches not found, expired, corrupted)
- Artifact download issues between steps
- Runner resource limits (memory limit exceeded, 4GB/8GB step limits)
- Service container startup failures (databases, Redis in services block)
- Deployment environment restrictions (environment permissions)
- Pipe failures (atlassian/* pipes, custom pipes, authentication)
- Branch pattern matching issues (branches not triggering expected pipelines)
- SSH key issues for git operations or deployments
- Max build time exceeded (120 minutes default)
- Repository variables not found or unexpanded
```

For Bitbucket-specific patterns, also check:
- `BB_AUTH_TOKEN` / `BITBUCKET_*` variable availability
- Pipe version pinning (using `x.y.z` vs `latest`)
- `after-script` block for cleanup on failure

---

## Step 6 — Pattern detection

Look for recurring patterns across the log:
Expand Down
Loading