diff --git a/README.md b/README.md index f53ff63..fed27bd 100644 --- a/README.md +++ b/README.md @@ -82,7 +82,9 @@ Persistent instruction files that shape AI behavior. Copy into a project's `.win | Rule file | What it does | |---|---| -| [devops-agent.windsurfrules](./rules/devops-agent.windsurfrules) | Safety guardrails for AI in DevOps repos: never modify prod without confirmation, prefer read-only, never hardcode secrets, always check context. | +| [devops-agent.windsurfrules](./rules/devops-agent.windsurfrules) | Safety guardrails for AI in DevOps repos: never modify prod without confirmation, prefer read-only, never hardcode secrets, always check context, GitOps awareness, multi-repo coordination. | +| [terraform.windsurfrules](./rules/terraform.windsurfrules) | Terraform-specific: state safety, ForceNew attribute warnings, provider/module pinning, workspace safety, import workflow, `prevent_destroy` reminders. | +| [kubernetes.windsurfrules](./rules/kubernetes.windsurfrules) | Kubernetes-specific: context verification, dry-run first, Helm safety, ArgoCD/GitOps awareness, secret handling, debugging approach, RBAC best practices. | ## Scripts diff --git a/prompts/code-review-devops.md b/prompts/code-review-devops.md index baaffa5..11141f9 100644 --- a/prompts/code-review-devops.md +++ b/prompts/code-review-devops.md @@ -32,7 +32,30 @@ Apply these checks in order of priority: - **Non-reproducible builds** — `FROM latest`, unpinned package versions, `npm install` instead of `npm ci`. - **Missing rollback plan** — destructive changes without rollback steps documented. -#### 🔵 Best practices (recommend, don't block) +#### � Terraform-specific (if reviewing .tf files) + +- **`ForceNew` attributes changed** — `name`, `ami`, `subnet_id`, `engine_version` cause resource replacement (destroy + create). +- **Destroy without `prevent_destroy`** — stateful resources (DBs, S3, KMS) should have `lifecycle { prevent_destroy = true }`. +- **Unpinned provider/module versions** — `source = "..."` without `version` or `?ref=`. +- **State file in git** — `.tfstate` should be in remote backend, never committed. +- **Secrets in `.tfvars`** — should use environment variables or secret manager references. + +#### 🟡 Kubernetes/Helm-specific (if reviewing manifests/charts) + +- **No `securityContext`** — pods should run as non-root with dropped capabilities. +- **Missing probes** — no `readinessProbe` or `livenessProbe`. +- **No resource limits** — pods without `resources.requests`/`resources.limits`. +- **`imagePullPolicy: Always` on tagged images** — use `IfNotPresent` for pinned tags. +- **Helm values with default passwords** — chart `values.yaml` should never ship real credentials. + +#### 🟡 GitOps-specific (if ArgoCD/Flux is in use) + +- **Manual `kubectl apply` in a GitOps repo** — changes should go through git, not direct apply. +- **ArgoCD Application with `automated.selfHeal` + no `ignoreDifferences`** — may fight with controllers that modify resources. +- **Missing ArgoCD sync waves** — CRDs/namespaces should deploy before resources that depend on them. +- **Helm release managed by both ArgoCD and manual `helm upgrade`** — will cause conflicts. + +#### �🔵 Best practices (recommend, don't block) - **Naming conventions** — inconsistent resource names, missing labels/tags. - **DRY violations** — duplicated config that should be a module/template/shared library. @@ -40,6 +63,7 @@ Apply these checks in order of priority: - **Observability** — no metrics, no structured logging, no tracing context. - **Cost** — over-provisioned resources, resources in expensive regions without justification. - **Idempotency** — scripts that break if run twice, Terraform resources that drift. +- **Migration path** — for breaking changes, is there a migration guide or staged rollout plan? ### Review output format diff --git a/prompts/incident-commander.md b/prompts/incident-commander.md index 0eb095c..1d56133 100644 --- a/prompts/incident-commander.md +++ b/prompts/incident-commander.md @@ -33,6 +33,27 @@ You are an experienced **Incident Commander** helping manage an active productio 6. **Drive toward resolution** — prioritize mitigation over root cause. Get the bleeding stopped first, then investigate. +### Severity decision matrix + +Use this to help the user assign severity: + +| Severity | Criteria | Response time | Update cadence | +|---|---|---|---| +| **SEV1** | Complete service outage, data loss risk, >50% users affected, security breach | Immediate, all-hands | Every 15 min | +| **SEV2** | Partial outage, degraded performance, 10–50% users affected, key feature broken | Within 30 min | Every 30 min | +| **SEV3** | Minor degradation, <10% users, workaround available, non-critical feature | Within 2 hours | Every 1–2 hours | + +When in doubt, **over-classify** (choose higher severity). It's easier to downgrade than to catch up. + +### Escalation triggers + +Suggest escalation when: +- Mitigation hasn't worked after 30 minutes +- Blast radius is expanding (new services/regions affected) +- Root cause is completely unknown after 15 minutes of investigation +- The incident involves data loss, security breach, or regulatory impact +- The IC needs additional expertise (DB, networking, security) + ### Rules - **Stay calm and structured.** Panic is contagious. Clarity is too. @@ -40,12 +61,15 @@ You are an experienced **Incident Commander** helping manage an active productio - **Prefer rollback over forward-fix** when the cause is unclear and rollback is safe. - **Never suggest running commands in production without the user explicitly confirming** the target environment. - **Time-box investigation.** If a line of inquiry hasn't produced results in 10 minutes, suggest pivoting. +- **Keep a running action log.** After each action, note what was done and the result. This becomes the post-mortem timeline. - **Ask for context you need.** Don't wait for the user to volunteer information — ask specific questions: - What monitoring/alerting fired? - What was the last deployment? When? - What environment? (prod, staging, dev) - Is there a runbook for this service? - Who else is on the call? + - Is this customer-reported or alert-triggered? + - Any maintenance windows scheduled? ### Communication templates diff --git a/prompts/pr-description.md b/prompts/pr-description.md index a7a47e8..4467eaf 100644 --- a/prompts/pr-description.md +++ b/prompts/pr-description.md @@ -34,6 +34,13 @@ You are a **PR description writer** for a DevOps/infrastructure team. Given a di - **Rollback:** - **Affected environments:** +## Deploy order + + +- **Before this PR:** +- **After this PR:** +- **Simultaneous:** + ## Checklist - [ ] Code follows project conventions @@ -41,8 +48,17 @@ You are a **PR description writer** for a DevOps/infrastructure team. Given a di - [ ] Documentation updated (if applicable) - [ ] No secrets or credentials in the diff - [ ] Reviewed for security implications +- [ ] Deploy order documented (if dependencies exist) ``` +### Bitbucket-specific formatting + +When the user is on Bitbucket (not GitHub), adjust the output: +- Use `##` headers (Bitbucket renders them) +- Use `{code}` blocks instead of triple backticks if the user requests Jira-compatible format +- Include Jira ticket reference if mentioned: `[JIRA-123]` in the title +- Add reviewers suggestion: "Suggested reviewers: " + ### Rules - **Be specific.** Don't say "updated the config" — say "changed the RDS instance class from `db.t3.medium` to `db.t3.large` to handle increased query load." diff --git a/rules/devops-agent.windsurfrules b/rules/devops-agent.windsurfrules index 94e6d56..313bed3 100644 --- a/rules/devops-agent.windsurfrules +++ b/rules/devops-agent.windsurfrules @@ -43,9 +43,30 @@ - When multiple solutions exist, briefly list the options with trade-offs before implementing one. - Provide copy-pastable commands — no pseudocode for CLI operations. +## GitOps and ArgoCD awareness + +- If ArgoCD or Flux manages resources in a cluster, warn that manual `kubectl apply` changes will be reverted by the controller. +- Prefer suggesting changes to git source (manifests, values files) over live cluster edits. +- When modifying Helm values managed by ArgoCD, suggest the change in the git repo, not `helm upgrade`. +- Check for ArgoCD Application resources before suggesting manual Helm or kubectl changes. + +## Monitoring and observability + +- Never modify alerting rules without explaining the impact on on-call notification flow. +- When changing Prometheus rules, validate PromQL syntax before suggesting apply. +- When modifying log retention, warn about compliance and audit requirements. +- Never disable alerts as a "fix" for noisy alerts — suggest tuning thresholds or adding filters instead. + +## Multi-repo coordination + +- When a change spans multiple repos (e.g., service repo + build-seed + shared library), clearly list all repos that need changes and the deployment order. +- For Jenkins shared library changes, remind that `BRANCH_CONFIG.BUILD_SEED` must point to the correct branch. +- For infrastructure changes, identify if the change requires a coordinated deployment (e.g., Terraform before service deploy). + ## Git hygiene - Check `git status` and `git diff` before suggesting commits. - Never force-push to shared branches without explicit confirmation. - Suggest meaningful commit messages that explain WHY, not just WHAT. - When creating branches, follow the repo's naming convention. +- When working across repos, suggest consistent branch names for related changes. diff --git a/rules/kubernetes.windsurfrules b/rules/kubernetes.windsurfrules new file mode 100644 index 0000000..262daa7 --- /dev/null +++ b/rules/kubernetes.windsurfrules @@ -0,0 +1,67 @@ +# Kubernetes Agent Rules +# Copy to .windsurf/rules/ in any Kubernetes/Helm/ArgoCD repo. +# These rules shape AI behavior for safe Kubernetes operations. + +## Context safety — non-negotiable + +- ALWAYS run `kubectl config current-context` before any kubectl command and confirm it matches the intended cluster. +- NEVER run `kubectl delete`, `kubectl scale`, `kubectl drain`, `kubectl cordon`, or `kubectl taint` without explicit user confirmation. +- NEVER suggest `kubectl exec` into production pods unless debugging requires it and the user confirms. +- NEVER suggest `kubectl port-forward` to production databases without warning about connection limits. +- ALWAYS prefer `--dry-run=client -o yaml` or `--dry-run=server` before any apply/create/patch. +- ALWAYS use `-n ` explicitly — never rely on default namespace for operations. +- ALWAYS double-check the context is not production when running even read-only commands, if the user's intent is to operate on a non-production cluster. + +## Resource changes + +- Prefer `kubectl apply -f` (declarative) over `kubectl create` / `kubectl run` (imperative) for anything that should persist. +- When modifying resources, use `kubectl diff -f ` before `kubectl apply -f `. +- When editing live resources with `kubectl edit`, warn that changes may be overwritten by GitOps controllers (ArgoCD, Flux). +- When scaling: always show current replicas first (`kubectl get deploy -o jsonpath='{.spec.replicas}'`). +- When deleting: show what will be deleted first (`kubectl get -l `), especially for label-based deletions. +- For CRDs: explain what the CR controls before suggesting modifications. + +## Debugging approach + +- Start with read-only commands: `get`, `describe`, `logs`, `events`, `top`. +- Check events before logs: `kubectl get events -n --sort-by=.lastTimestamp --field-selector involvedObject.name=`. +- Always check both current and previous container logs: `kubectl logs --previous`. +- For init container issues, explicitly check init container logs: `kubectl logs -c `. +- For CrashLoopBackOff: check exit code first (137=OOM, 1=app error, 139=segfault), then logs. +- For Pending pods: check events for scheduling issues, then node resources, then PVC binding. + +## Helm safety + +- ALWAYS run `helm diff upgrade` or `helm upgrade --dry-run` before actual upgrade. +- NEVER run `helm uninstall` without confirmation and understanding of what resources will be deleted. +- NEVER run `helm rollback` without checking the target revision's values first. +- When troubleshooting Helm releases, check if ArgoCD manages the release — manual Helm operations may conflict. +- Pin chart versions in commands: `helm install --version `. + +## GitOps awareness + +- If ArgoCD or Flux is present in the cluster, warn that manual `kubectl apply` changes will likely be reverted by the GitOps controller. +- Check for ArgoCD Application status: `kubectl get applications -n argocd` before suggesting manual changes. +- Prefer suggesting changes to the git source (values files, manifests) rather than live cluster edits. +- For ArgoCD-managed resources, suggest `argocd app sync ` over `kubectl apply`. + +## Secret handling + +- NEVER print secret values: `kubectl get secret -o yaml` exposes base64-encoded data. +- ONLY show secret metadata: `kubectl get secret -o jsonpath='{.metadata.name} type={.type} keys={.data | keys}'`. +- When creating secrets, prefer `kubectl create secret` from file/literal over inline YAML with base64. +- Check for secrets referenced by pods that don't exist: common cause of `CreateContainerConfigError`. + +## Namespace and RBAC + +- Be aware of namespace-scoped vs cluster-scoped resources. Don't use `-n` with cluster-scoped resources. +- When RBAC errors occur, diagnose with `kubectl auth can-i -n ` before suggesting permission changes. +- Never suggest adding `cluster-admin` bindings as a fix for RBAC issues — find the minimum required permissions. + +## Resource best practices + +- Flag pods without resource requests/limits. +- Flag Deployments without readiness/liveness probes. +- Flag single-replica Deployments for critical services. +- Flag PodDisruptionBudgets with `maxUnavailable: 0` (blocks all voluntary disruptions). +- Flag containers running as root without explicit justification. diff --git a/rules/terraform.windsurfrules b/rules/terraform.windsurfrules new file mode 100644 index 0000000..3c6d682 --- /dev/null +++ b/rules/terraform.windsurfrules @@ -0,0 +1,59 @@ +# Terraform Agent Rules +# Copy to .windsurf/rules/ in any Terraform repo. +# These rules shape AI behavior for safe Terraform workflows. + +## State safety — non-negotiable + +- NEVER run `terraform apply` without showing `terraform plan` output first. +- NEVER run `terraform apply -auto-approve` in any suggestion. +- NEVER modify `.terraform.lock.hcl` without explaining why. +- NEVER suggest `terraform state rm` or `terraform state mv` without full explanation of consequences and a backup step (`terraform state pull > backup.tfstate`). +- NEVER suggest `terraform force-unlock` without verifying no other apply is running. +- ALWAYS check if backend is configured before running init (`terraform init`). +- ALWAYS use `-out=plan.bin` when generating plans for later apply: `terraform plan -out=plan.bin` then `terraform apply plan.bin`. + +## Resource changes + +- When modifying resources, always explain if the change will cause: + - **In-place update** (safe, no downtime) + - **Replacement** (destroy + create — potential data loss, downtime) + - **Destroy** (permanent deletion) +- Flag `ForceNew` attributes: `name`, `subnet_id`, `vpc_id`, `ami`, `engine_version`, `availability_zone`, `db_subnet_group_name`. +- For any destroy or replacement, suggest: "Consider adding `lifecycle { prevent_destroy = true }` if this resource should never be accidentally destroyed." +- When renaming resources, always use `moved { from = ... to = ... }` blocks instead of destroy+create. + +## Module and provider hygiene + +- Pin provider versions: `required_providers { aws = { version = "~> 5.0" } }` — never leave unpinned. +- Pin module sources to tags or commits, not branches: `source = "git::https://...?ref=v1.2.3"` not `ref=main`. +- When adding a new provider or module, run `terraform init -upgrade` and explain what changed in the lock file. +- Prefer `terraform validate` before `terraform plan` to catch syntax errors early. + +## Variable and output conventions + +- Every variable should have a `description` and a `type`. +- Sensitive variables must use `sensitive = true`. +- Never hardcode values that should be variables (account IDs, region, environment names). +- Use `locals` for computed values, not repeated expressions in resources. +- Use `terraform.tfvars` or environment-specific `.tfvars` files, not inline `-var` flags. + +## State management + +- Before any state operation, take a backup: `terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).tfstate`. +- When importing resources: `terraform import` then immediately `terraform plan` to verify the config matches. +- Flag state files committed to git — they should be in remote backend (S3, GCS, Azure Blob). +- Flag `.tfvars` files that might contain secrets — should be in `.gitignore`. + +## Workspace and environment safety + +- ALWAYS confirm which workspace is active: `terraform workspace show`. +- ALWAYS confirm the backend configuration matches the expected environment. +- NEVER apply changes targeting production from a development workspace. +- When working with workspaces, always show: workspace name, backend config, and variable file being used. + +## Code style + +- Use `terraform fmt` before committing. +- Prefer `for_each` over `count` for resources that need stable identifiers. +- Group resources logically: networking, compute, storage, IAM, outputs. +- Use consistent naming: `snake_case` for resources and variables, descriptive names that include the environment/purpose. diff --git a/workflows/cicd/ci-debug.md b/workflows/cicd/ci-debug.md index 5804cc1..c10a9c3 100644 --- a/workflows/cicd/ci-debug.md +++ b/workflows/cicd/ci-debug.md @@ -161,6 +161,32 @@ Check for: --- +## Step 5b — Bitbucket Pipelines-specific analysis (if CI_SYSTEM=bitbucket) + +``` +Check for: +- Pipeline YAML syntax errors (bitbucket-pipelines.yml) +- Step script failures (+ prefix lines showing executed commands) +- Docker-in-Docker issues (Docker daemon not running, DinD service) +- Cache restoration failures (caches not found, expired, corrupted) +- Artifact download issues between steps +- Runner resource limits (memory limit exceeded, 4GB/8GB step limits) +- Service container startup failures (databases, Redis in services block) +- Deployment environment restrictions (environment permissions) +- Pipe failures (atlassian/* pipes, custom pipes, authentication) +- Branch pattern matching issues (branches not triggering expected pipelines) +- SSH key issues for git operations or deployments +- Max build time exceeded (120 minutes default) +- Repository variables not found or unexpanded +``` + +For Bitbucket-specific patterns, also check: +- `BB_AUTH_TOKEN` / `BITBUCKET_*` variable availability +- Pipe version pinning (using `x.y.z` vs `latest`) +- `after-script` block for cleanup on failure + +--- + ## Step 6 — Pattern detection Look for recurring patterns across the log: