23seriy · 23seriy · May 18, 2026 · May 18, 2026
diff --git a/README.md b/README.md
@@ -82,7 +82,9 @@ Persistent instruction files that shape AI behavior. Copy into a project's `.win
 
 | Rule file | What it does |
 |---|---|
-| [devops-agent.windsurfrules](./rules/devops-agent.windsurfrules) | Safety guardrails for AI in DevOps repos: never modify prod without confirmation, prefer read-only, never hardcode secrets, always check context. |
+| [devops-agent.windsurfrules](./rules/devops-agent.windsurfrules) | Safety guardrails for AI in DevOps repos: never modify prod without confirmation, prefer read-only, never hardcode secrets, always check context, GitOps awareness, multi-repo coordination. |
+| [terraform.windsurfrules](./rules/terraform.windsurfrules) | Terraform-specific: state safety, ForceNew attribute warnings, provider/module pinning, workspace safety, import workflow, `prevent_destroy` reminders. |
+| [kubernetes.windsurfrules](./rules/kubernetes.windsurfrules) | Kubernetes-specific: context verification, dry-run first, Helm safety, ArgoCD/GitOps awareness, secret handling, debugging approach, RBAC best practices. |
 
 ## Scripts
 

diff --git a/prompts/code-review-devops.md b/prompts/code-review-devops.md
@@ -32,14 +32,38 @@ Apply these checks in order of priority:
 - **Non-reproducible builds** — `FROM latest`, unpinned package versions, `npm install` instead of `npm ci`.
 - **Missing rollback plan** — destructive changes without rollback steps documented.
 
-#### 🔵 Best practices (recommend, don't block)
+#### � Terraform-specific (if reviewing .tf files)
+
+- **`ForceNew` attributes changed** — `name`, `ami`, `subnet_id`, `engine_version` cause resource replacement (destroy + create).
+- **Destroy without `prevent_destroy`** — stateful resources (DBs, S3, KMS) should have `lifecycle { prevent_destroy = true }`.
+- **Unpinned provider/module versions** — `source = "..."` without `version` or `?ref=`.
+- **State file in git** — `.tfstate` should be in remote backend, never committed.
+- **Secrets in `.tfvars`** — should use environment variables or secret manager references.
+
+#### 🟡 Kubernetes/Helm-specific (if reviewing manifests/charts)
+
+- **No `securityContext`** — pods should run as non-root with dropped capabilities.
+- **Missing probes** — no `readinessProbe` or `livenessProbe`.
+- **No resource limits** — pods without `resources.requests`/`resources.limits`.
+- **`imagePullPolicy: Always` on tagged images** — use `IfNotPresent` for pinned tags.
+- **Helm values with default passwords** — chart `values.yaml` should never ship real credentials.
+
+#### 🟡 GitOps-specific (if ArgoCD/Flux is in use)
+
+- **Manual `kubectl apply` in a GitOps repo** — changes should go through git, not direct apply.
+- **ArgoCD Application with `automated.selfHeal` + no `ignoreDifferences`** — may fight with controllers that modify resources.
+- **Missing ArgoCD sync waves** — CRDs/namespaces should deploy before resources that depend on them.
+- **Helm release managed by both ArgoCD and manual `helm upgrade`** — will cause conflicts.
+
+#### �🔵 Best practices (recommend, don't block)
 
 - **Naming conventions** — inconsistent resource names, missing labels/tags.
 - **DRY violations** — duplicated config that should be a module/template/shared library.
 - **Documentation** — missing README updates, undocumented parameters, no inline comments on complex logic.
 - **Observability** — no metrics, no structured logging, no tracing context.
 - **Cost** — over-provisioned resources, resources in expensive regions without justification.
 - **Idempotency** — scripts that break if run twice, Terraform resources that drift.
+- **Migration path** — for breaking changes, is there a migration guide or staged rollout plan?
 
 ### Review output format
 

diff --git a/prompts/incident-commander.md b/prompts/incident-commander.md
@@ -33,19 +33,43 @@ You are an experienced **Incident Commander** helping manage an active productio
 
 6. **Drive toward resolution** — prioritize mitigation over root cause. Get the bleeding stopped first, then investigate.
 
+### Severity decision matrix
+
+Use this to help the user assign severity:
+
+| Severity | Criteria | Response time | Update cadence |
+|---|---|---|---|
+| **SEV1** | Complete service outage, data loss risk, >50% users affected, security breach | Immediate, all-hands | Every 15 min |
+| **SEV2** | Partial outage, degraded performance, 10–50% users affected, key feature broken | Within 30 min | Every 30 min |
+| **SEV3** | Minor degradation, <10% users, workaround available, non-critical feature | Within 2 hours | Every 1–2 hours |
+
+When in doubt, **over-classify** (choose higher severity). It's easier to downgrade than to catch up.
+
+### Escalation triggers
+
+Suggest escalation when:
+- Mitigation hasn't worked after 30 minutes
+- Blast radius is expanding (new services/regions affected)
+- Root cause is completely unknown after 15 minutes of investigation
+- The incident involves data loss, security breach, or regulatory impact
+- The IC needs additional expertise (DB, networking, security)
+
 ### Rules
 
 - **Stay calm and structured.** Panic is contagious. Clarity is too.
 - **Never guess.** If you don't know, say so and suggest how to find out.
 - **Prefer rollback over forward-fix** when the cause is unclear and rollback is safe.
 - **Never suggest running commands in production without the user explicitly confirming** the target environment.
 - **Time-box investigation.** If a line of inquiry hasn't produced results in 10 minutes, suggest pivoting.
+- **Keep a running action log.** After each action, note what was done and the result. This becomes the post-mortem timeline.
 - **Ask for context you need.** Don't wait for the user to volunteer information — ask specific questions:
   - What monitoring/alerting fired?
   - What was the last deployment? When?
   - What environment? (prod, staging, dev)
   - Is there a runbook for this service?
   - Who else is on the call?
+  - Is this customer-reported or alert-triggered?
+  - Any maintenance windows scheduled?
 
 ### Communication templates
 

diff --git a/prompts/pr-description.md b/prompts/pr-description.md
@@ -34,15 +34,31 @@ You are a **PR description writer** for a DevOps/infrastructure team. Given a di
 - **Rollback:** <how to revert if needed>
 - **Affected environments:** <which envs will be impacted>
 
+## Deploy order
+
+<does this PR need to be deployed before or after other PRs? List dependencies.>
+- **Before this PR:** <list PRs that must be merged/deployed first, or "none">
+- **After this PR:** <list PRs that depend on this, or "none">
+- **Simultaneous:** <if multiple repos need coordinated deployment>
+
 ## Checklist
 
 - [ ] Code follows project conventions
 - [ ] Tests added/updated
 - [ ] Documentation updated (if applicable)
 - [ ] No secrets or credentials in the diff
 - [ ] Reviewed for security implications
+- [ ] Deploy order documented (if dependencies exist)
 ```
 
+### Bitbucket-specific formatting
+
+When the user is on Bitbucket (not GitHub), adjust the output:
+- Use `##` headers (Bitbucket renders them)
+- Use `{code}` blocks instead of triple backticks if the user requests Jira-compatible format
+- Include Jira ticket reference if mentioned: `[JIRA-123]` in the title
+- Add reviewers suggestion: "Suggested reviewers: <names based on file ownership>"
+
 ### Rules
 
 - **Be specific.** Don't say "updated the config" — say "changed the RDS instance class from `db.t3.medium` to `db.t3.large` to handle increased query load."

diff --git a/rules/devops-agent.windsurfrules b/rules/devops-agent.windsurfrules
@@ -43,9 +43,30 @@
 - When multiple solutions exist, briefly list the options with trade-offs before implementing one.
 - Provide copy-pastable commands — no pseudocode for CLI operations.
 
+## GitOps and ArgoCD awareness
+
+- If ArgoCD or Flux manages resources in a cluster, warn that manual `kubectl apply` changes will be reverted by the controller.
+- Prefer suggesting changes to git source (manifests, values files) over live cluster edits.
+- When modifying Helm values managed by ArgoCD, suggest the change in the git repo, not `helm upgrade`.
+- Check for ArgoCD Application resources before suggesting manual Helm or kubectl changes.
+
+## Monitoring and observability
+
+- Never modify alerting rules without explaining the impact on on-call notification flow.
+- When changing Prometheus rules, validate PromQL syntax before suggesting apply.
+- When modifying log retention, warn about compliance and audit requirements.
+- Never disable alerts as a "fix" for noisy alerts — suggest tuning thresholds or adding filters instead.
+
+## Multi-repo coordination
+
+- When a change spans multiple repos (e.g., service repo + build-seed + shared library), clearly list all repos that need changes and the deployment order.
+- For Jenkins shared library changes, remind that `BRANCH_CONFIG.BUILD_SEED` must point to the correct branch.
+- For infrastructure changes, identify if the change requires a coordinated deployment (e.g., Terraform before service deploy).
+
 ## Git hygiene
 
 - Check `git status` and `git diff` before suggesting commits.
 - Never force-push to shared branches without explicit confirmation.
 - Suggest meaningful commit messages that explain WHY, not just WHAT.
 - When creating branches, follow the repo's naming convention.
+- When working across repos, suggest consistent branch names for related changes.
diff --git a/rules/kubernetes.windsurfrules b/rules/kubernetes.windsurfrules
@@ -0,0 +1,67 @@
+# Kubernetes Agent Rules
+# Copy to .windsurf/rules/ in any Kubernetes/Helm/ArgoCD repo.
+# These rules shape AI behavior for safe Kubernetes operations.
+
+## Context safety — non-negotiable
+
+- ALWAYS run `kubectl config current-context` before any kubectl command and confirm it matches the intended cluster.
+- NEVER run `kubectl delete`, `kubectl scale`, `kubectl drain`, `kubectl cordon`, or `kubectl taint` without explicit user confirmation.
+- NEVER suggest `kubectl exec` into production pods unless debugging requires it and the user confirms.
+- NEVER suggest `kubectl port-forward` to production databases without warning about connection limits.
+- ALWAYS prefer `--dry-run=client -o yaml` or `--dry-run=server` before any apply/create/patch.
+- ALWAYS use `-n <namespace>` explicitly — never rely on default namespace for operations.
+- ALWAYS double-check the context is not production when running even read-only commands, if the user's intent is to operate on a non-production cluster.
+
+## Resource changes
+
+- Prefer `kubectl apply -f` (declarative) over `kubectl create` / `kubectl run` (imperative) for anything that should persist.
+- When modifying resources, use `kubectl diff -f <file>` before `kubectl apply -f <file>`.
+- When editing live resources with `kubectl edit`, warn that changes may be overwritten by GitOps controllers (ArgoCD, Flux).
+- When scaling: always show current replicas first (`kubectl get deploy <name> -o jsonpath='{.spec.replicas}'`).
+- When deleting: show what will be deleted first (`kubectl get <resource> -l <selector>`), especially for label-based deletions.
+- For CRDs: explain what the CR controls before suggesting modifications.
+
+## Debugging approach
+
+- Start with read-only commands: `get`, `describe`, `logs`, `events`, `top`.
+- Check events before logs: `kubectl get events -n <ns> --sort-by=.lastTimestamp --field-selector involvedObject.name=<pod>`.
+- Always check both current and previous container logs: `kubectl logs <pod> --previous`.
+- For init container issues, explicitly check init container logs: `kubectl logs <pod> -c <init-container>`.
+- For CrashLoopBackOff: check exit code first (137=OOM, 1=app error, 139=segfault), then logs.
+- For Pending pods: check events for scheduling issues, then node resources, then PVC binding.
+
+## Helm safety
+
+- ALWAYS run `helm diff upgrade` or `helm upgrade --dry-run` before actual upgrade.
+- NEVER run `helm uninstall` without confirmation and understanding of what resources will be deleted.
+- NEVER run `helm rollback` without checking the target revision's values first.
+- When troubleshooting Helm releases, check if ArgoCD manages the release — manual Helm operations may conflict.
+- Pin chart versions in commands: `helm install <name> <chart> --version <x.y.z>`.
+
+## GitOps awareness
+
+- If ArgoCD or Flux is present in the cluster, warn that manual `kubectl apply` changes will likely be reverted by the GitOps controller.
+- Check for ArgoCD Application status: `kubectl get applications -n argocd` before suggesting manual changes.
+- Prefer suggesting changes to the git source (values files, manifests) rather than live cluster edits.
+- For ArgoCD-managed resources, suggest `argocd app sync <app>` over `kubectl apply`.
+
+## Secret handling
+
+- NEVER print secret values: `kubectl get secret <name> -o yaml` exposes base64-encoded data.
+- ONLY show secret metadata: `kubectl get secret <name> -o jsonpath='{.metadata.name} type={.type} keys={.data | keys}'`.
+- When creating secrets, prefer `kubectl create secret` from file/literal over inline YAML with base64.
+- Check for secrets referenced by pods that don't exist: common cause of `CreateContainerConfigError`.
+
+## Namespace and RBAC
+
+- Be aware of namespace-scoped vs cluster-scoped resources. Don't use `-n` with cluster-scoped resources.
+- When RBAC errors occur, diagnose with `kubectl auth can-i <verb> <resource> -n <ns>` before suggesting permission changes.
+- Never suggest adding `cluster-admin` bindings as a fix for RBAC issues — find the minimum required permissions.
+
+## Resource best practices
+
+- Flag pods without resource requests/limits.
+- Flag Deployments without readiness/liveness probes.
+- Flag single-replica Deployments for critical services.
+- Flag PodDisruptionBudgets with `maxUnavailable: 0` (blocks all voluntary disruptions).
+- Flag containers running as root without explicit justification.
diff --git a/rules/terraform.windsurfrules b/rules/terraform.windsurfrules
@@ -0,0 +1,59 @@
+# Terraform Agent Rules
+# Copy to .windsurf/rules/ in any Terraform repo.
+# These rules shape AI behavior for safe Terraform workflows.
+
+## State safety — non-negotiable
+
+- NEVER run `terraform apply` without showing `terraform plan` output first.
+- NEVER run `terraform apply -auto-approve` in any suggestion.
+- NEVER modify `.terraform.lock.hcl` without explaining why.
+- NEVER suggest `terraform state rm` or `terraform state mv` without full explanation of consequences and a backup step (`terraform state pull > backup.tfstate`).
+- NEVER suggest `terraform force-unlock` without verifying no other apply is running.
+- ALWAYS check if backend is configured before running init (`terraform init`).
+- ALWAYS use `-out=plan.bin` when generating plans for later apply: `terraform plan -out=plan.bin` then `terraform apply plan.bin`.
+
+## Resource changes
+
+- When modifying resources, always explain if the change will cause:
+  - **In-place update** (safe, no downtime)
+  - **Replacement** (destroy + create — potential data loss, downtime)
+  - **Destroy** (permanent deletion)
+- Flag `ForceNew` attributes: `name`, `subnet_id`, `vpc_id`, `ami`, `engine_version`, `availability_zone`, `db_subnet_group_name`.
+- For any destroy or replacement, suggest: "Consider adding `lifecycle { prevent_destroy = true }` if this resource should never be accidentally destroyed."
+- When renaming resources, always use `moved { from = ... to = ... }` blocks instead of destroy+create.
+
+## Module and provider hygiene
+
+- Pin provider versions: `required_providers { aws = { version = "~> 5.0" } }` — never leave unpinned.
+- Pin module sources to tags or commits, not branches: `source = "git::https://...?ref=v1.2.3"` not `ref=main`.
+- When adding a new provider or module, run `terraform init -upgrade` and explain what changed in the lock file.
+- Prefer `terraform validate` before `terraform plan` to catch syntax errors early.
+
+## Variable and output conventions
+
+- Every variable should have a `description` and a `type`.
+- Sensitive variables must use `sensitive = true`.
+- Never hardcode values that should be variables (account IDs, region, environment names).
+- Use `locals` for computed values, not repeated expressions in resources.
+- Use `terraform.tfvars` or environment-specific `.tfvars` files, not inline `-var` flags.
+
+## State management
+
+- Before any state operation, take a backup: `terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).tfstate`.
+- When importing resources: `terraform import` then immediately `terraform plan` to verify the config matches.
+- Flag state files committed to git — they should be in remote backend (S3, GCS, Azure Blob).
+- Flag `.tfvars` files that might contain secrets — should be in `.gitignore`.
+
+## Workspace and environment safety
+
+- ALWAYS confirm which workspace is active: `terraform workspace show`.
+- ALWAYS confirm the backend configuration matches the expected environment.
+- NEVER apply changes targeting production from a development workspace.
+- When working with workspaces, always show: workspace name, backend config, and variable file being used.
+
+## Code style
+
+- Use `terraform fmt` before committing.
+- Prefer `for_each` over `count` for resources that need stable identifiers.
+- Group resources logically: networking, compute, storage, IAM, outputs.
+- Use consistent naming: `snake_case` for resources and variables, descriptive names that include the environment/purpose.
diff --git a/workflows/cicd/ci-debug.md b/workflows/cicd/ci-debug.md
@@ -161,6 +161,32 @@ Check for:
 
 ---
 
+## Step 5b — Bitbucket Pipelines-specific analysis (if CI_SYSTEM=bitbucket)
+
+```
+Check for:
+- Pipeline YAML syntax errors (bitbucket-pipelines.yml)
+- Step script failures (+ prefix lines showing executed commands)
+- Docker-in-Docker issues (Docker daemon not running, DinD service)
+- Cache restoration failures (caches not found, expired, corrupted)
+- Artifact download issues between steps
+- Runner resource limits (memory limit exceeded, 4GB/8GB step limits)
+- Service container startup failures (databases, Redis in services block)
+- Deployment environment restrictions (environment permissions)
+- Pipe failures (atlassian/* pipes, custom pipes, authentication)
+- Branch pattern matching issues (branches not triggering expected pipelines)
+- SSH key issues for git operations or deployments
+- Max build time exceeded (120 minutes default)
+- Repository variables not found or unexpanded
+```
+
+For Bitbucket-specific patterns, also check:
+- `BB_AUTH_TOKEN` / `BITBUCKET_*` variable availability
+- Pipe version pinning (using `x.y.z` vs `latest`)
+- `after-script` block for cleanup on failure
+
+---
+
 ## Step 6 — Pattern detection
 
 Look for recurring patterns across the log: