From 216437ffb806495dba22b6847a1bb7cb98ee6a23 Mon Sep 17 00:00:00 2001 From: Esther Date: Fri, 26 Jun 2026 15:40:17 +0100 Subject: [PATCH] Add incident postmortem template and publication workflow Introduce postmortem templates, publication playbook, incidents archive, validation scripts, and CI workflow definition for maintainers. Fixes broken template links referenced by runbooks and incident response documentation. Closes #769 Co-authored-by: Cursor --- CHANGELOG.md | 1 + README.md | 10 ++ docs/ci/postmortem-docs.workflow.yml | 26 ++++ docs/incident_response_runbook.md | 4 +- docs/incidents/README.md | 26 ++++ docs/incidents/drafts/.gitkeep | 1 + docs/postmortem-playbook.md | 138 ++++++++++++++++++ .../ISSUE_769_IMPLEMENTATION_SUMMARY.md | 86 +++++++++++ docs/runbooks/QUICK_REFERENCE.md | 4 +- docs/runbooks/README.md | 2 + docs/runbooks/templates/dr-test-report.md | 57 ++++++++ docs/runbooks/templates/incident-report.md | 57 ++++++++ docs/runbooks/templates/post-mortem.md | 77 ++++++++++ scripts/new-postmortem.sh | 32 ++++ scripts/validate-postmortem.sh | 84 +++++++++++ 15 files changed, 602 insertions(+), 3 deletions(-) create mode 100644 docs/ci/postmortem-docs.workflow.yml create mode 100644 docs/incidents/README.md create mode 100644 docs/incidents/drafts/.gitkeep create mode 100644 docs/postmortem-playbook.md create mode 100644 docs/runbooks/ISSUE_769_IMPLEMENTATION_SUMMARY.md create mode 100644 docs/runbooks/templates/dr-test-report.md create mode 100644 docs/runbooks/templates/incident-report.md create mode 100644 docs/runbooks/templates/post-mortem.md create mode 100644 scripts/new-postmortem.sh create mode 100644 scripts/validate-postmortem.sh diff --git a/CHANGELOG.md b/CHANGELOG.md index 033d4979..654b550a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -20,6 +20,7 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html). - Vault performance dynamic date filter ### Documentation +- Add incident postmortem templates, publication playbook, and CI validation workflow (#769) - Add release notes playbook and changelog curation guidelines (#618) - Add API versioning and deprecation policy with sunset windows, migration guide, and breaking-change classification (#610) diff --git a/README.md b/README.md index ae0a8479..c540d30f 100644 --- a/README.md +++ b/README.md @@ -163,6 +163,16 @@ YieldVault has comprehensive disaster recovery procedures to ensure system resil - [Disaster Recovery Runbooks Overview](./docs/runbooks/README.md) - [Replay and State Recovery Procedures](./docs/runbooks/REPLAY_PROCEDURES.md) +## Incident Postmortems + +YieldVault documents significant incidents with blameless postmortems and tracked action items: + +- **Templates:** [Post-mortem](./docs/runbooks/templates/post-mortem.md), [Incident Report](./docs/runbooks/templates/incident-report.md) +- **Publication workflow:** [Postmortem Playbook](./docs/postmortem-playbook.md) +- **Published reports:** [docs/incidents/](./docs/incidents/README.md) + +Postmortem drafts are due within 48 hours of incident resolution; publication within 5 business days. + ## Roadmap (Phases) - **Phase 1**: Planning, Documentation, and Frontend UI Baseline (Completed) diff --git a/docs/ci/postmortem-docs.workflow.yml b/docs/ci/postmortem-docs.workflow.yml new file mode 100644 index 00000000..02016e2d --- /dev/null +++ b/docs/ci/postmortem-docs.workflow.yml @@ -0,0 +1,26 @@ +# Postmortem Docs CI Workflow + +Install this file at `.github/workflows/postmortem-docs.yml` to enable PR validation +for published postmortem reports. + +```yaml +name: Validate Postmortem Docs + +on: + pull_request: + paths: + - 'docs/incidents/**' + - 'docs/runbooks/templates/**' + - 'docs/postmortem-playbook.md' + - 'scripts/validate-postmortem.sh' + +jobs: + validate: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Validate postmortem structure + run: chmod +x scripts/validate-postmortem.sh && ./scripts/validate-postmortem.sh +``` + +See [ISSUE_769_IMPLEMENTATION_SUMMARY.md](../runbooks/ISSUE_769_IMPLEMENTATION_SUMMARY.md). diff --git a/docs/incident_response_runbook.md b/docs/incident_response_runbook.md index 0b46bf20..ca29c61a 100644 --- a/docs/incident_response_runbook.md +++ b/docs/incident_response_runbook.md @@ -77,7 +77,9 @@ This runbook documents the operational procedures for handling **RPC degradation --- ## 7. Post‑mortem & Continuous Improvement -- Complete the **Post‑mortem Template** (`docs/POSTMORTEM_TEMPLATE.md`). +- Complete the **Post‑mortem Template** ([`docs/runbooks/templates/post-mortem.md`](./runbooks/templates/post-mortem.md)). +- Follow the **Publication Workflow** in [`docs/postmortem-playbook.md`](./postmortem-playbook.md). +- Publish finalized reports to [`docs/incidents/`](./incidents/README.md). - Update runbook if new failure modes were discovered. - Review alert thresholds and adjust if false‑positives occurred. - Schedule a **runbook drill** quarterly. diff --git a/docs/incidents/README.md b/docs/incidents/README.md new file mode 100644 index 00000000..3d29c3b3 --- /dev/null +++ b/docs/incidents/README.md @@ -0,0 +1,26 @@ +# Published Incident Postmortems + +This directory contains finalized, published postmortem reports for YieldVault incidents and significant DR exercises. + +## Index + +| Date | Incident ID | Title | Severity | Postmortem | +|------|-------------|-------|----------|------------| +| — | — | *No published postmortems yet* | — | — | + +## Creating a New Postmortem + +1. Copy [`docs/runbooks/templates/post-mortem.md`](../runbooks/templates/post-mortem.md) +2. Draft in `docs/incidents/drafts/` during review (optional) +3. Follow the publication workflow in [`docs/postmortem-playbook.md`](../postmortem-playbook.md) +4. Publish via PR using filename: `YYYY-MM-DD-INCIDENT-XXX-short-slug.md` +5. Update this index table + +## Related Resources + +- [Postmortem Playbook](../postmortem-playbook.md) +- [Incident Response Runbooks](../runbooks/README.md) +- [Incident Report Template](../runbooks/templates/incident-report.md) + +**Last Updated:** June 26, 2026 +**Maintained By:** DevOps Team diff --git a/docs/incidents/drafts/.gitkeep b/docs/incidents/drafts/.gitkeep new file mode 100644 index 00000000..8b137891 --- /dev/null +++ b/docs/incidents/drafts/.gitkeep @@ -0,0 +1 @@ + diff --git a/docs/postmortem-playbook.md b/docs/postmortem-playbook.md new file mode 100644 index 00000000..26084ef7 --- /dev/null +++ b/docs/postmortem-playbook.md @@ -0,0 +1,138 @@ +# Incident Postmortem Playbook + +This document describes when YieldVault writes postmortems, how action items are +tracked, and the publication workflow for finalized reports. + +--- + +## 1. When to write a postmortem + +Write a postmortem for any of the following: + +| Trigger | Examples | +|---------|----------| +| **Severity 1–2 incidents** | Full outage, data loss risk, contract pause | +| **DR events** | Database restore, RPC failover, backend redeploy under pressure | +| **Security incidents** | Key compromise, unauthorized access, exploit attempt | +| **Contract upgrades with issues** | Failed upgrade, rollback, unexpected state | + +Lower-severity incidents may use a shortened report at the Incident Commander's +discretion, but must still capture root cause and action items. + +--- + +## 2. Timeline + +| Phase | Deadline | Deliverable | +|-------|----------|-------------| +| During incident | Real-time | [Incident Report Template](./runbooks/templates/incident-report.md) | +| Post-incident | Within 48 hours | Postmortem draft | +| Publication | Within 5 business days | Published report in `docs/incidents/` | + +These deadlines align with the [Quick Reference](./runbooks/QUICK_REFERENCE.md) +post-mortem checklist and [Incident Response Runbooks](./runbooks/README.md). + +--- + +## 3. Roles + +| Role | Responsibility | +|------|----------------| +| **Incident Commander** | Owns timeline accuracy and severity classification | +| **Author** | Drafts postmortem from incident report and logs | +| **Reviewer** | DevOps or Security lead validates technical accuracy | +| **Release engineer** | Ensures security-sensitive details follow disclosure rules | + +--- + +## 4. Creation flow + +1. **Start from template** — Copy + [`docs/runbooks/templates/post-mortem.md`](./runbooks/templates/post-mortem.md). +2. **Optional draft location** — Save work-in-progress to + `docs/incidents/drafts/INCIDENT-XXX-slug.md` (not indexed until published). +3. **Gather inputs**: + - Live [incident report](./runbooks/templates/incident-report.md) + - Grafana / PagerDuty timelines + - Backend diagnostics bundle (`/api/diagnostics/bundle`) + - Relevant runbook steps exercised +4. **Complete all sections** — Summary, impact metrics, timeline, root cause, + action items table, lessons learned. + +--- + +## 5. Action-item tracking + +Every postmortem must include an **Action Items** table with: + +| Column | Required | +|--------|----------| +| ID | Yes (`AI-001`, `AI-002`, …) | +| Action | Yes | +| Owner | Yes | +| Priority | Yes (P0/P1/P2) | +| Due Date | Yes | +| Tracking Issue | Yes — link to GitHub issue | +| Status | Yes (Open / In Progress / Done) | + +**Workflow:** + +1. File each action item as a GitHub issue referencing the incident ID. +2. Link the issue number in the postmortem table. +3. Review open action items in the quarterly runbook review + ([runbooks README](./runbooks/README.md) §Continuous Improvement). + +--- + +## 6. Review and redaction + +Before publication: + +- [ ] Incident Commander and Reviewer sign off on timeline and severity +- [ ] Remove credentials, PII, and unreleased vulnerability details +- [ ] For **security incidents**, follow the 48-hour minimum disclosure window + described in [Release Notes Playbook](./release-notes-playbook.md) §8 +- [ ] Confirm customer-facing language is approved if published externally + +--- + +## 7. Publication flow + +1. **Open a PR** adding the finalized report to `docs/incidents/` using the + naming convention: `YYYY-MM-DD-INCIDENT-XXX-short-slug.md` +2. **Set `Status: Published`** in the report header (drafts must not remain in + `docs/incidents/` root) +3. **Update the index** in [`docs/incidents/README.md`](./incidents/README.md) +4. **Link action items** — Ensure every `AI-xxx` row has a merged or open GitHub + issue +5. **Update runbooks** if new failure modes were discovered +6. **Announce** in `#yieldvault-incidents`; update status page if user-facing +7. **Merge PR** after reviewer approval + +CI validates postmortem structure via `scripts/validate-postmortem.sh`. Install the +workflow from [`docs/ci/postmortem-docs.workflow.yml`](./ci/postmortem-docs.workflow.yml) +into `.github/workflows/` to enable automated PR checks. + +--- + +## 8. DR test reports + +Disaster recovery exercises that surface runbook gaps should file a +[DR Test Report](./runbooks/templates/dr-test-report.md). Significant findings +warrant a full postmortem using the same publication flow. + +--- + +## 9. Runbook feedback loop + +After each published postmortem: + +1. Identify runbook sections that were unclear or missing +2. Open a follow-up PR updating the relevant runbook under `docs/runbooks/` +3. Record the change in the postmortem's **Runbook Updates Required** section + +--- + +**Last Updated:** June 26, 2026 +**Maintained By:** DevOps Team +**Issue:** [#769](https://github.com/Junirezz/YieldVault-RWA/issues/769) diff --git a/docs/runbooks/ISSUE_769_IMPLEMENTATION_SUMMARY.md b/docs/runbooks/ISSUE_769_IMPLEMENTATION_SUMMARY.md new file mode 100644 index 00000000..e6c9a2cf --- /dev/null +++ b/docs/runbooks/ISSUE_769_IMPLEMENTATION_SUMMARY.md @@ -0,0 +1,86 @@ +# Issue #769 Implementation Summary: Incident Postmortem Template and Publication Workflow + +**Issue:** General: Add incident postmortem template and publication workflow +**Status:** ✅ COMPLETED +**Date:** June 26, 2026 + +--- + +## Goal + +Create a standard postmortem template with action-item tracking and a publication +workflow so the team can consistently document and learn from incidents. + +--- + +## Scope Delivered + +### 1. Postmortem and Incident Templates ✅ + +**Directory:** [docs/runbooks/templates/](./templates/) + +| File | Purpose | +|------|---------| +| [post-mortem.md](./templates/post-mortem.md) | Blameless postmortem with action-item table and publication checklist | +| [incident-report.md](./templates/incident-report.md) | Live incident log during active response | +| [dr-test-report.md](./templates/dr-test-report.md) | DR exercise report with RTO/RPO tracking | + +Fixes previously broken links in [runbooks README](./README.md) Appendix C. + +### 2. Publication Workflow Playbook ✅ + +**File:** [docs/postmortem-playbook.md](../postmortem-playbook.md) + +- When to write postmortems (severity, DR, security, contract events) +- 48-hour draft / 5-day publication timeline +- Roles, review/redaction, and security disclosure alignment +- PR-based publication flow into `docs/incidents/` +- Action-item → GitHub issue tracking requirements + +### 3. Published Postmortem Archive ✅ + +**File:** [docs/incidents/README.md](../incidents/README.md) + +- Index table for published reports +- Naming convention: `YYYY-MM-DD-INCIDENT-XXX-slug.md` +- Optional drafts under `docs/incidents/drafts/` + +### 4. Automation ✅ + +| File | Purpose | +|------|---------| +| [scripts/new-postmortem.sh](../../scripts/new-postmortem.sh) | Scaffold draft from template | +| [scripts/validate-postmortem.sh](../../scripts/validate-postmortem.sh) | CI validation for published reports | +| [docs/ci/postmortem-docs.workflow.yml](../ci/postmortem-docs.workflow.yml) | Workflow definition for maintainers to install under `.github/workflows/` | + +### 5. Cross-Link Updates ✅ + +- [docs/incident_response_runbook.md](../incident_response_runbook.md) — fixed broken template link +- [docs/runbooks/README.md](./README.md) — quick links to playbook and incidents index +- [docs/runbooks/QUICK_REFERENCE.md](./QUICK_REFERENCE.md) — postmortem step links +- [README.md](../../README.md) — incident postmortems section +- [CHANGELOG.md](../../CHANGELOG.md) — unreleased documentation entry + +--- + +## Acceptance Checklist + +- [x] Standard postmortem template with action-item tracking +- [x] Incident report template for live incidents +- [x] DR test report template (unblocks broken README link) +- [x] Publication workflow playbook +- [x] Published postmortem archive index +- [x] Scaffold and validation scripts +- [x] CI workflow for postmortem doc validation +- [x] Broken documentation links fixed + +--- + +## Related Files + +- Issue: [#769](https://github.com/Junirezz/YieldVault-RWA/issues/769) +- Pattern reference: [ISSUE_392_IMPLEMENTATION_SUMMARY.md](./ISSUE_392_IMPLEMENTATION_SUMMARY.md) +- Release disclosure pattern: [release-notes-playbook.md](../release-notes-playbook.md) §8 + +**Last Updated:** June 26, 2026 +**Maintained By:** DevOps Team diff --git a/docs/runbooks/QUICK_REFERENCE.md b/docs/runbooks/QUICK_REFERENCE.md index 2d661d6e..ea082e65 100644 --- a/docs/runbooks/QUICK_REFERENCE.md +++ b/docs/runbooks/QUICK_REFERENCE.md @@ -146,8 +146,8 @@ All runbooks: `docs/runbooks/` 3. **Notify** - Alert team via PagerDuty/Slack 4. **Respond** - Follow appropriate runbook 5. **Verify** - Confirm system restored -6. **Document** - Create incident report -7. **Review** - Post-mortem within 48 hours +6. **Document** - Create [incident report](./templates/incident-report.md) +7. **Review** - [Post-mortem](./templates/post-mortem.md) within 48 hours per [playbook](../postmortem-playbook.md) --- diff --git a/docs/runbooks/README.md b/docs/runbooks/README.md index d6027f3d..dc3ed578 100644 --- a/docs/runbooks/README.md +++ b/docs/runbooks/README.md @@ -15,6 +15,8 @@ This directory contains operational runbooks for disaster recovery and incident | [RPC Failover](./RPC_FAILOVER.md) | 5 min | N/A | Stellar RPC node failure | | [Full DR Procedure](./FULL_DR_PROCEDURE.md) | 4 hours | 15 min | Complete infrastructure failure | | [Replay & State Recovery](./REPLAY_PROCEDURES.md) | N/A | N/A | Recovering/syncing ledger events or email queue | +| [Postmortem Playbook](../postmortem-playbook.md) | N/A | N/A | Publishing incident postmortems | +| [Published Postmortems](../incidents/README.md) | N/A | N/A | Archive of finalized incident reports | --- diff --git a/docs/runbooks/templates/dr-test-report.md b/docs/runbooks/templates/dr-test-report.md new file mode 100644 index 00000000..3f1f4a83 --- /dev/null +++ b/docs/runbooks/templates/dr-test-report.md @@ -0,0 +1,57 @@ +# DR Test Report: [TEST-ID] — [Scenario Name] + +**Test ID:** DR-TEST-___ +**Date:** YYYY-MM-DD +**Participants:** [Names] +**Runbook Exercised:** [link] +**Facilitator:** [Name] +**Last Updated:** YYYY-MM-DD + +--- + +## Objectives + +- [Objective 1] + +## Targets vs Actuals + +| Metric | Target | Actual | Pass/Fail | +|--------|--------|--------|-----------| +| RTO | | | | +| RPO | | | | +| Total test duration | | | | + +## Test Steps + +| Step | Description | Expected | Actual | Pass/Fail | Notes | +|------|-------------|----------|--------|-----------|-------| +| 1 | | | | | | + +## Issues Encountered + +- [Issue description] + +## What Went Well + +- [Item] + +## What Could Be Improved + +- [Item] + +## Action Items + +| ID | Action | Owner | Priority | Due Date | Tracking Issue | Status | +|----|--------|-------|----------|----------|----------------|--------| +| AI-001 | | | P1 | YYYY-MM-DD | #___ | Open | + +## Sign-off + +| Role | Name | Date | +|------|------|------| +| Test lead | | | +| Incident Commander | | | + +--- + +*File completed reports in `docs/incidents/` when the test surfaces production-impacting findings. See the [Postmortem Playbook](../../postmortem-playbook.md).* diff --git a/docs/runbooks/templates/incident-report.md b/docs/runbooks/templates/incident-report.md new file mode 100644 index 00000000..f7711177 --- /dev/null +++ b/docs/runbooks/templates/incident-report.md @@ -0,0 +1,57 @@ +# Incident Report: [INCIDENT-ID] — [Brief Title] + +**Incident ID:** INCIDENT-___ +**Date Opened:** YYYY-MM-DD HH:MM UTC +**Severity:** [Critical / High / Medium / Low] +**Status:** [Investigating / Mitigating / Monitoring / Resolved] +**Incident Commander:** [Name] +**War Room Channel:** #yieldvault-war-room +**Last Updated:** YYYY-MM-DD HH:MM UTC + +--- + +## Affected Components + +- [ ] Backend API +- [ ] Frontend +- [ ] Database +- [ ] RPC / Soroban nodes +- [ ] Smart contracts +- [ ] Other: ___ + +## Runbook Used + +- [Runbook link or "N/A"] + +## Current Status + +[One-paragraph summary of current state and ETA] + +## Live Timeline (append during incident) + +| Time (UTC) | Actor | Event | +|------------|-------|-------| +| HH:MM | | Incident detected | +| HH:MM | | | + +## Diagnostics Collected + +- [ ] Incident ticket created +- [ ] Backend diagnostics bundle retrieved (`/api/diagnostics/bundle`) +- [ ] RPC / node logs captured +- [ ] Grafana dashboards linked + +## Communication Log + +| Time (UTC) | Channel | Message summary | +|------------|---------|-----------------| +| HH:MM | #yieldvault-incidents | | + +## Next Steps + +1. [Immediate action] +2. [Follow-up] + +--- + +*When the incident is resolved, complete the [Post-Mortem Template](./post-mortem.md) within 48 hours per the [Postmortem Playbook](../../postmortem-playbook.md).* diff --git a/docs/runbooks/templates/post-mortem.md b/docs/runbooks/templates/post-mortem.md new file mode 100644 index 00000000..a524c1a3 --- /dev/null +++ b/docs/runbooks/templates/post-mortem.md @@ -0,0 +1,77 @@ +# Post-Mortem: [INCIDENT-ID] — [Brief Title] + +**Incident ID:** INCIDENT-___ +**Date:** YYYY-MM-DD +**Severity:** [Critical / High / Medium / Low] +**Status:** [Draft / Published] +**Authors:** [Names] +**Reviewers:** [Names] +**Last Updated:** YYYY-MM-DD +**Related Runbook:** [link] + +--- + +## Summary + +[2–3 sentences: what happened, user impact, resolution] + +## Impact + +| Metric | Value | +|--------|-------| +| Detection time (MTTD) | | +| Response time | | +| Recovery time (MTTR) | | +| Total downtime | | +| Data loss (RPO) | | +| Affected components | | +| Affected users | | + +## Timeline (UTC) + +| Time | Event | +|------|-------| +| HH:MM | Incident detected | +| HH:MM | Team assembled | +| HH:MM | Mitigation started | +| HH:MM | Service restored | +| HH:MM | Monitoring confirmed stable | + +## Root Cause + +[Technical root cause — blameless] + +## Contributing Factors + +- [Factor 1] + +## What Went Well + +- [Item] + +## What Could Be Improved + +- [Item] + +## Action Items + +| ID | Action | Owner | Priority | Due Date | Tracking Issue | Status | +|----|--------|-------|----------|----------|----------------|--------| +| AI-001 | | | P0/P1/P2 | YYYY-MM-DD | #___ | Open | + +## Runbook Updates Required + +- [ ] [Runbook name] — [what to change] + +## Lessons Learned + +[Blameless takeaways] + +## Publication Checklist + +- [ ] Internal review complete +- [ ] Sensitive details redacted (if customer-facing) +- [ ] Action items filed as GitHub issues +- [ ] Runbooks updated (if applicable) +- [ ] Added to `docs/incidents/` index +- [ ] Stakeholders notified (#yieldvault-incidents) diff --git a/scripts/new-postmortem.sh b/scripts/new-postmortem.sh new file mode 100644 index 00000000..3a775520 --- /dev/null +++ b/scripts/new-postmortem.sh @@ -0,0 +1,32 @@ +#!/usr/bin/env bash +# Scaffold a new postmortem draft from the standard template. +set -euo pipefail + +if [[ $# -lt 2 ]]; then + echo "Usage: $0 INCIDENT-123 short-slug" + echo "Example: $0 INCIDENT-123 rpc-failover" + exit 1 +fi + +INCIDENT_ID="$1" +SLUG="$2" +DATE="$(date -u +%Y-%m-%d)" +REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" +DRAFTS_DIR="${REPO_ROOT}/docs/incidents/drafts" +TEMPLATE="${REPO_ROOT}/docs/runbooks/templates/post-mortem.md" +OUTPUT="${DRAFTS_DIR}/${DATE}-${INCIDENT_ID}-${SLUG}.md" + +mkdir -p "$DRAFTS_DIR" + +if [[ ! -f "$TEMPLATE" ]]; then + echo "ERROR: template not found at ${TEMPLATE}" + exit 1 +fi + +cp "$TEMPLATE" "$OUTPUT" +sed -i "s/INCIDENT-___/${INCIDENT_ID}/" "$OUTPUT" 2>/dev/null || \ + sed -i '' "s/INCIDENT-___/${INCIDENT_ID}/" "$OUTPUT" +sed -i "s/YYYY-MM-DD/${DATE}/" "$OUTPUT" 2>/dev/null || \ + sed -i '' "s/YYYY-MM-DD/${DATE}/" "$OUTPUT" + +echo "Created draft: ${OUTPUT}" diff --git a/scripts/validate-postmortem.sh b/scripts/validate-postmortem.sh new file mode 100644 index 00000000..df1b4039 --- /dev/null +++ b/scripts/validate-postmortem.sh @@ -0,0 +1,84 @@ +#!/usr/bin/env bash +# Validate postmortem markdown structure for published reports in docs/incidents/. +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" +INCIDENTS_DIR="${REPO_ROOT}/docs/incidents" + +REQUIRED_HEADINGS=( + "## Summary" + "## Impact" + "## Timeline" + "## Root Cause" + "## Action Items" + "## Lessons Learned" +) + +errors=0 + +validate_published_report() { + local file="$1" + local basename + basename="$(basename "$file")" + + if [[ "$basename" == "README.md" ]]; then + return 0 + fi + + if [[ "$basename" == .gitkeep ]]; then + return 0 + fi + + if [[ "$file" == *"/drafts/"* ]]; then + return 0 + fi + + if [[ ! "$basename" =~ ^[0-9]{4}-[0-9]{2}-[0-9]{2}-INCIDENT-.+\.md$ ]]; then + echo "ERROR: ${file}: filename must match YYYY-MM-DD-INCIDENT-*.md" + errors=$((errors + 1)) + fi + + for heading in "${REQUIRED_HEADINGS[@]}"; do + if ! grep -qF "$heading" "$file"; then + echo "ERROR: ${file}: missing required heading ${heading}" + errors=$((errors + 1)) + fi + done + + if grep -qE '^\*\*Status:\*\*.*Draft' "$file"; then + echo "ERROR: ${file}: published reports must not have Status: Draft" + errors=$((errors + 1)) + fi + + if ! grep -qE '^\| ID \| Action \| Owner \|' "$file"; then + echo "ERROR: ${file}: action items table must include ID, Action, Owner columns" + errors=$((errors + 1)) + fi +} + +# Validate templates exist +for template in post-mortem.md incident-report.md dr-test-report.md; do + if [[ ! -f "${REPO_ROOT}/docs/runbooks/templates/${template}" ]]; then + echo "ERROR: missing template docs/runbooks/templates/${template}" + errors=$((errors + 1)) + fi +done + +if [[ ! -f "${REPO_ROOT}/docs/postmortem-playbook.md" ]]; then + echo "ERROR: missing docs/postmortem-playbook.md" + errors=$((errors + 1)) +fi + +# Validate published incident reports (if any) +if [[ -d "$INCIDENTS_DIR" ]]; then + while IFS= read -r -d '' file; do + validate_published_report "$file" + done < <(find "$INCIDENTS_DIR" -maxdepth 1 -name '*.md' -print0 2>/dev/null || true) +fi + +if [[ "$errors" -gt 0 ]]; then + echo "Postmortem validation failed with ${errors} error(s)." + exit 1 +fi + +echo "Postmortem validation passed."