Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
- Vault performance dynamic date filter

### Documentation
- Add incident postmortem templates, publication playbook, and CI validation workflow (#769)
- Add release notes playbook and changelog curation guidelines (#618)
- Add API versioning and deprecation policy with sunset windows, migration guide, and breaking-change classification (#610)

Expand Down
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,16 @@ YieldVault has comprehensive disaster recovery procedures to ensure system resil
- [Disaster Recovery Runbooks Overview](./docs/runbooks/README.md)
- [Replay and State Recovery Procedures](./docs/runbooks/REPLAY_PROCEDURES.md)

## Incident Postmortems

YieldVault documents significant incidents with blameless postmortems and tracked action items:

- **Templates:** [Post-mortem](./docs/runbooks/templates/post-mortem.md), [Incident Report](./docs/runbooks/templates/incident-report.md)
- **Publication workflow:** [Postmortem Playbook](./docs/postmortem-playbook.md)
- **Published reports:** [docs/incidents/](./docs/incidents/README.md)

Postmortem drafts are due within 48 hours of incident resolution; publication within 5 business days.

## Roadmap (Phases)

- **Phase 1**: Planning, Documentation, and Frontend UI Baseline (Completed)
Expand Down
26 changes: 26 additions & 0 deletions docs/ci/postmortem-docs.workflow.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Postmortem Docs CI Workflow

Install this file at `.github/workflows/postmortem-docs.yml` to enable PR validation
for published postmortem reports.

```yaml
name: Validate Postmortem Docs

on:
pull_request:
paths:
- 'docs/incidents/**'
- 'docs/runbooks/templates/**'
- 'docs/postmortem-playbook.md'
- 'scripts/validate-postmortem.sh'

jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate postmortem structure
run: chmod +x scripts/validate-postmortem.sh && ./scripts/validate-postmortem.sh
```

See [ISSUE_769_IMPLEMENTATION_SUMMARY.md](../runbooks/ISSUE_769_IMPLEMENTATION_SUMMARY.md).
4 changes: 3 additions & 1 deletion docs/incident_response_runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,9 @@ This runbook documents the operational procedures for handling **RPC degradation

---
## 7. Post‑mortem & Continuous Improvement
- Complete the **Post‑mortem Template** (`docs/POSTMORTEM_TEMPLATE.md`).
- Complete the **Post‑mortem Template** ([`docs/runbooks/templates/post-mortem.md`](./runbooks/templates/post-mortem.md)).
- Follow the **Publication Workflow** in [`docs/postmortem-playbook.md`](./postmortem-playbook.md).
- Publish finalized reports to [`docs/incidents/`](./incidents/README.md).
- Update runbook if new failure modes were discovered.
- Review alert thresholds and adjust if false‑positives occurred.
- Schedule a **runbook drill** quarterly.
Expand Down
26 changes: 26 additions & 0 deletions docs/incidents/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Published Incident Postmortems

This directory contains finalized, published postmortem reports for YieldVault incidents and significant DR exercises.

## Index

| Date | Incident ID | Title | Severity | Postmortem |
|------|-------------|-------|----------|------------|
| — | — | *No published postmortems yet* | — | — |

## Creating a New Postmortem

1. Copy [`docs/runbooks/templates/post-mortem.md`](../runbooks/templates/post-mortem.md)
2. Draft in `docs/incidents/drafts/` during review (optional)
3. Follow the publication workflow in [`docs/postmortem-playbook.md`](../postmortem-playbook.md)
4. Publish via PR using filename: `YYYY-MM-DD-INCIDENT-XXX-short-slug.md`
5. Update this index table

## Related Resources

- [Postmortem Playbook](../postmortem-playbook.md)
- [Incident Response Runbooks](../runbooks/README.md)
- [Incident Report Template](../runbooks/templates/incident-report.md)

**Last Updated:** June 26, 2026
**Maintained By:** DevOps Team
1 change: 1 addition & 0 deletions docs/incidents/drafts/.gitkeep
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

138 changes: 138 additions & 0 deletions docs/postmortem-playbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Incident Postmortem Playbook

This document describes when YieldVault writes postmortems, how action items are
tracked, and the publication workflow for finalized reports.

---

## 1. When to write a postmortem

Write a postmortem for any of the following:

| Trigger | Examples |
|---------|----------|
| **Severity 1–2 incidents** | Full outage, data loss risk, contract pause |
| **DR events** | Database restore, RPC failover, backend redeploy under pressure |
| **Security incidents** | Key compromise, unauthorized access, exploit attempt |
| **Contract upgrades with issues** | Failed upgrade, rollback, unexpected state |

Lower-severity incidents may use a shortened report at the Incident Commander's
discretion, but must still capture root cause and action items.

---

## 2. Timeline

| Phase | Deadline | Deliverable |
|-------|----------|-------------|
| During incident | Real-time | [Incident Report Template](./runbooks/templates/incident-report.md) |
| Post-incident | Within 48 hours | Postmortem draft |
| Publication | Within 5 business days | Published report in `docs/incidents/` |

These deadlines align with the [Quick Reference](./runbooks/QUICK_REFERENCE.md)
post-mortem checklist and [Incident Response Runbooks](./runbooks/README.md).

---

## 3. Roles

| Role | Responsibility |
|------|----------------|
| **Incident Commander** | Owns timeline accuracy and severity classification |
| **Author** | Drafts postmortem from incident report and logs |
| **Reviewer** | DevOps or Security lead validates technical accuracy |
| **Release engineer** | Ensures security-sensitive details follow disclosure rules |

---

## 4. Creation flow

1. **Start from template** — Copy
[`docs/runbooks/templates/post-mortem.md`](./runbooks/templates/post-mortem.md).
2. **Optional draft location** — Save work-in-progress to
`docs/incidents/drafts/INCIDENT-XXX-slug.md` (not indexed until published).
3. **Gather inputs**:
- Live [incident report](./runbooks/templates/incident-report.md)
- Grafana / PagerDuty timelines
- Backend diagnostics bundle (`/api/diagnostics/bundle`)
- Relevant runbook steps exercised
4. **Complete all sections** — Summary, impact metrics, timeline, root cause,
action items table, lessons learned.

---

## 5. Action-item tracking

Every postmortem must include an **Action Items** table with:

| Column | Required |
|--------|----------|
| ID | Yes (`AI-001`, `AI-002`, …) |
| Action | Yes |
| Owner | Yes |
| Priority | Yes (P0/P1/P2) |
| Due Date | Yes |
| Tracking Issue | Yes — link to GitHub issue |
| Status | Yes (Open / In Progress / Done) |

**Workflow:**

1. File each action item as a GitHub issue referencing the incident ID.
2. Link the issue number in the postmortem table.
3. Review open action items in the quarterly runbook review
([runbooks README](./runbooks/README.md) §Continuous Improvement).

---

## 6. Review and redaction

Before publication:

- [ ] Incident Commander and Reviewer sign off on timeline and severity
- [ ] Remove credentials, PII, and unreleased vulnerability details
- [ ] For **security incidents**, follow the 48-hour minimum disclosure window
described in [Release Notes Playbook](./release-notes-playbook.md) §8
- [ ] Confirm customer-facing language is approved if published externally

---

## 7. Publication flow

1. **Open a PR** adding the finalized report to `docs/incidents/` using the
naming convention: `YYYY-MM-DD-INCIDENT-XXX-short-slug.md`
2. **Set `Status: Published`** in the report header (drafts must not remain in
`docs/incidents/` root)
3. **Update the index** in [`docs/incidents/README.md`](./incidents/README.md)
4. **Link action items** — Ensure every `AI-xxx` row has a merged or open GitHub
issue
5. **Update runbooks** if new failure modes were discovered
6. **Announce** in `#yieldvault-incidents`; update status page if user-facing
7. **Merge PR** after reviewer approval

CI validates postmortem structure via `scripts/validate-postmortem.sh`. Install the
workflow from [`docs/ci/postmortem-docs.workflow.yml`](./ci/postmortem-docs.workflow.yml)
into `.github/workflows/` to enable automated PR checks.

---

## 8. DR test reports

Disaster recovery exercises that surface runbook gaps should file a
[DR Test Report](./runbooks/templates/dr-test-report.md). Significant findings
warrant a full postmortem using the same publication flow.

---

## 9. Runbook feedback loop

After each published postmortem:

1. Identify runbook sections that were unclear or missing
2. Open a follow-up PR updating the relevant runbook under `docs/runbooks/`
3. Record the change in the postmortem's **Runbook Updates Required** section

---

**Last Updated:** June 26, 2026
**Maintained By:** DevOps Team
**Issue:** [#769](https://github.com/Junirezz/YieldVault-RWA/issues/769)
86 changes: 86 additions & 0 deletions docs/runbooks/ISSUE_769_IMPLEMENTATION_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Issue #769 Implementation Summary: Incident Postmortem Template and Publication Workflow

**Issue:** General: Add incident postmortem template and publication workflow
**Status:** ✅ COMPLETED
**Date:** June 26, 2026

---

## Goal

Create a standard postmortem template with action-item tracking and a publication
workflow so the team can consistently document and learn from incidents.

---

## Scope Delivered

### 1. Postmortem and Incident Templates ✅

**Directory:** [docs/runbooks/templates/](./templates/)

| File | Purpose |
|------|---------|
| [post-mortem.md](./templates/post-mortem.md) | Blameless postmortem with action-item table and publication checklist |
| [incident-report.md](./templates/incident-report.md) | Live incident log during active response |
| [dr-test-report.md](./templates/dr-test-report.md) | DR exercise report with RTO/RPO tracking |

Fixes previously broken links in [runbooks README](./README.md) Appendix C.

### 2. Publication Workflow Playbook ✅

**File:** [docs/postmortem-playbook.md](../postmortem-playbook.md)

- When to write postmortems (severity, DR, security, contract events)
- 48-hour draft / 5-day publication timeline
- Roles, review/redaction, and security disclosure alignment
- PR-based publication flow into `docs/incidents/`
- Action-item → GitHub issue tracking requirements

### 3. Published Postmortem Archive ✅

**File:** [docs/incidents/README.md](../incidents/README.md)

- Index table for published reports
- Naming convention: `YYYY-MM-DD-INCIDENT-XXX-slug.md`
- Optional drafts under `docs/incidents/drafts/`

### 4. Automation ✅

| File | Purpose |
|------|---------|
| [scripts/new-postmortem.sh](../../scripts/new-postmortem.sh) | Scaffold draft from template |
| [scripts/validate-postmortem.sh](../../scripts/validate-postmortem.sh) | CI validation for published reports |
| [docs/ci/postmortem-docs.workflow.yml](../ci/postmortem-docs.workflow.yml) | Workflow definition for maintainers to install under `.github/workflows/` |

### 5. Cross-Link Updates ✅

- [docs/incident_response_runbook.md](../incident_response_runbook.md) — fixed broken template link
- [docs/runbooks/README.md](./README.md) — quick links to playbook and incidents index
- [docs/runbooks/QUICK_REFERENCE.md](./QUICK_REFERENCE.md) — postmortem step links
- [README.md](../../README.md) — incident postmortems section
- [CHANGELOG.md](../../CHANGELOG.md) — unreleased documentation entry

---

## Acceptance Checklist

- [x] Standard postmortem template with action-item tracking
- [x] Incident report template for live incidents
- [x] DR test report template (unblocks broken README link)
- [x] Publication workflow playbook
- [x] Published postmortem archive index
- [x] Scaffold and validation scripts
- [x] CI workflow for postmortem doc validation
- [x] Broken documentation links fixed

---

## Related Files

- Issue: [#769](https://github.com/Junirezz/YieldVault-RWA/issues/769)
- Pattern reference: [ISSUE_392_IMPLEMENTATION_SUMMARY.md](./ISSUE_392_IMPLEMENTATION_SUMMARY.md)
- Release disclosure pattern: [release-notes-playbook.md](../release-notes-playbook.md) §8

**Last Updated:** June 26, 2026
**Maintained By:** DevOps Team
4 changes: 2 additions & 2 deletions docs/runbooks/QUICK_REFERENCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,8 +146,8 @@ All runbooks: `docs/runbooks/`
3. **Notify** - Alert team via PagerDuty/Slack
4. **Respond** - Follow appropriate runbook
5. **Verify** - Confirm system restored
6. **Document** - Create incident report
7. **Review** - Post-mortem within 48 hours
6. **Document** - Create [incident report](./templates/incident-report.md)
7. **Review** - [Post-mortem](./templates/post-mortem.md) within 48 hours per [playbook](../postmortem-playbook.md)

---

Expand Down
2 changes: 2 additions & 0 deletions docs/runbooks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ This directory contains operational runbooks for disaster recovery and incident
| [RPC Failover](./RPC_FAILOVER.md) | 5 min | N/A | Stellar RPC node failure |
| [Full DR Procedure](./FULL_DR_PROCEDURE.md) | 4 hours | 15 min | Complete infrastructure failure |
| [Replay & State Recovery](./REPLAY_PROCEDURES.md) | N/A | N/A | Recovering/syncing ledger events or email queue |
| [Postmortem Playbook](../postmortem-playbook.md) | N/A | N/A | Publishing incident postmortems |
| [Published Postmortems](../incidents/README.md) | N/A | N/A | Archive of finalized incident reports |

---

Expand Down
57 changes: 57 additions & 0 deletions docs/runbooks/templates/dr-test-report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# DR Test Report: [TEST-ID] — [Scenario Name]

**Test ID:** DR-TEST-___
**Date:** YYYY-MM-DD
**Participants:** [Names]
**Runbook Exercised:** [link]
**Facilitator:** [Name]
**Last Updated:** YYYY-MM-DD

---

## Objectives

- [Objective 1]

## Targets vs Actuals

| Metric | Target | Actual | Pass/Fail |
|--------|--------|--------|-----------|
| RTO | | | |
| RPO | | | |
| Total test duration | | | |

## Test Steps

| Step | Description | Expected | Actual | Pass/Fail | Notes |
|------|-------------|----------|--------|-----------|-------|
| 1 | | | | | |

## Issues Encountered

- [Issue description]

## What Went Well

- [Item]

## What Could Be Improved

- [Item]

## Action Items

| ID | Action | Owner | Priority | Due Date | Tracking Issue | Status |
|----|--------|-------|----------|----------|----------------|--------|
| AI-001 | | | P1 | YYYY-MM-DD | #___ | Open |

## Sign-off

| Role | Name | Date |
|------|------|------|
| Test lead | | |
| Incident Commander | | |

---

*File completed reports in `docs/incidents/` when the test surfaces production-impacting findings. See the [Postmortem Playbook](../../postmortem-playbook.md).*
Loading
Loading