Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 166 additions & 0 deletions INCIDENT_RESPONSE_RUNBOOK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# NotifyChain Incident Response Runbook

This runbook defines how maintainers respond to production incidents affecting NotifyChain listener, scheduler, dashboard, and contract-facing operations.

## Severity Levels

| Severity | Definition | Examples | Initial Response Target |
|----------|------------|----------|-------------------------|
| SEV1 | Complete production outage or user-impacting data loss risk | Listener cannot ingest events, scheduler cannot dispatch due notifications, contract integration is unusable | 15 minutes |
| SEV2 | Major degradation with available workaround | Retry queue backing up, Discord delivery failing for many recipients, API 500 rate elevated | 30 minutes |
| SEV3 | Partial degradation or isolated customer impact | One endpoint failing, delayed metrics, single integration misconfigured | 1 business hour |
| SEV4 | Low-impact operational issue | Documentation mismatch, non-critical alert, cosmetic dashboard issue | 1 business day |

## Roles

| Role | Responsibilities |
|------|------------------|
| Incident Commander | Owns coordination, severity, timeline, and final resolution call. |
| Technical Lead | Investigates root cause, directs fixes, and validates recovery. |
| Communications Lead | Sends internal and external updates. |
| Scribe | Records timeline, commands, decisions, and follow-up items. |

For small incidents one maintainer may hold multiple roles, but the incident commander should still be explicit.

## Response Procedure

1. Acknowledge the alert or report in the incident channel.
2. Assign severity and roles.
3. Create or update an incident tracking issue.
4. Confirm current blast radius:
- Check `/health` and indexing health endpoints.
- Check listener logs for request IDs and correlation IDs.
- Check scheduler stats and overdue scheduled notifications.
- Check `notification_execution_log` for recent `FAILED` or repeated `RETRY` rows.
- Check rate-limit, archive, template, and analytics dependencies when affected.
5. Stabilize the system:
- Roll back the last risky deploy if symptoms began immediately after release.
- Pause high-volume jobs if they are amplifying failures.
- Recover stale scheduler locks if dispatchers crashed.
- Increase worker capacity only after confirming the backing service is healthy.
6. Communicate status using the templates below.
7. Verify recovery with user-visible checks and relevant tests.
8. Downgrade or resolve the incident.
9. Schedule a postmortem for SEV1 and SEV2 incidents.

## Scheduler And Retry Checks

Use these checks when scheduled notifications are late, missing, or retrying unexpectedly.

```sql
SELECT status, COUNT(*) AS count
FROM scheduled_notifications
GROUP BY status;
```

```sql
SELECT id, execute_at, retry_count, max_retries, next_retry_at, last_error
FROM scheduled_notifications
WHERE status IN ('PENDING', 'PROCESSING')
ORDER BY execute_at ASC
LIMIT 20;
```

```sql
SELECT scheduled_notification_id, execution_attempt, status, error_message, execution_time
FROM notification_execution_log
ORDER BY execution_time DESC
LIMIT 50;
```

Recovery guidance:

| Symptom | Action |
|---------|--------|
| Many `PROCESSING` rows have expired locks | Restart the listener or run the scheduler so stale-lock recovery returns rows to `PENDING`. |
| `PENDING` rows are overdue but not dispatched | Confirm scheduler is enabled, poll interval is sane, and worker shutdown is not in progress. |
| Rows retry immediately too often | Check `next_retry_at` and retry scheduler config (`baseDelayMs`, `multiplier`, `maxDelayMs`, `jitter`). |
| Rows hit `FAILED` quickly | Compare `retry_count` with `max_retries` and inspect `last_error` for permanent validation or configuration errors. |

## Communication Templates

### Initial Internal Update

```text
Incident opened: <SEV> - <short title>
Impact: <who/what is affected>
Started: <UTC timestamp>
Commander: <name>
Current hypothesis: <brief>
Next update: <time>
```

### External Status Update

```text
We are investigating an issue affecting <component>. Some users may experience <impact>. We will provide another update by <time>.
```

### Mitigation Update

```text
We have identified the cause as <cause> and applied <mitigation>. We are monitoring recovery and will confirm once service is stable.
```

### Resolution Update

```text
The issue affecting <component> has been resolved as of <UTC timestamp>. Impact was <summary>. We will publish follow-up actions after review.
```

## Resolution Criteria

An incident can be resolved when:

1. The user-facing symptom is no longer occurring.
2. Relevant health checks are green.
3. Queue backlog is draining or back to normal.
4. No new related errors are appearing in logs for at least one monitoring window.
5. The incident commander and technical lead agree recovery is stable.

## Postmortem Checklist

Complete this checklist for all SEV1 and SEV2 incidents.

- [ ] Timeline includes detection, acknowledgement, mitigation, and resolution timestamps.
- [ ] Customer impact is quantified.
- [ ] Root cause is stated without blame.
- [ ] Triggering condition is identified.
- [ ] Detection gaps are documented.
- [ ] Prevention actions have owners and due dates.
- [ ] Tests, alerts, docs, or runbooks are updated where needed.
- [ ] Follow-up issue links are added to the incident record.
- [ ] Final summary is shared with maintainers and affected stakeholders.

## Postmortem Template

```markdown
# Postmortem: <Incident Title>

Date:
Severity:
Status:
Incident Commander:
Technical Lead:

## Summary

## Impact

## Timeline

## Root Cause

## What Went Well

## What Went Poorly

## Where We Got Lucky

## Corrective Actions

| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|

## Follow-up Links
```
2 changes: 2 additions & 0 deletions listener/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

Base URL: `http://localhost:8787` (configured via `EVENTS_API_PORT`)

For a centralized list of API errors, causes, examples, and troubleshooting steps, see [API_ERROR_REFERENCE.md](./API_ERROR_REFERENCE.md).

---

## Events
Expand Down
152 changes: 152 additions & 0 deletions listener/API_ERROR_REFERENCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Notification API Error Reference

This reference lists listener API error responses, common causes, and operator or client actions. Error bodies use JSON and generally follow:

```json
{ "error": "Human readable error message" }
```

Batch validation endpoints return structured validation results instead:

```json
{
"valid": false,
"processedCount": 0,
"errors": [
{
"index": -1,
"code": "PARSE_ERROR",
"message": "Request body must be valid JSON."
}
]
}
```

## HTTP Status Codes

| Status | Meaning | Typical Causes | Recommended Resolution |
|--------|---------|----------------|------------------------|
| 400 | Bad request | Invalid JSON, missing required fields, invalid IDs, invalid dates, invalid template or preference body | Fix request syntax and required fields before retrying. |
| 401 | Unauthorized | Missing webhook signature, missing key ID, unknown key ID, expired signature, invalid signature | Re-sign the exact raw body with a registered secret and send required headers. |
| 404 | Not found | Unknown route, scheduled notification ID not found, template not found, archived record not found | Verify the endpoint path and resource ID. |
| 429 | Too many requests | Rate limit exceeded for the client IP or API key | Wait for `Retry-After`, reduce request rate, or request a higher configured limit. |
| 500 | Internal server error | Database read/write failure, downstream service failure, unexpected handler error | Check listener logs with the request/correlation ID, database health, and downstream dependency status. |
| 503 | Service unavailable | Optional service disabled or unavailable, such as scheduler, templates, archive, analytics, metrics, or rate-limit metrics | Enable/configure the service or retry after the dependency is restored. |

## Scheduled Notifications

### `POST /api/schedule`

| Status | Error | Cause | Resolution |
|--------|-------|-------|------------|
| 400 | `Missing required fields: executeAt, payload, targetRecipient` | Request omitted one or more required scheduling fields. | Include `executeAt`, `payload`, and `targetRecipient`. |
| 400 | `executeAt is not a valid date` | `executeAt` could not be parsed into a valid JavaScript `Date`. | Send an ISO 8601 timestamp such as `2026-06-29T12:00:00.000Z`. |
| 500 | Runtime message from `NotificationAPI` or database | Date is expired, insert failed, or another scheduling error occurred. | Use a future `executeAt`; if it persists, inspect database logs. |
| 503 | `Scheduler not enabled` | Server was started without a configured `NotificationAPI`. | Configure and pass `notificationAPI` to the events server. |

Sample invalid date:

```json
{ "error": "executeAt is not a valid date" }
```

### `GET /api/schedule/:id`

| Status | Error | Cause | Resolution |
|--------|-------|-------|------------|
| 400 | `Invalid notification ID` | `:id` is not numeric. | Use an integer notification ID. |
| 404 | `Notification not found` | No scheduled notification exists with the requested ID. | Confirm the ID returned from `POST /api/schedule`. |
| 500 | Database error message | Lookup failed unexpectedly. | Check SQLite availability and listener logs. |
| 503 | `Scheduler not enabled` | Scheduler API is unavailable. | Enable scheduler configuration. |

### `GET /api/schedule/stats`

| Status | Error | Cause | Resolution |
|--------|-------|-------|------------|
| 500 | Database error message | Stats query failed. | Check database health and schema migrations. |
| 503 | `Scheduler not enabled` | Scheduler API is unavailable. | Enable scheduler configuration. |

## Batch Validation

### `POST /api/notifications/validate-batch`

| Status | Error Shape | Cause | Resolution |
|--------|-------------|-------|------------|
| 400 | `errors[].code = PARSE_ERROR` | Request body is not valid JSON. | Send a JSON array or `{ "notifications": [...] }`. |
| 400 | `INVALID_STRUCTURE` | Body is not an array and does not contain `notifications`. | Send an array of notification payloads. |
| 400 | `EMPTY_BATCH` | Batch has no items. | Include at least one notification. |
| 400 | `MISSING_FIELD`, `EMPTY_FIELD`, `INVALID_CHANNEL`, `DUPLICATE_RECIPIENT` | One or more notification items are invalid. | Correct the item fields shown in `errors[]`. |

Sample validation failure:

```json
{
"valid": false,
"processedCount": 0,
"errors": [
{
"index": 0,
"field": "recipient",
"code": "MISSING_FIELD",
"message": "Item at index [0]: Missing required field 'recipient'."
}
]
}
```

## Webhooks

### `POST /api/webhooks`

| Status | Error | Cause | Resolution |
|--------|-------|-------|------------|
| 400 | `Failed to read request body` | Body stream could not be collected. | Retry with a valid request body. |
| 401 | `Missing signature header` | `X-Signature` was not provided. | Include the HMAC-SHA256 signature header. |
| 401 | `Missing key-id header` | `X-Key-Id` was not provided. | Include the key ID matching a configured webhook secret. |
| 401 | `Unknown key-id` | Key ID is not configured. | Use a registered key ID or update server secrets. |
| 401 | `Request signature expired` | Timestamp is outside the allowed window. | Recreate and resend the webhook promptly. |
| 401 | `Invalid signature` | HMAC does not match the raw body. | Sign the exact raw request body with the correct secret. |

## Templates

| Endpoint | Status | Error | Cause | Resolution |
|----------|--------|-------|-------|------------|
| Template routes | 503 | `Template service not enabled` | Template service was not injected into the server. | Configure `templateService`. |
| `GET /api/templates/:id` | 404 | `Template not found` | Unknown template ID. | Verify the template ID. |
| `GET /api/templates/:id/audit` | 404 | `Template not found` | No template and no audit records exist for ID. | Verify ID or create template first. |
| `PUT /api/templates/:id` | 400 | `Invalid JSON` | Request body is malformed JSON. | Send valid JSON. |
| `PUT /api/templates/:id` | 400 | `Invalid body: ...` | Update payload failed validation. | Provide at least one valid template field. |
| `PUT /api/templates/:id` | 404 | Template repository not found message | Template ID does not exist. | Create the template or use the correct ID. |
| `POST /api/templates` | 400 | `Invalid body: id, name, type, and body are required` | Required create fields are missing. | Include all required fields. |
| `POST /api/templates` | 400 | Template validation message | Template failed repository validation. | Correct field types and values. |

## Preferences

### `PUT /api/preferences/:userId`

| Status | Error | Cause | Resolution |
|--------|-------|-------|------------|
| 400 | `Invalid body: expected { categories: { [key]: boolean } }` | `categories` is missing or not an object. | Send category flags as booleans. |
| 400 | `Invalid JSON` | Body is malformed JSON. | Send valid JSON. |

## History, Search, Archive, Analytics, and Metrics

| Endpoint Area | Status | Error | Cause | Resolution |
|---------------|--------|-------|-------|------------|
| `/api/notifications/history` | 500 | Database or service error message | History query failed. | Check database and execution log table. |
| `/api/search/suggestions` | 500 | Search service error message | Suggestion lookup failed. | Check search service logs and data source. |
| `/api/archive/run` | 503 | `Archive service not enabled` | Archive service unavailable. | Configure `archiveService`. |
| `/api/archive/:id` | 404 | `Archived record not found` | Unknown archive ID. | Verify the archive record ID. |
| `/api/archive*` | 500 | Archive error message | Archive query/run failed. | Inspect archive logs and backing store. |
| `/api/rate-limit/metrics` | 503 | `Rate limiting not enabled` | Rate limiter was not configured. | Configure rate limiting before querying metrics. |
| Metrics history routes | 503 | `Metrics history store unavailable` | Metrics store missing. | Configure `metricsStore`. |
| Analytics routes | 503 | `Analytics aggregator unavailable` | Aggregator disabled or unavailable. | Configure analytics aggregation. |

## Troubleshooting Checklist

1. Capture the HTTP status, response body, request ID, and correlation ID.
2. Confirm the endpoint is enabled in server options.
3. Validate request JSON, required fields, and date formats.
4. For scheduler errors, inspect `scheduled_notifications` and `notification_execution_log`.
5. For retry errors, check `retry_count`, `max_retries`, and `next_retry_at`.
6. For 500 errors, inspect listener logs and database migration state.
Loading
Loading