Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 184 additions & 0 deletions docs/RUNBOOK-redis-latency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Runbook: Redis Latency Alert — dark-redis-rust Fleet

> **Linked Gap:** [`docs/adr/gaps/GAP-001-fleet-wide-redis-latency.md`](adr/gaps/GAP-001-fleet-wide-redis-latency.md)
> **Monitor:** `redis-latency`
> **Alert ID prototype:** `alert-factory-mvp-1`

---

## When to use this runbook

Use this runbook whenever the `redis-latency` Datadog monitor transitions to **Warning**
or **Alert** state for any environment in the `dark-redis-rust` fleet. Environments include
`mvp-test`, `gym-*`, and any newly provisioned `test-*` instances.

---

## Step 0 — Establish scope before acting

```
1. Open Datadog → Monitors → search "redis-latency"
2. Note which environments are currently in Alert vs Warning vs OK vs No Data.
3. Check the Datadog Events stream for the 30-minute window before the first trigger:
- Filter: service:dark-redis-rust OR env:mvp-test OR env:gym-*
- Look for: deploys, restarts, config changes, new monitor provisioning.
4. If multiple environments fired within 10 minutes of each other → suspect shared
infrastructure cause. Proceed to Step 1.
If only one environment fired → proceed to Step 2 (isolated triage).
```

**Wave-pattern indicator:** ≥3 environments firing within a 10-minute window is strong
evidence of a shared cause (network, backing store, or fleet provisioning event).

---

## Step 1 — Check for shared infrastructure events

```bash
# Verify whether a network or backing-store event correlates with the alert window.
# (Commands below assume kubectl access to the relevant cluster.)

# List recent events in the namespace
kubectl get events -n dark-redis-rust --sort-by='.lastTimestamp' | tail -40

# Check node conditions for the relevant nodes
kubectl describe nodes | grep -A5 "Conditions:"

# Check if any HPA or Deployment scaling happened around the trigger time
kubectl rollout history deployment/redis-rust -n dark-redis-rust
```

If a scaling event or network disruption is confirmed → file an incident with the
infrastructure team and skip to Step 5 (monitor tuning) once the root cause is resolved.

---

## Step 2 — Inspect the Redis slowlog

Connect to the affected instance (substitute `mvp-test` / `gym-*` as appropriate):

```bash
# Port-forward to the Redis-rust instance
kubectl port-forward svc/redis-rust-mvp-test 6380:6379 -n dark-redis-rust

# In a separate terminal — inspect the slowlog
redis-cli -p 6380 SLOWLOG GET 25
```

**What to look for:**

| Command pattern | Likely cause | Action |
|----------------|-------------|--------|
| `KEYS *` | Full keyspace scan | Replace caller with `SCAN`; see HARNESS.md §shard-aggregated |
| `LRANGE key 0 -1` on large list | Large list traversal | Add LIMIT or paginate |
| `HGETALL` on huge hash | Unbounded hash read | Add field-level access |
| `DEBUG SLEEP` | Test artifact | Remove from production path |
| Repeated `BGSAVE` / `BGREWRITEAOF` | Persistence pressure | See Step 3 |

Reset the slowlog after capturing:
```bash
redis-cli -p 6380 SLOWLOG RESET
```

---

## Step 3 — Check Redis INFO stats

```bash
redis-cli -p 6380 INFO all | grep -E \
'connected_clients|blocked_clients|used_memory_human|used_memory_peak_human|'\
'rdb_last_bgsave_status|aof_last_rewrite_status|rdb_last_bgsave_time_sec|'\
'total_commands_processed|rejected_connections|evicted_keys|keyspace_hits|keyspace_misses'
```

**Thresholds to flag:**

| Stat | Concern threshold |
|------|------------------|
| `connected_clients` | > 80% of `maxclients` config value |
| `blocked_clients` | > 0 (any blocking indicates `BLPOP`/`BRPOP` contention) |
| `rdb_last_bgsave_status` | `err` |
| `rdb_last_bgsave_time_sec` | > 30s (fork latency) |
| `evicted_keys` | Increasing over time (memory pressure) |
| `rejected_connections` | > 0 |

---

## Step 4 — Cross-correlate gym-* instances

If the wave pattern was confirmed in Step 0, check whether `gym-*` instances have recovered:

```bash
# Check all gym-* instances in one pass
for env in gym-2450 gym-110e gym-run1-2ea1 gym-bbd4 gym-run5-ww7y; do
echo "=== $env ==="
redis-cli -h redis-rust-${env}.dark-redis-rust.svc INFO server | grep uptime_in_seconds
redis-cli -h redis-rust-${env}.dark-redis-rust.svc INFO stats | grep total_commands_processed
done
```

If `gym-*` instances recovered on their own:
- Check their uptime — a restart is the most common auto-recovery.
- Check whether a network event resolved (correlate with cloud provider status page).
- Document recovery time in the incident ticket.

If `gym-*` instances are still in Alert:
- Escalate to the infrastructure team — this confirms a fleet-wide shared cause.

---

## Step 5 — Verify pod health

```bash
kubectl get pods -n dark-redis-rust -l env=mvp-test
kubectl describe pod <pod-name> -n dark-redis-rust | grep -A10 "Events:"
kubectl top pod <pod-name> -n dark-redis-rust
```

Restart a pod only if:
- The pod is in `CrashLoopBackOff` or `OOMKilled` state.
- Memory stats from Step 3 show the instance is paging or evicting keys at high rate.

```bash
# Graceful restart (rolling, zero-downtime)
kubectl rollout restart deployment/redis-rust-mvp-test -n dark-redis-rust
```

---

## Step 6 — Tune the redis-latency monitor (post-incident)

The `redis-latency` monitor currently lacks environment tagging, which slows correlation.
After the immediate incident is resolved, update the monitor configuration:

1. Open the monitor in Datadog → Edit.
2. Add `env` as a **group-by** dimension so alerts fire per environment, not fleet-wide.
3. Add the following tags to alert notifications:
```
env:{{env.name}}
service:dark-redis-rust
alert_id:{{alert.id}}
```
4. Set a **No Data** notification threshold of 10 minutes so telemetry gaps surface
as their own alert rather than silently masking latency state.
5. Save and verify the monitor fires correctly in a staging environment.

---

## Escalation path

| Condition | Escalate to |
|-----------|------------|
| All `gym-*` still in Alert after 15 min | Infrastructure / SRE on-call |
| `rejected_connections > 0` | Database reliability team |
| `rdb_last_bgsave_status: err` | Storage / persistence team |
| Wave pattern repeats within 24h | Architecture review (see GAP-001) |

---

## References

- [GAP-001: Fleet-Wide Redis Latency Wave Pattern](adr/gaps/GAP-001-fleet-wide-redis-latency.md)
- [HARNESS.md — shard-aggregated commands](HARNESS.md)
- [perf_config.toml](../perf_config.toml) — `num_shards`, `max_size` config
- Redis documentation: [SLOWLOG](https://redis.io/docs/manual/latency/), [INFO](https://redis.io/commands/info/)
113 changes: 113 additions & 0 deletions docs/adr/gaps/GAP-001-fleet-wide-redis-latency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# GAP-001: Fleet-Wide Redis Latency Wave Pattern

**Status:** Investigating
**Severity:** High
**Discovered:** 2026-04-20
**Alert ID:** alert-factory-mvp-1
**Monitor Ref:** redis-latency
**Linked Issue:** is-019da93c-ccd6-7820-be36-c0018faf9c1c

---

## Summary

On 2026-04-20 at 04:45–04:50Z, a rolling wave of Redis latency alerts fired across
the dark-redis-rust fleet. Five `gym-*` environments triggered the `redis-latency`
monitor at 04:45Z, followed five minutes later by the `mvp-test` environment.
The coordinated propagation pattern points to a **shared infrastructure cause** rather
than an isolated instance failure, and no code deploy was recorded as a proximate trigger.

Telemetry gaps were also identified: the `redis-latency` monitor did not resolve to a
named Datadog entity directly, and zero correlated events (deploys, restarts, config
changes) were recorded for `dark-redis-rust` or `mvp-test` in the three-hour window
around the trigger time.

---

## Evidence

| Timestamp (Z) | Event |
|---------------|-------|
| 04:45:00 | `redis-latency` monitor fires on `gym-2450`, `gym-110e`, `gym-run1-2ea1`, `gym-bbd4`, `gym-run5-ww7y` |
| 04:49:00 | New `test-url-shortener-redis` monitors provisioned (fleet churn, possible connection pressure) |
| 04:50:00 | `redis-latency` monitor fires on `mvp-test` |
| — | Monitor state shows **No Data** for `redis-latency` during post-hoc query; telemetry gap confirmed |

- **Zero correlated events** found in the 3-hour window around trigger time.
- New monitor provisioning at 04:49Z coincides with the alert window.
- Rolling 5-minute propagation strongly implies a **shared backing store or network path**.

---

## Impact

| Dimension | Assessment |
|-----------|-----------|
| Correctness | No data loss expected; latency SLO breached |
| Availability | Degraded response times for all clients hitting affected instances |
| Scope | `mvp-test` confirmed; `gym-*` cluster (5 instances) likely; fleet-wide cannot be ruled out |
| Observability | `redis-latency` monitor not environment-tagged; future correlation is slow |

---

## Hypotheses (Descending Probability)

1. **Network congestion / increased RTT** to the Redis backing store — rolling-wave pattern
fits a network event propagating through the fleet.
2. **Hot key / slow command** (e.g., `KEYS`, `LRANGE` on a large list) causing head-of-line
blocking across instances sharing the same slowlog pattern.
3. **Connection pool exhaustion** triggered by a traffic spike or fleet provisioning churn
(the new `test-url-shortener-redis` monitors at 04:49Z are a candidate trigger).
4. **Redis memory pressure** (eviction storms, `BGSAVE`/`BGREWRITEAOF` fork latency)
spreading across instances with similar memory profiles.

---

## Required Investigative Actions

> See [`docs/RUNBOOK-redis-latency.md`](../RUNBOOK-redis-latency.md) for the step-by-step
> triage procedure.

1. **Inspect Redis slowlog** on `mvp-test` — identify commands exceeding latency SLO.
2. **Check `INFO` stats** on `mvp-test` at time of trigger:
`connected_clients`, `used_memory`, `rdb_last_bgsave_status`, `blocked_clients`.
3. **Verify pod health** on `mvp-test` via `k8s_provisioner` once write access is available.
4. **Cross-correlate `gym-*` instances** — confirm recovery status; if recovered, determine
what changed (auto-restart, memory freed, network event resolved).
5. **Tune the `redis-latency` monitor** to emit environment-tagged events for faster
future correlation.

---

## Potential Solutions

### Short-term (operational)
- Add `env` tag to all `redis-latency` monitor alerts so Datadog events are automatically
correlated by environment.
- Add a `slowlog-log-slower-than` config to the project's `perf_config.toml` / Docker config
to ensure slow commands are captured in the slowlog at an appropriate threshold.
- Document the triage runbook (see companion `docs/RUNBOOK-redis-latency.md`).

### Medium-term (architectural)
- Introduce a connection-pool configuration section to `perf_config.toml` with explicit
`max_connections`, `min_idle`, and `connection_timeout` knobs, so pool exhaustion
is a tunable rather than a surprise.
- Add a health-check endpoint that exposes `INFO` snapshot data, enabling automated
pre-alert diagnosis.

### Long-term
- Consider per-environment Datadog monitors (or monitor scoping) instead of a single
fleet-wide `redis-latency` monitor to reduce correlation ambiguity.
- Evaluate whether the fleet's Redis backing stores share any network path or resource
that could be isolated to prevent wave propagation.

---

## Related

- Issue: `is-019da93c-ccd6-7820-be36-c0018faf9c1c`
- Alert: `alert-factory-mvp-1`
- Runbook: [`docs/RUNBOOK-redis-latency.md`](../RUNBOOK-redis-latency.md)
- Config: [`perf_config.toml`](../../perf_config.toml)
- Docker config: [`docker-benchmark/perf_config.toml`](../../docker-benchmark/perf_config.toml)
- ADR-009 (Security / TLS / ACL): [`docs/adr/009-security-tls-acl.md`](009-security-tls-acl.md)
2 changes: 1 addition & 1 deletion docs/adr/gaps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ Discovered → Open → Investigating → ADR-Drafted → Closed

| Gap | Title | Status | Severity |
|-----|-------|--------|----------|
| - | (None yet) | - | - |
| [GAP-001](GAP-001-fleet-wide-redis-latency.md) | Fleet-Wide Redis Latency Wave Pattern | Investigating | High |

## Gap vs Deviation vs ADR

Expand Down
Loading