nerdsane · nerdsane · Apr 20, 2026 · Apr 20, 2026 · Apr 20, 2026
diff --git a/docs/RUNBOOK-redis-latency.md b/docs/RUNBOOK-redis-latency.md
@@ -0,0 +1,184 @@
+# Runbook: Redis Latency Alert — dark-redis-rust Fleet
+
+> **Linked Gap:** [`docs/adr/gaps/GAP-001-fleet-wide-redis-latency.md`](adr/gaps/GAP-001-fleet-wide-redis-latency.md)
+> **Monitor:** `redis-latency`
+> **Alert ID prototype:** `alert-factory-mvp-1`
+
+---
+
+## When to use this runbook
+
+Use this runbook whenever the `redis-latency` Datadog monitor transitions to **Warning**
+or **Alert** state for any environment in the `dark-redis-rust` fleet. Environments include
+`mvp-test`, `gym-*`, and any newly provisioned `test-*` instances.
+
+---
+
+## Step 0 — Establish scope before acting
+
+```
+1. Open Datadog → Monitors → search "redis-latency"
+2. Note which environments are currently in Alert vs Warning vs OK vs No Data.
+3. Check the Datadog Events stream for the 30-minute window before the first trigger:
+   - Filter: service:dark-redis-rust OR env:mvp-test OR env:gym-*
+   - Look for: deploys, restarts, config changes, new monitor provisioning.
+4. If multiple environments fired within 10 minutes of each other → suspect shared
+   infrastructure cause. Proceed to Step 1.
+   If only one environment fired → proceed to Step 2 (isolated triage).
+```
+
+**Wave-pattern indicator:** ≥3 environments firing within a 10-minute window is strong
+evidence of a shared cause (network, backing store, or fleet provisioning event).
+
+---
+
+## Step 1 — Check for shared infrastructure events
+
+```bash
+# Verify whether a network or backing-store event correlates with the alert window.
+# (Commands below assume kubectl access to the relevant cluster.)
+
+# List recent events in the namespace
+kubectl get events -n dark-redis-rust --sort-by='.lastTimestamp' | tail -40
+
+# Check node conditions for the relevant nodes
+kubectl describe nodes | grep -A5 "Conditions:"
+
+# Check if any HPA or Deployment scaling happened around the trigger time
+kubectl rollout history deployment/redis-rust -n dark-redis-rust
+```
+
+If a scaling event or network disruption is confirmed → file an incident with the
+infrastructure team and skip to Step 5 (monitor tuning) once the root cause is resolved.
+
+---
+
+## Step 2 — Inspect the Redis slowlog
+
+Connect to the affected instance (substitute `mvp-test` / `gym-*` as appropriate):
+
+```bash
+# Port-forward to the Redis-rust instance
+kubectl port-forward svc/redis-rust-mvp-test 6380:6379 -n dark-redis-rust
+
+# In a separate terminal — inspect the slowlog
+redis-cli -p 6380 SLOWLOG GET 25
+```
+
+**What to look for:**
+
+| Command pattern | Likely cause | Action |
+|----------------|-------------|--------|
+| `KEYS *` | Full keyspace scan | Replace caller with `SCAN`; see HARNESS.md §shard-aggregated |
+| `LRANGE key 0 -1` on large list | Large list traversal | Add LIMIT or paginate |
+| `HGETALL` on huge hash | Unbounded hash read | Add field-level access |
+| `DEBUG SLEEP` | Test artifact | Remove from production path |
+| Repeated `BGSAVE` / `BGREWRITEAOF` | Persistence pressure | See Step 3 |
+
+Reset the slowlog after capturing:
+```bash
+redis-cli -p 6380 SLOWLOG RESET
+```
+
+---
+
+## Step 3 — Check Redis INFO stats
+
+```bash
+redis-cli -p 6380 INFO all | grep -E \
+  'connected_clients|blocked_clients|used_memory_human|used_memory_peak_human|'\
+  'rdb_last_bgsave_status|aof_last_rewrite_status|rdb_last_bgsave_time_sec|'\
+  'total_commands_processed|rejected_connections|evicted_keys|keyspace_hits|keyspace_misses'
+```
+
+**Thresholds to flag:**
+
+| Stat | Concern threshold |
+|------|------------------|
+| `connected_clients` | > 80% of `maxclients` config value |
+| `blocked_clients` | > 0 (any blocking indicates `BLPOP`/`BRPOP` contention) |
+| `rdb_last_bgsave_status` | `err` |
+| `rdb_last_bgsave_time_sec` | > 30s (fork latency) |
+| `evicted_keys` | Increasing over time (memory pressure) |
+| `rejected_connections` | > 0 |
+
+---
+
+## Step 4 — Cross-correlate gym-* instances
+
+If the wave pattern was confirmed in Step 0, check whether `gym-*` instances have recovered:
+
+```bash
+# Check all gym-* instances in one pass
+for env in gym-2450 gym-110e gym-run1-2ea1 gym-bbd4 gym-run5-ww7y; do
+  echo "=== $env ==="
+  redis-cli -h redis-rust-${env}.dark-redis-rust.svc INFO server | grep uptime_in_seconds
+  redis-cli -h redis-rust-${env}.dark-redis-rust.svc INFO stats | grep total_commands_processed
+done
+```
+
+If `gym-*` instances recovered on their own:
+- Check their uptime — a restart is the most common auto-recovery.
+- Check whether a network event resolved (correlate with cloud provider status page).
+- Document recovery time in the incident ticket.
+
+If `gym-*` instances are still in Alert:
+- Escalate to the infrastructure team — this confirms a fleet-wide shared cause.
+
+---
+
+## Step 5 — Verify pod health
+
+```bash
+kubectl get pods -n dark-redis-rust -l env=mvp-test
+kubectl describe pod <pod-name> -n dark-redis-rust | grep -A10 "Events:"
+kubectl top pod <pod-name> -n dark-redis-rust
+```
+
+Restart a pod only if:
+- The pod is in `CrashLoopBackOff` or `OOMKilled` state.
+- Memory stats from Step 3 show the instance is paging or evicting keys at high rate.
+
+```bash
+# Graceful restart (rolling, zero-downtime)
+kubectl rollout restart deployment/redis-rust-mvp-test -n dark-redis-rust
+```
+
+---
+
+## Step 6 — Tune the redis-latency monitor (post-incident)
+
+The `redis-latency` monitor currently lacks environment tagging, which slows correlation.
+After the immediate incident is resolved, update the monitor configuration:
+
+1. Open the monitor in Datadog → Edit.
+2. Add `env` as a **group-by** dimension so alerts fire per environment, not fleet-wide.
+3. Add the following tags to alert notifications:
+   ```
+   env:{{env.name}}
+   service:dark-redis-rust
+   alert_id:{{alert.id}}
+   ```
+4. Set a **No Data** notification threshold of 10 minutes so telemetry gaps surface
+   as their own alert rather than silently masking latency state.
+5. Save and verify the monitor fires correctly in a staging environment.
+
+---
+
+## Escalation path
+
+| Condition | Escalate to |
+|-----------|------------|
+| All `gym-*` still in Alert after 15 min | Infrastructure / SRE on-call |
+| `rejected_connections > 0` | Database reliability team |
+| `rdb_last_bgsave_status: err` | Storage / persistence team |
+| Wave pattern repeats within 24h | Architecture review (see GAP-001) |
+
+---
+
+## References
+
+- [GAP-001: Fleet-Wide Redis Latency Wave Pattern](adr/gaps/GAP-001-fleet-wide-redis-latency.md)
+- [HARNESS.md — shard-aggregated commands](HARNESS.md)
+- [perf_config.toml](../perf_config.toml) — `num_shards`, `max_size` config
+- Redis documentation: [SLOWLOG](https://redis.io/docs/manual/latency/), [INFO](https://redis.io/commands/info/)
diff --git a/docs/adr/gaps/GAP-001-fleet-wide-redis-latency.md b/docs/adr/gaps/GAP-001-fleet-wide-redis-latency.md
@@ -0,0 +1,113 @@
+# GAP-001: Fleet-Wide Redis Latency Wave Pattern
+
+**Status:** Investigating
+**Severity:** High
+**Discovered:** 2026-04-20
+**Alert ID:** alert-factory-mvp-1
+**Monitor Ref:** redis-latency
+**Linked Issue:** is-019da93c-ccd6-7820-be36-c0018faf9c1c
+
+---
+
+## Summary
+
+On 2026-04-20 at 04:45–04:50Z, a rolling wave of Redis latency alerts fired across
+the dark-redis-rust fleet. Five `gym-*` environments triggered the `redis-latency`
+monitor at 04:45Z, followed five minutes later by the `mvp-test` environment.
+The coordinated propagation pattern points to a **shared infrastructure cause** rather
+than an isolated instance failure, and no code deploy was recorded as a proximate trigger.
+
+Telemetry gaps were also identified: the `redis-latency` monitor did not resolve to a
+named Datadog entity directly, and zero correlated events (deploys, restarts, config
+changes) were recorded for `dark-redis-rust` or `mvp-test` in the three-hour window
+around the trigger time.
+
+---
+
+## Evidence
+
+| Timestamp (Z) | Event |
+|---------------|-------|
+| 04:45:00 | `redis-latency` monitor fires on `gym-2450`, `gym-110e`, `gym-run1-2ea1`, `gym-bbd4`, `gym-run5-ww7y` |
+| 04:49:00 | New `test-url-shortener-redis` monitors provisioned (fleet churn, possible connection pressure) |
+| 04:50:00 | `redis-latency` monitor fires on `mvp-test` |
+| — | Monitor state shows **No Data** for `redis-latency` during post-hoc query; telemetry gap confirmed |
+
+- **Zero correlated events** found in the 3-hour window around trigger time.
+- New monitor provisioning at 04:49Z coincides with the alert window.
+- Rolling 5-minute propagation strongly implies a **shared backing store or network path**.
+
+---
+
+## Impact
+
+| Dimension | Assessment |
+|-----------|-----------|
+| Correctness | No data loss expected; latency SLO breached |
+| Availability | Degraded response times for all clients hitting affected instances |
+| Scope | `mvp-test` confirmed; `gym-*` cluster (5 instances) likely; fleet-wide cannot be ruled out |
+| Observability | `redis-latency` monitor not environment-tagged; future correlation is slow |
+
+---
+
+## Hypotheses (Descending Probability)
+
+1. **Network congestion / increased RTT** to the Redis backing store — rolling-wave pattern
+   fits a network event propagating through the fleet.
+2. **Hot key / slow command** (e.g., `KEYS`, `LRANGE` on a large list) causing head-of-line
+   blocking across instances sharing the same slowlog pattern.
+3. **Connection pool exhaustion** triggered by a traffic spike or fleet provisioning churn
+   (the new `test-url-shortener-redis` monitors at 04:49Z are a candidate trigger).
+4. **Redis memory pressure** (eviction storms, `BGSAVE`/`BGREWRITEAOF` fork latency)
+   spreading across instances with similar memory profiles.
+
+---
+
+## Required Investigative Actions
+
+> See [`docs/RUNBOOK-redis-latency.md`](../RUNBOOK-redis-latency.md) for the step-by-step
+> triage procedure.
+
+1. **Inspect Redis slowlog** on `mvp-test` — identify commands exceeding latency SLO.
+2. **Check `INFO` stats** on `mvp-test` at time of trigger:
+   `connected_clients`, `used_memory`, `rdb_last_bgsave_status`, `blocked_clients`.
+3. **Verify pod health** on `mvp-test` via `k8s_provisioner` once write access is available.
+4. **Cross-correlate `gym-*` instances** — confirm recovery status; if recovered, determine
+   what changed (auto-restart, memory freed, network event resolved).
+5. **Tune the `redis-latency` monitor** to emit environment-tagged events for faster
+   future correlation.
+
+---
+
+## Potential Solutions
+
+### Short-term (operational)
+- Add `env` tag to all `redis-latency` monitor alerts so Datadog events are automatically
+  correlated by environment.
+- Add a `slowlog-log-slower-than` config to the project's `perf_config.toml` / Docker config
+  to ensure slow commands are captured in the slowlog at an appropriate threshold.
+- Document the triage runbook (see companion `docs/RUNBOOK-redis-latency.md`).
+
+### Medium-term (architectural)
+- Introduce a connection-pool configuration section to `perf_config.toml` with explicit
+  `max_connections`, `min_idle`, and `connection_timeout` knobs, so pool exhaustion
+  is a tunable rather than a surprise.
+- Add a health-check endpoint that exposes `INFO` snapshot data, enabling automated
+  pre-alert diagnosis.
+
+### Long-term
+- Consider per-environment Datadog monitors (or monitor scoping) instead of a single
+  fleet-wide `redis-latency` monitor to reduce correlation ambiguity.
+- Evaluate whether the fleet's Redis backing stores share any network path or resource
+  that could be isolated to prevent wave propagation.
+
+---
+
+## Related
+
+- Issue: `is-019da93c-ccd6-7820-be36-c0018faf9c1c`
+- Alert: `alert-factory-mvp-1`
+- Runbook: [`docs/RUNBOOK-redis-latency.md`](../RUNBOOK-redis-latency.md)
+- Config: [`perf_config.toml`](../../perf_config.toml)
+- Docker config: [`docker-benchmark/perf_config.toml`](../../docker-benchmark/perf_config.toml)
+- ADR-009 (Security / TLS / ACL): [`docs/adr/009-security-tls-acl.md`](009-security-tls-acl.md)
diff --git a/docs/adr/gaps/README.md b/docs/adr/gaps/README.md
@@ -58,7 +58,7 @@ Discovered → Open → Investigating → ADR-Drafted → Closed
 
 | Gap | Title | Status | Severity |
 |-----|-------|--------|----------|
-| - | (None yet) | - | - |
+| [GAP-001](GAP-001-fleet-wide-redis-latency.md) | Fleet-Wide Redis Latency Wave Pattern | Investigating | High |
 
 ## Gap vs Deviation vs ADR