From abb5447a439a2d4e678a000e57614d1259900f27 Mon Sep 17 00:00:00 2001 From: sesh nalla <39490039+nerdsane@users.noreply.github.com> Date: Mon, 20 Apr 2026 00:55:49 -0400 Subject: [PATCH 1/3] docs(gap): GAP-001 fleet-wide Redis latency wave pattern (is-019da93c) --- .../gaps/GAP-001-fleet-wide-redis-latency.md | 113 ++++++++++++++++++ 1 file changed, 113 insertions(+) create mode 100644 docs/adr/gaps/GAP-001-fleet-wide-redis-latency.md diff --git a/docs/adr/gaps/GAP-001-fleet-wide-redis-latency.md b/docs/adr/gaps/GAP-001-fleet-wide-redis-latency.md new file mode 100644 index 0000000..42a8a2f --- /dev/null +++ b/docs/adr/gaps/GAP-001-fleet-wide-redis-latency.md @@ -0,0 +1,113 @@ +# GAP-001: Fleet-Wide Redis Latency Wave Pattern + +**Status:** Investigating +**Severity:** High +**Discovered:** 2026-04-20 +**Alert ID:** alert-factory-mvp-1 +**Monitor Ref:** redis-latency +**Linked Issue:** is-019da93c-ccd6-7820-be36-c0018faf9c1c + +--- + +## Summary + +On 2026-04-20 at 04:45–04:50Z, a rolling wave of Redis latency alerts fired across +the dark-redis-rust fleet. Five `gym-*` environments triggered the `redis-latency` +monitor at 04:45Z, followed five minutes later by the `mvp-test` environment. +The coordinated propagation pattern points to a **shared infrastructure cause** rather +than an isolated instance failure, and no code deploy was recorded as a proximate trigger. + +Telemetry gaps were also identified: the `redis-latency` monitor did not resolve to a +named Datadog entity directly, and zero correlated events (deploys, restarts, config +changes) were recorded for `dark-redis-rust` or `mvp-test` in the three-hour window +around the trigger time. + +--- + +## Evidence + +| Timestamp (Z) | Event | +|---------------|-------| +| 04:45:00 | `redis-latency` monitor fires on `gym-2450`, `gym-110e`, `gym-run1-2ea1`, `gym-bbd4`, `gym-run5-ww7y` | +| 04:49:00 | New `test-url-shortener-redis` monitors provisioned (fleet churn, possible connection pressure) | +| 04:50:00 | `redis-latency` monitor fires on `mvp-test` | +| — | Monitor state shows **No Data** for `redis-latency` during post-hoc query; telemetry gap confirmed | + +- **Zero correlated events** found in the 3-hour window around trigger time. +- New monitor provisioning at 04:49Z coincides with the alert window. +- Rolling 5-minute propagation strongly implies a **shared backing store or network path**. + +--- + +## Impact + +| Dimension | Assessment | +|-----------|-----------| +| Correctness | No data loss expected; latency SLO breached | +| Availability | Degraded response times for all clients hitting affected instances | +| Scope | `mvp-test` confirmed; `gym-*` cluster (5 instances) likely; fleet-wide cannot be ruled out | +| Observability | `redis-latency` monitor not environment-tagged; future correlation is slow | + +--- + +## Hypotheses (Descending Probability) + +1. **Network congestion / increased RTT** to the Redis backing store — rolling-wave pattern + fits a network event propagating through the fleet. +2. **Hot key / slow command** (e.g., `KEYS`, `LRANGE` on a large list) causing head-of-line + blocking across instances sharing the same slowlog pattern. +3. **Connection pool exhaustion** triggered by a traffic spike or fleet provisioning churn + (the new `test-url-shortener-redis` monitors at 04:49Z are a candidate trigger). +4. **Redis memory pressure** (eviction storms, `BGSAVE`/`BGREWRITEAOF` fork latency) + spreading across instances with similar memory profiles. + +--- + +## Required Investigative Actions + +> See [`docs/RUNBOOK-redis-latency.md`](../RUNBOOK-redis-latency.md) for the step-by-step +> triage procedure. + +1. **Inspect Redis slowlog** on `mvp-test` — identify commands exceeding latency SLO. +2. **Check `INFO` stats** on `mvp-test` at time of trigger: + `connected_clients`, `used_memory`, `rdb_last_bgsave_status`, `blocked_clients`. +3. **Verify pod health** on `mvp-test` via `k8s_provisioner` once write access is available. +4. **Cross-correlate `gym-*` instances** — confirm recovery status; if recovered, determine + what changed (auto-restart, memory freed, network event resolved). +5. **Tune the `redis-latency` monitor** to emit environment-tagged events for faster + future correlation. + +--- + +## Potential Solutions + +### Short-term (operational) +- Add `env` tag to all `redis-latency` monitor alerts so Datadog events are automatically + correlated by environment. +- Add a `slowlog-log-slower-than` config to the project's `perf_config.toml` / Docker config + to ensure slow commands are captured in the slowlog at an appropriate threshold. +- Document the triage runbook (see companion `docs/RUNBOOK-redis-latency.md`). + +### Medium-term (architectural) +- Introduce a connection-pool configuration section to `perf_config.toml` with explicit + `max_connections`, `min_idle`, and `connection_timeout` knobs, so pool exhaustion + is a tunable rather than a surprise. +- Add a health-check endpoint that exposes `INFO` snapshot data, enabling automated + pre-alert diagnosis. + +### Long-term +- Consider per-environment Datadog monitors (or monitor scoping) instead of a single + fleet-wide `redis-latency` monitor to reduce correlation ambiguity. +- Evaluate whether the fleet's Redis backing stores share any network path or resource + that could be isolated to prevent wave propagation. + +--- + +## Related + +- Issue: `is-019da93c-ccd6-7820-be36-c0018faf9c1c` +- Alert: `alert-factory-mvp-1` +- Runbook: [`docs/RUNBOOK-redis-latency.md`](../RUNBOOK-redis-latency.md) +- Config: [`perf_config.toml`](../../perf_config.toml) +- Docker config: [`docker-benchmark/perf_config.toml`](../../docker-benchmark/perf_config.toml) +- ADR-009 (Security / TLS / ACL): [`docs/adr/009-security-tls-acl.md`](009-security-tls-acl.md) From 22f2333f5f4df6068cec74ff0c12cae977837368 Mon Sep 17 00:00:00 2001 From: sesh nalla <39490039+nerdsane@users.noreply.github.com> Date: Mon, 20 Apr 2026 00:55:50 -0400 Subject: [PATCH 2/3] docs(runbook): Redis latency triage runbook for dark-redis-rust fleet (is-019da93c) --- docs/RUNBOOK-redis-latency.md | 184 ++++++++++++++++++++++++++++++++++ 1 file changed, 184 insertions(+) create mode 100644 docs/RUNBOOK-redis-latency.md diff --git a/docs/RUNBOOK-redis-latency.md b/docs/RUNBOOK-redis-latency.md new file mode 100644 index 0000000..1598759 --- /dev/null +++ b/docs/RUNBOOK-redis-latency.md @@ -0,0 +1,184 @@ +# Runbook: Redis Latency Alert — dark-redis-rust Fleet + +> **Linked Gap:** [`docs/adr/gaps/GAP-001-fleet-wide-redis-latency.md`](adr/gaps/GAP-001-fleet-wide-redis-latency.md) +> **Monitor:** `redis-latency` +> **Alert ID prototype:** `alert-factory-mvp-1` + +--- + +## When to use this runbook + +Use this runbook whenever the `redis-latency` Datadog monitor transitions to **Warning** +or **Alert** state for any environment in the `dark-redis-rust` fleet. Environments include +`mvp-test`, `gym-*`, and any newly provisioned `test-*` instances. + +--- + +## Step 0 — Establish scope before acting + +``` +1. Open Datadog → Monitors → search "redis-latency" +2. Note which environments are currently in Alert vs Warning vs OK vs No Data. +3. Check the Datadog Events stream for the 30-minute window before the first trigger: + - Filter: service:dark-redis-rust OR env:mvp-test OR env:gym-* + - Look for: deploys, restarts, config changes, new monitor provisioning. +4. If multiple environments fired within 10 minutes of each other → suspect shared + infrastructure cause. Proceed to Step 1. + If only one environment fired → proceed to Step 2 (isolated triage). +``` + +**Wave-pattern indicator:** ≥3 environments firing within a 10-minute window is strong +evidence of a shared cause (network, backing store, or fleet provisioning event). + +--- + +## Step 1 — Check for shared infrastructure events + +```bash +# Verify whether a network or backing-store event correlates with the alert window. +# (Commands below assume kubectl access to the relevant cluster.) + +# List recent events in the namespace +kubectl get events -n dark-redis-rust --sort-by='.lastTimestamp' | tail -40 + +# Check node conditions for the relevant nodes +kubectl describe nodes | grep -A5 "Conditions:" + +# Check if any HPA or Deployment scaling happened around the trigger time +kubectl rollout history deployment/redis-rust -n dark-redis-rust +``` + +If a scaling event or network disruption is confirmed → file an incident with the +infrastructure team and skip to Step 5 (monitor tuning) once the root cause is resolved. + +--- + +## Step 2 — Inspect the Redis slowlog + +Connect to the affected instance (substitute `mvp-test` / `gym-*` as appropriate): + +```bash +# Port-forward to the Redis-rust instance +kubectl port-forward svc/redis-rust-mvp-test 6380:6379 -n dark-redis-rust + +# In a separate terminal — inspect the slowlog +redis-cli -p 6380 SLOWLOG GET 25 +``` + +**What to look for:** + +| Command pattern | Likely cause | Action | +|----------------|-------------|--------| +| `KEYS *` | Full keyspace scan | Replace caller with `SCAN`; see HARNESS.md §shard-aggregated | +| `LRANGE key 0 -1` on large list | Large list traversal | Add LIMIT or paginate | +| `HGETALL` on huge hash | Unbounded hash read | Add field-level access | +| `DEBUG SLEEP` | Test artifact | Remove from production path | +| Repeated `BGSAVE` / `BGREWRITEAOF` | Persistence pressure | See Step 3 | + +Reset the slowlog after capturing: +```bash +redis-cli -p 6380 SLOWLOG RESET +``` + +--- + +## Step 3 — Check Redis INFO stats + +```bash +redis-cli -p 6380 INFO all | grep -E \ + 'connected_clients|blocked_clients|used_memory_human|used_memory_peak_human|'\ + 'rdb_last_bgsave_status|aof_last_rewrite_status|rdb_last_bgsave_time_sec|'\ + 'total_commands_processed|rejected_connections|evicted_keys|keyspace_hits|keyspace_misses' +``` + +**Thresholds to flag:** + +| Stat | Concern threshold | +|------|------------------| +| `connected_clients` | > 80% of `maxclients` config value | +| `blocked_clients` | > 0 (any blocking indicates `BLPOP`/`BRPOP` contention) | +| `rdb_last_bgsave_status` | `err` | +| `rdb_last_bgsave_time_sec` | > 30s (fork latency) | +| `evicted_keys` | Increasing over time (memory pressure) | +| `rejected_connections` | > 0 | + +--- + +## Step 4 — Cross-correlate gym-* instances + +If the wave pattern was confirmed in Step 0, check whether `gym-*` instances have recovered: + +```bash +# Check all gym-* instances in one pass +for env in gym-2450 gym-110e gym-run1-2ea1 gym-bbd4 gym-run5-ww7y; do + echo "=== $env ===" + redis-cli -h redis-rust-${env}.dark-redis-rust.svc INFO server | grep uptime_in_seconds + redis-cli -h redis-rust-${env}.dark-redis-rust.svc INFO stats | grep total_commands_processed +done +``` + +If `gym-*` instances recovered on their own: +- Check their uptime — a restart is the most common auto-recovery. +- Check whether a network event resolved (correlate with cloud provider status page). +- Document recovery time in the incident ticket. + +If `gym-*` instances are still in Alert: +- Escalate to the infrastructure team — this confirms a fleet-wide shared cause. + +--- + +## Step 5 — Verify pod health + +```bash +kubectl get pods -n dark-redis-rust -l env=mvp-test +kubectl describe pod -n dark-redis-rust | grep -A10 "Events:" +kubectl top pod -n dark-redis-rust +``` + +Restart a pod only if: +- The pod is in `CrashLoopBackOff` or `OOMKilled` state. +- Memory stats from Step 3 show the instance is paging or evicting keys at high rate. + +```bash +# Graceful restart (rolling, zero-downtime) +kubectl rollout restart deployment/redis-rust-mvp-test -n dark-redis-rust +``` + +--- + +## Step 6 — Tune the redis-latency monitor (post-incident) + +The `redis-latency` monitor currently lacks environment tagging, which slows correlation. +After the immediate incident is resolved, update the monitor configuration: + +1. Open the monitor in Datadog → Edit. +2. Add `env` as a **group-by** dimension so alerts fire per environment, not fleet-wide. +3. Add the following tags to alert notifications: + ``` + env:{{env.name}} + service:dark-redis-rust + alert_id:{{alert.id}} + ``` +4. Set a **No Data** notification threshold of 10 minutes so telemetry gaps surface + as their own alert rather than silently masking latency state. +5. Save and verify the monitor fires correctly in a staging environment. + +--- + +## Escalation path + +| Condition | Escalate to | +|-----------|------------| +| All `gym-*` still in Alert after 15 min | Infrastructure / SRE on-call | +| `rejected_connections > 0` | Database reliability team | +| `rdb_last_bgsave_status: err` | Storage / persistence team | +| Wave pattern repeats within 24h | Architecture review (see GAP-001) | + +--- + +## References + +- [GAP-001: Fleet-Wide Redis Latency Wave Pattern](adr/gaps/GAP-001-fleet-wide-redis-latency.md) +- [HARNESS.md — shard-aggregated commands](HARNESS.md) +- [perf_config.toml](../perf_config.toml) — `num_shards`, `max_size` config +- Redis documentation: [SLOWLOG](https://redis.io/docs/manual/latency/), [INFO](https://redis.io/commands/info/) From 82696da7b0e26464a975c0e0295a78fee96dc82b Mon Sep 17 00:00:00 2001 From: sesh nalla <39490039+nerdsane@users.noreply.github.com> Date: Mon, 20 Apr 2026 00:56:15 -0400 Subject: [PATCH 3/3] docs(gap): register GAP-001 in gaps index (is-019da93c) --- docs/adr/gaps/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/adr/gaps/README.md b/docs/adr/gaps/README.md index 20a3527..966d20a 100644 --- a/docs/adr/gaps/README.md +++ b/docs/adr/gaps/README.md @@ -58,7 +58,7 @@ Discovered → Open → Investigating → ADR-Drafted → Closed | Gap | Title | Status | Severity | |-----|-------|--------|----------| -| - | (None yet) | - | - | +| [GAP-001](GAP-001-fleet-wide-redis-latency.md) | Fleet-Wide Redis Latency Wave Pattern | Investigating | High | ## Gap vs Deviation vs ADR