Summary
- Service: Azure App Service sreproactive-vscode-39596 (rg: rg-sre-proactive-demo)
- Alert context: Plan CPU Sev2 fired 2026-05-10T05:19:37Z. Correlated response time regression vs baseline.
- Action taken: Verified staging & production health; performed slot swap (staging -> production) at 2026-05-10T05:24:25.917Z UTC.
- Outcome: Post-swap average response time improved to 1.78 ms (well below baseline 47.43 ms).
Evidence
- Baseline (from baseline.txt):
- BaselineResponseTime = 47.431782352941184 ms
- BaselineTimestamp = 2026-01-22T16:00:55.2520697Z
- Current (pre-swap) App Insights query window:
- Window: 2026-05-10T05:15:25Z .. 2026-05-10T05:20:25Z
- KQL:
requests
| where timestamp between (datetime(2026-05-10T05:15:25Z) .. datetime(2026-05-10T05:20:25Z))
| where cloud_RoleName <> '' and not (cloud_RoleName contains "staging")
| summarize CurrentResponseTime = avg(duration) by cloud_RoleName
| extend CurrentTimestamp = datetime(2026-05-10T05:20:25Z)
- Result: cloud_RoleName=sreproactive-vscode-39596; CurrentResponseTime=88.03081666666667 ms; CurrentTimestamp=2026-05-10T05:20:25Z
- Deviation vs baseline: +85.59%
- Post-swap verification query window:
- Window: 2026-05-10T05:23:11Z .. 2026-05-10T05:25:11Z
- KQL:
requests
| where timestamp between (datetime(2026-05-10T05:23:11Z) .. datetime(2026-05-10T05:25:11Z))
| where cloud_RoleName <> '' and not (cloud_RoleName contains "staging")
| summarize PostSwapResponseTime = avg(duration) by cloud_RoleName
| extend CurrentTimestamp = datetime(2026-05-10T05:25:11Z)
- Result: cloud_RoleName=sreproactive-vscode-39596; PostSwapResponseTime=1.7782 ms; CurrentTimestamp=2026-05-10T05:25:11Z
Health endpoints prior to swap
Artifacts
Next steps / Suspicions
- Investigate recent code/config changes in the previously active production slot that could explain the 85%+ latency increase.
- Review dependency latency (DB/external calls) around 05:15–05:20Z.
- Consider adding automated canary/slot soak validation to catch regressions pre-promotion.
Closing note
- Mitigation successful via slot swap; response time is better than baseline. Keeping this issue open for root-cause analysis.
This issue was created by sre-agent-proactive-demo--73aee8f4
Tracked by the SRE agent here
Summary
Evidence
requests
| where timestamp between (datetime(2026-05-10T05:15:25Z) .. datetime(2026-05-10T05:20:25Z))
| where cloud_RoleName <> '' and not (cloud_RoleName contains "staging")
| summarize CurrentResponseTime = avg(duration) by cloud_RoleName
| extend CurrentTimestamp = datetime(2026-05-10T05:20:25Z)
requests
| where timestamp between (datetime(2026-05-10T05:23:11Z) .. datetime(2026-05-10T05:25:11Z))
| where cloud_RoleName <> '' and not (cloud_RoleName contains "staging")
| summarize PostSwapResponseTime = avg(duration) by cloud_RoleName
| extend CurrentTimestamp = datetime(2026-05-10T05:25:11Z)
Health endpoints prior to swap
Artifacts
Next steps / Suspicions
Closing note
This issue was created by sre-agent-proactive-demo--73aee8f4
Tracked by the SRE agent here