Skip to content

High response time vs baseline on sreproactive-vscode-39596 – mitigated via slot swap (2026-05-10) #111

@gderossilive

Description

@gderossilive

Summary

  • Service: Azure App Service sreproactive-vscode-39596 (rg: rg-sre-proactive-demo)
  • Alert context: Plan CPU Sev2 fired 2026-05-10T05:19:37Z. Correlated response time regression vs baseline.
  • Action taken: Verified staging & production health; performed slot swap (staging -> production) at 2026-05-10T05:24:25.917Z UTC.
  • Outcome: Post-swap average response time improved to 1.78 ms (well below baseline 47.43 ms).

Evidence

  • Baseline (from baseline.txt):
    • BaselineResponseTime = 47.431782352941184 ms
    • BaselineTimestamp = 2026-01-22T16:00:55.2520697Z
  • Current (pre-swap) App Insights query window:
    • Window: 2026-05-10T05:15:25Z .. 2026-05-10T05:20:25Z
    • KQL:
      requests
      | where timestamp between (datetime(2026-05-10T05:15:25Z) .. datetime(2026-05-10T05:20:25Z))
      | where cloud_RoleName <> '' and not (cloud_RoleName contains "staging")
      | summarize CurrentResponseTime = avg(duration) by cloud_RoleName
      | extend CurrentTimestamp = datetime(2026-05-10T05:20:25Z)
    • Result: cloud_RoleName=sreproactive-vscode-39596; CurrentResponseTime=88.03081666666667 ms; CurrentTimestamp=2026-05-10T05:20:25Z
    • Deviation vs baseline: +85.59%
  • Post-swap verification query window:
    • Window: 2026-05-10T05:23:11Z .. 2026-05-10T05:25:11Z
    • KQL:
      requests
      | where timestamp between (datetime(2026-05-10T05:23:11Z) .. datetime(2026-05-10T05:25:11Z))
      | where cloud_RoleName <> '' and not (cloud_RoleName contains "staging")
      | summarize PostSwapResponseTime = avg(duration) by cloud_RoleName
      | extend CurrentTimestamp = datetime(2026-05-10T05:25:11Z)
    • Result: cloud_RoleName=sreproactive-vscode-39596; PostSwapResponseTime=1.7782 ms; CurrentTimestamp=2026-05-10T05:25:11Z

Health endpoints prior to swap

Artifacts

Next steps / Suspicions

  • Investigate recent code/config changes in the previously active production slot that could explain the 85%+ latency increase.
  • Review dependency latency (DB/external calls) around 05:15–05:20Z.
  • Consider adding automated canary/slot soak validation to catch regressions pre-promotion.

Closing note

  • Mitigation successful via slot swap; response time is better than baseline. Keeping this issue open for root-cause analysis.

This issue was created by sre-agent-proactive-demo--73aee8f4
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions