Skip to content

Auto-remediation: High response time on sreproactive-vscode-39596; slot swap executed 2026-05-10T05:31:43Z; latency still elevated #112

@gderossilive

Description

@gderossilive

Summary

  • Alert: Proactive Reliability (App Service) High Response Time Alert
  • Resource: sreproactive-vscode-39596 (/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-sre-proactive-demo/providers/Microsoft.Web/sites/sreproactive-vscode-39596)
  • Action: Automatic deployment slot swap (staging → production) executed at 2026-05-10T05:31:43.153Z
  • Status: Response time remains elevated post-swap; investigation required

Evidence

  • Baseline (from baseline.txt):
    • BaselineResponseTime: 37.080758823529415 ms
    • BaselineTimestamp: 2026-05-09T09:30:44.0722477Z
  • Pre-swap (App Insights, 5m window):
    • Window: 2026-05-10T05:23:33Z .. 2026-05-10T05:28:33Z
    • Query:
      requests
      | where timestamp between (datetime(2026-05-10T05:23:33Z) .. datetime(2026-05-10T05:28:33Z))
      | where cloud_RoleName <> '' and not(cloud_RoleName contains "staging")
      | summarize CurrentResponseTime = avg(duration) by cloud_RoleName
      | extend CurrentTimestamp = datetime(2026-05-10T05:28:33Z)
    • Result: sreproactive-vscode-39596 CurrentResponseTime = 230.12468823529412 ms; CurrentTimestamp = 2026-05-10T05:28:33Z
    • Deviation vs baseline: +520.60%
  • Post-swap (App Insights, ~2.7m window):
    • Window: 2026-05-10T05:32:00Z .. 2026-05-10T05:34:40Z
    • Query:
      requests
      | where timestamp between (datetime(2026-05-10T05:32:00Z) .. datetime(2026-05-10T05:34:40Z))
      | where cloud_RoleName == "sreproactive-vscode-39596"
      | summarize AvgResponseTime = avg(duration)
      | extend CurrentTimestamp = datetime(2026-05-10T05:34:40Z)
    • Result: AvgResponseTime = 215.0364 ms; Deviation vs baseline: +479.91%

Health Endpoints

Notes

  • Swap was triggered automatically due to >=20% degradation.
  • Despite swap, response time remains far above baseline, indicating a deeper regression.
  • Please investigate recent commits/config changes affecting the production slot and staging contents.

Next Steps (suggested)

  • Review recent deployments/changes around 2026-05-10T05:20Z–05:35Z.
  • Compare application configuration between slots (app settings/connection strings).
  • Analyze dependencies (DB calls, external services) for latency.
  • Consider temporarily increasing instance count if load-related.

This issue was created by sre-agent-proactive-demo--73aee8f4
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions