Skip to content

Fix DeploymentHealthCheck: pre-swap staging validation, post-swap warmup wait, and automatic rollback#108

Draft
Copilot wants to merge 2 commits into
mainfrom
copilot/fix-high-response-time-issue-yet-again
Draft

Fix DeploymentHealthCheck: pre-swap staging validation, post-swap warmup wait, and automatic rollback#108
Copilot wants to merge 2 commits into
mainfrom
copilot/fix-high-response-time-issue-yet-again

Conversation

Copy link
Copy Markdown

Copilot AI commented May 6, 2026

The DeploymentHealthCheck subagent blindly swapped staging → production on any regression, with no guard on staging health and no warmup window before measuring results — causing a case where post-swap latency hit 436ms vs a 26ms baseline.

Root causes

  • No pre-swap staging check — agent swapped regardless of whether staging was actually faster; if staging was also degraded, the swap worsened production
  • 84-second post-swap measurement window — slot swaps trigger cold starts that spike latency for 3+ minutes, making the immediate re-check meaningless
  • No rollback path — a swap that didn't help stayed in production indefinitely

Changes

SubAgents/DeploymentHealthCheck.yaml

  • Step 7 – Pre-swap staging check: queries staging App Insights before acting; skips swap if staging response time is unavailable or ≥ production (SwapExecuted=false)
  • Step 9 – Warm-up wait: calls WaitInMilliSeconds(180000) after swap before any post-swap measurement
  • Step 10 – Post-swap verification + auto-rollback: re-queries App Insights; if still >20% over baseline, immediately swaps back and sets outcome "Swap reverted - post-swap still degraded"
  • Extended pre-swap query window ago(2m)ago(5m) for more stable averages
  • Added WaitInMilliSeconds to tools list
  • Enriched GitHub issue template and Teams post to include staging metrics, swap timestamp, and outcome label

README.md

  • Updated demo flow description and ASCII diagram to reflect the new 3-phase remediation: staging check → swap + warmup → post-swap verify → optional rollback

…up wait, and automatic rollback

Agent-Logs-Url: https://github.com/gderossilive/AzSreAgentLab/sessions/22f59868-59af-4486-8b62-ee2564ac930c

Co-authored-by: gderossilive <20165027+gderossilive@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix high response time compared to baseline for sreproactive-vscode-39596 Fix DeploymentHealthCheck: pre-swap staging validation, post-swap warmup wait, and automatic rollback May 6, 2026
Copilot AI requested a review from gderossilive May 6, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High response time vs baseline; auto slot swap executed for sreproactive-vscode-39596

2 participants