diff --git a/demos/ProactiveReliabilityAppService/README.md b/demos/ProactiveReliabilityAppService/README.md index 89c32c5..3096f97 100644 --- a/demos/ProactiveReliabilityAppService/README.md +++ b/demos/ProactiveReliabilityAppService/README.md @@ -177,13 +177,15 @@ Typical session (end-to-end): - It retrieves the most recent stored baseline (for example from `baseline.txt`). - It compares timestamps (current must be newer) and response time vs baseline (for example, >20% regression threshold) to decide whether remediation is required. - Remediation + verification - - It verifies the app/slot state (staging slot present; app reachable). - - It executes the remediation slot swap (swap `staging` → `production`, or swap back depending on which slot is healthy). - - It re-queries Application Insights to confirm improvement. + - It checks staging response time before swapping to confirm staging is actually healthier. + - It executes the remediation slot swap (swap `staging` → `production`) only when staging is faster. + - It waits 3 minutes after the swap for the new slot to warm up (cold starts can spike latency). + - It re-queries Application Insights to confirm improvement after the warmup window. + - If post-swap response time is still elevated (>20% over baseline), it automatically swaps back. - Comms + closure - - It posts a short deployment/health summary (for example to Teams) with links/metrics. - - Optionally, it opens a GitHub issue capturing findings and recommendations. - - It records an incident closure note (for example: “Impact cleared. App Service response time back to baseline; no residual impact.”). + - It posts a deployment/health summary (for example to Teams) with pre-swap, staging, and post-swap metrics. + - It opens a GitHub issue capturing findings and recommendations (always when a regression was detected). + - Outcomes: “Resolved by swap”, “Swap reverted - post-swap still degraded”, or “No swap performed - both slots degraded”. ASCII flow (typical): @@ -203,8 +205,8 @@ ASCII flow (typical): | v +------------------------------+ - | Query Application Insights | - | (current avg response time) | + | Query App Insights (prod) | + | (last 5m avg response time) | +--------------+---------------+ | v @@ -228,26 +230,42 @@ ASCII flow (typical): +--------+-------+ | | v | +-----------------------+ - | | Verify slots/app | - | | (prod/staging health) | + | | Query App Insights | + | | (staging slot, last5m)| | +----------+------------+ | | - | v - | +-----------------------+ - | | Execute slot swap | - | +----------+------------+ - | | - | v - | +-----------------------+ - | | Re-query App Insights | - | | (confirm improvement) | - | +----------+------------+ - | | - v v - +----------------+ +----------------------+ - | Post summary | | Post summary | - | (Teams/email) | | + GitHub issue (opt) | - +----------------+ +----------------------+ + | +----------+-----------+ + | | | + | v v + | +-----------------+ +---------------------+ + | | Both slots slow | | Staging is faster | + | | (skip swap) | | Execute slot swap | + | +--------+--------+ +----------+----------+ + | | | + | | v + | | +---------------------+ + | | | Wait 3 min warmup | + | | +----------+----------+ + | | | + | | v + | | +---------------------+ + | | | Re-query App Insights| + | | | (post-swap, last 5m)| + | | +----------+----------+ + | | | + | | +----------+-----------+ + | | | | + | | v v + | | +-----------------+ +---------------------+ + | | | Still elevated | | Back to baseline | + | | | (swap back) | | (resolved) | + | | +--------+--------+ +----------+----------+ + | | | | + v v v v + +----------------------------------------------------------+ + | Post summary (Teams) + GitHub issue (always if slow) | + | Outcome: Resolved / Swap reverted / No swap (both slow) | + +----------------------------------------------------------+ ``` Useful options: diff --git a/demos/ProactiveReliabilityAppService/SubAgents/DeploymentHealthCheck.yaml b/demos/ProactiveReliabilityAppService/SubAgents/DeploymentHealthCheck.yaml index 6f41a3a..042e4a9 100644 --- a/demos/ProactiveReliabilityAppService/SubAgents/DeploymentHealthCheck.yaml +++ b/demos/ProactiveReliabilityAppService/SubAgents/DeploymentHealthCheck.yaml @@ -4,8 +4,8 @@ spec: name: DeploymentHealthCheck system_prompt: >+ Goal: To identify if the current response time is higher than the baseline response time and decide - if a swap operation is needed, if a GitHub issue needs to be created, and then post an update to - Teams channel. + if a swap operation is needed, verify the swap improved performance, swap back if it did not, and + then post an update to Teams channel and file a GitHub issue. Resource Info: @@ -21,10 +21,10 @@ spec: 1. Connect to Application Insights using the Resource ID above. - 2. Run App Insights Query: + 2. Run App Insights Query for production response time: requests - | where timestamp >= ago(2m) - | where cloud_RoleName == 'sreproactive-vscode-39596' + | where timestamp >= ago(5m) + | where cloud_RoleName == 'sreproactive-vscode-39596' | summarize CurrentResponseTime = avg(duration) | extend CurrentTimestamp = now() @@ -37,39 +37,101 @@ spec: 5. Compare Timestamps. Compare CurrentTimestamp with BaselineTimestamp. Only proceed to next step if CurrentTimestamp is newer than BaselineTimestamp. - 6. Compare Response Times and Auto-Swap. Compare CurrentResponseTime with BaselineResponseTime. - If CurrentResponseTime is greater than BaselineResponseTime by 20% or more, execute slot swap - without approval: - az webapp deployment slot swap --resource-group rg-sre-proactive-demo --name sreproactive-vscode-39596 --slot staging --target-slot production + 6. Compare Response Times. Compare CurrentResponseTime with BaselineResponseTime. + If CurrentResponseTime is NOT greater than BaselineResponseTime by 20% or more, skip directly + to step 11 (no remediation required). - 7. Create GitHub Issue (If Slow). If response time was slower by 20% or more, do a semantic - search of the code to identify why response time was slow, create recommendations on what to do, - and file a GitHub issue. - - 8. Post to Teams Channel at: - + 7. Pre-Swap Staging Health Check. Before swapping, query staging response time to confirm staging + is healthier than production: + requests + | where timestamp >= ago(5m) + | where cloud_RoleName contains "staging" + | summarize StagingResponseTime = avg(duration) + | extend StagingTimestamp = now() + If StagingResponseTime returns no data or StagingResponseTime >= CurrentResponseTime, + do NOT perform a swap (both slots may be degraded). Skip to step 10 with SwapExecuted=false. + + 8. Execute Swap (only if staging is healthier than production). Staging response time is lower + than production. Execute slot swap without approval: + az webapp deployment slot swap --resource-group rg-sre-proactive-demo --name sreproactive-vscode-39596 --slot staging --target-slot production + Record SwapExecuted=true and SwapTimestamp=now(). + + 9. Wait for App Warm-up. Use WaitInMilliSeconds to wait 180000 ms (3 minutes) for the newly + swapped slot to complete its cold start before querying performance data. + + 10. Post-Swap Verification. Re-query App Insights for post-swap response time: + requests + | where timestamp >= ago(5m) + | where cloud_RoleName == 'sreproactive-vscode-39596' + | summarize PostSwapResponseTime = avg(duration) + | extend PostSwapTimestamp = now() + + Evaluate the result: + - If PostSwapResponseTime <= BaselineResponseTime * 1.2 (within 20% of baseline): + Swap was successful. Set Outcome="Resolved by swap". + - If PostSwapResponseTime > BaselineResponseTime * 1.2 AND SwapExecuted=true: + Swap did NOT resolve the issue. Execute a swap back immediately: + az webapp deployment slot swap --resource-group rg-sre-proactive-demo --name sreproactive-vscode-39596 --slot staging --target-slot production + Set Outcome="Swap reverted - post-swap still degraded; manual investigation required". + - If SwapExecuted=false: + Set Outcome="No swap performed - both slots appeared degraded; manual investigation required". + + 11. Create GitHub Issue. Do a semantic search of the code to identify why response time was slow, + create recommendations on what to do, and file a GitHub issue using the following format: + + Title: for + + Body (Markdown): + ## Summary + - Alert detected at: + - Baseline: ms (from ) + - Pre-swap production avg (last 5m): ms (% over baseline) + - Staging avg (pre-swap): ms (or "no data") + - Swap executed: + - Swap timestamp: + - Post-swap production avg (after 3m warmup): ms + - Outcome: + + ## Evidence + - Pre-swap App Insights query: + - Post-swap App Insights query: + + ## Code Analysis + + + ## Recommendations + + + 12. Post to Teams Channel at: + Teams Post Format: Deployment Health Check: - + --- - Time of Deployment (slot swap):