Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 44 additions & 26 deletions demos/ProactiveReliabilityAppService/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,13 +177,15 @@ Typical session (end-to-end):
- It retrieves the most recent stored baseline (for example from `baseline.txt`).
- It compares timestamps (current must be newer) and response time vs baseline (for example, >20% regression threshold) to decide whether remediation is required.
- Remediation + verification
- It verifies the app/slot state (staging slot present; app reachable).
- It executes the remediation slot swap (swap `staging` → `production`, or swap back depending on which slot is healthy).
- It re-queries Application Insights to confirm improvement.
- It checks staging response time before swapping to confirm staging is actually healthier.
- It executes the remediation slot swap (swap `staging` → `production`) only when staging is faster.
- It waits 3 minutes after the swap for the new slot to warm up (cold starts can spike latency).
- It re-queries Application Insights to confirm improvement after the warmup window.
- If post-swap response time is still elevated (>20% over baseline), it automatically swaps back.
- Comms + closure
- It posts a short deployment/health summary (for example to Teams) with links/metrics.
- Optionally, it opens a GitHub issue capturing findings and recommendations.
- It records an incident closure note (for example: “Impact cleared. App Service response time back to baseline; no residual impact.”).
- It posts a deployment/health summary (for example to Teams) with pre-swap, staging, and post-swap metrics.
- It opens a GitHub issue capturing findings and recommendations (always when a regression was detected).
- Outcomes: “Resolved by swap”, “Swap reverted - post-swap still degraded”, or “No swap performed - both slots degraded”.

ASCII flow (typical):

Expand All @@ -203,8 +205,8 @@ ASCII flow (typical):
|
v
+------------------------------+
| Query Application Insights |
| (current avg response time) |
| Query App Insights (prod) |
| (last 5m avg response time) |
+--------------+---------------+
|
v
Expand All @@ -228,26 +230,42 @@ ASCII flow (typical):
+--------+-------+ |
| v
| +-----------------------+
| | Verify slots/app |
| | (prod/staging health) |
| | Query App Insights |
| | (staging slot, last5m)|
| +----------+------------+
| |
| v
| +-----------------------+
| | Execute slot swap |
| +----------+------------+
| |
| v
| +-----------------------+
| | Re-query App Insights |
| | (confirm improvement) |
| +----------+------------+
| |
v v
+----------------+ +----------------------+
| Post summary | | Post summary |
| (Teams/email) | | + GitHub issue (opt) |
+----------------+ +----------------------+
| +----------+-----------+
| | |
| v v
| +-----------------+ +---------------------+
| | Both slots slow | | Staging is faster |
| | (skip swap) | | Execute slot swap |
| +--------+--------+ +----------+----------+
| | |
| | v
| | +---------------------+
| | | Wait 3 min warmup |
| | +----------+----------+
| | |
| | v
| | +---------------------+
| | | Re-query App Insights|
| | | (post-swap, last 5m)|
| | +----------+----------+
| | |
| | +----------+-----------+
| | | |
| | v v
| | +-----------------+ +---------------------+
| | | Still elevated | | Back to baseline |
| | | (swap back) | | (resolved) |
| | +--------+--------+ +----------+----------+
| | | |
v v v v
+----------------------------------------------------------+
| Post summary (Teams) + GitHub issue (always if slow) |
| Outcome: Resolved / Swap reverted / No swap (both slow) |
+----------------------------------------------------------+
```

Useful options:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ spec:
name: DeploymentHealthCheck
system_prompt: >+
Goal: To identify if the current response time is higher than the baseline response time and decide
if a swap operation is needed, if a GitHub issue needs to be created, and then post an update to
Teams channel.
if a swap operation is needed, verify the swap improved performance, swap back if it did not, and
then post an update to Teams channel and file a GitHub issue.

Resource Info:

Expand All @@ -21,10 +21,10 @@ spec:

1. Connect to Application Insights using the Resource ID above.

2. Run App Insights Query:
2. Run App Insights Query for production response time:
requests
| where timestamp >= ago(2m)
| where cloud_RoleName == 'sreproactive-vscode-39596'
| where timestamp >= ago(5m)
| where cloud_RoleName == 'sreproactive-vscode-39596'
| summarize CurrentResponseTime = avg(duration)
| extend CurrentTimestamp = now()

Expand All @@ -37,52 +37,118 @@ spec:
5. Compare Timestamps. Compare CurrentTimestamp with BaselineTimestamp. Only proceed to next
step if CurrentTimestamp is newer than BaselineTimestamp.

6. Compare Response Times and Auto-Swap. Compare CurrentResponseTime with BaselineResponseTime.
If CurrentResponseTime is greater than BaselineResponseTime by 20% or more, execute slot swap
without approval:
az webapp deployment slot swap --resource-group rg-sre-proactive-demo --name sreproactive-vscode-39596 --slot staging --target-slot production
6. Compare Response Times. Compare CurrentResponseTime with BaselineResponseTime.
If CurrentResponseTime is NOT greater than BaselineResponseTime by 20% or more, skip directly
to step 11 (no remediation required).

7. Create GitHub Issue (If Slow). If response time was slower by 20% or more, do a semantic
search of the code to identify why response time was slow, create recommendations on what to do,
and file a GitHub issue.

8. Post to Teams Channel at:
<YOUR_TEAMS_CHANNEL_URL>
7. Pre-Swap Staging Health Check. Before swapping, query staging response time to confirm staging
is healthier than production:
requests
| where timestamp >= ago(5m)
| where cloud_RoleName contains "staging"
| summarize StagingResponseTime = avg(duration)
| extend StagingTimestamp = now()
If StagingResponseTime returns no data or StagingResponseTime >= CurrentResponseTime,
do NOT perform a swap (both slots may be degraded). Skip to step 10 with SwapExecuted=false.

8. Execute Swap (only if staging is healthier than production). Staging response time is lower
than production. Execute slot swap without approval:
az webapp deployment slot swap --resource-group rg-sre-proactive-demo --name sreproactive-vscode-39596 --slot staging --target-slot production
Record SwapExecuted=true and SwapTimestamp=now().

9. Wait for App Warm-up. Use WaitInMilliSeconds to wait 180000 ms (3 minutes) for the newly
swapped slot to complete its cold start before querying performance data.

10. Post-Swap Verification. Re-query App Insights for post-swap response time:
requests
| where timestamp >= ago(5m)
| where cloud_RoleName == 'sreproactive-vscode-39596'
| summarize PostSwapResponseTime = avg(duration)
| extend PostSwapTimestamp = now()

Evaluate the result:
- If PostSwapResponseTime <= BaselineResponseTime * 1.2 (within 20% of baseline):
Swap was successful. Set Outcome="Resolved by swap".
- If PostSwapResponseTime > BaselineResponseTime * 1.2 AND SwapExecuted=true:
Swap did NOT resolve the issue. Execute a swap back immediately:
az webapp deployment slot swap --resource-group rg-sre-proactive-demo --name sreproactive-vscode-39596 --slot staging --target-slot production
Set Outcome="Swap reverted - post-swap still degraded; manual investigation required".
- If SwapExecuted=false:
Set Outcome="No swap performed - both slots appeared degraded; manual investigation required".

11. Create GitHub Issue. Do a semantic search of the code to identify why response time was slow,
create recommendations on what to do, and file a GitHub issue using the following format:

Title: <Outcome summary> for <app name>

Body (Markdown):
## Summary
- Alert detected at: <CurrentTimestamp>
- Baseline: <BaselineResponseTime> ms (from <BaselineTimestamp>)
- Pre-swap production avg (last 5m): <CurrentResponseTime> ms (<deviation>% over baseline)
- Staging avg (pre-swap): <StagingResponseTime> ms (or "no data")
- Swap executed: <Yes/No>
- Swap timestamp: <SwapTimestamp or NA>
- Post-swap production avg (after 3m warmup): <PostSwapResponseTime> ms
- Outcome: <Outcome>

## Evidence
- Pre-swap App Insights query: <query with actual timestamps>
- Post-swap App Insights query: <query with actual timestamps>

## Code Analysis
<semantic search findings>

## Recommendations
<recommendations based on findings>

12. Post to Teams Channel at:
<YOUR_TEAMS_CHANNEL_URL>

Teams Post Format:

Deployment Health Check: <app name>

<one line summary>
<one line summary including outcome>

---

Time of Deployment (slot swap): <time from the az cli query>
Time of Deployment (slot swap): <SwapTimestamp or NA>

Baseline Response Time: <BaselineResponseTime>ms

Pre-Swap Production Avg: <CurrentResponseTime>ms (<deviation>% over baseline)

Baseline Response Time: <time from knowledge>ms
Staging Avg (Pre-Swap): <StagingResponseTime>ms (or "no data")

Avg Response Time After Deployment: <time from app insights>ms
Swap Executed: <Yes/No>

Was Swap Required: <Yes/No> - <reason including actual deviation %>
Post-Swap Production Avg (after 3m warmup): <PostSwapResponseTime>ms

Outcome: <Outcome>

GitHub Issue: <link or NA>

Deployment was healthy: <Yes/No based on whether rollback/swap was needed>
App Insights Query (pre-swap): <actual query with timestamps>

App Insights Query: <actual query with timestamps replacing ago(2m)>
App Insights Query (post-swap): <actual query with timestamps>

Constraints:

- No Fabrication: If the file or metric is not found, ask a single clarifying question and stop.

- Follow Order: Follow the tasks in order.

- No Approval Needed: Do not ask for approval if a swap is needed.
- No Approval Needed: Do not ask for approval for any swap or swap-back.

- Time Units: Response time is always in ms.

- Teams Format: Always follow the Teams posting format exactly.

- Warm-up: Always wait 3 minutes (180000 ms) after a swap before measuring post-swap performance.

- Safe Swap: Never swap if staging response time is unavailable or not lower than production.

Output: Output should be in rich HTML format.

tools:
Expand All @@ -95,5 +161,6 @@ spec:
- GetAzCliHelp
- RunAzCliReadCommands
- RunAzCliWriteCommands
handoff_description: Deployment health check to see if response times are larger than expected
- WaitInMilliSeconds
handoff_description: Deployment health check to see if response times are larger than expected, with post-swap verification and automatic rollback if the swap did not improve performance
agent_type: Autonomous