google · copybara-service · Jun 15, 2026
diff --git a/skills/cloud/agent-platform-alert-configuration/SKILL.md b/skills/cloud/agent-platform-alert-configuration/SKILL.md
diff --git a/skills/cloud/agent-platform-alert-configuration/assets/alerts_initial_duplicate.tf b/skills/cloud/agent-platform-alert-configuration/assets/alerts_initial_duplicate.tf
@@ -0,0 +1,19 @@
+resource "google_monitoring_alert_policy" "agent_error_rate_fast_burn" {
+  project      = var.project_id
+  display_name = "Agent Error Rate Fast Burn"
+  combiner     = "OR"
+  conditions {
+    display_name = "Error Rate Fast Burn"
+    condition_prometheus_query_language {
+      query    = <<-EOT
+        (
+          sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{response_code!~"2..", reasoning_engine_id="12345"}[5m]))
+          /
+          sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{reasoning_engine_id="12345"}[5m]))
+          > (1 - var.slo_target) * 3
+        )
+      EOT
+      duration = "300s"
+    }
+  }
+}
diff --git a/skills/cloud/agent-platform-alert-configuration/assets/draft_invalid_query.tf b/skills/cloud/agent-platform-alert-configuration/assets/draft_invalid_query.tf
@@ -0,0 +1,11 @@
+resource "google_monitoring_alert_policy" "draft_policy" {
+  project      = var.project_id
+  display_name = "Draft Policy"
+  combiner     = "OR"
+  conditions {
+    display_name = "Draft Condition"
+    condition_prometheus_query_language {
+      query = "sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[5y])) / sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[5m]"
+    }
+  }
+}
diff --git a/skills/cloud/agent-platform-alert-configuration/assets/mock_bursty.json b/skills/cloud/agent-platform-alert-configuration/assets/mock_bursty.json
diff --git a/skills/cloud/agent-platform-alert-configuration/assets/mock_seasonal.json b/skills/cloud/agent-platform-alert-configuration/assets/mock_seasonal.json
diff --git a/skills/cloud/agent-platform-alert-configuration/assets/mock_steady.json b/skills/cloud/agent-platform-alert-configuration/assets/mock_steady.json
diff --git a/...ud/agent-platform-alert-configuration/references/has_historical_traffic_data.md b/...ud/agent-platform-alert-configuration/references/has_historical_traffic_data.md
@@ -0,0 +1,47 @@
+# Has Historical Traffic Data Available
+
+Use these instructions if the agent has historical metrics data available:
+
+## 1. Run Traffic Analyzer Script
+
+-   Run the `analyze_traffic.py` script to classify the metrics traffic pattern
+    profile using one of the following commands:
+    -   **Live Query**: `python3 scripts/analyze_traffic.py --live --project-id
+        [PROJECT_ID] --reasoning-engine-id [REASONING_ENGINE_ID]`
+    -   **Metrics File**: `python3 scripts/analyze_traffic.py --metrics-file
+        [PATH_TO_JSON]`
+-   **Handling Tool Failures**: If the `--live` command fails with
+    `CredentialsMissingError` (exit code 1), report the error and instruct the
+    user to run `gcloud auth application-default login` on their terminal.
+-   Map the traffic pattern profile classified by the script to **Latency**:
+    -   **Steady**: Maps to **Long-Window Z-Score Baseline (1-week lookback)**
+        (safe since the script verified we have at least 14 days of history).
+    -   **Seasonal**: Maps to **Seasonal Decomposition** (average 1w and 1d).
+    -   **Bursty**: Maps to **Moving Averages** (1h baseline).
+-   **Fallback for Insufficient Data / No Traffic**:
+    -   If the script fails with a `ValueError` indicating insufficient data
+        points (less than 14 days of history), or if it outputs "New Agent / No
+        Traffic" (inactive agent), you MUST fallback to the user inquiry
+        instructions in
+        [no_historical_traffic_data.md](no_historical_traffic_data.md) to ask
+        the user for the expected traffic pattern.
+-   Regardless of the script's output profile, the other policies MUST use their
+    correct data-class defaults:
+    -   **Error Rate**: ALWAYS use **Multi-Window Multi-Burn Rate SLO Alerting**
+        (or ratio-based static limits).
+
+## 2. User Notification
+
+Clearly communicate the findings and selection at the start of your response:
+
+1.  Explain the classified traffic profile (Seasonal, Steady, or Bursty) output
+    by the metrics analysis script (citing indicators like standard deviation,
+    autocorrelation, or zero-ratio from the script output). If falling back to
+    user inquiry due to zero metrics or insufficient data, explain that.
+2.  Propose the corresponding alerting policy mapping (Latency matching the
+    traffic profile, Error Rate using SLO Burn Rate).
+3.  Ask the user if this expected profile mapping is correct or if they would
+    like to customize standard deviation thresholds.
+4.  Provide a brief plain-English explanation of what each of the proposed
+    alerts measures and how the underlying algorithms work and what they
+    actually measure. Keep this explanation in the conversational response text.
diff --git a/...oud/agent-platform-alert-configuration/references/no_historical_traffic_data.md b/...oud/agent-platform-alert-configuration/references/no_historical_traffic_data.md
@@ -0,0 +1,69 @@
+# No Historical Traffic Data Available
+
+Use these instructions if there is no historical metrics data available for the
+agent (e.g., brand new agent):
+
+## 1. Ask the User for the Traffic Pattern and Handle Defaults
+
+-   Because no historical metrics data is available and we do NOT perform
+    traffic pattern inference based on the name, description, or context of the
+    agent, you MUST ask the user directly what traffic pattern they expect for
+    their agent.
+-   Present the user with the three options for customization:
+    -   **Steady/Consistent**: (e.g., continuous background tasks, sync jobs,
+        daemons). Maps **Latency** to **Short-Window Z-Score Baseline (1-hour
+        lookback)**.
+    -   **Bursty/Inconsistent**: (e.g., periodic background tasks, batch worker,
+        cron jobs). Maps **Latency** to **Moving Averages (1-hour baseline)**.
+    -   **Seasonal/Cyclical**: (e.g., user-facing systems, support portals,
+        chatbots). Maps **Latency** to **Seasonal Decomposition** (requires
+        offsets `1d` and `1w`).
+-   Inform the user that the default traffic pattern is **Steady/Consistent**
+    (which maps to Short-Window Z-Score Baseline), and that you will use this
+    default if they do not have a good idea or do not specify one.
+-   **Handling Automated or Immediate Setup Requests**: If the user's prompt
+    asks you to configure or write the alerting policies immediately (e.g., "Set
+    up its alerting policies in 'monitoring/alerts.tf'"), or if you are running
+    in an automated/non-interactive script, you MUST NOT pause to wait for their
+    response. Instead, ask the question in your response, state that you are
+    deploying the default Steady/Consistent pattern because no choice was
+    specified yet, and **immediately proceed to generate and write the default
+    configuration (Steady / Consistent -> Short-Window Z-Score)**.
+-   Regardless of the selected traffic pattern, the other policies MUST use
+    their correct data-class defaults:
+
+    -   **Error Rate**: ALWAYS use **Multi-Window Multi-Burn Rate SLO Alerting**
+        (or ratio-based static limits).
+
+*   **Short-Window Z-Score / Moving Averages**: Require **1 hour** of traffic
+    history.
+
+*   **SLO Burn Rate (Error Rate)**: Requires up to **3 days** for the slow burn
+    component, though the fast burn component (1h/5m) will work after 1 hour.
+
+*   **Seasonal Decomposition**: Requires **1 week** of history (due to the `1w`
+    offset). **WARNING:** If the user switches to Seasonal Decomposition, warn
+    them that they will have a 1-week blind spot, and suggest starting with
+    **Short-Window Z-Score** or **Static Thresholds** as a temporary guard.
+
+## 2. User Notification
+
+Clearly communicate the lack of historical data, explain the options, and detail
+the immediate actions taken at the start of your response:
+
+1.  Explain that since the agent has no historic data, you cannot automatically
+    analyze the traffic pattern.
+2.  Ask the user directly what traffic pattern they expect (Steady, Seasonal, or
+    Bursty), detailing the mapping differences and the 1-week blind spot risk if
+    they choose Seasonal.
+3.  Inform the user that the default is **Steady / Consistent** (Short-Window
+    Z-Score algorithm for Latency) and you will proceed with this default if
+    they don't have a good idea or do not choose.
+4.  If the user accepts the default, explain that you have deployed the
+    Steady/Consistent default to ensure the files are configured immediately,
+    but they can request an update if they prefer another pattern.
+5.  Explain the warm-up periods (1 hour for Latency, up to 3 days for SLOs).
+6.  Propose the rest of the configuration mapping: Error Rate (SLO Burn Rate).
+7.  Provide a brief plain-English explanation of what each of the proposed
+    alerts measures and how the underlying algorithms work and what they
+    actually measure.
diff --git a/skills/cloud/agent-platform-alert-configuration/references/promql_queries.md b/skills/cloud/agent-platform-alert-configuration/references/promql_queries.md
@@ -0,0 +1,132 @@
+# PromQL Queries Reference
+
+This file contains the recommended PromQL queries and template configurations
+for monitoring Latency and Error Rates of Agent Platform agents.
+
+## Table of Contents
+
+-   [1. Latency (95th Percentile)](#1-latency-95th-percentile)
+    -   [Z-Score (Steady Traffic)](#z-score-recommended-for-steady-traffic)
+    -   [Moving Averages (Bursty Traffic)](#moving-averages-recommended-for-bursty-traffic)
+    -   [Seasonal Decomposition (Seasonal Traffic)](#seasonal-decomposition-recommended-for-traffic-with-seasonal-or-time-of-day-component)
+-   [2. Error Rate (SLO)](#2-error-rate-slo)
+    -   [Fast Burn SLO](#fast-burn-slo-1-hour-and-5-minute-windows)
+    -   [Slow Burn SLO](#slow-burn-slo-3-day-and-6-hour-windows)
+
+--------------------------------------------------------------------------------
+
+## 1. Latency (95th Percentile)
+
+### Z-Score (Recommended for Steady Traffic)
+
+#### Long-Window Z-Score (For Established Agents - >1 week history)
+
+Compares the 5-minute 95th percentile latency to the 1-week baseline.
+
+```promql
+abs(
+  histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m])) by (le, reasoning_engine_id))
+  -
+  histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[1w])) by (le, reasoning_engine_id))
+)
+/
+stddev_over_time(
+  (histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m])) by (le, reasoning_engine_id)))[1w:5m]
+) > 3
+```
+
+*Note: The denominator uses a subquery `[1w:5m]` to calculate standard deviation
+of the 5-minute latency over 1 week. The numerator uses `[1w]` rate directly to
+avoid a second subquery for the mean.*
+
+#### Short-Window Z-Score (For Newer Agents - >1 hour history)
+
+Compares the 1-minute 95th percentile latency to the 1-hour baseline. Useful for
+quick activation on new agents.
+
+```promql
+abs(
+  histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[1m])) by (le, reasoning_engine_id))
+  -
+  histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[1h])) by (le, reasoning_engine_id))
+)
+/
+stddev_over_time(
+  (histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[1m])) by (le, reasoning_engine_id)))[1h:1m]
+) > 3
+```
+
+### Moving Averages (Recommended for Bursty Traffic)
+
+Compares the 5-minute latency to the 1-hour average.
+
+```promql
+histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m])) by (le, reasoning_engine_id))
+>
+1.5 * histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[1h])) by (le, reasoning_engine_id))
+```
+
+### Seasonal Decomposition (Recommended for traffic with seasonal or time-of-day component)
+
+> [!NOTE] For the Latency alert policy, ONLY use seasonal decomposition to track
+> Latency spikes. Alert policies using seasonal decomposition tracking both
+> spikes and drops can falsely trigger alerts.
+
+Compares the 5-minute latency to the average of 1-week and 1-day lookback
+baselines.
+
+```promql
+histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m])) by (le, reasoning_engine_id))
+/
+(
+  (
+    histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m] offset 1d)) by (le, reasoning_engine_id))
+    +
+    histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m] offset 1w)) by (le, reasoning_engine_id))
+  ) / 2
+)
+> 2
+```
+
+--------------------------------------------------------------------------------
+
+## 2. Error Rate (SLO)
+
+Always use Multi-Window Multi-Burn Rate SLOs. Z-score is not recommended due to
+sparsity.
+
+### Fast Burn SLO (1-Hour and 5-Minute Windows)
+
+```promql
+(
+  sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{response_code!~"2.."}[5m])) by (reasoning_engine_id)
+  /
+  sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[5m])) by (reasoning_engine_id)
+  > (1 - ${var.slo_target}) * 14.4
+)
+and
+(
+  sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{response_code!~"2.."}[1h])) by (reasoning_engine_id)
+  /
+  sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[1h])) by (reasoning_engine_id)
+  > (1 - ${var.slo_target}) * 14.4
+)
+```
+
+### Slow Burn SLO (3-Day and 6-Hour Windows)
+
+```promql
+(
+  sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{response_code!~"2.."}[6h])) by (reasoning_engine_id)
+  /
+  sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[6h])) by (reasoning_engine_id)
+  > (1 - ${var.slo_target}) * 1.0
+)
+and
+(
+  sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{response_code!~"2.."}[3d])) by (reasoning_engine_id)
+  /
+  sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[3d])) by (reasoning_engine_id)
+  > (1 - ${var.slo_target}) * 1.0
+)
+```