Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
288 changes: 288 additions & 0 deletions skills/cloud/agent-platform-alert-configuration/SKILL.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
resource "google_monitoring_alert_policy" "agent_error_rate_fast_burn" {
project = var.project_id
display_name = "Agent Error Rate Fast Burn"
combiner = "OR"
conditions {
display_name = "Error Rate Fast Burn"
condition_prometheus_query_language {
query = <<-EOT
(
sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{response_code!~"2..", reasoning_engine_id="12345"}[5m]))
/
sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{reasoning_engine_id="12345"}[5m]))
> (1 - var.slo_target) * 3
)
EOT
duration = "300s"
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
resource "google_monitoring_alert_policy" "draft_policy" {
project = var.project_id
display_name = "Draft Policy"
combiner = "OR"
conditions {
display_name = "Draft Condition"
condition_prometheus_query_language {
query = "sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[5y])) / sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[5m]"
}
}
}

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Has Historical Traffic Data Available

Use these instructions if the agent has historical metrics data available:

## 1. Run Traffic Analyzer Script

- Run the `analyze_traffic.py` script to classify the metrics traffic pattern
profile using one of the following commands:
- **Live Query**: `python3 scripts/analyze_traffic.py --live --project-id
[PROJECT_ID] --reasoning-engine-id [REASONING_ENGINE_ID]`
- **Metrics File**: `python3 scripts/analyze_traffic.py --metrics-file
[PATH_TO_JSON]`
- **Handling Tool Failures**: If the `--live` command fails with
`CredentialsMissingError` (exit code 1), report the error and instruct the
user to run `gcloud auth application-default login` on their terminal.
- Map the traffic pattern profile classified by the script to **Latency**:
- **Steady**: Maps to **Long-Window Z-Score Baseline (1-week lookback)**
(safe since the script verified we have at least 14 days of history).
- **Seasonal**: Maps to **Seasonal Decomposition** (average 1w and 1d).
- **Bursty**: Maps to **Moving Averages** (1h baseline).
- **Fallback for Insufficient Data / No Traffic**:
- If the script fails with a `ValueError` indicating insufficient data
points (less than 14 days of history), or if it outputs "New Agent / No
Traffic" (inactive agent), you MUST fallback to the user inquiry
instructions in
[no_historical_traffic_data.md](no_historical_traffic_data.md) to ask
the user for the expected traffic pattern.
- Regardless of the script's output profile, the other policies MUST use their
correct data-class defaults:
- **Error Rate**: ALWAYS use **Multi-Window Multi-Burn Rate SLO Alerting**
(or ratio-based static limits).

## 2. User Notification

Clearly communicate the findings and selection at the start of your response:

1. Explain the classified traffic profile (Seasonal, Steady, or Bursty) output
by the metrics analysis script (citing indicators like standard deviation,
autocorrelation, or zero-ratio from the script output). If falling back to
user inquiry due to zero metrics or insufficient data, explain that.
2. Propose the corresponding alerting policy mapping (Latency matching the
traffic profile, Error Rate using SLO Burn Rate).
3. Ask the user if this expected profile mapping is correct or if they would
like to customize standard deviation thresholds.
4. Provide a brief plain-English explanation of what each of the proposed
alerts measures and how the underlying algorithms work and what they
actually measure. Keep this explanation in the conversational response text.
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# No Historical Traffic Data Available

Use these instructions if there is no historical metrics data available for the
agent (e.g., brand new agent):

## 1. Ask the User for the Traffic Pattern and Handle Defaults

- Because no historical metrics data is available and we do NOT perform
traffic pattern inference based on the name, description, or context of the
agent, you MUST ask the user directly what traffic pattern they expect for
their agent.
- Present the user with the three options for customization:
- **Steady/Consistent**: (e.g., continuous background tasks, sync jobs,
daemons). Maps **Latency** to **Short-Window Z-Score Baseline (1-hour
lookback)**.
- **Bursty/Inconsistent**: (e.g., periodic background tasks, batch worker,
cron jobs). Maps **Latency** to **Moving Averages (1-hour baseline)**.
- **Seasonal/Cyclical**: (e.g., user-facing systems, support portals,
chatbots). Maps **Latency** to **Seasonal Decomposition** (requires
offsets `1d` and `1w`).
- Inform the user that the default traffic pattern is **Steady/Consistent**
(which maps to Short-Window Z-Score Baseline), and that you will use this
default if they do not have a good idea or do not specify one.
- **Handling Automated or Immediate Setup Requests**: If the user's prompt
asks you to configure or write the alerting policies immediately (e.g., "Set
up its alerting policies in 'monitoring/alerts.tf'"), or if you are running
in an automated/non-interactive script, you MUST NOT pause to wait for their
response. Instead, ask the question in your response, state that you are
deploying the default Steady/Consistent pattern because no choice was
specified yet, and **immediately proceed to generate and write the default
configuration (Steady / Consistent -> Short-Window Z-Score)**.
- Regardless of the selected traffic pattern, the other policies MUST use
their correct data-class defaults:

- **Error Rate**: ALWAYS use **Multi-Window Multi-Burn Rate SLO Alerting**
(or ratio-based static limits).

* **Short-Window Z-Score / Moving Averages**: Require **1 hour** of traffic
history.

* **SLO Burn Rate (Error Rate)**: Requires up to **3 days** for the slow burn
component, though the fast burn component (1h/5m) will work after 1 hour.

* **Seasonal Decomposition**: Requires **1 week** of history (due to the `1w`
offset). **WARNING:** If the user switches to Seasonal Decomposition, warn
them that they will have a 1-week blind spot, and suggest starting with
**Short-Window Z-Score** or **Static Thresholds** as a temporary guard.

## 2. User Notification

Clearly communicate the lack of historical data, explain the options, and detail
the immediate actions taken at the start of your response:

1. Explain that since the agent has no historic data, you cannot automatically
analyze the traffic pattern.
2. Ask the user directly what traffic pattern they expect (Steady, Seasonal, or
Bursty), detailing the mapping differences and the 1-week blind spot risk if
they choose Seasonal.
3. Inform the user that the default is **Steady / Consistent** (Short-Window
Z-Score algorithm for Latency) and you will proceed with this default if
they don't have a good idea or do not choose.
4. If the user accepts the default, explain that you have deployed the
Steady/Consistent default to ensure the files are configured immediately,
but they can request an update if they prefer another pattern.
5. Explain the warm-up periods (1 hour for Latency, up to 3 days for SLOs).
6. Propose the rest of the configuration mapping: Error Rate (SLO Burn Rate).
7. Provide a brief plain-English explanation of what each of the proposed
alerts measures and how the underlying algorithms work and what they
actually measure.
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# PromQL Queries Reference

This file contains the recommended PromQL queries and template configurations
for monitoring Latency and Error Rates of Agent Platform agents.

## Table of Contents

- [1. Latency (95th Percentile)](#1-latency-95th-percentile)
- [Z-Score (Steady Traffic)](#z-score-recommended-for-steady-traffic)
- [Moving Averages (Bursty Traffic)](#moving-averages-recommended-for-bursty-traffic)
- [Seasonal Decomposition (Seasonal Traffic)](#seasonal-decomposition-recommended-for-traffic-with-seasonal-or-time-of-day-component)
- [2. Error Rate (SLO)](#2-error-rate-slo)
- [Fast Burn SLO](#fast-burn-slo-1-hour-and-5-minute-windows)
- [Slow Burn SLO](#slow-burn-slo-3-day-and-6-hour-windows)

--------------------------------------------------------------------------------

## 1. Latency (95th Percentile)

### Z-Score (Recommended for Steady Traffic)

#### Long-Window Z-Score (For Established Agents - >1 week history)

Compares the 5-minute 95th percentile latency to the 1-week baseline.

```promql
abs(
histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m])) by (le, reasoning_engine_id))
-
histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[1w])) by (le, reasoning_engine_id))
)
/
stddev_over_time(
(histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m])) by (le, reasoning_engine_id)))[1w:5m]
) > 3
```

*Note: The denominator uses a subquery `[1w:5m]` to calculate standard deviation
of the 5-minute latency over 1 week. The numerator uses `[1w]` rate directly to
avoid a second subquery for the mean.*

#### Short-Window Z-Score (For Newer Agents - >1 hour history)

Compares the 1-minute 95th percentile latency to the 1-hour baseline. Useful for
quick activation on new agents.

```promql
abs(
histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[1m])) by (le, reasoning_engine_id))
-
histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[1h])) by (le, reasoning_engine_id))
)
/
stddev_over_time(
(histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[1m])) by (le, reasoning_engine_id)))[1h:1m]
) > 3
```

### Moving Averages (Recommended for Bursty Traffic)

Compares the 5-minute latency to the 1-hour average.

```promql
histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m])) by (le, reasoning_engine_id))
>
1.5 * histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[1h])) by (le, reasoning_engine_id))
```

### Seasonal Decomposition (Recommended for traffic with seasonal or time-of-day component)

> [!NOTE] For the Latency alert policy, ONLY use seasonal decomposition to track
> Latency spikes. Alert policies using seasonal decomposition tracking both
> spikes and drops can falsely trigger alerts.

Compares the 5-minute latency to the average of 1-week and 1-day lookback
baselines.

```promql
histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m])) by (le, reasoning_engine_id))
/
(
(
histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m] offset 1d)) by (le, reasoning_engine_id))
+
histogram_quantile(0.95, sum(rate(aiplatform_googleapis_com:reasoning_engine_request_latencies_bucket[5m] offset 1w)) by (le, reasoning_engine_id))
) / 2
)
> 2
```

--------------------------------------------------------------------------------

## 2. Error Rate (SLO)

Always use Multi-Window Multi-Burn Rate SLOs. Z-score is not recommended due to
sparsity.

### Fast Burn SLO (1-Hour and 5-Minute Windows)

```promql
(
sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{response_code!~"2.."}[5m])) by (reasoning_engine_id)
/
sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[5m])) by (reasoning_engine_id)
> (1 - ${var.slo_target}) * 14.4
)
and
(
sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{response_code!~"2.."}[1h])) by (reasoning_engine_id)
/
sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[1h])) by (reasoning_engine_id)
> (1 - ${var.slo_target}) * 14.4
)
```

### Slow Burn SLO (3-Day and 6-Hour Windows)

```promql
(
sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{response_code!~"2.."}[6h])) by (reasoning_engine_id)
/
sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[6h])) by (reasoning_engine_id)
> (1 - ${var.slo_target}) * 1.0
)
and
(
sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count{response_code!~"2.."}[3d])) by (reasoning_engine_id)
/
sum(rate(aiplatform_googleapis_com:reasoning_engine_request_count[3d])) by (reasoning_engine_id)
> (1 - ${var.slo_target}) * 1.0
)
```
Loading