Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions docs/ai/finops-agent.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,11 @@ Before enabling the FinOps Agent, ensure the following:

## Enabling the FinOps Agent

### Step 1: Create the FinOps Agent Secret
### Step 1: Install a FinOps module

The FinOps Agent relies on a FinOps module to provide Kubernetes resource cost data. Install a FinOps module before enabling the agent — for example, the [OpenCost FinOps module](https://github.com/openchoreo/community-modules/tree/main/finops-opencost).

### Step 2: Create the FinOps Agent Secret

The FinOps Agent requires a Kubernetes Secret named `finops-agent` in the `openchoreo-observability-plane` namespace with the following keys:

Expand Down Expand Up @@ -63,7 +67,7 @@ spec:
EOF
```

### Step 2: Upgrade the Observability Plane
### Step 3: Upgrade the Observability Plane

Enable the FinOps Agent and configure the LLM model. The `--reuse-values` flag preserves your existing configuration.

Expand All @@ -73,7 +77,8 @@ Enable the FinOps Agent and configure the LLM model. The `--reuse-values` flag p
--namespace openchoreo-observability-plane \\
--reuse-values \\
--set finOpsAgent.enabled="true" \\
--set finOpsAgent.llmName=<model-name>`}
--set finOpsAgent.llmName=<model-name> \\
--set finOpsAgent.remediationEnabled=true`}
</CodeBlock>

:::note Supported Models
Expand All @@ -82,15 +87,15 @@ The FinOps Agent currently supports the [OpenAI](https://platform.openai.com/) G

If the observability plane and control plane are in separate clusters, also set `finOpsAgent.openchoreoApiUrl` to the control plane API URL (defaults to `http://api.openchoreo.localhost:8080`).

### Step 3: Register with the control plane
### Step 4: Register with the control plane

Configure `finOpsAgentURL` in the `ClusterObservabilityPlane` resource so the UI knows where to reach the FinOps Agent:

<CodeBlock language="bash">
{`kubectl patch clusterobservabilityplane default --type=merge -p '{"spec":{"finOpsAgentURL":"http://finops-agent.openchoreo.localhost:11080"}}'`}
</CodeBlock>

### Step 4: Verify the installation
### Step 5: Verify the installation

Check that the FinOps Agent pod is running:

Expand Down Expand Up @@ -171,7 +176,7 @@ kubectl exec -n openbao openbao-0 -- \
bao kv put secret/finops-sql-backend-uri value="postgresql+asyncpg://<USER>:<PASSWORD>@<HOST>:<PORT>/<DBNAME>"
```

Add the `SQL_BACKEND_URI` key to the ExternalSecret from [Step 1](#step-1-create-the-finops-agent-secret):
Add the `SQL_BACKEND_URI` key to the ExternalSecret from [Step 2](#step-2-create-the-finops-agent-secret):

```bash
kubectl patch externalsecret finops-agent -n openchoreo-observability-plane --type=json \
Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/try-it-out/on-k3d-locally.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -623,7 +623,7 @@ helm upgrade --install observability-metrics-prometheus \
oci://ghcr.io/openchoreo/helm-charts/observability-metrics-prometheus \
--create-namespace \
--namespace openchoreo-observability-plane \
--version 0.5.1
--version 0.6.1
```

#### Enable logs collection in the configured logs module
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ helm upgrade --install observability-metrics-prometheus \
oci://ghcr.io/openchoreo/helm-charts/observability-metrics-prometheus \
--create-namespace \
--namespace openchoreo-observability-plane \
--version 0.5.1
--version 0.6.1

step "Enabling logs collection in the configured logs module..."
helm upgrade observability-logs-opensearch \
Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/try-it-out/on-your-environment.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -968,7 +968,7 @@ helm upgrade --install observability-metrics-prometheus \
oci://ghcr.io/openchoreo/helm-charts/observability-metrics-prometheus \
--create-namespace \
--namespace openchoreo-observability-plane \
--version 0.5.1
--version 0.6.1
```

#### Install the traces module (OpenSearch)
Expand Down
98 changes: 96 additions & 2 deletions docs/tutorials/component-alerts-and-incidents.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ spec:
enum:
- log
- metric
- budget
description: "The data source type for the alert rule."
query:
type: string
Expand Down Expand Up @@ -143,6 +144,10 @@ spec:
type: boolean
default: false
description: "Enables incident creation when this alert fires. When enabled, a corresponding incident will be created in the incident management system."
triggerAiCostAnalysis:
type: boolean
default: false
description: "Enables AI-powered cost analysis when an incident is created for a budget alert. Provides automated cost breakdown and optimization recommendations. Requires incident.enabled to also be true and is only valid for budget source type."
triggerAiRca:
type: boolean
default: false
Expand All @@ -153,6 +158,10 @@ spec:
message: "A notification channel is mandatory for alert rules (incident-only rules are not supported). Provide environmentConfigs.actions.notifications.channels or set environment.defaultNotificationChannel."
- rule: "${!has(environmentConfigs.actions) || !has(environmentConfigs.actions.incident) || environmentConfigs.actions.incident.triggerAiRca == false || environmentConfigs.actions.incident.enabled == true}"
message: "incident.enabled must be true when triggerAiRca is true. AI-powered root cause analysis requires incident creation to be enabled."
- rule: "${!has(environmentConfigs.actions) || !has(environmentConfigs.actions.incident) || environmentConfigs.actions.incident.triggerAiCostAnalysis == false || environmentConfigs.actions.incident.enabled == true}"
message: "incident.enabled must be true when triggerAiCostAnalysis is true. AI-powered cost analysis requires incident creation to be enabled."
- rule: "${!has(environmentConfigs.actions) || !has(environmentConfigs.actions.incident) || environmentConfigs.actions.incident.triggerAiCostAnalysis == false || parameters.source.type == 'budget'}"
message: "triggerAiCostAnalysis can only be enabled for budget source type alerts."

creates:
- targetPlane: observabilityplane
Expand Down Expand Up @@ -189,7 +198,8 @@ spec:
? environmentConfigs.actions.notifications.channels
: [environment.defaultNotificationChannel]}
incident:
enabled: ${has(environmentConfigs.actions) && has(environmentConfigs.actions.incident) && (environmentConfigs.actions.incident.enabled || environmentConfigs.actions.incident.triggerAiRca)}
enabled: ${has(environmentConfigs.actions) && has(environmentConfigs.actions.incident) && (environmentConfigs.actions.incident.enabled || environmentConfigs.actions.incident.triggerAiRca || environmentConfigs.actions.incident.triggerAiCostAnalysis)}
triggerAiCostAnalysis: ${has(environmentConfigs.actions) && has(environmentConfigs.actions.incident) && environmentConfigs.actions.incident.enabled && environmentConfigs.actions.incident.triggerAiCostAnalysis}
triggerAiRca: ${has(environmentConfigs.actions) && has(environmentConfigs.actions.incident) && environmentConfigs.actions.incident.enabled && environmentConfigs.actions.incident.triggerAiRca}
EOF

Expand Down Expand Up @@ -403,6 +413,9 @@ Apply this step to create alert-rule trait instances for:
- `cart` component (metric-based alert)
- Trigger: memory usage of the cart component exceeds 70% for 2 minutes
- Trait instance name: `cartservice-high-memory-alert`
- `redis` component (budget-based alert)
- Trigger: Cost of the redis component exceeds USD 1 in 5 minutes
- Trait instance name: `redis-budget-alert`

Attach the alert-rule traits to existing components by appending them to spec.traits if the field already exists,
or creating the array if it does not:
Expand Down Expand Up @@ -530,6 +543,47 @@ else
}
]'
fi

# redis component: budget-based alert
if kubectl get component redis -n default \
-o jsonpath='{.spec.traits[*].instanceName}' 2>/dev/null \
| tr ' ' '\n' | grep -qx "redis-budget-alert"; then
echo "Trait 'redis-budget-alert' already exists on component 'redis', skipping."
else
kubectl patch component redis -n default --type=json -p='[
{
"op": "add",
"path": "/spec/traits/-",
"value": {
"name": "observability-alert-rule",
"kind": "ClusterTrait",
"instanceName": "redis-budget-alert",
"parameters": {
"description": "Alert when redis cost for 5mins exceeds USD 2",
"severity": "warning",
"source": { "type": "budget" },
"condition": { "window": "5m", "interval": "1m", "operator": "gt", "threshold": 2 }
}
}
}
]' 2>/dev/null || kubectl patch component redis -n default --type=json -p='[
{
"op": "add",
"path": "/spec/traits",
"value": [{
"name": "observability-alert-rule",
"kind": "ClusterTrait",
"instanceName": "redis-budget-alert",
"parameters": {
"description": "Alert when redis cost for 5mins exceeds USD 2",
"severity": "warning",
"source": { "type": "budget" },
"condition": { "window": "5m", "interval": "1m", "operator": "gt", "threshold": 2 }
}
}]
}
]'
fi
```

#### Notification Channels Are Configured Per Environment
Expand All @@ -551,6 +605,7 @@ This step creates `ReleaseBinding`s that:
- `frontend` component: overrides `PRODUCT_CATALOG_SERVICE_ADDR` to an invalid endpoint
- `recommendation` component: reduces CPU requests/limits to make high CPU easier to hit
- `cart` component: reduces memory requests/limits to make high memory easier to hit
- `redis` component: inflates CPU and memory requests/limits so the projected cost crosses the budget threshold in a short period of time
- Configure alert behavior for the `development` environment via `traitEnvironmentConfigs`:
- Enable/disable alert rules
- Select notification channels
Expand Down Expand Up @@ -638,11 +693,42 @@ spec:
memory: "150Mi"
# Note: traitEnvironmentConfigs is omitted here.
# Defaults: alert enabled, incident creation disabled, no AI RCA, uses environment's default notification channel

---
# ReleaseBinding for redis component with oversized resource requests/limits to trigger a budget alert
apiVersion: openchoreo.dev/v1alpha1
kind: ReleaseBinding
metadata:
name: redis-development
namespace: default
spec:
owner:
projectName: gcp-microservice-demo
componentName: redis
environment: development
componentTypeEnvironmentConfigs:
resources:
requests:
cpu: "500m"
memory: "400Mi"
limits:
cpu: "1000m"
memory: "1000Mi"
traitEnvironmentConfigs:
redis-budget-alert:
enabled: true
actions:
notifications:
channels:
- "webhook-notification-channel-development"
incident:
enabled: true
triggerAiCostAnalysis: true
EOF
```

:::note
`actions.incident.triggerAiRca: true` requires `actions.incident.enabled: true`. AI root cause analysis can only be enabled when incident creation is enabled.
`actions.incident.triggerAiRca: true` and `actions.incident.triggerAiCostAnalysis: true` both require `actions.incident.enabled: true`. `triggerAiCostAnalysis` is only valid for alerts with `source.type: budget`.
:::

### Step 6: Trigger Alerts
Expand Down Expand Up @@ -686,6 +772,14 @@ Also you can acknowledge and resolve incidents via the Backstage portal when the

If you have properly configured the [SRE Agent](../ai/sre-agent.mdx), you can verify AI root cause analysis by checking the RCA reports in the Backstage portal when an incident is created.

### Step 9: Verify Budget Alert and AI Cost Analysis

Within a few minutes of applying the redis `ReleaseBinding`, the `redis-budget-alert` should fire because the inflated CPU/memory requests push the cost above the threshold.

- Confirm alert delivery to the configured webhook notification channel.
- Confirm that an incident was created for the budget alert
- If the FinOps Agent is configured, an **AI cost analysis** report is generated for the incident — view it in the Backstage portal alongside the incident. The cost analysis report provides a cost optimization recommendation and lets you apply the recommendation automatically.

## Summary

You attached OpenChoreo **observability alert rules** to existing components (as `observability-alert-rule` traits), configured **email** and **webhook** notification channels per environment, and enabled **incident creation** (plus AI root cause analysis) via `ReleaseBinding` `traitEnvironmentConfigs`.
Expand Down