Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions docs/.custom_wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ Furo
gb
gh
Gi
GiB
GitHub
github
GitOps
Expand Down Expand Up @@ -89,6 +90,7 @@ LogQL
loki
Makefile
matchers
Mem
MetalLB
MetricsEndpointProvider
Microceph
Expand Down Expand Up @@ -182,8 +184,8 @@ TLS
tls
TOC
toctree
Traefik
Traefik's
traefik
traefik's
txt
ubuntu
UI
Expand All @@ -194,6 +196,8 @@ unencrypted
URL
utils
uv
vCPU
vCPUs
venv
visualizes
VMs
Expand Down
22 changes: 14 additions & 8 deletions docs/explanation/generic-rules.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
---
myst:
html_meta:
description: "Understand COS generic alert rules for host health, including HostHealth and AggregatorHostHealth behavior and scope."
html_meta:
description: "Understand COS generic alert rules for host health, including HostHealth and AggregatorHostHealth behavior and scope."
---

# Generic alert rule groups
The Canonical Observability Stack (COS) includes Generic alert rules which provide a minimal set of rules to inform admins when hosts in a deployment are unhealthy, unreachable, or otherwise unresponsive. This helps relieve charm authors from having to implement their host-health-related alerts per charm.

The Canonical Observability Stack (COS) includes Generic alert rules which provide a minimal set of rules to inform admins when hosts in a deployment are unhealthy, unreachable, or otherwise unresponsive. This helps relieve charm authors from having to implement their host-health-related alerts per charm.
There are two generic alert rule groups: `HostHealth` and `AggregatorHostHealth`, each containing multiple alert rules.
This guide explains the purpose of each rule group and its alerts. For steps to troubleshoot firing alert rules, refer to the [troubleshooting guide](../how-to/troubleshooting/troubleshoot-firing-alert-rules.md).
This guide explains the purpose of each rule group and its alerts. For steps to troubleshoot firing alert rules, refer to the [troubleshooting guide](../how-to/troubleshooting).

The `HostHealth` and `AggregatorHostHealth` alert rule groups are applicable to the following deployment scenarios:

Expand All @@ -28,23 +29,26 @@ end
grafana-agent ---|prometheus_remote_write| prometheus
cos-proxy ---|metrics_endpoint| prometheus
```

You can find more information on these groups and the alert rules they contain below.

## `HostHealth` alert group

The `HostHealth` alert rule group contains the `HostDown` and `HostMetricsMissing` alert rules, identifying unreachable target scenarios.

### `HostDown` alert rule

The `HostDown` alert rule is directly applicable to cases where a charm is being scraped by Prometheus for metrics. This rule notifies you when Prometheus (or Mimir) fails to scrape its target. The alert expression executes `up{...} < 1` with labels including the target's Juju topology: `juju_model`, `juju_application`, etc. The [`up` metric](https://prometheus.io/docs/concepts/jobs_instances/), which is what this alert's expression relies on, indicates the health or reachability status of a node. For example, when `up` is 1 for a charm, this is a sign that Prometheus is able to successfully call the metrics endpoint of that charm and access the metrics that are exposed at that endpoint.

This alert is especially important for COS Lite, where Prometheus is capable of scraping charms for metrics. The firing of this alert indicates that Prometheus is not able to scrape a target for metrics, leading to `up` being 0.


### `HostMetricsMissing` alert rule

```{note}
`HostMetricsMissing` is also used in the `AggregatorHostHealth` group. As part of the `HostHealth` group, however, it monitors the health of any charm (not just aggregators) whose metrics are collected by an aggregator and then remote written to a metrics backend. See the `AggregatorHostHealth` group for details on the distinction.
```

This alert notifies you when metrics are not reaching the Prometheus (or Mimir) database, regardless of whether scrape succeeded. The alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, `juju_unit`, etc.
This alert notifies you when metrics are not reaching the Prometheus (or Mimir) database, regardless of whether scrape succeeded. The alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, `juju_unit`, etc.

Like the `HostDown` rule, this rule gives you an idea of the health of a node and whether it is reachable. However, unlike `HostDown`, `HostMetricsMissing` is used in scenarios where metrics from a charm are remote written into Prometheus or Mimir, as opposed to being scraped. This rule is especially important in COS, where the use of Mimir instead of Prometheus warrants metrics to be remote written (as Mimir does not scrape).

Expand All @@ -54,9 +58,11 @@ To provide an example that distinguishes between `HostDown` and `HostMetricsMiss
- In COS HA, a collector such as `opentelemetry-collector` scrapes Alertmanager and then remote writes the collected metrics into Mimir. In this scenario, in Mimir, we either have an `up` of 1 or an absent `up` altogether. Here, we need the `HostMetricsMissing` alert to be aware of the health of Alertmanager. Note that it is possible that the scrape of Alertmanager being made by the aggregator is successful and that `up` is missing because the aggregator is failing to remote write what it has scraped.

## `AggregatorHostHealth` alert group

The `AggregatorHostHealth` alert rule group focuses explicitly on the health of aggregators (remote writers), such as `opentelemetry-collector` and `grafana-agent`. This group contains the `HostMetricsMissing` and the `AggregatorMetricsMissing` alert rules.

### `HostMetricsMissing` alert rule

The `HostMetricsMissing` alert rule fires when metrics are not reaching the Prometheus (or Mimir) database, regardless of whether scrape succeeded. The alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, `juju_unit`, etc. However, when it comes to aggregators, this rule indicates whether alerts from a collector itself are reaching the metrics backend.

When you have an aggregator charm (e.g. `opentelemetry-collector` or `grafana-agent`), this alert is duplicated per unit of that aggregator so that it identifies if a unit is missing a time series. For example, if you have 2 units of `opentelemetry-collector`, and one is behind a restrictive firewall, you should receive only one firing `HostMetricsMissing` alert.
Expand All @@ -66,9 +72,9 @@ By default, the severity of this alert is `warning`. However, when this alert is
```

### `AggregatorMetricsMissing` alert rule
Similar to `HostMetricsMissing`, this alert is applied to aggregators to ensure their `up` metric exists. The difference, however, is that `AggregatorMetricsMissing` **triggers only when *all units* of an aggregator are down**. For this reason, the alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, but leaves out `juju_unit`. If you have 2 units of an aggregator and the `up` metric is missing for both, this alert will fire.

Similar to `HostMetricsMissing`, this alert is applied to aggregators to ensure their `up` metric exists. The difference, however, is that `AggregatorMetricsMissing` **triggers only when _all units_ of an aggregator are down**. For this reason, the alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, but leaves out `juju_unit`. If you have 2 units of an aggregator and the `up` metric is missing for both, this alert will fire.

```{note}
By default, the severity of this alert is **always** `critical`.
```

68 changes: 68 additions & 0 deletions docs/explanation/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
myst:
html_meta:
description: "Understand COS architecture and design decisions, telemetry models, Juju topology, stack variants, and alerting."
---

(explanation)=

# Explanation

These pages provide conceptual background and design intent for the COS
stack. Use this section to understand the why and how behind
our architecture, telemetry model, and operational choices.

## Overview

A high-level introduction to observability and the model-driven approach COS makes
use of.

```{toctree}
:maxdepth: 1

What is Observability? <https://canonical.com/observability/what-is-observability>
Model-Driven Observability <https://ubuntu.com/blog/tag/model-driven-observability>
```

## Topology & stack variants

Information about deployment topology, the Juju model layout, and the
different stack variants available (COS, and COS Lite).

```{toctree}
:maxdepth: 1

Juju Topology <juju-topology>
Stack variants <stack-variants>
```

## Architecture & design

These pages describe the architecture decisions, design goals, and the
telemetry pipelines we rely on. They are useful when evaluating how COS fits
into your observability strategy or when designing integrations.

```{toctree}
:maxdepth: 1

Design Goals <design-goals>
Logging Architecture <logging-architecture>
Telemetry Flow <telemetry-flow>
Telemetry Correlation <telemetry-correlation>
Telemetry Labels <telemetry-labels>
Opentelemetry Protocol (OTLP) Juju Topology Labels <telemetry-otlp-topology-labels>
Data Integrity <data-integrity>
```

## Alerting & rules

Guidance about built-in alerting, charmed alert rules and how rules are
designed and managed across the stack.

```{toctree}
:maxdepth: 1

Charmed alert rules <charmed-rules>
Generic alert rules <generic-rules>
Dashboard upgrades and deduplication <dashboard-upgrades>
```
66 changes: 0 additions & 66 deletions docs/explanation/index.rst

This file was deleted.

72 changes: 72 additions & 0 deletions docs/how-to/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
myst:
html_meta:
description: "Practical how-to guides for operating Canonical Observability Stack, including migration, integration, telemetry configuration, and troubleshooting tasks."
---

(how-to)=

# How-to guides

These guides accompany you through the complete COS stack operations life cycle.

```{note}
If you are looking for instructions on how to get started with COS Lite, see
{ref}`the tutorial section <tutorial>`.
```

## Validating

These guides will help validating new and existing deployments.

```{toctree}
:maxdepth: 1

Validate COS deployment <validate-cos-deployment>
```

## Migrating

These guides till assist existing users of other observability stacks offered by
Canonical in migrating to COS Lite or the full COS.

```{toctree}
:maxdepth: 1

Cross-track upgrade instructions <upgrade>
Migrate from LMA to COS Lite <migrate-lma-to-cos-lite>
Migrate from Grafana Agent to OpenTelemetry Collector <migrate-grafana-agent-to-otelcol>
```

## Configuring

Once COS has been deployed, the next natural step would be to integrate your charms and workloads
with COS to actually observe them.

```{toctree}
:maxdepth: 1

Evaluate telemetry volume <evaluate-telemetry-volume>
Add tracing to COS Lite <add-tracing-to-cos-lite>
Add alert rules <adding-alert-rules>
Configure scrape jobs <configure-scrape-jobs>
Expose a metrics endpoint <exposing-a-metrics-endpoint>
Integrate COS Lite with uncharmed applications <integrating-cos-lite-with-uncharmed-applications>
Disable built-in charm alert rules <disable-charmed-rules>
Testing with Minio <deploy-s3-integrator-and-minio>
Configure TLS encryption <configure-tls-encryption>
Selectively drop telemetry using scrape config <selectively-drop-telemetry-scrape-config>
Selectively drop telemetry using opentelemetry-collector <selectively-drop-telemetry-otelcol>
Tier OpenTelemetry Collector with different pipelines per data stream <tiered-otelcols>
```

## Troubleshooting

During continuous operations, you might sometimes run into issues that you need to resolve. These
how-to guides will assist you in troubleshooting COS in an effective manner.

```{toctree}
:maxdepth: 1

Troubleshooting <troubleshooting>
```
76 changes: 0 additions & 76 deletions docs/how-to/index.rst

This file was deleted.

Loading
Loading