canonical · sed-i · Feb 4, 2026 · Feb 5, 2026 · Feb 5, 2026 · Mar 10, 2026
diff --git a/docs/.custom_wordlist.txt b/docs/.custom_wordlist.txt
@@ -52,6 +52,7 @@ Furo
 gb
 gh
 Gi
+GiB
 GitHub
 github
 GitOps
@@ -89,6 +90,7 @@ LogQL
 loki
 Makefile
 matchers
+Mem
 MetalLB
 MetricsEndpointProvider
 Microceph
@@ -182,8 +184,8 @@ TLS
 tls
 TOC
 toctree
-Traefik
-Traefik's
+traefik
+traefik's
 txt
 ubuntu
 UI
@@ -194,6 +196,8 @@ unencrypted
 URL
 utils
 uv
+vCPU
+vCPUs
 venv
 visualizes
 VMs

diff --git a/docs/explanation/generic-rules.md b/docs/explanation/generic-rules.md
@@ -1,13 +1,14 @@
 ---
 myst:
- html_meta:
-  description: "Understand COS generic alert rules for host health, including HostHealth and AggregatorHostHealth behavior and scope."
+  html_meta:
+    description: "Understand COS generic alert rules for host health, including HostHealth and AggregatorHostHealth behavior and scope."
 ---
 
 # Generic alert rule groups
-The Canonical Observability Stack (COS) includes Generic alert rules which provide a minimal set of rules to inform admins when hosts in a deployment are unhealthy, unreachable, or otherwise unresponsive. This helps relieve charm authors from having to implement their host-health-related alerts per charm. 
+
+The Canonical Observability Stack (COS) includes Generic alert rules which provide a minimal set of rules to inform admins when hosts in a deployment are unhealthy, unreachable, or otherwise unresponsive. This helps relieve charm authors from having to implement their host-health-related alerts per charm.
 There are two generic alert rule groups: `HostHealth` and `AggregatorHostHealth`, each containing multiple alert rules.
-This guide explains the purpose of each rule group and its alerts. For steps to troubleshoot firing alert rules, refer to the [troubleshooting guide](../how-to/troubleshooting/troubleshoot-firing-alert-rules.md).
+This guide explains the purpose of each rule group and its alerts. For steps to troubleshoot firing alert rules, refer to the [troubleshooting guide](../how-to/troubleshooting).
 
 The `HostHealth` and `AggregatorHostHealth` alert rule groups are applicable to the following deployment scenarios:
 
@@ -28,23 +29,26 @@ end
 grafana-agent ---|prometheus_remote_write| prometheus
 cos-proxy ---|metrics_endpoint| prometheus
 ```
+
 You can find more information on these groups and the alert rules they contain below.
 
 ## `HostHealth` alert group
+
 The `HostHealth` alert rule group contains the `HostDown` and `HostMetricsMissing` alert rules, identifying unreachable target scenarios.
 
 ### `HostDown` alert rule
+
 The `HostDown` alert rule is directly applicable to cases where a charm is being scraped by Prometheus for metrics. This rule notifies you when Prometheus (or Mimir) fails to scrape its target. The alert expression executes `up{...} < 1` with labels including the target's Juju topology: `juju_model`, `juju_application`, etc. The [`up` metric](https://prometheus.io/docs/concepts/jobs_instances/), which is what this alert's expression relies on, indicates the health or reachability status of a node. For example, when `up` is 1 for a charm, this is a sign that Prometheus is able to successfully call the metrics endpoint of that charm and access the metrics that are exposed at that endpoint.
 
 This alert is especially important for COS Lite, where Prometheus is capable of scraping charms for metrics. The firing of this alert indicates that Prometheus is not able to scrape a target for metrics, leading to `up` being 0.
 
-
 ### `HostMetricsMissing` alert rule
+
 ```{note}
 `HostMetricsMissing` is also used in the `AggregatorHostHealth` group. As part of the `HostHealth` group, however, it monitors the health of any charm (not just aggregators) whose metrics are collected by an aggregator and then remote written to a metrics backend. See the `AggregatorHostHealth` group for details on the distinction.
 ```
 
-This alert notifies you when metrics are not reaching the Prometheus (or Mimir) database, regardless of whether scrape succeeded. The alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, `juju_unit`, etc. 
+This alert notifies you when metrics are not reaching the Prometheus (or Mimir) database, regardless of whether scrape succeeded. The alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, `juju_unit`, etc.
 
 Like the `HostDown` rule, this rule gives you an idea of the health of a node and whether it is reachable. However, unlike `HostDown`, `HostMetricsMissing` is used in scenarios where metrics from a charm are remote written into Prometheus or Mimir, as opposed to being scraped. This rule is especially important in COS, where the use of Mimir instead of Prometheus warrants metrics to be remote written (as Mimir does not scrape).
 
@@ -54,9 +58,11 @@ To provide an example that distinguishes between `HostDown` and `HostMetricsMiss
 - In COS HA, a collector such as `opentelemetry-collector` scrapes Alertmanager and then remote writes the collected metrics into Mimir. In this scenario, in Mimir, we either have an `up` of 1 or an absent `up` altogether. Here, we need the `HostMetricsMissing` alert to be aware of the health of Alertmanager. Note that it is possible that the scrape of Alertmanager being made by the aggregator is successful and that `up` is missing because the aggregator is failing to remote write what it has scraped.
 
 ## `AggregatorHostHealth` alert group
+
 The `AggregatorHostHealth` alert rule group focuses explicitly on the health of aggregators (remote writers), such as `opentelemetry-collector` and `grafana-agent`. This group contains the `HostMetricsMissing` and the `AggregatorMetricsMissing` alert rules.
 
 ### `HostMetricsMissing` alert rule
+
 The `HostMetricsMissing` alert rule fires when metrics are not reaching the Prometheus (or Mimir) database, regardless of whether scrape succeeded. The alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, `juju_unit`, etc. However, when it comes to aggregators, this rule indicates whether alerts from a collector itself are reaching the metrics backend.
 
 When you have an aggregator charm (e.g. `opentelemetry-collector` or `grafana-agent`), this alert is duplicated per unit of that aggregator so that it identifies if a unit is missing a time series. For example, if you have 2 units of `opentelemetry-collector`, and one is behind a restrictive firewall, you should receive only one firing `HostMetricsMissing` alert.
@@ -66,9 +72,9 @@ By default, the severity of this alert is `warning`. However, when this alert is
 ```
 
 ### `AggregatorMetricsMissing` alert rule
-Similar to `HostMetricsMissing`, this alert is applied to aggregators to ensure their `up` metric exists. The difference, however, is that `AggregatorMetricsMissing` **triggers only when *all units* of an aggregator are down**. For this reason, the alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, but leaves out `juju_unit`. If you have 2 units of an aggregator and the `up` metric is missing for both, this alert will fire.
+
+Similar to `HostMetricsMissing`, this alert is applied to aggregators to ensure their `up` metric exists. The difference, however, is that `AggregatorMetricsMissing` **triggers only when _all units_ of an aggregator are down**. For this reason, the alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, but leaves out `juju_unit`. If you have 2 units of an aggregator and the `up` metric is missing for both, this alert will fire.
 
 ```{note}
 By default, the severity of this alert is **always** `critical`.
 ```
-
diff --git a/docs/explanation/index.md b/docs/explanation/index.md
@@ -0,0 +1,68 @@
+---
+myst:
+  html_meta:
+    description: "Understand COS architecture and design decisions, telemetry models, Juju topology, stack variants, and alerting."
+---
+
+(explanation)=
+
+# Explanation
+
+These pages provide conceptual background and design intent for the COS
+stack. Use this section to understand the why and how behind
+our architecture, telemetry model, and operational choices.
+
+## Overview
+
+A high-level introduction to observability and the model-driven approach COS makes
+use of.
+
+```{toctree}
+:maxdepth: 1
+
+What is Observability? <https://canonical.com/observability/what-is-observability>
+Model-Driven Observability <https://ubuntu.com/blog/tag/model-driven-observability>
+```
+
+## Topology & stack variants
+
+Information about deployment topology, the Juju model layout, and the
+different stack variants available (COS, and COS Lite).
+
+```{toctree}
+:maxdepth: 1
+
+Juju Topology <juju-topology>
+Stack variants <stack-variants>
+```
+
+## Architecture & design
+
+These pages describe the architecture decisions, design goals, and the
+telemetry pipelines we rely on. They are useful when evaluating how COS fits
+into your observability strategy or when designing integrations.
+
+```{toctree}
+:maxdepth: 1
+
+Design Goals <design-goals>
+Logging Architecture <logging-architecture>
+Telemetry Flow <telemetry-flow>
+Telemetry Correlation <telemetry-correlation>
+Telemetry Labels <telemetry-labels>
+Opentelemetry Protocol (OTLP) Juju Topology Labels <telemetry-otlp-topology-labels>
+Data Integrity <data-integrity>
+```
+
+## Alerting & rules
+
+Guidance about built-in alerting, charmed alert rules and how rules are
+designed and managed across the stack.
+
+```{toctree}
+:maxdepth: 1
+
+Charmed alert rules <charmed-rules>
+Generic alert rules <generic-rules>
+Dashboard upgrades and deduplication <dashboard-upgrades>
+```
diff --git a/docs/explanation/index.rst b/docs/explanation/index.rst
diff --git a/docs/how-to/index.md b/docs/how-to/index.md
@@ -0,0 +1,72 @@
+---
+myst:
+  html_meta:
+    description: "Practical how-to guides for operating Canonical Observability Stack, including migration, integration, telemetry configuration, and troubleshooting tasks."
+---
+
+(how-to)=
+
+# How-to guides
+
+These guides accompany you through the complete COS stack operations life cycle.
+
+```{note}
+If you are looking for instructions on how to get started with COS Lite, see
+{ref}`the tutorial section <tutorial>`.
+```
+
+## Validating
+
+These guides will help validating new and existing deployments.
+
+```{toctree}
+:maxdepth: 1
+
+Validate COS deployment <validate-cos-deployment>
+```
+
+## Migrating
+
+These guides till assist existing users of other observability stacks offered by
+Canonical in migrating to COS Lite or the full COS.
+
+```{toctree}
+:maxdepth: 1
+
+Cross-track upgrade instructions <upgrade>
+Migrate from LMA to COS Lite <migrate-lma-to-cos-lite>
+Migrate from Grafana Agent to OpenTelemetry Collector <migrate-grafana-agent-to-otelcol>
+```
+
+## Configuring
+
+Once COS has been deployed, the next natural step would be to integrate your charms and workloads
+with COS to actually observe them.
+
+```{toctree}
+:maxdepth: 1
+
+Evaluate telemetry volume <evaluate-telemetry-volume>
+Add tracing to COS Lite <add-tracing-to-cos-lite>
+Add alert rules <adding-alert-rules>
+Configure scrape jobs <configure-scrape-jobs>
+Expose a metrics endpoint <exposing-a-metrics-endpoint>
+Integrate COS Lite with uncharmed applications <integrating-cos-lite-with-uncharmed-applications>
+Disable built-in charm alert rules <disable-charmed-rules>
+Testing with Minio <deploy-s3-integrator-and-minio>
+Configure TLS encryption <configure-tls-encryption>
+Selectively drop telemetry using scrape config <selectively-drop-telemetry-scrape-config>
+Selectively drop telemetry using opentelemetry-collector <selectively-drop-telemetry-otelcol>
+Tier OpenTelemetry Collector with different pipelines per data stream <tiered-otelcols>
+```
+
+## Troubleshooting
+
+During continuous operations, you might sometimes run into issues that you need to resolve. These
+how-to guides will assist you in troubleshooting COS in an effective manner.
+
+```{toctree}
+:maxdepth: 1
+
+Troubleshooting <troubleshooting>
+```
diff --git a/docs/how-to/index.rst b/docs/how-to/index.rst
diff --git a/docs/how-to/migrate-gagent-to-otelcol.md → ...ow-to/migrate-grafana-agent-to-otelcol.md b/docs/how-to/migrate-gagent-to-otelcol.md → ...ow-to/migrate-grafana-agent-to-otelcol.md