diff --git a/docs/.custom_wordlist.txt b/docs/.custom_wordlist.txt index 36b57d1f..47521a7f 100644 --- a/docs/.custom_wordlist.txt +++ b/docs/.custom_wordlist.txt @@ -52,6 +52,7 @@ Furo gb gh Gi +GiB GitHub github GitOps @@ -89,6 +90,7 @@ LogQL loki Makefile matchers +Mem MetalLB MetricsEndpointProvider Microceph @@ -182,8 +184,8 @@ TLS tls TOC toctree -Traefik -Traefik's +traefik +traefik's txt ubuntu UI @@ -194,6 +196,8 @@ unencrypted URL utils uv +vCPU +vCPUs venv visualizes VMs diff --git a/docs/explanation/generic-rules.md b/docs/explanation/generic-rules.md index 6331ad30..dc6ca8a6 100644 --- a/docs/explanation/generic-rules.md +++ b/docs/explanation/generic-rules.md @@ -1,13 +1,14 @@ --- myst: - html_meta: - description: "Understand COS generic alert rules for host health, including HostHealth and AggregatorHostHealth behavior and scope." + html_meta: + description: "Understand COS generic alert rules for host health, including HostHealth and AggregatorHostHealth behavior and scope." --- # Generic alert rule groups -The Canonical Observability Stack (COS) includes Generic alert rules which provide a minimal set of rules to inform admins when hosts in a deployment are unhealthy, unreachable, or otherwise unresponsive. This helps relieve charm authors from having to implement their host-health-related alerts per charm. + +The Canonical Observability Stack (COS) includes Generic alert rules which provide a minimal set of rules to inform admins when hosts in a deployment are unhealthy, unreachable, or otherwise unresponsive. This helps relieve charm authors from having to implement their host-health-related alerts per charm. There are two generic alert rule groups: `HostHealth` and `AggregatorHostHealth`, each containing multiple alert rules. -This guide explains the purpose of each rule group and its alerts. For steps to troubleshoot firing alert rules, refer to the [troubleshooting guide](../how-to/troubleshooting/troubleshoot-firing-alert-rules.md). +This guide explains the purpose of each rule group and its alerts. For steps to troubleshoot firing alert rules, refer to the [troubleshooting guide](../how-to/troubleshooting). The `HostHealth` and `AggregatorHostHealth` alert rule groups are applicable to the following deployment scenarios: @@ -28,23 +29,26 @@ end grafana-agent ---|prometheus_remote_write| prometheus cos-proxy ---|metrics_endpoint| prometheus ``` + You can find more information on these groups and the alert rules they contain below. ## `HostHealth` alert group + The `HostHealth` alert rule group contains the `HostDown` and `HostMetricsMissing` alert rules, identifying unreachable target scenarios. ### `HostDown` alert rule + The `HostDown` alert rule is directly applicable to cases where a charm is being scraped by Prometheus for metrics. This rule notifies you when Prometheus (or Mimir) fails to scrape its target. The alert expression executes `up{...} < 1` with labels including the target's Juju topology: `juju_model`, `juju_application`, etc. The [`up` metric](https://prometheus.io/docs/concepts/jobs_instances/), which is what this alert's expression relies on, indicates the health or reachability status of a node. For example, when `up` is 1 for a charm, this is a sign that Prometheus is able to successfully call the metrics endpoint of that charm and access the metrics that are exposed at that endpoint. This alert is especially important for COS Lite, where Prometheus is capable of scraping charms for metrics. The firing of this alert indicates that Prometheus is not able to scrape a target for metrics, leading to `up` being 0. - ### `HostMetricsMissing` alert rule + ```{note} `HostMetricsMissing` is also used in the `AggregatorHostHealth` group. As part of the `HostHealth` group, however, it monitors the health of any charm (not just aggregators) whose metrics are collected by an aggregator and then remote written to a metrics backend. See the `AggregatorHostHealth` group for details on the distinction. ``` -This alert notifies you when metrics are not reaching the Prometheus (or Mimir) database, regardless of whether scrape succeeded. The alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, `juju_unit`, etc. +This alert notifies you when metrics are not reaching the Prometheus (or Mimir) database, regardless of whether scrape succeeded. The alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, `juju_unit`, etc. Like the `HostDown` rule, this rule gives you an idea of the health of a node and whether it is reachable. However, unlike `HostDown`, `HostMetricsMissing` is used in scenarios where metrics from a charm are remote written into Prometheus or Mimir, as opposed to being scraped. This rule is especially important in COS, where the use of Mimir instead of Prometheus warrants metrics to be remote written (as Mimir does not scrape). @@ -54,9 +58,11 @@ To provide an example that distinguishes between `HostDown` and `HostMetricsMiss - In COS HA, a collector such as `opentelemetry-collector` scrapes Alertmanager and then remote writes the collected metrics into Mimir. In this scenario, in Mimir, we either have an `up` of 1 or an absent `up` altogether. Here, we need the `HostMetricsMissing` alert to be aware of the health of Alertmanager. Note that it is possible that the scrape of Alertmanager being made by the aggregator is successful and that `up` is missing because the aggregator is failing to remote write what it has scraped. ## `AggregatorHostHealth` alert group + The `AggregatorHostHealth` alert rule group focuses explicitly on the health of aggregators (remote writers), such as `opentelemetry-collector` and `grafana-agent`. This group contains the `HostMetricsMissing` and the `AggregatorMetricsMissing` alert rules. ### `HostMetricsMissing` alert rule + The `HostMetricsMissing` alert rule fires when metrics are not reaching the Prometheus (or Mimir) database, regardless of whether scrape succeeded. The alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, `juju_unit`, etc. However, when it comes to aggregators, this rule indicates whether alerts from a collector itself are reaching the metrics backend. When you have an aggregator charm (e.g. `opentelemetry-collector` or `grafana-agent`), this alert is duplicated per unit of that aggregator so that it identifies if a unit is missing a time series. For example, if you have 2 units of `opentelemetry-collector`, and one is behind a restrictive firewall, you should receive only one firing `HostMetricsMissing` alert. @@ -66,9 +72,9 @@ By default, the severity of this alert is `warning`. However, when this alert is ``` ### `AggregatorMetricsMissing` alert rule -Similar to `HostMetricsMissing`, this alert is applied to aggregators to ensure their `up` metric exists. The difference, however, is that `AggregatorMetricsMissing` **triggers only when *all units* of an aggregator are down**. For this reason, the alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, but leaves out `juju_unit`. If you have 2 units of an aggregator and the `up` metric is missing for both, this alert will fire. + +Similar to `HostMetricsMissing`, this alert is applied to aggregators to ensure their `up` metric exists. The difference, however, is that `AggregatorMetricsMissing` **triggers only when _all units_ of an aggregator are down**. For this reason, the alert expression executes `absent(up{...})` with labels including the aggregator's Juju topology: `juju_model`, `juju_application`, but leaves out `juju_unit`. If you have 2 units of an aggregator and the `up` metric is missing for both, this alert will fire. ```{note} By default, the severity of this alert is **always** `critical`. ``` - diff --git a/docs/explanation/index.md b/docs/explanation/index.md new file mode 100644 index 00000000..2f819bc2 --- /dev/null +++ b/docs/explanation/index.md @@ -0,0 +1,68 @@ +--- +myst: + html_meta: + description: "Understand COS architecture and design decisions, telemetry models, Juju topology, stack variants, and alerting." +--- + +(explanation)= + +# Explanation + +These pages provide conceptual background and design intent for the COS +stack. Use this section to understand the why and how behind +our architecture, telemetry model, and operational choices. + +## Overview + +A high-level introduction to observability and the model-driven approach COS makes +use of. + +```{toctree} +:maxdepth: 1 + +What is Observability? +Model-Driven Observability +``` + +## Topology & stack variants + +Information about deployment topology, the Juju model layout, and the +different stack variants available (COS, and COS Lite). + +```{toctree} +:maxdepth: 1 + +Juju Topology +Stack variants +``` + +## Architecture & design + +These pages describe the architecture decisions, design goals, and the +telemetry pipelines we rely on. They are useful when evaluating how COS fits +into your observability strategy or when designing integrations. + +```{toctree} +:maxdepth: 1 + +Design Goals +Logging Architecture +Telemetry Flow +Telemetry Correlation +Telemetry Labels +Opentelemetry Protocol (OTLP) Juju Topology Labels +Data Integrity +``` + +## Alerting & rules + +Guidance about built-in alerting, charmed alert rules and how rules are +designed and managed across the stack. + +```{toctree} +:maxdepth: 1 + +Charmed alert rules +Generic alert rules +Dashboard upgrades and deduplication +``` diff --git a/docs/explanation/index.rst b/docs/explanation/index.rst deleted file mode 100644 index 9e8a7757..00000000 --- a/docs/explanation/index.rst +++ /dev/null @@ -1,66 +0,0 @@ -.. meta:: - :description: Understand COS architecture and design decisions, telemetry models, Juju topology, stack variants, and alerting. - -.. _explanation: - -Explanation -*********** - -These pages provide conceptual background and design intent for the COS -stack. Use this section to understand the why and how behind -our architecture, telemetry model, and operational choices. - -Overview -======== - -A high-level introduction to observability and the model-driven approach COS makes -use of. - -.. toctree:: - :maxdepth: 1 - - What is Observability? - Model-Driven Observability - -Topology & stack variants -========================= - -Information about deployment topology, the Juju model layout, and the -different stack variants available (COS, and COS Lite). - -.. toctree:: - :maxdepth: 1 - - Juju Topology - Stack variants - -Architecture & design -===================== - -These pages describe the architecture decisions, design goals, and the -telemetry pipelines we rely on. They are useful when evaluating how COS fits -into your observability strategy or when designing integrations. - -.. toctree:: - :maxdepth: 1 - - Design Goals - Logging Architecture - Telemetry Flow - Telemetry Correlation - Telemetry Labels - Opentelemetry Protocol (OTLP) Juju Topology Labels - Data Integrity - -Alerting & rules -================= - -Guidance about built-in alerting, charmed alert rules and how rules are -designed and managed across the stack. - -.. toctree:: - :maxdepth: 1 - - Charmed alert rules - Generic alert rules - Dashboard upgrades and deduplication diff --git a/docs/how-to/index.md b/docs/how-to/index.md new file mode 100644 index 00000000..1567b759 --- /dev/null +++ b/docs/how-to/index.md @@ -0,0 +1,72 @@ +--- +myst: + html_meta: + description: "Practical how-to guides for operating Canonical Observability Stack, including migration, integration, telemetry configuration, and troubleshooting tasks." +--- + +(how-to)= + +# How-to guides + +These guides accompany you through the complete COS stack operations life cycle. + +```{note} +If you are looking for instructions on how to get started with COS Lite, see +{ref}`the tutorial section `. +``` + +## Validating + +These guides will help validating new and existing deployments. + +```{toctree} +:maxdepth: 1 + +Validate COS deployment +``` + +## Migrating + +These guides till assist existing users of other observability stacks offered by +Canonical in migrating to COS Lite or the full COS. + +```{toctree} +:maxdepth: 1 + +Cross-track upgrade instructions +Migrate from LMA to COS Lite +Migrate from Grafana Agent to OpenTelemetry Collector +``` + +## Configuring + +Once COS has been deployed, the next natural step would be to integrate your charms and workloads +with COS to actually observe them. + +```{toctree} +:maxdepth: 1 + +Evaluate telemetry volume +Add tracing to COS Lite +Add alert rules +Configure scrape jobs +Expose a metrics endpoint +Integrate COS Lite with uncharmed applications +Disable built-in charm alert rules +Testing with Minio +Configure TLS encryption +Selectively drop telemetry using scrape config +Selectively drop telemetry using opentelemetry-collector +Tier OpenTelemetry Collector with different pipelines per data stream +``` + +## Troubleshooting + +During continuous operations, you might sometimes run into issues that you need to resolve. These +how-to guides will assist you in troubleshooting COS in an effective manner. + +```{toctree} +:maxdepth: 1 + +Troubleshooting +``` diff --git a/docs/how-to/index.rst b/docs/how-to/index.rst deleted file mode 100644 index 18aabe7e..00000000 --- a/docs/how-to/index.rst +++ /dev/null @@ -1,76 +0,0 @@ -.. meta:: - :description: Practical how-to guides for operating Canonical Observability Stack, including migration, integration, telemetry configuration, and troubleshooting tasks. - -.. _how-to: - -How-to guides -************* - -These guides accompany you through the complete COS stack operations life cycle. - - -.. note:: - - If you are looking for instructions on how to get started with COS Lite, see - :ref:`the tutorial section `. - -Validating -========== - -These guides will help validating new and existing deployments. - -.. toctree:: - :maxdepth: 1 - - Validate COS deployment - -Migrating -========= - -These guides till assist existing users of other observability stacks offered by -Canonical in migrating to COS Lite or the full COS. - -.. toctree:: - :maxdepth: 1 - - Cross-track upgrade instructions - Migrate from LMA to COS Lite - Migrate from Grafana Agent to OpenTelemetry Collector - -Configuring -============= - -Once COS has been deployed, the next natural step would be to integrate your charms and workloads -with COS to actually observe them. - -.. toctree:: - :maxdepth: 1 - - Evaluate telemetry volume - Add tracing to COS Lite - Add alert rules - Configure scrape jobs - Expose a metrics endpoint - Integrate COS Lite with uncharmed applications - Disable built-in charm alert rules - Testing with Minio - Configure TLS encryption - Selectively drop telemetry using scrape config - Selectively drop telemetry using opentelemetry-collector - Tier OpenTelemetry Collector with different pipelines per data stream - -Troubleshooting -=============== - -During continuous operations, you might sometimes run into issues that you need to resolve. These -how-to guides will assist you in troubleshooting COS in an effective manner. - -.. toctree:: - :maxdepth: 1 - :hidden: - - Troubleshooting - -- `Troubleshoot "Gateway Address Unavailable" in Traefik `_ -- `Troubleshoot "socket: too many open files" `_ -- `Troubleshoot integrations `_ diff --git a/docs/how-to/migrate-gagent-to-otelcol.md b/docs/how-to/migrate-grafana-agent-to-otelcol.md similarity index 100% rename from docs/how-to/migrate-gagent-to-otelcol.md rename to docs/how-to/migrate-grafana-agent-to-otelcol.md diff --git a/docs/how-to/troubleshooting.md b/docs/how-to/troubleshooting.md new file mode 100644 index 00000000..54f3a9b0 --- /dev/null +++ b/docs/how-to/troubleshooting.md @@ -0,0 +1,655 @@ +--- +myst: + html_meta: + description: "Diagnose and fix common Canonical Observability Stack issues, including integrations, Grafana access, OpenTelemetry Collector errors, and alert rules." +--- + +# Troubleshooting + +## `Gateway address unavailable` + +Whenever Traefik is used to ingress your Kubernetes workloads, you might in some specific +cases encounter a "Gateway Address Unavailable" message. In this article, we'll go through +what you can do to remediate it. + +```{caution} +In this article, we will assume that you are running MicroK8s on either a bare-metal +or virtual machine. If your setup differs from this, parts of the how-to may still +apply, although you will need to tailor the exact steps and commands to your setup. +``` + +### Checklist + +- You have run `juju trust traefik --scope=cluster` +- The [MetalLB MicroK8s add-on](https://microk8s.io/docs/addon-metallb) is enabled. +- Traefik's service type is ``LoadBalancer``. +- An external IP address is assigned to Traefik. + +### Possible causes + +#### The MetalLB add-on isn't enabled + +Check with: + +```bash +$ microk8s status -a metallb +``` + +If it is disabled, you can enable it with: + +```bash +$ IPADDR=$(ip -4 -j route get 2.2.2.2 | jq -r '.[] | .prefsrc') +$ microk8s enable metallb:$IPADDR-$IPADDR +``` + +This command will fetch the IPv4 address assigned to your host, and hand it to MetalLB +as an assignable IP. If the address range you want to hand to MetalLB differs from your +host IP, alter the `$IPADDR` variable to instead specify the range you want to assign, +for instance `IPADDR=10.0.0.1-10.0.0.100`. + +#### No external IP address is assigned to the Traefik service + +Does the Traefik service have an external IP assigned to it? Check with: + +```bash +$ JUJU_APP_NAME="traefik" +$ kubectl get svc -A -o wide | grep -E "^NAMESPACE|$JUJU_APP_NAME" +``` + +#### No available IP in address pool + +This can happen when: +- MetalLB has only one IP in its range but you deployed two instances of Traefik, + or when Traefik is forcefully removed (`--force --no-wait`) and a new Traefik + app is deployed immediately after. +- The [ingress](https://microk8s.io/docs/ingress) add-on is enabled. It's possible + that Nginx from the ingress add-on has claimed the `ExternalIP`. Disable Nginx and + re-enable MetalLB. + +Check with: + +```bash +$ kubectl get ipaddresspool -n metallb-system -o yaml && kubectl get all -n metallb-system +``` + +You could add more IPs to the range: + +```bash +$ FROM_IP="..." TO_IP="..." +$ microk8s enable metallb:$FROM_IP-$TO_IP +``` + +#### The Load Balancer service type reverted to `ClusterIP` + +Juju controller cycling may cause the type to revert from `LoadBalancer` back to +`ClusterIP`. + +Check with: + +```bash +$ kubectl get svc -A -o wide | grep -E "^NAMESPACE|LoadBalancer" +``` + +If Traefik isn't listed (it's not `LoadBalancer`), then recreate the pod to have it +re-trigger the assignment of the external IP with `kubectl delete` . It should be `LoadBalancer` +when Kubernetes brings it back. + +#### Integration tests pass locally but fail on GitHub runners + +This used to happen when the github runners were at peak usage, making the already small 2cpu7gb +runners run even slower. As much of a bad answer as this is, the best response may be to increase +timeouts or try to move CI jobs to internal runners. + +### Verification + +Verify that the Traefik Kubernetes service now has been assigned an external IP: + +``` + +$ microk8s.kubectl get services -A + +NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) +cos traefik LoadBalancer 10.152.183.130 10.70.43.245 80:32343/TCP,443:30698/TCP 4d3h + 👆 - This one! +``` + +Verify that Traefik is functioning correctly by trying to trigger one of your ingress paths. +If you have COS Lite deployed, you may check that if works as expected using the Catalogue charm: + +```bash +# curl http:///-catalogue/ +# for example... +$ curl http://10.70.43.245/cos-catalogue/ +``` + +This command should return a long HTML code block if everything works as expected. + + +# Grafana admin password + +Compare the output of: + +- Charm action: `juju run graf/0 get-admin-password` +- Pebble plan: `juju ssh --container grafana graf/0 /charm/bin/pebble plan | grep GF_SECURITY_ADMIN_PASSWORD` +- Secret content: Obtain secret id from `juju secrets` and then `juju show-secret d6buvufmp25c7am9qqtg --reveal` + +All 3 should be identical. If they are not identical, + +1. Manually [reset the admin password](https://grafana.com/docs/grafana/latest/administration/cli/#reset-admin-password), + `juju ssh --container grafana graf/0 grafana cli --config /etc/grafana/grafana-config.ini admin reset-admin-password pa55w0rd` +2. Update the secret with the same: `juju update-secret d6buvufmp25c7am9qqtg password=pa55w0rd` +3. Run the action so the charm updates the pebble service environment variable: `juju run graf/0 get-admin-password` + + +## Integrations + +Integrating a charm with [COS](https://charmhub.io/topics/canonical-observability-stack) means: + +- having your app's metrics and corresponding alert rules reach [Prometheus](https://charmhub.io/prometheus-k8s/). +- having your app's logs and corresponding alert rules reach [Loki](https://charmhub.io/loki-k8s/). +- having your app's dashboards reach [grafana](https://charmhub.io/grafana-k8s/). + +The COS team is responsible for some aspects of testing, and some aspects of testing belong to +the charms integrating with COS. + +### Tests for the built-in alert rules + +#### Unit tests + +You can use: + +- `promtool test rules` (see details [here](https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/)) + to make sure they fire when you expect them to fire. As part of the test you hard-code the time + series values you are testing for. +- `promtool check rules` (see details [here](https://prometheus.io/docs/prometheus/latest/command-line/promtool/#promtool-check)) + to make sure the rules have valid syntax. +- `cos-tool validate` (see details [here](https://github.com/canonical/cos-tool)). The advantage of + cos-tool is that the same executable can validate both Prometheus and Loki rules. + +Make sure your alerts manifest matches the output of: + +```bash +$ juju ssh prometheus/0 curl localhost:9090/api/v1/rules | jq -r '.data.groups | .[] | .rules | .[] | .name' +# and... +$ juju ssh loki/0 curl localhost:3100/loki/api/v1/rules +``` + +#### Integration tests + +```{note} +A fresh deployment shouldn't fire alerts. This can happen when the alert rules are not taking into account +that there is no prior data, thus interpreting it as `0`. +``` + +### Tests for the metrics endpoint and scrape job + +#### Integration tests + +- `promtool check metrics` (see details [here](https://prometheus.io/docs/prometheus/latest/command-line/promtool/#promtool-check)) to lint the the metrics endpoint, + e.g. + ``` + curl -s http://localhost:8080/metrics | promtool check metrics`. + ``` +- For scrape targets: when related to prometheus, and after a scrape interval elapses (default: `1m`), all + prometheus targets listed in `GET /api/v1/targets` should be `"health": "up"`. Repeat the test with/without + ingress and TLS. +- For remote-write (and scrape targets): when related to prometheus, make sure that `GET /api/v1/labels` + and `GET /api/v1/label/juju_unit` have your charm listed. +- Make sure that the metric names in your alert rules have matching metrics in your own `/metrics` endpoint. + +### Tests for log lines + +#### Integration tests + +When related to Loki, make sure your logging sources are listed in: + - `GET /loki/api/v1/label/filename/values` + - `GET /loki/api/v1/label/juju_unit/values` + +### Tests for dashboards + +#### Unit tests + +* JSON linting + +#### Integration tests + +Make sure the dashboards manifest you have in the charm matches: + +```bash +$ juju ssh grafana/0 curl http://admin:password@localhost:3000/api/search +``` + +### Data Duplication + +#### Multiple grafana-agent apps related to the same principle + +Charms should use `limit: 1` for the cos-agent relation (see example [here](https://github.com/canonical/zookeeper-operator/blob/main/metadata.yaml#L31), +but this cannot be enforced by grafana-agent itself. You can confirm this is the case with `jq`: + +```bash +$ juju export-bundle | yq -o json '.' | jq -r ' + .applications as $apps | + .relations as $relations | + $apps + | to_entries + | map(select(.value.charm == "grafana-agent")) | map(.key) as $grafana_agents | + $apps + | to_entries + | map(.key) as $valid_apps | + $relations + | map({ + app1: (.[0] | split(":")[0]), + app2: (.[1] | split(":")[0]) + }) + | map(select( + ((.app1 | IN($grafana_agents[])) and (.app2 | IN($valid_apps[]))) or + ((.app2 | IN($grafana_agents[])) and (.app1 | IN($valid_apps[]))) + )) + | map(if .app1 | IN($grafana_agents[]) then .app2 else .app1 end) + | group_by(.) + | map({app: .[0], count: length}) + | map(select(.count > 1)) + ' +``` + +If the same principal has more than one cos-agent relation, you would see output such as: + +```json + +[ + { + "app": "openstack-exporter", + "count": 2 + } +] +``` + +Otherwise, you'd get: + +```bash +jq: error (at :19): Cannot iterate over null (null) +``` + +(which is good). + +You can achieve this also using the status YAML. Save the following script to `is_multi_agent.py`: + + + +```python +#!/usr/bin/env python3 + +import yaml, sys + +status = yaml.safe_load(sys.stdin.read()) + +# A mapping from grafana-agent app name to the list of apps it's subordiante to +agents = { + k: v["subordinate-to"] + for k, v in status["applications"].items() + if v["charm"] == "grafana-agent" +} +# print(agents) + +for agent, principals in agents.items(): + for p in principals: + for name, unit in status["applications"][p].get("units", {}).items(): + subord_apps = {u.split("/", -1)[0] for u in unit["subordinates"].keys()} + subord_agents = subord_apps & agents.keys() + if len(subord_agents) > 1: + print( + f"{name} is related to more than one grafana-agent subordinate: {subord_agents}" + ) +``` + +Then run it using: + +```bash +$ juju status --format=yaml | ./is_multi_agent.py +``` + +If there is a problem, you would see output such as: + +```bash +openstack-exporter/19 is related to more than one grafana-agent subordinate: {'grafana-agent-container', 'grafana-agent-vm'} +``` + +### Grafana-agent related to multiple principles on the same machine + +The grafana-agent machine charm can only be related to one principal in the same machine. + +Save the following script to `is_multi.py`: + +```python + +#!/usr/bin/env python3 + +import yaml, sys + +status = yaml.safe_load(sys.stdin.read()) + +# A mapping from grafana-agent app name to the list of apps it's subordiante to +agents = { + k: v["subordinate-to"] + for k, v in status["applications"].items() + if v["charm"] == "grafana-agent" +} + +for agent, principals in agents.items(): + # A mapping from app name to machines + machines = { + p: [u["machine"] for u in status["applications"][p].get("units", {}).values()] + for p in principals + } + + from itertools import combinations + + for p1, p2 in combinations(principals, 2): + if overlap := set(machines[p1]) & set(machines[p2]): + print( + f"{agent} is subordinate to both '{p1}', '{p2}' in the same machines {overlap}" + ) +``` + +Then run it with: + +```bash +$ juju status --format=yaml | ./is_multi.py +``` + +If there is a problem, you would see output such as: + +```bash +ga is subordinate to both 'co', 'nc' in the same machines {'24'} +``` + +### Additional thoughts +- A rock's CI could dump a record of the `/metrics` endpoint each time the rock is built. This + way some integration tests could turn into unit tests. + +### See also + +- [Troubleshooting Prometheus Integrations](https://discourse.charmhub.io/t/prometheus-k8s-docs-troubleshooting-integrations/14351) +- [Troubleshooting missing logs](https://discourse.charmhub.io/t/loki-k8s-docs-troubleshooting-missing-logs/14187) + + +## `No data` in Grafana panels + +Data in Grafana panels is obtained by querying datasources. + + +### Adjust the time range +Check if there is any data when you change the +[time range](https://grafana.com/docs/grafana-cloud/visualizations/dashboards/use-dashboards/#set-dashboard-time-range) +to `1d`, `7d`, etc. +Perhaps you had "no data" all along or it started happening only recently. + + +### Inspect variable values +Drop-down [variables](https://grafana.com/docs/grafana/latest/dashboards/variables/) +could be filtering out data incorrectly. +Under dashboard settings, inspect the current values of the variables. +- If you can find a combination of dropdown selections that results in data being shown, then + perhaps the offered variable options should be [narrowed down](https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/#add-a-query-variable) with a more accurate query. +- If the options listed in the dropdown are missing items you expect to be there, then the datasource might be + missing some telemetry, or perhaps we refer to a metric that does not exist, or apply a combination of labels that does not produce a result. + + +### Confirm the query is valid +[Edit the panel](https://grafana.com/docs/grafana/latest/panels-visualizations/panel-editor-overview/) +and incrementally simplify the faulty query, until data shows up. +For example, +- drop label matchers +- remove aggregation operations (`on`, `sum by`) +- replace `$__` interval macros with literals such as `5s` or `5m` +- remove drop-down variables from the query +- disable transformations or overrides that could potentially hide data + +Open the query inspector panel and check the response. + +If only some of the telemetry you expect to have does not exist, then perhaps a relation is missing (or duplicated). + + +### Check datasource connection +Test the datasource connection. +- URL correct? +- For TLS, does grafana trust the CA that signed the datasource? Perhaps there's a missing certificate-transfer relation? +- Credentials valid? +- Proxy configured? Proxy can be [configured](https://documentation.ubuntu.com/juju/latest/reference/configuration/list-of-model-configuration-keys/#model-config-http-proxy) per model. +- Datasource (backend) errors in the logs? +- Errors in grafana server logs? + + +### Test the query in the datasource UI +Some datasources (backends, e.g. Prometheus) have their own UI where you can paste the query +from the faulty Grafana panel. If the query works in the backend UI but not in Grafana, +check datasource connection. + + +### Confirm that the relevant juju relations are in place +- Grafana should be related over the [grafana-source](https://charmhub.io/integrations/grafana_datasource) relation to all relevant datasources. +- In typical deployments, telemetry is pushed from outside the model. Make sure the backends have an ingress relation. +- For deployment that are TLS-terminated, Grafana needs a `recieve-ca-cert` relation from Traefik. + + +### Confirm backends are not out of disk space +If a backend (e.g. Prometheus) runs out of disk space, then it will not ingest new +telemetry. + + +### Confirm you can curl the backend via its ingress URL +- Can grafana reach the datasource URL? +- Can grafana-agent or opentelemetry (or any other telemetry producer or aggregator) reach its backend? + For example, can grafana-agent reach prometheus? Pay attention to http vs. https. + + +## OpenTelemetry Collector + +### High resource usage + +#### Attempting to scrape too many logs? + +Inspect the list of files opened by otelcol and their size. + +```bash +juju ssh ubuntu/0 "sudo lsof -nP -p $(pgrep otelcol)" +``` + +You should see entries such as: + +``` +COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME +otelcol 45246 root 46r REG 8,1 11980753 3206003 /var/log/syslog +otelcol 45246 root 12r REG 8,1 292292 3205748 /var/log/lastlog +otelcol 45246 root 30r REG 8,1 157412 3161673 /var/log/auth.log +otelcol 45246 root 16r REG 8,1 96678 3195546 /var/log/juju/machine-lock.log +otelcol 45246 root 45r REG 8,1 77200 3205894 /var/log/cloud-init.log +otelcol 45246 root 35r REG 8,1 61211 3205745 /var/log/dpkg.log +otelcol 45246 root 25r REG 8,1 29037 3205893 /var/log/cloud-init-output.log +otelcol 45246 root 18r REG 8,1 6121 3205741 /var/log/apt/history.log +otelcol 45246 root 15r REG 8,1 1941 3206035 /var/log/unattended-upgrades/unattended-upgrades.log +otelcol 45246 root 17r REG 8,1 474 3183206 /var/log/alternatives.log +``` + +Compare the total size of logs to the available memory. + + +## `socket: too many open files` + +When deploying the Grafana Agent or Prometheus charms in large environments, +you may sometimes bump into an issue where the large amount of scrape targets +leads to the process hitting the max open files count, as set by ``ulimit``. + +This issue can be identified by looking in your Grafana Agent logs, or Prometheus +Scrape Targets in the UI, for the following kind of message: + +``` +Get "http://10.0.0.1:9275/metrics": dial tcp 10.0.0.1:9275: socket: too many open files +``` + +To resolve this, we need to increase the max open file limit of the Kubernetes +deployment itself. For MicroK8s, this would be done by increasing the limits in +`/var/snap/microk8s/current/args/containerd-env`. + +### 1. Juju SSH into the machine + +```bash +$ juju ssh uk8s/1 +``` + +Substitute `uk8s/1` with the name of your MicroK8s unit. If you have more than +one unit, you will need to repeat this for each of them. + +### 2. Open the ``containerd-env`` + +You can use whatever editor you prefer for this. In this how-to, we'll use ``vim``. + +```bash +$ vim /var/snap/microk8s/current/args/containerd-env +``` + +### 3. Increase the `ulimit` + +```diff + +# Attempt to change the maximum number of open file descriptors +# this get inherited to the running containers +# +- ulimit -n 1024 || true ++ ulimit -n 65536 || true + +# Attempt to change the maximum locked memory limit +# this get inherited to the running containers +# +- ulimit -l 1024 || true ++ ulimit -l 16384 || true +``` + +### 4. Restart the MicroK8s machine + +Restart the machine the MicroK8s unit is deployed on and then wait for it to come back up. + +```bash +$ sudo reboot +``` + +### 5. Validate + +Validate that the change made it through and had the desired effect once the machine is +back up and running. + +```bash +$ juju ssh uk8s/1 cat /var/snap/microk8s/current/args/containerd-env + +[...] + +# Attempt to change the maximum number of open file descriptors +# this get inherited to the running containers +# +ulimit -n 65536 || true + +# Attempt to change the maximum locked memory limit +# this get inherited to the running containers +# +ulimit -l 16384 || true +``` + +## Firing alert rules +This guide describes how to troubleshoot firing generic alert rules. For detailed explanations on the design and goals of these rules, refer to the [explanation page](/explanation/generic-rules). + +### How to troubleshoot the `HostDown` alert +The `HostDown` alert is a sign that Prometheus is unable to scrape the metrics endpoint of the charm for whom this alert is firing. The methods below can help pinpoint the issue. + +#### Ensure the workload is running +It is possible that the charm being scraped by Prometheus is not running. Shell into the workload container and check the service status: +```shell +juju ssh +``` + +#### Ensure Prometheus is scraping the correct endpoint +It is possible that Prometheus is not scraping the correct address, endpoint, or port. When a charm is related to Prometheus for scraping of metrics, the Prometheus config file appends the related charm's metrics endpoint address and port into its list of targets. For K8s charms, this address can be the pod's FQDN or the ingress address (if using Traefik for example). If the charm being scraped does not write the address correctly, then Prometheus will be unable to reach it. + +Another possibility is that the charm does not specify the correct port or endpoint for its metrics. When a charm instantiates the `MetricsEndpointProvider` object, it needs to set the correct port and metrics endpoint. For example, Alertmanager exposes its metrics at the `/metrics` endpoint on port 9093. Charm authors should ensure these values are correctly set, otherwise Prometheus may not have the correct information when attempting to scrape. Use the `ss` command to determine which ports are exposed by your workload. + +#### Ensure the correct firewall and SSL/TLS configurations are applied +From inside the Prometheus container: +1. View the Prometheus configuration file located at `/etc/prometheus/prometheus.yml` +```shell +cat /etc/prometheus/prometheus.yml +``` +2. Find the address of your target +3. Attempt to `curl` it from inside that container. +```shell +curl
+``` +4. Ensure the `curl` request is successful + +A failed request can be due to a firewall issue. Ensure your firewall rules allow Prometheus to reach the instance. + +If your workload uses TLS communication, Prometheus needs to trust that CA that signed that workload to be able to reach it. For example, if your charm is signed through an integration to Lego, Prometheus needs to have the CA cert in its root store (through a `receive-ca-cert` relation) so it can communicate in HTTPS with your charm. + +### How to troubleshoot the `AggregatorHostHealth` alerts +The `HostMetricsMissing` and `AggregatorMetricsMissing` alerts under the `AggregatorHostHealth` group are similar, with only differences in their severity and the units they are responsible for. As such, the methods to troubleshoot them are identical. +#### Confirm the aggregator is running +For machine charms, ensure the snap is running by checking its status in the machine hosting it. In this example, we'll assume that our aggregator is `grafana-agent` on a machine with ID 0. +1. Shell into the machine: +```shell +juju ssh 0 +``` +2. Check the status of the `grafana-agent` snap: +```shell +sudo snap services grafana-agent +``` +Ensure that the status of the snap is indicated as `active`. + +For K8s charms, ensure the relevant pebble service is running by checking its status in the workload container. In this example, we'll assume we have the `opentelemetry-collector` k8s charm deployed with the name `otel` and we want to check the status of the pebble service in the workload container in unit 0. The name of the workload container is `otelcol`. +```{note} +You need to know the name of the workload container in order to shell into it. You can find this information by consulting the `containers` section of a charm's `charmcraft.yaml` file. Alternatively, you can use `kubectl describe pod` to view the containers inside the pod. +``` +1. Shell into the workload container: +```shell +juju ssh --container otelcol otel/0 +``` +2. Check the status of the `otelcol` pebble service: +```shell +pebble services otelcol +``` + +#### Confirm the backend is reachable +It is possible that the aggregator is running, but failing to remote write metrics into the metrics backend. This can occur if there are network or firewall issues, leaving the aggregator unable to successfully hit the metrics backend's remote write endpoint. + +The causes in these cases can often be revealed by looking at the workload logs and looking for logs that suggest issues in reaching a host. The logs will often mention timeouts, DNS name resolution failures, TLS certificate issues, or more broadly "export failures". +1. For machine aggregators, view the snap logs: +```shell +sudo snap logs opentelemetry-collector +``` +2. For K8s aggregators, use `juju ssh` and `pebble logs` to view the workload logs. For example, for `opentelemetry-collector-k8s` unit 0, you will need to look at the Pebble logs in the `otelcol` container: +```shell +juju ssh --container otelcol opentelemetry-collector/0 pebble logs +``` + +In some cases, the backend may be unreachable due to SSL/TLS related issues. This often happens when your aggregator is located outside the Juju model where your COS instance lives and you are using TLS communication when the aggregator tries to reach the backend (external or full TLS). If you are using ingress, it is required for the aggregator to trust the CA that signed the backend or ingress provider (e.g. Traefik). + +#### Inspect existing `up` time series +Perhaps the metrics *do* reach Prometheus, but the `expr` labels we have rendered in the alert do not match the actual metric labels. You can confirm by going to the Prometheus (or Grafana) UI and querying for `up`. Compare the set of labels you get for the returned `up` time series. + + +# Compressed rules in relation databags + +In some relations, rules are compressed in the databag and are not human readable, making troubleshooting difficult. Assuming your unit and endpoint are named `otelcol/0` and `receive-otlp` respectively, then you can view the compressed rules with: + +```bash +juju show-unit otelcol/0 --format=json | \ + jq -r '."otelcol/0"."relation-info"[] | select(.endpoint == "receive-otlp") | ."application-data".rules' + +> /Td6WFoAAATm1rRGAgAhARYAAAB0L ... IEAJVHNA5MGJt6AAGcCtk3AABCHzmZscRn+wIAAAAABFla +``` + +And decompress for troubleshooting with: +```bash +juju show-unit otelcol/0 --format=json | \ + jq -r '."otelcol/0"."relation-info"[] | select(.endpoint == "receive-otlp") | ."application-data".rules' | \ + base64 -d | xz -d | jq + +> {JSON rule content ...} +``` diff --git a/docs/how-to/troubleshooting/index.rst b/docs/how-to/troubleshooting/index.rst deleted file mode 100644 index 072d82e1..00000000 --- a/docs/how-to/troubleshooting/index.rst +++ /dev/null @@ -1,20 +0,0 @@ -.. meta:: - :description: Diagnose and fix common Canonical Observability Stack issues, including integrations, Grafana access, OpenTelemetry Collector errors, and alert rules. - -.. _troubleshooting: - -Troubleshooting -*************** - -.. toctree:: - :maxdepth: 1 - :titlesonly: - - Troubleshoot "Gateway Address Unavailable" in Traefik - Troubleshoot "socket: too many open files" - Troubleshoot integrations - Troubleshoot "no data" in Grafana panels - Troubleshoot firing alert rules - Troubleshoot grafana admin password - Troubleshoot OpenTelemetry Collector - Troubleshoot compressed rules in databag diff --git a/docs/how-to/troubleshooting/troubleshoot-compressed-rules-in-databag.md b/docs/how-to/troubleshooting/troubleshoot-compressed-rules-in-databag.md deleted file mode 100644 index 0fca4ac8..00000000 --- a/docs/how-to/troubleshooting/troubleshoot-compressed-rules-in-databag.md +++ /dev/null @@ -1,19 +0,0 @@ -# Troubleshoot compressed rules in relation databags - -In some relations, rules are compressed in the databag and are not human readable, making troubleshooting difficult. Assuming your unit and endpoint are named `otelcol/0` and `receive-otlp` respectively, then you can view the compressed rules with: - -```bash -juju show-unit otelcol/0 --format=json | \ - jq -r '."otelcol/0"."relation-info"[] | select(.endpoint == "receive-otlp") | ."application-data".rules' - -> /Td6WFoAAATm1rRGAgAhARYAAAB0L ... IEAJVHNA5MGJt6AAGcCtk3AABCHzmZscRn+wIAAAAABFla -``` - -And decompress for troubleshooting with: -```bash -juju show-unit otelcol/0 --format=json | \ - jq -r '."otelcol/0"."relation-info"[] | select(.endpoint == "receive-otlp") | ."application-data".rules' | \ - base64 -d | xz -d | jq - -> {JSON rule content ...} -``` diff --git a/docs/how-to/troubleshooting/troubleshoot-firing-alert-rules.md b/docs/how-to/troubleshooting/troubleshoot-firing-alert-rules.md deleted file mode 100644 index 76f2b050..00000000 --- a/docs/how-to/troubleshooting/troubleshoot-firing-alert-rules.md +++ /dev/null @@ -1,84 +0,0 @@ ---- -myst: - html_meta: - description: "Troubleshoot firing alert rules in COS through steps such as exploring workload health, and ensuring scraping, connectivity, and other configurations are correct." ---- - -# Troubleshoot firing alert rules -This guide describes how to troubleshoot firing generic alert rules. For detailed explanations on the design and goals of these rules, refer to the [explanation page](/explanation/generic-rules). - -## How to troubleshoot the `HostDown` alert -The `HostDown` alert is a sign that Prometheus is unable to scrape the metrics endpoint of the charm for whom this alert is firing. The methods below can help pinpoint the issue. - -### Ensure the workload is running -It is possible that the charm being scraped by Prometheus is not running. Shell into the workload container and check the service status: -```shell -juju ssh -``` - -### Ensure Prometheus is scraping the correct endpoint -It is possible that Prometheus is not scraping the correct address, endpoint, or port. When a charm is related to Prometheus for scraping of metrics, the Prometheus config file appends the related charm's metrics endpoint address and port into its list of targets. For K8s charms, this address can be the pod's FQDN or the ingress address (if using Traefik for example). If the charm being scraped does not write the address correctly, then Prometheus will be unable to reach it. - -Another possibility is that the charm does not specify the correct port or endpoint for its metrics. When a charm instantiates the `MetricsEndpointProvider` object, it needs to set the correct port and metrics endpoint. For example, Alertmanager exposes its metrics at the `/metrics` endpoint on port 9093. Charm authors should ensure these values are correctly set, otherwise Prometheus may not have the correct information when attempting to scrape. Use the `ss` command to determine which ports are exposed by your workload. - -### Ensure the correct firewall and SSL/TLS configurations are applied -From inside the Prometheus container: -1. View the Prometheus configuration file located at `/etc/prometheus/prometheus.yml` -```shell -cat /etc/prometheus/prometheus.yml -``` -2. Find the address of your target -3. Attempt to `curl` it from inside that container. -```shell -curl
-``` -4. Ensure the `curl` request is successful - -A failed request can be due to a firewall issue. Ensure your firewall rules allow Prometheus to reach the instance. - -If your workload uses TLS communication, Prometheus needs to trust that CA that signed that workload to be able to reach it. For example, if your charm is signed through an integration to Lego, Prometheus needs to have the CA cert in its root store (through a `receive-ca-cert` relation) so it can communicate in HTTPS with your charm. - -## How to troubleshoot the `AggregatorHostHealth` alerts -The `HostMetricsMissing` and `AggregatorMetricsMissing` alerts under the `AggregatorHostHealth` group are similar, with only differences in their severity and the units they are responsible for. As such, the methods to troubleshoot them are identical. -### Confirm the aggregator is running -For machine charms, ensure the snap is running by checking its status in the machine hosting it. In this example, we'll assume that our aggregator is `grafana-agent` on a machine with ID 0. -1. Shell into the machine: -```shell -juju ssh 0 -``` -2. Check the status of the `grafana-agent` snap: -```shell -sudo snap services grafana-agent -``` -Ensure that the status of the snap is indicated as `active`. - -For K8s charms, ensure the relevant pebble service is running by checking its status in the workload container. In this example, we'll assume we have the `opentelemetry-collector` k8s charm deployed with the name `otel` and we want to check the status of the pebble service in the workload container in unit 0. The name of the workload container is `otelcol`. -```{note} -You need to know the name of the workload container in order to shell into it. You can find this information by consulting the `containers` section of a charm's `charmcraft.yaml` file. Alternatively, you can use `kubectl describe pod` to view the containers inside the pod. -``` -1. Shell into the workload container: -```shell -juju ssh --container otelcol otel/0 -``` -2. Check the status of the `otelcol` pebble service: -```shell -pebble services otelcol -``` - -### Confirm the backend is reachable -It is possible that the aggregator is running, but failing to remote write metrics into the metrics backend. This can occur if there are network or firewall issues, leaving the aggregator unable to successfully hit the metrics backend's remote write endpoint. - -The causes in these cases can often be revealed by looking at the workload logs and looking for logs that suggest issues in reaching a host. The logs will often mention timeouts, DNS name resolution failures, TLS certificate issues, or more broadly "export failures". -1. For machine aggregators, view the snap logs: -```shell -sudo snap logs opentelemetry-collector -``` -2. For K8s aggregators, use `juju ssh` and `pebble logs` to view the workload logs. For example, for `opentelemetry-collector-k8s` unit 0, you will need to look at the Pebble logs in the `otelcol` container: -```shell -juju ssh --container otelcol opentelemetry-collector/0 pebble logs -``` - -In some cases, the backend may be unreachable due to SSL/TLS related issues. This often happens when your aggregator is located outside the Juju model where your COS instance lives and you are using TLS communication when the aggregator tries to reach the backend (external or full TLS). If you are using ingress, it is required for the aggregator to trust the CA that signed the backend or ingress provider (e.g. Traefik). - -### Inspect existing `up` time series -Perhaps the metrics *do* reach Prometheus, but the `expr` labels we have rendered in the alert do not match the actual metric labels. You can confirm by going to the Prometheus (or Grafana) UI and querying for `up`. Compare the set of labels you get for the returned `up` time series. diff --git a/docs/how-to/troubleshooting/troubleshoot-gateway-address-unavailable.md b/docs/how-to/troubleshooting/troubleshoot-gateway-address-unavailable.md deleted file mode 100644 index b1e2fd80..00000000 --- a/docs/how-to/troubleshooting/troubleshoot-gateway-address-unavailable.md +++ /dev/null @@ -1,123 +0,0 @@ ---- -myst: - html_meta: - description: "Troubleshoot Gateway address unavailable errors in COS by exploring possible causes such as charm configuration and network reachability." ---- - -# Troubleshooting `Gateway address unavailable` - -Whenever Traefik is used to ingress your Kubernetes workloads, you might in some specific -cases encounter a "Gateway Address Unavailable" message. In this article, we'll go through -what you can do to remediate it. - -```{caution} -In this article, we will assume that you are running MicroK8s on either a bare-metal -or virtual machine. If your setup differs from this, parts of the how-to may still -apply, although you will need to tailor the exact steps and commands to your setup. -``` - -## Checklist - -- You have run `juju trust traefik --scope=cluster` -- The [MetalLB MicroK8s add-on](https://microk8s.io/docs/addon-metallb) is enabled. -- Traefik's service type is ``LoadBalancer``. -- An external IP address is assigned to Traefik. - -## Possible causes - -### The MetalLB add-on isn't enabled - -Check with: - -```bash -$ microk8s status -a metallb -``` - -If it is disabled, you can enable it with: - -```bash -$ IPADDR=$(ip -4 -j route get 2.2.2.2 | jq -r '.[] | .prefsrc') -$ microk8s enable metallb:$IPADDR-$IPADDR -``` - -This command will fetch the IPv4 address assigned to your host, and hand it to MetalLB -as an assignable IP. If the address range you want to hand to MetalLB differs from your -host IP, alter the `$IPADDR` variable to instead specify the range you want to assign, -for instance `IPADDR=10.0.0.1-10.0.0.100`. - -### No external IP address is assigned to the Traefik service - -Does the Traefik service have an external IP assigned to it? Check with: - -```bash -$ JUJU_APP_NAME="traefik" -$ kubectl get svc -A -o wide | grep -E "^NAMESPACE|$JUJU_APP_NAME" -``` - -### No available IP in address pool - -This can happen when: -- MetalLB has only one IP in its range but you deployed two instances of Traefik, - or when Traefik is forcefully removed (`--force --no-wait`) and a new Traefik - app is deployed immediately after. -- The [ingress](https://microk8s.io/docs/ingress) add-on is enabled. It's possible - that Nginx from the ingress add-on has claimed the `ExternalIP`. Disable Nginx and - re-enable MetalLB. - -Check with: - -```bash -$ kubectl get ipaddresspool -n metallb-system -o yaml && kubectl get all -n metallb-system -``` - -You could add more IPs to the range: - -```bash -$ FROM_IP="..." TO_IP="..." -$ microk8s enable metallb:$FROM_IP-$TO_IP -``` - -### The Load Balancer service type reverted to `ClusterIP` - -Juju controller cycling may cause the type to revert from `LoadBalancer` back to -`ClusterIP`. - -Check with: - -```bash -$ kubectl get svc -A -o wide | grep -E "^NAMESPACE|LoadBalancer" -``` - -If Traefik isn't listed (it's not `LoadBalancer`), then recreate the pod to have it -re-trigger the assignment of the external IP with `kubectl delete` . It should be `LoadBalancer` -when Kubernetes brings it back. - -### Integration tests pass locally but fail on GitHub runners - -This used to happen when the github runners were at peak usage, making the already small 2cpu7gb -runners run even slower. As much of a bad answer as this is, the best response may be to increase -timeouts or try to move CI jobs to internal runners. - -## Verification - -Verify that the Traefik Kubernetes service now has been assigned an external IP: - -``` - -$ microk8s.kubectl get services -A - -NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) -cos traefik LoadBalancer 10.152.183.130 10.70.43.245 80:32343/TCP,443:30698/TCP 4d3h - 👆 - This one! -``` - -Verify that Traefik is functioning correctly by trying to trigger one of your ingress paths. -If you have COS Lite deployed, you may check that if works as expected using the Catalogue charm: - -```bash -# curl http:///-catalogue/ -# for example... -$ curl http://10.70.43.245/cos-catalogue/ -``` - -This command should return a long HTML code block if everything works as expected. \ No newline at end of file diff --git a/docs/how-to/troubleshooting/troubleshoot-grafana-admin-password.md b/docs/how-to/troubleshooting/troubleshoot-grafana-admin-password.md deleted file mode 100644 index 80ae0c4e..00000000 --- a/docs/how-to/troubleshooting/troubleshoot-grafana-admin-password.md +++ /dev/null @@ -1,20 +0,0 @@ ---- -myst: - html_meta: - description: "Troubleshoot problems with the Grafana admin password in COS with repeatable reset steps, to regain access and restore administrator control quickly." ---- - -# Troubleshoot grafana admin password - -Compare the output of: - -- Charm action: `juju run graf/0 get-admin-password` -- Pebble plan: `juju ssh --container grafana graf/0 /charm/bin/pebble plan | grep GF_SECURITY_ADMIN_PASSWORD` -- Secret content: Obtain secret id from `juju secrets` and then `juju show-secret d6buvufmp25c7am9qqtg --reveal` - -All 3 should be identical. If they are not identical, - -1. Manually [reset the admin password](https://grafana.com/docs/grafana/latest/administration/cli/#reset-admin-password), - `juju ssh --container grafana graf/0 grafana cli --config /etc/grafana/grafana-config.ini admin reset-admin-password pa55w0rd` -2. Update the secret with the same: `juju update-secret d6buvufmp25c7am9qqtg password=pa55w0rd` -3. Run the action so the charm updates the pebble service environment variable: `juju run graf/0 get-admin-password` diff --git a/docs/how-to/troubleshooting/troubleshoot-integrations.md b/docs/how-to/troubleshooting/troubleshoot-integrations.md deleted file mode 100644 index d8602691..00000000 --- a/docs/how-to/troubleshooting/troubleshoot-integrations.md +++ /dev/null @@ -1,236 +0,0 @@ ---- -myst: - html_meta: - description: "Troubleshoot COS integrations on various topics, such as alert rules, metrics and scraping, logs, dashboards, and data duplication." ---- - -# Troubleshooting integrations - -Integrating a charm with [COS](https://charmhub.io/topics/canonical-observability-stack) means: - -- having your app's metrics and corresponding alert rules reach [Prometheus](https://charmhub.io/prometheus-k8s/). -- having your app's logs and corresponding alert rules reach [Loki](https://charmhub.io/loki-k8s/). -- having your app's dashboards reach [grafana](https://charmhub.io/grafana-k8s/). - -The COS team is responsible for some aspects of testing, and some aspects of testing belong to -the charms integrating with COS. - -## Tests for the built-in alert rules - -### Unit tests - -You can use: - -- `promtool test rules` (see details [here](https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/)) - to make sure they fire when you expect them to fire. As part of the test you hard-code the time - series values you are testing for. -- `promtool check rules` (see details [here](https://prometheus.io/docs/prometheus/latest/command-line/promtool/#promtool-check)) - to make sure the rules have valid syntax. -- `cos-tool validate` (see details [here](https://github.com/canonical/cos-tool)). The advantage of - cos-tool is that the same executable can validate both Prometheus and Loki rules. - -Make sure your alerts manifest matches the output of: - -```bash -$ juju ssh prometheus/0 curl localhost:9090/api/v1/rules | jq -r '.data.groups | .[] | .rules | .[] | .name' -# and... -$ juju ssh loki/0 curl localhost:3100/loki/api/v1/rules -``` - -### Integration tests - -```{note} -A fresh deployment shouldn't fire alerts. This can happen when the alert rules are not taking into account -that there is no prior data, thus interpreting it as `0`. -``` - -## Tests for the metrics endpoint and scrape job - -### Integration tests - -- `promtool check metrics` (see details [here](https://prometheus.io/docs/prometheus/latest/command-line/promtool/#promtool-check)) to lint the the metrics endpoint, - e.g. - ``` - curl -s http://localhost:8080/metrics | promtool check metrics`. - ``` -- For scrape targets: when related to prometheus, and after a scrape interval elapses (default: `1m`), all - prometheus targets listed in `GET /api/v1/targets` should be `"health": "up"`. Repeat the test with/without - ingress and TLS. -- For remote-write (and scrape targets): when related to prometheus, make sure that `GET /api/v1/labels` - and `GET /api/v1/label/juju_unit` have your charm listed. -- Make sure that the metric names in your alert rules have matching metrics in your own `/metrics` endpoint. - -## Tests for log lines - -### Integration tests - -When related to Loki, make sure your logging sources are listed in: - - `GET /loki/api/v1/label/filename/values` - - `GET /loki/api/v1/label/juju_unit/values` - -## Tests for dashboards - -### Unit tests - -* JSON linting - -### Integration tests - -Make sure the dashboards manifest you have in the charm matches: - -```bash -$ juju ssh grafana/0 curl http://admin:password@localhost:3000/api/search -``` - -## Data Duplication - -### Multiple grafana-agent apps related to the same principle - -Charms should use `limit: 1` for the cos-agent relation (see example [here](https://github.com/canonical/zookeeper-operator/blob/main/metadata.yaml#L31), -but this cannot be enforced by grafana-agent itself. You can confirm this is the case with `jq`: - -```bash -$ juju export-bundle | yq -o json '.' | jq -r ' - .applications as $apps | - .relations as $relations | - $apps - | to_entries - | map(select(.value.charm == "grafana-agent")) | map(.key) as $grafana_agents | - $apps - | to_entries - | map(.key) as $valid_apps | - $relations - | map({ - app1: (.[0] | split(":")[0]), - app2: (.[1] | split(":")[0]) - }) - | map(select( - ((.app1 | IN($grafana_agents[])) and (.app2 | IN($valid_apps[]))) or - ((.app2 | IN($grafana_agents[])) and (.app1 | IN($valid_apps[]))) - )) - | map(if .app1 | IN($grafana_agents[]) then .app2 else .app1 end) - | group_by(.) - | map({app: .[0], count: length}) - | map(select(.count > 1)) - ' -``` - -If the same principal has more than one cos-agent relation, you would see output such as: - -```json - -[ - { - "app": "openstack-exporter", - "count": 2 - } -] -``` - -Otherwise, you'd get: - -```bash -jq: error (at :19): Cannot iterate over null (null) -``` - -(which is good). - -You can achieve this also using the status YAML. Save the following script to `is_multi_agent.py`: - - - -```python -#!/usr/bin/env python3 - -import yaml, sys - -status = yaml.safe_load(sys.stdin.read()) - -# A mapping from grafana-agent app name to the list of apps it's subordiante to -agents = { - k: v["subordinate-to"] - for k, v in status["applications"].items() - if v["charm"] == "grafana-agent" -} -# print(agents) - -for agent, principals in agents.items(): - for p in principals: - for name, unit in status["applications"][p].get("units", {}).items(): - subord_apps = {u.split("/", -1)[0] for u in unit["subordinates"].keys()} - subord_agents = subord_apps & agents.keys() - if len(subord_agents) > 1: - print( - f"{name} is related to more than one grafana-agent subordinate: {subord_agents}" - ) -``` - -Then run it using: - -```bash -$ juju status --format=yaml | ./is_multi_agent.py -``` - -If there is a problem, you would see output such as: - -```bash -openstack-exporter/19 is related to more than one grafana-agent subordinate: {'grafana-agent-container', 'grafana-agent-vm'} -``` - -### Grafana-agent related to multiple principles on the same machine - -The grafana-agent machine charm can only be related to one principal in the same machine. - -Save the following script to `is_multi.py`: - -```python - -#!/usr/bin/env python3 - -import yaml, sys - -status = yaml.safe_load(sys.stdin.read()) - -# A mapping from grafana-agent app name to the list of apps it's subordiante to -agents = { - k: v["subordinate-to"] - for k, v in status["applications"].items() - if v["charm"] == "grafana-agent" -} - -for agent, principals in agents.items(): - # A mapping from app name to machines - machines = { - p: [u["machine"] for u in status["applications"][p].get("units", {}).values()] - for p in principals - } - - from itertools import combinations - - for p1, p2 in combinations(principals, 2): - if overlap := set(machines[p1]) & set(machines[p2]): - print( - f"{agent} is subordinate to both '{p1}', '{p2}' in the same machines {overlap}" - ) -``` - -Then run it with: - -```bash -$ juju status --format=yaml | ./is_multi.py -``` - -If there is a problem, you would see output such as: - -```bash -ga is subordinate to both 'co', 'nc' in the same machines {'24'} -``` - -## Additional thoughts -- A rock's CI could dump a record of the `/metrics` endpoint each time the rock is built. This - way some integration tests could turn into unit tests. - -## See also - -- [Troubleshooting Prometheus Integrations](https://discourse.charmhub.io/t/prometheus-k8s-docs-troubleshooting-integrations/14351) -- [Troubleshooting missing logs](https://discourse.charmhub.io/t/loki-k8s-docs-troubleshooting-missing-logs/14187) \ No newline at end of file diff --git a/docs/how-to/troubleshooting/troubleshoot-no-data-in-grafana-panels.md b/docs/how-to/troubleshooting/troubleshoot-no-data-in-grafana-panels.md deleted file mode 100644 index 294dc76d..00000000 --- a/docs/how-to/troubleshooting/troubleshoot-no-data-in-grafana-panels.md +++ /dev/null @@ -1,74 +0,0 @@ ---- -myst: - html_meta: - description: "Troubleshoot data issues in Grafana panels using various methods, such as adjusting time ranges, inspecting variables, confirming query validity, and checking connections." ---- - -# Troubleshoot `no data` in Grafana panels - -Data in Grafana panels is obtained by querying datasources. - - -## Adjust the time range -Check if there is any data when you change the -[time range](https://grafana.com/docs/grafana-cloud/visualizations/dashboards/use-dashboards/#set-dashboard-time-range) -to `1d`, `7d`, etc. -Perhaps you had "no data" all along or it started happening only recently. - - -## Inspect variable values -Drop-down [variables](https://grafana.com/docs/grafana/latest/dashboards/variables/) -could be filtering out data incorrectly. -Under dashboard settings, inspect the current values of the variables. -- If you can find a combination of dropdown selections that results in data being shown, then - perhaps the offered variable options should be [narrowed down](https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/#add-a-query-variable) with a more accurate query. -- If the options listed in the dropdown are missing items you expect to be there, then the datasource might be - missing some telemetry, or perhaps we refer to a metric that does not exist, or apply a combination of labels that does not produce a result. - - -## Confirm the query is valid -[Edit the panel](https://grafana.com/docs/grafana/latest/panels-visualizations/panel-editor-overview/) -and incrementally simplify the faulty query, until data shows up. -For example, -- drop label matchers -- remove aggregation operations (`on`, `sum by`) -- replace `$__` interval macros with literals such as `5s` or `5m` -- remove drop-down variables from the query -- disable transformations or overrides that could potentially hide data - -Open the query inspector panel and check the response. - -If only some of the telemetry you expect to have does not exist, then perhaps a relation is missing (or duplicated). - - -## Check datasource connection -Test the datasource connection. -- URL correct? -- For TLS, does grafana trust the CA that signed the datasource? Perhaps there's a missing certificate-transfer relation? -- Credentials valid? -- Proxy configured? Proxy can be [configured](https://documentation.ubuntu.com/juju/latest/reference/configuration/list-of-model-configuration-keys/#model-config-http-proxy) per model. -- Datasource (backend) errors in the logs? -- Errors in grafana server logs? - - -## Test the query in the datasource UI -Some datasources (backends, e.g. Prometheus) have their own UI where you can paste the query -from the faulty Grafana panel. If the query works in the backend UI but not in Grafana, -check datasource connection. - - -## Confirm that the relevant juju relations are in place -- Grafana should be related over the [grafana-source](https://charmhub.io/integrations/grafana_datasource) relation to all relevant datasources. -- In typical deployments, telemetry is pushed from outside the model. Make sure the backends have an ingress relation. -- For deployment that are TLS-terminated, Grafana needs a `recieve-ca-cert` relation from Traefik. - - -## Confirm backends are not out of disk space -If a backend (e.g. Prometheus) runs out of disk space, then it will not ingest new -telemetry. - - -## Confirm you can curl the backend via its ingress URL -- Can grafana reach the datasource URL? -- Can grafana-agent or opentelemetry (or any other telemetry producer or aggregator) reach its backend? - For example, can grafana-agent reach prometheus? Pay attention to http vs. https. \ No newline at end of file diff --git a/docs/how-to/troubleshooting/troubleshoot-otelcol.md b/docs/how-to/troubleshooting/troubleshoot-otelcol.md deleted file mode 100644 index d73d5d4d..00000000 --- a/docs/how-to/troubleshooting/troubleshoot-otelcol.md +++ /dev/null @@ -1,35 +0,0 @@ ---- -myst: - html_meta: - description: "Troubleshoot OpenTelemetry Collector issues in COS with various strategies, such as reviewing high resource usage and attempting to scrape too many logs." ---- - -# Troubleshoot OpenTelemetry Collector - -## High resource usage - -### Attempting to scrape too many logs? - -Inspect the list of files opened by otelcol and their size. - -```bash -juju ssh ubuntu/0 "sudo lsof -nP -p $(pgrep otelcol)" -``` - -You should see entries such as: - -``` -COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME -otelcol 45246 root 46r REG 8,1 11980753 3206003 /var/log/syslog -otelcol 45246 root 12r REG 8,1 292292 3205748 /var/log/lastlog -otelcol 45246 root 30r REG 8,1 157412 3161673 /var/log/auth.log -otelcol 45246 root 16r REG 8,1 96678 3195546 /var/log/juju/machine-lock.log -otelcol 45246 root 45r REG 8,1 77200 3205894 /var/log/cloud-init.log -otelcol 45246 root 35r REG 8,1 61211 3205745 /var/log/dpkg.log -otelcol 45246 root 25r REG 8,1 29037 3205893 /var/log/cloud-init-output.log -otelcol 45246 root 18r REG 8,1 6121 3205741 /var/log/apt/history.log -otelcol 45246 root 15r REG 8,1 1941 3206035 /var/log/unattended-upgrades/unattended-upgrades.log -otelcol 45246 root 17r REG 8,1 474 3183206 /var/log/alternatives.log -``` - -Compare the total size of logs to the available memory. diff --git a/docs/how-to/troubleshooting/troubleshoot-socket-too-many-open-files.md b/docs/how-to/troubleshooting/troubleshoot-socket-too-many-open-files.md deleted file mode 100644 index 2dd909bd..00000000 --- a/docs/how-to/troubleshooting/troubleshoot-socket-too-many-open-files.md +++ /dev/null @@ -1,85 +0,0 @@ ---- -myst: - html_meta: - description: "Troubleshoot socket too many open files errors when deploying the Grafana agent or Prometheus charms. Review logs and investigate scrape targets." ---- - -# Troubleshooting ``socket: too many open files`` - -When deploying the Grafana Agent or Prometheus charms in large environments, -you may sometimes bump into an issue where the large amount of scrape targets -leads to the process hitting the max open files count, as set by ``ulimit``. - -This issue can be identified by looking in your Grafana Agent logs, or Prometheus -Scrape Targets in the UI, for the following kind of message: - -``` -Get "http://10.0.0.1:9275/metrics": dial tcp 10.0.0.1:9275: socket: too many open files -``` - -To resolve this, we need to increase the max open file limit of the Kubernetes -deployment itself. For MicroK8s, this would be done by increasing the limits in -`/var/snap/microk8s/current/args/containerd-env`. - -## 1. Juju SSH into the machine - -```bash -$ juju ssh uk8s/1 -``` - -Substitute `uk8s/1` with the name of your MicroK8s unit. If you have more than -one unit, you will need to repeat this for each of them. - -## 2. Open the ``containerd-env`` - -You can use whatever editor you prefer for this. In this how-to, we'll use ``vim``. - -```bash -$ vim /var/snap/microk8s/current/args/containerd-env -``` - -## 3. Increase the `ulimit` - -```diff - -# Attempt to change the maximum number of open file descriptors -# this get inherited to the running containers -# -- ulimit -n 1024 || true -+ ulimit -n 65536 || true - -# Attempt to change the maximum locked memory limit -# this get inherited to the running containers -# -- ulimit -l 1024 || true -+ ulimit -l 16384 || true -``` - -## 4. Restart the MicroK8s machine - -Restart the machine the MicroK8s unit is deployed on and then wait for it to come back up. - -```bash -$ sudo reboot -``` - -## 5. Validate - -Validate that the change made it through and had the desired effect once the machine is -back up and running. - -```bash -$ juju ssh uk8s/1 cat /var/snap/microk8s/current/args/containerd-env - -[...] - -# Attempt to change the maximum number of open file descriptors -# this get inherited to the running containers -# -ulimit -n 65536 || true - -# Attempt to change the maximum locked memory limit -# this get inherited to the running containers -# -ulimit -l 16384 || true -``` \ No newline at end of file diff --git a/docs/how-to/upgrade.md b/docs/how-to/upgrade.md index a423ba46..bc5e9b33 100644 --- a/docs/how-to/upgrade.md +++ b/docs/how-to/upgrade.md @@ -6,6 +6,38 @@ myst: # Upgrade instructions +## COS 3 +### Migrate from COS 2 to COS 3 +Using Terraform: +1. Update all references to track 2/stable and then: + ```bash + terraform apply + ``` +2. Manually refresh all charms to 2/stable + ```bash + juju refresh --channel 2/stable + ``` +3. Update all references to track 3/stable and then: + ```bash + terraform apply + ``` + +### Migrate from COS Lite 2 to COS Lite 3 +Using Terraform: +1. Update all references to track 2/stable and then: + ```bash + terraform apply + ``` +2. Follow the instructions for `Without Terraform`. + +Without Terraform: +1. Refresh all track 2 charms so they point to the latest revision on `2/stable`. + ```bash + juju refresh --channel 2/stable + ``` +2. Refresh to track 3. + + ## COS 2 ### Migrate from COS Lite 1 to COS 2 diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 00000000..723a0319 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,67 @@ +--- +myst: + html_meta: + description: "Canonical Observability Stack documentation with tutorials, how-to guides, reference material, and architecture explanations for low-ops observability operations." +--- + +# Observability documentation + +Highly-integrated, low-operations observability stack powered by [Juju](https://documentation.ubuntu.com/juju/3.6/) and Kubernetes. + +The Canonical Observability Stack (COS) gathers, processes, visualizes, and alerts on telemetry generated by workloads running both within, and outside of, Juju. + +By leveraging the topology model of Juju to contextualize the data, and charm relations to automate configuration and integration, it provides a low-ops observability suite based on best-in-class, open-source observability tools. + +For Site Reliability Engineers, COS provides a turn-key, out-of-the-box solution for improved day-2 operational insight. + +```{toctree} +:maxdepth: 1 +:hidden: + +tutorial/index +how-to/index +reference/index +explanation/index +``` + +````{grid} 1 1 2 2 +```{grid-item-card} {ref}`Tutorial ` + +**Get started** - a hands-on introduction for new users deploying COS. +``` + +```{grid-item-card} {ref}`How-to guides ` + +**Step-by-step guides** - learn key operations, ranging from exposing +a metrics endpoint to integrating COS Lite with uncharmed applications. + +``` +```` + +````{grid} 1 1 2 2 +:reverse: +```{grid-item-card} {ref}`Reference ` + +**Technical information** - specifications, APIs, architecture. +``` + +```{grid-item-card} {ref}`Explanation ` + +**Discussion and clarification** of key topics and concepts. + +``` + +```` + +## Project and community + +The Canonical Observability Stack is a member of the Canonical family. It's an open source project +that warmly welcomes community projects, contributions, suggestions, fixes +and constructive feedback. + +- [Join the Discourse community forum](https://discourse.charmhub.io/c/charm/observability/62) +- [Join the Matrix community chat](https://matrix.to/#/#cos:ubuntu.com) +- [Contribute on GitHub](https://github.com/canonical/observability) + +- [Code of conduct](https://ubuntu.com/community/ethos/code-of-conduct) +- [Canonical contributor license agreement](https://canonical.com/legal/contributors) diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100644 index f92251e2..00000000 --- a/docs/index.rst +++ /dev/null @@ -1,62 +0,0 @@ -.. meta:: - :description: Canonical Observability Stack documentation with tutorials, how-to guides, reference material, and architecture explanations for low-ops observability operations. - -Observability documentation -=========================== - -Highly-integrated, low-operations observability stack powered by `Juju `_ and Kubernetes. - -The Canonical Observability Stack (COS) gathers, processes, visualizes, and alerts on telemetry generated by workloads running both within, and outside of, Juju. - -By leveraging the topology model of Juju to contextualize the data, and charm relations to automate configuration and integration, it provides a low-ops observability suite based on best-in-class, open-source observability tools. - -For Site Reliability Engineers, COS provides a turn-key, out-of-the-box solution for improved day-2 operational insight. - - -.. toctree:: - :maxdepth: 1 - :hidden: - - tutorial/index - how-to/index - reference/index - explanation/index - - -.. grid:: 1 1 2 2 - - .. grid-item-card:: :ref:`Tutorial ` - - **Get started** - a hands-on introduction for new users deploying COS. - - .. grid-item-card:: :ref:`How-to guides ` - - **Step-by-step guides** - learn key operations, ranging from exposing - a metrics endpoint to integrating COS Lite with uncharmed applications. - -.. grid:: 1 1 2 2 - :reverse: - - .. grid-item-card:: :ref:`Reference ` - - **Technical information** - specifications, APIs, architecture. - - .. grid-item-card:: :ref:`Explanation ` - - **Discussion and clarification** of key topics and concepts. - - -Project and community -===================== - -The Canonical Observability Stack is a member of the Canonical family. It's an open source project -that warmly welcomes community projects, contributions, suggestions, fixes -and constructive feedback. - -* `Join the Discourse community forum `_ -* `Join the Matrix community chat `_ -* `Contribute on GitHub `_ - -* `Code of conduct `_ -* `Canonical contributor license agreement - `_ diff --git a/docs/reference/best-practices/index.md b/docs/reference/best-practices/index.md new file mode 100644 index 00000000..fb6f971d --- /dev/null +++ b/docs/reference/best-practices/index.md @@ -0,0 +1,18 @@ +--- +myst: + html_meta: + description: "Apply Canonical Observability Stack deployment best practices for topology, lifecycle, networking, and storage decisions in production environments." +--- + +(best-practices)= + +# Deployment best practices + +```{toctree} +:maxdepth: 1 + +Topology +Lifecycle +Storage +Networking +``` diff --git a/docs/reference/best-practices/index.rst b/docs/reference/best-practices/index.rst deleted file mode 100644 index 9b729835..00000000 --- a/docs/reference/best-practices/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -.. meta:: - :description: Apply Canonical Observability Stack deployment best practices for topology, lifecycle, networking, and storage decisions in production environments. - -.. _best-practices: - -Deployment best practices -********************************* - -.. toctree:: - :maxdepth: 1 - - Topology - Lifecycle - Storage - Networking diff --git a/docs/reference/charms.md b/docs/reference/charms.md deleted file mode 100644 index e05dcf00..00000000 --- a/docs/reference/charms.md +++ /dev/null @@ -1,54 +0,0 @@ ---- -myst: - html_meta: - description: "Browse the complete reference for Canonical Observability Stack charms and operators, including component roles and deployment-relevant details." ---- - -# Charms - -## COS - -| Project | Substrate | Charmhub | Source Code | Bug Report | -|--------------------------|-----------|----------------------------------------------------------|--------------------------------------------------------------------------|---------------------------------------------------------------------------------| -| Catalogue | K8s | [Charmhub](https://charmhub.io/catalogue-k8s) | [Source](https://github.com/canonical/catalogue-k8s-operator) | [Issues](https://github.com/canonical/catalogue-k8s-operator/issues) | -| Grafana | K8s | [Charmhub](https://charmhub.io/grafana-k8s) | [Source](https://github.com/canonical/grafana-k8s-operator) | [Issues](https://github.com/canonical/grafana-k8s-operator/issues) | -| Loki Coordinator | K8s | [Charmhub](https://charmhub.io/loki-coordinator-k8s) | [Source](https://github.com/canonical/loki-coordinator-k8s-operator) | [Issues](https://github.com/canonical/loki-coordinator-k8s-operator/issues) | -| Loki Worker | K8s | [Charmhub](https://charmhub.io/loki-worker-k8s) | [Source](https://github.com/canonical/loki-worker-k8s-operator) | [Issues](https://github.com/canonical/loki-worker-k8s-operator/issues) | -| Mimir Coordinator | K8s | [Charmhub](https://charmhub.io/mimir-coordinator-k8s) | [Source](https://github.com/canonical/mimir-coordinator-k8s-operator) | [Issues](https://github.com/canonical/mimir-coordinator-k8s-operator/issues) | -| Mimir Worker | K8s | [Charmhub](https://charmhub.io/mimir-worker-k8s) | [Source](https://github.com/canonical/mimir-worker-k8s-operator) | [Issues](https://github.com/canonical/mimir-worker-k8s-operator/issues) | -| S3 Integrator | Any | [Charmhub](https://charmhub.io/s3-integrator) | [Source](https://github.com/canonical/s3-integrator) | [Issues](https://github.com/canonical/s3-integrator/issues) | -| Self-signed Certificates | Any | [Charmhub](https://charmhub.io/self-signed-certificates) | [Source](https://github.com/canonical/self-signed-certificates-operator) | [Issues](https://github.com/canonical/self-signed-certificates-operator/issues) | -| Tempo Coordinator | K8s | [Charmhub](https://charmhub.io/tempo-coordinator-k8s) | [Source](https://github.com/canonical/tempo-coordinator-k8s-operator) | [Issues](https://github.com/canonical/tempo-coordinator-k8s-operator/issues) | -| Tempo Worker | K8s | [Charmhub](https://charmhub.io/tempo-worker-k8s) | [Source](https://github.com/canonical/tempo-worker-k8s-operator) | [Issues](https://github.com/canonical/tempo-worker-k8s-operator/issues) | -| Traefik | K8s | [Charmhub](https://charmhub.io/traefik-k8s) | [Source](https://github.com/canonical/traefik-k8s-operator) | [Issues](https://github.com/canonical/traefik-k8s-operator/issues) | - -## COS Lite - -| Project | Substrate | Charmhub | Source Code | Bug Report | -|--------------------------|-----------|----------------------------------------------------------|--------------------------------------------------------------------------|---------------------------------------------------------------------------------| -| Alertmanager | K8s | [Charmhub](https://charmhub.io/alertmanager-k8s) | [Source](https://github.com/canonical/alertmanager-k8s-operator) | [Issues](https://github.com/canonical/alertmanager-k8s-operator/issues) | -| Catalogue | K8s | [Charmhub](https://charmhub.io/catalogue-k8s) | [Source](https://github.com/canonical/catalogue-k8s-operator) | [Issues](https://github.com/canonical/catalogue-k8s-operator/issues) | -| Grafana | K8s | [Charmhub](https://charmhub.io/grafana-k8s) | [Source](https://github.com/canonical/grafana-k8s-operator) | [Issues](https://github.com/canonical/grafana-k8s-operator/issues) | -| Loki | K8s | [Charmhub](https://charmhub.io/loki-k8s) | [Source](https://github.com/canonical/loki-k8s-operator) | [Issues](https://github.com/canonical/loki-k8s-operator/issues) | -| Prometheus | K8s | [Charmhub](https://charmhub.io/prometheus-k8s) | [Source](https://github.com/canonical/prometheus-k8s-operator) | [Issues](https://github.com/canonical/prometheus-k8s-operator/issues) | -| Traefik | K8s | [Charmhub](https://charmhub.io/traefik-k8s) | [Source](https://github.com/canonical/traefik-k8s-operator) | [Issues](https://github.com/canonical/traefik-k8s-operator/issues) | -| Self-signed Certificates | Any | [Charmhub](https://charmhub.io/self-signed-certificates) | [Source](https://github.com/canonical/self-signed-certificates-operator) | [Issues](https://github.com/canonical/self-signed-certificates-operator/issues) | - -## Peripheral charms - -| Project | Substrate | Charmhub | Source Code | Bug Report | -|--------------------------|-----------|--------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------------------------------| -| Blackbox Exporter | K8s | [Charmhub](https://charmhub.io/blackbox-exporter-k8s) | [Source](https://github.com/canonical/blackbox-exporter-k8s-operator) | [Issues](https://github.com/canonical/blackbox-exporter-k8s-operator/issues) | -| Blackbox Exporter | Machine | [Charmhub](https://charmhub.io/blackbox-exporter) | [Source](https://github.com/canonical/blackbox-exporter-operator) | [Issues](https://github.com/canonical/blackbox-exporter-operator/issues) | -| COS Configuration | K8s | [Charmhub](https://charmhub.io/cos-configuration-k8s) | [Source](https://github.com/canonical/cos-configuration-k8s-operator) | [Issues](https://github.com/canonical/cos-configuration-k8s-operator/issues) | -| COS Proxy | Machines | [Charmhub](https://charmhub.io/cos-proxy) | [Source](https://github.com/canonical/cos-proxy-operator) | [Issues](https://github.com/canonical/cos-proxy-operator/issues) | -| Grafana Agent | K8s | [Charmhub](https://charmhub.io/grafana-agent-k8s) | [Source](https://github.com/canonical/grafana-agent-k8s-operator) | [Issues](https://github.com/canonical/grafana-agent-k8s-operator/issues) | -| Grafana Agent | Machines | [Charmhub](https://charmhub.io/grafana-agent) | [Source](https://github.com/canonical/grafana-agent-operator) | [Issues](https://github.com/canonical/grafana-agent-operator/issues) | -| Opentelemetry Collector | K8s | [Charmhub](https://charmhub.io/opentelemetry-collector-k8s) | [Source](https://github.com/canonical/opentelemetry-collector-k8s-operator) | [Issues](https://github.com/canonical/opentelemetry-collector-k8s-operator/issues) | -| Opentelemetry Collector | Machines | [Charmhub](https://charmhub.io/opentelemetry-collector) | [Source](https://github.com/canonical/grafana-agent-operator) | [Issues](https://github.com/canonical/opentelemetry-collector-operator/issues) -| Karma | K8s | [Charmhub](https://charmhub.io/karma-k8s) | [Source](https://github.com/canonical/karma-k8s-operator) | [Issues](https://github.com/canonical/karma-k8s-operator/issues) | -| Karma Alertmanager Proxy | K8s | [Charmhub](https://charmhub.io/karma-alertmanager-proxy-k8s) | [Source](https://github.com/canonical/karma-alertmanager-proxy-k8s-operator) | [Issues](https://github.com/canonical/karma-alertmanager-proxy-k8s-operator/issues) | -| Prometheus Scrape Config | Any | [Charmhub](https://charmhub.io/prometheus-scrape-config-k8s) | [Source](https://github.com/canonical/prometheus-scrape-config-k8s-operator) | [Issues](https://github.com/canonical/prometheus-scrape-config-k8s-operator/issues) | -| Prometheus Scrape Target | Any | [Charmhub](https://charmhub.io/prometheus-scrape-target-k8s) | [Source](https://github.com/canonical/prometheus-scrape-target-k8s-operator) | [Issues](https://github.com/canonical/prometheus-scrape-target-k8s-operator/issues) | -| Script Exporter | K8s | [Charmhub](https://charmhub.io/script-exporter) | [Source](https://github.com/canonical/script-exporter-operator) | [Issues](https://github.com/canonical/script-exporter-operator/issues) | -| SNMP Exporter | Machines | - | [Source](https://github.com/canonical/snmp-exporter-operator) | [Issues](https://github.com/canonical/snmp-exporter-operator/issues) | diff --git a/docs/reference/cos-components.md b/docs/reference/cos-components.md new file mode 100644 index 00000000..00058491 --- /dev/null +++ b/docs/reference/cos-components.md @@ -0,0 +1,82 @@ +--- +myst: + html_meta: + description: "Browse the complete reference for Canonical Observability Stack charms, rocks and snaps, including component roles, deployment-relevant details and registry locations." +--- + +# COS components + +## COS charms + +| Charm | Substrate | Workload version | Contributing | +| ------------------------------------------------------------------------ | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [Catalogue](https://charmhub.io/catalogue-k8s) | K8s | | [Source](https://github.com/canonical/catalogue-k8s-operator), [issues](https://github.com/canonical/catalogue-k8s-operator/issues) | +| [Grafana](https://charmhub.io/grafana-k8s) | K8s | | [Source](https://github.com/canonical/grafana-k8s-operator), [issues](https://github.com/canonical/grafana-k8s-operator/issues) | +| [Loki Coordinator](https://charmhub.io/loki-coordinator-k8s) | K8s | | [Source](https://github.com/canonical/loki-coordinator-k8s-operator), [issues](https://github.com/canonical/loki-coordinator-k8s-operator/issues) | +| [Loki Worker](https://charmhub.io/loki-worker-k8s) | K8s | | [Source](https://github.com/canonical/loki-worker-k8s-operator), [issues](https://github.com/canonical/loki-worker-k8s-operator/issues) | +| [Mimir Coordinator](https://charmhub.io/mimir-coordinator-k8s) | K8s | | [Source](https://github.com/canonical/mimir-coordinator-k8s-operator), [issues](https://github.com/canonical/mimir-coordinator-k8s-operator/issues) | +| [Mimir Worker](https://charmhub.io/mimir-worker-k8s) | K8s | | [Source](https://github.com/canonical/mimir-worker-k8s-operator), [issues](https://github.com/canonical/mimir-worker-k8s-operator/issues) | +| [S3 Integrator](https://charmhub.io/s3-integrator) | Any | | [Source](https://github.com/canonical/s3-integrator), [issues](https://github.com/canonical/s3-integrator/issues) | +| [Self-signed Certificates](https://charmhub.io/self-signed-certificates) | Any | | [Source](https://github.com/canonical/self-signed-certificates-operator), [issues](https://github.com/canonical/self-signed-certificates-operator/issues) | +| [Tempo Coordinator](https://charmhub.io/tempo-coordinator-k8s) | K8s | | [Source](https://github.com/canonical/tempo-coordinator-k8s-operator), [issues](https://github.com/canonical/tempo-coordinator-k8s-operator/issues) | +| [Tempo Worker](https://charmhub.io/tempo-worker-k8s) | K8s | | [Source](https://github.com/canonical/tempo-worker-k8s-operator), [issues](https://github.com/canonical/tempo-worker-k8s-operator/issues) | +| [Traefik](https://charmhub.io/traefik-k8s) | K8s | | [Source](https://github.com/canonical/traefik-k8s-operator), [issues](https://github.com/canonical/traefik-k8s-operator/issues) | + +## COS Lite charms + +| Charm | Substrate | Workload version | Contributing | +| ------------------------------------------------------------------------ | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [Alertmanager](https://charmhub.io/alertmanager-k8s) | K8s | | [Source](https://github.com/canonical/alertmanager-k8s-operator), [issues](https://github.com/canonical/alertmanager-k8s-operator/issues) | +| [Catalogue](https://charmhub.io/catalogue-k8s) | K8s | | [Source](https://github.com/canonical/catalogue-k8s-operator), [issues](https://github.com/canonical/catalogue-k8s-operator/issues) | +| [Grafana](https://charmhub.io/grafana-k8s) | K8s | | [Source](https://github.com/canonical/grafana-k8s-operator), [issues](https://github.com/canonical/grafana-k8s-operator/issues) | +| [Loki](https://charmhub.io/loki-k8s) | K8s | | [Source](https://github.com/canonical/loki-k8s-operator), [issues](https://github.com/canonical/loki-k8s-operator/issues) | +| [Prometheus](https://charmhub.io/prometheus-k8s) | K8s | | [Source](https://github.com/canonical/prometheus-k8s-operator), [issues](https://github.com/canonical/prometheus-k8s-operator/issues) | +| [Traefik](https://charmhub.io/traefik-k8s) | K8s | | [Source](https://github.com/canonical/traefik-k8s-operator), [issues](https://github.com/canonical/traefik-k8s-operator/issues) | +| [Self-signed Certificates](https://charmhub.io/self-signed-certificates) | Any | | [Source](https://github.com/canonical/self-signed-certificates-operator), [issues](https://github.com/canonical/self-signed-certificates-operator/issues) | + +## Peripheral charms + +| Charm | Substrate | Workload version | Contributing | +| ---------------------------------------------------------------------------- | --------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [Blackbox Exporter](https://charmhub.io/blackbox-exporter-k8s) | K8s | | [Source](https://github.com/canonical/blackbox-exporter-k8s-operator), [issues](https://github.com/canonical/blackbox-exporter-k8s-operator/issues) | +| [Blackbox Exporter](https://charmhub.io/blackbox-exporter) | Machine | | [Source](https://github.com/canonical/blackbox-exporter-operator), [issues](https://github.com/canonical/blackbox-exporter-operator/issues) | +| [COS Configuration](https://charmhub.io/cos-configuration-k8s) | K8s | | [Source](https://github.com/canonical/cos-configuration-k8s-operator), [issues](https://github.com/canonical/cos-configuration-k8s-operator/issues) | +| [COS Proxy](https://charmhub.io/cos-proxy) | Machines | | [Source](https://github.com/canonical/cos-proxy-operator), [issues](https://github.com/canonical/cos-proxy-operator/issues) | +| [Grafana Agent](https://charmhub.io/grafana-agent-k8s) | K8s | | [Source](https://github.com/canonical/grafana-agent-k8s-operator), [issues](https://github.com/canonical/grafana-agent-k8s-operator/issues) | +| [Grafana Agent](https://charmhub.io/grafana-agent) | Machines | | [Source](https://github.com/canonical/grafana-agent-operator), [issues](https://github.com/canonical/grafana-agent-operator/issues) | +| [Opentelemetry Collector](https://charmhub.io/opentelemetry-collector-k8s) | K8s | | [Source](https://github.com/canonical/opentelemetry-collector-k8s-operator), [issues](https://github.com/canonical/opentelemetry-collector-k8s-operator/issues) | +| [Opentelemetry Collector](https://charmhub.io/opentelemetry-collector) | Machines | | [Source](https://github.com/canonical/grafana-agent-operator), [issues](https://github.com/canonical/opentelemetry-collector-operator/issues) | +| [Karma](https://charmhub.io/karma-k8s) | K8s | | [Source](https://github.com/canonical/karma-k8s-operator), [issues](https://github.com/canonical/karma-k8s-operator/issues) | +| [Karma Alertmanager Proxy](https://charmhub.io/karma-alertmanager-proxy-k8s) | K8s | | [Source](https://github.com/canonical/karma-alertmanager-proxy-k8s-operator), [issues](https://github.com/canonical/karma-alertmanager-proxy-k8s-operator/issues) | +| [Prometheus Scrape Config](https://charmhub.io/prometheus-scrape-config-k8s) | Any | | [Source](https://github.com/canonical/prometheus-scrape-config-k8s-operator), [issues](https://github.com/canonical/prometheus-scrape-config-k8s-operator/issues) | +| [Prometheus Scrape Target](https://charmhub.io/prometheus-scrape-target-k8s) | Any | | [Source](https://github.com/canonical/prometheus-scrape-target-k8s-operator), [issues](https://github.com/canonical/prometheus-scrape-target-k8s-operator/issues) | +| [Script Exporter](https://charmhub.io/script-exporter) | K8s | | [Source](https://github.com/canonical/script-exporter-operator), [issues](https://github.com/canonical/script-exporter-operator/issues) | +| SNMP Exporter | Machines | | [Source](https://github.com/canonical/snmp-exporter-operator), [issues](https://github.com/canonical/snmp-exporter-operator/issues) | + +## Rocks + +| Rock | Contributing | +| ----------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`ubuntu/alertmanager`](https://hub.docker.com/r/ubuntu/alertmanager) | [Source](https://github.com/canonical/alertmanager-rock), [issues](https://github.com/canonical/alertmanager-rock/issues) | +| [`ubuntu/blackbox-exporter`](https://hub.docker.com/r/ubuntu/blackbox-exporter) | [Source](https://github.com/canonical/blackbox-exporter-rock), [issues](https://github.com/canonical/blackbox-exporter-rock/issues) | +| `ghcr.io/canonical/git-sync` | [Source](https://github.com/canonical/git-sync-rock), [issues](https://github.com/canonical/git-sync-rock/issues) | +| [`ubuntu/grafana-agent`](https://hub.docker.com/r/ubuntu/grafana-agent) | [Source](https://github.com/canonical/grafana-agent-rock), [issues](https://github.com/canonical/grafana-agent-rock/issues) | +| [`ubuntu/grafana`](https://hub.docker.com/r/ubuntu/grafana) | [Source](https://github.com/canonical/grafana-rock), [issues](https://github.com/canonical/grafana-rock/issues) | +| [`ubuntu/karma`](https://hub.docker.com/r/ubuntu/karma) | [Source](https://github.com/canonical/karma-rock), [issues](https://github.com/canonical/karma-rock/issues) | +| [`ubuntu/loki`](https://hub.docker.com/r/ubuntu/loki) | [Source](https://github.com/canonical/loki-rock), [issues](https://github.com/canonical/loki-rock/issues) | +| [`ubuntu/mimir`](https://hub.docker.com/r/ubuntu/mimir) | [Source](https://github.com/canonical/mimir-rock), [issues](https://github.com/canonical/mimir-rock/issues) | +| [`ubuntu/nginx-prometheus-exporter`](https://hub.docker.com/r/ubuntu/nginx-prometheus-exporter) | [Source](https://github.com/canonical/nginx-prometheus-exporter-rock), [issues](https://github.com/canonical/nginx-prometheus-exporter-rock/issues) | +| [`ubuntu/node-exporter`](https://hub.docker.com/r/ubuntu/node-exporter) | [Source](https://github.com/canonical/node-exporter-rock), [issues](https://github.com/canonical/node-exporter-rock/issues) | +| [`ubuntu/opentelemetry-collector`](https://hub.docker.com/r/ubuntu/opentelemetry-collector) | [Source](https://github.com/canonical/opentelemetry-collector-rock), [issues](https://github.com/canonical/opentelemetry-collector-rock/issues) | +| [`ubuntu/parca`](https://hub.docker.com/r/ubuntu/parca) | [Source](https://github.com/canonical/parca-rock), [issues](https://github.com/canonical/parca-rock/issues) | +| [`ubuntu/prometheus-pushgateway`](https://hub.docker.com/r/ubuntu/prometheus-pushgateway) | [Source](https://github.com/canonical/prometheus-pushgateway-rock), [issues](https://github.com/canonical/prometheus-pushgateway-rock/issues) | +| [`ghcr.io/canonical/s3proxy`](https://github.com/canonical/s3proxy-rock/pkgs/container/s3proxy) | [Source](https://github.com/canonical/s3proxy-rock), [issues](https://github.com/canonical/s3proxy-rock/issues) | +| [`ubuntu/tempo`](https://hub.docker.com/r/ubuntu/tempo) | [Source](https://github.com/canonical/tempo-rock), [issues](https://github.com/canonical/tempo-rock/issues) | +| [`ubuntu/xk6`](https://hub.docker.com/r/ubuntu/xk6) | [Source](https://github.com/canonical/xk6-rock), [issues](https://github.com/canonical/xk6-rock/issues) | + +## Snaps + +| Snap | Contributing | +| ----------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | +| [Grafana Agent](https://snapcraft.io/grafana-agent) | [Source](https://github.com/canonical/grafana-agent-snap), [issues](https://github.com/canonical/grafana-agent-snap/issues) | +| [OpenTelemetry Collector](https://snapcraft.io/opentelemetry-collector) | [Source](https://github.com/canonical/opentelemetry-collector-snap), [issues](https://github.com/canonical/opentelemetry-collector-snap/issues) | diff --git a/docs/reference/index.md b/docs/reference/index.md new file mode 100644 index 00000000..ec5290d5 --- /dev/null +++ b/docs/reference/index.md @@ -0,0 +1,71 @@ +--- +myst: + html_meta: + description: "Use this COS reference material for release information, security details, Juju topology, compatibility matrices, and operational best practices." +--- + +(reference)= + +# Reference + +These pages include reference material for deploying, operating and integrating with COS. +Use it when you need exact requirements, compatibility matrices, or operational hardening +guidance. + +## Release & lifecycle + +Information about releases, timelines and our policy for updates and +compatibility. + +```{toctree} +:maxdepth: 1 + +Release Notes +Release Policy +System Requirements +``` + +## Security + +Hardware, software and security guidance required for production use. Consult +these pages before planning or hardening a deployment. + +```{toctree} +:maxdepth: 1 + +Security Hardening Guide +Cryptographic Documentation +``` + +## Topology + +Topology reference pages describing how COS makes us of Juju topology as telemetry labels. + +```{toctree} +:maxdepth: 1 + +Model Topology for COS Lite +Juju Topology Labels +``` + +## Integrations & artifacts + +Compatibility and packaging information for charms, snaps, and rocks (OCI images). + +```{toctree} +:maxdepth: 1 + +Integration Matrix +COS components +``` + +## Deployment best practices + +Operational guidance and recommended patterns for deploying and managing +COS in production. + +```{toctree} +:maxdepth: 1 + +Deployment Best Practices +``` diff --git a/docs/reference/index.rst b/docs/reference/index.rst deleted file mode 100644 index a704918b..00000000 --- a/docs/reference/index.rst +++ /dev/null @@ -1,71 +0,0 @@ -.. meta:: - :description: Use this COS reference material for release information, security details, Juju topology, compatibility matrices, and operational best practices. - -.. _reference: - -Reference -********* - -These pages include reference material for deploying, operating and integrating with COS. -Use it when you need exact requirements, compatibility matrices, or operational hardening -guidance. - -Release & lifecycle -==================== - -Information about releases, timelines and our policy for updates and -compatibility. - -.. toctree:: - :maxdepth: 1 - - Release Notes - Release Policy - System Requirements - -Security -======================= - -Hardware, software and security guidance required for production use. Consult -these pages before planning or hardening a deployment. - -.. toctree:: - :maxdepth: 1 - - Security Hardening Guide - Cryptographic Documentation - -Topology -================= - -Topology reference pages describing how COS makes us of Juju topology as telemetry labels. - -.. toctree:: - :maxdepth: 1 - - Model Topology for COS Lite - Juju Topology Labels - -Integrations & artifacts -========================= - -Compatibility and packaging information for charms, snaps, and rocks (OCI images). - -.. toctree:: - :maxdepth: 1 - - Integration Matrix - Charms - Snaps - Rock OCI Images - -Deployment best practices -========================= - -Operational guidance and recommended patterns for deploying and managing -COS in production. - -.. toctree:: - :maxdepth: 1 - - Deployment Best Practices diff --git a/docs/reference/release-notes.md b/docs/reference/release-notes.md index 884a23ca..19f121c7 100644 --- a/docs/reference/release-notes.md +++ b/docs/reference/release-notes.md @@ -4,44 +4,56 @@ myst: description: "Read COS 2 release notes to track new features, review requirements and compatibility, peripheral-charm changes, and breaking and deprecated changes." --- -# COS 2 release notes -October 2025 +# COS 3 release notes +May 2026 -These release notes cover new features and changes in COS 2. +These release notes cover new features and changes in COS 3. -COS 2 is designated as a short-term support release. This means it will continue to receive security updates and critical bug fixes for 9 months. +COS 3 newer versions of all underlying charms, as well as new features around charmed opentelemetry-collector. + +COS 3 is designated as a long-term support (LTS) release. This means it will continue to receive security updates and critical bug fixes for 15 years. + +If you have COS 2 installed, make plans to upgrade to COS 3 before July 2026. See our [release policy](release-policy.md) and [upgrade instructions](../how-to/upgrade.md). +To report bugs or security issues, refer to the index of [COS components](../reference/cos-components). + ## Requirements and compatibility See [system requirements](system-requirements.md). -COS 2 is compatible with Juju v3.6+. When deployed using terraform, Juju v3.6.11+ is recommended. +COS 3 is compatible with Juju v3.6+. + + +## What's new in COS 3 + + +### COS components + +| Component | Version | +|--------------------------|---------| +| alertmanager | 0.x | +| catalogue | | +| grafana | 12.x | +| loki | 3.x | +| mimir | 3.x | +| opentelemetry-collector | 0.x | +| s3-integrator | | +| self-signed-certificates | | +| tempo | 2.x | +| traefik | 2.x | + + +### COS Lite components + + -## What's new in COS 2 -- Terraform modules for [COS](https://github.com/canonical/observability-stack/tree/main/terraform/cos) - and [COS Lite](https://github.com/canonical/observability-stack/tree/main/terraform/cos-lite). - As Juju bundles are deprecated, the standard way of deploying COS is now using the - [Juju Terraform provider](https://registry.terraform.io/providers/juju/juju/latest/docs). - - [Telemetry correlation](../explanation/telemetry-correlation.md) is now automatically enabled when you deploy COS using the - Terraform module. -- **Grafana v12**. We upgraded the workload version from Grafana 9 to Grafana 12. A thorough review of Grafana's breaking changes and how they affect us is available [on Discourse](https://discourse.charmhub.io/t/cos-will-start-using-grafana-12-what-changed/18868). -- **Opentelemetry collector**. Charmed opentelemetry-collector was released. The charm was designed to be a drop-in replacement for - the grafana-agent charm (upstream grafana-agent is EOL since November 2025, and we will support charmed grafana-agent until July 2026). -- `extra_alert_labels` config option. A new config option in grafana-agent and opentelemetry-collector enabled adding custom labels to alert rules. Custom labels are useful for differentiating alerts coming from sites with different SLAs. -- **API links in catalogue-k8s**. The cards in catalogue-k8s now support extra links for documentation and APIs. COS charms now provide links to the workload API, making it easier to locate ingress URLs -for workloads without a web UI. ## Notable changes in peripheral charms -- Multiple scripts in script-exporter. The script exporter VM charm can now take an archive of scripts. It can now be deployed on 20.04, 22.04 and 24.04. -- Prior to cos-proxy `rev166`, duplicate telemetry exists after `upgrade` and `config-changed` hooks. Follow [these remediation steps](https://github.com/canonical/cos-proxy-operator/issues/208#issuecomment-3367094739) to resolve, requiring you to upgrade cos-proxy to `>=rev166` via `track2`. This also includes the feature of scrape-configs and alert rules for `cos-agent` relations. -## Backwards-incompatible changes -- Charms from track 2 can be deployed on juju models v3.6+. -- Terraform module variable `model_name` renamed to `model` in all charms. -- Grafana v12 changes how the panel view URL is generated for repeated panels. Links to **repeated panels** in a dashboard changed slightly; previously bookmarked links to a repeated panel (not its dashboard) will no longer work. +## Backwards-incompatible changes +- If you are using charmed Grafana Agent to push telemetry to COS, note that the vendor announced end-of-life, so we will not be supporting the charm beyond July 2026. Make plans to [upgrade to charmed OpenTelemetry Collector][../how-to/migrate-grafana-agent-to-otelcol]. ## Deprecated features -- The `LogProxyConsumer` charm library (owned by Loki) is deprecated. Use Pebble log forwarding instead. diff --git a/docs/reference/release-policy.md b/docs/reference/release-policy.md index fa511d57..d9ba258f 100644 --- a/docs/reference/release-policy.md +++ b/docs/reference/release-policy.md @@ -10,6 +10,15 @@ Our release policy includes two kinds of releases, short-term releases and long- We release every six months, same as Ubuntu, and our LTS releases coincide with [Ubuntu's LTS release cadence](https://ubuntu.com/about/release-cycle). + +## Long-term support + +| Track | Release date | End of life | Ubuntu base | Min. Juju version | +| ------- | ------------- | ----------------- | ------------------------------------ | ----------------- | +| `3` | 2026-05 | 2036-04 (approx.) | 26.04 (rocks), 22.04+ (subordinates) | 3.6 | + + + ## Short-term releases Short-term releases are supported for nine months by providing security patches and critical bug fixes. @@ -19,13 +28,6 @@ Short-term releases are supported for nine months by providing security patches | `1` | 2025-05 | 2026-02 | 24.04 (rocks) | 3.1 | Mimir 2.x, Prometheus 2.x, Loki 2.x (COS Lite), Loki 3.0 (COS), Grafana 9.x, Grafana Agent 0.40.4 | -## Long-term support - -| Track | Release date | End of life | Ubuntu base | Min. Juju version | Brief summary | -| ------- | ------------------- | ------------------- | ------------------------------------ | ----------------- | ------------- | -| `3-lts` | 2026-04 (predicted) | 2038-04 (predicted) | 26.04 (rocks), 22.04+ (subordinates) | | | - - ## Charmhub tracks and git branches We create the Charmhub track at beginning of a cycle, and a git branch at end of cycle. For example, during June-September 2025, we had track `2` on Charmhub, but only `track/1` and `main` branches on github. diff --git a/docs/reference/rock-oci-images.md b/docs/reference/rock-oci-images.md deleted file mode 100644 index 1237b34c..00000000 --- a/docs/reference/rock-oci-images.md +++ /dev/null @@ -1,26 +0,0 @@ ---- -myst: - html_meta: - description: "Reference COS Rock OCI images with registry locations, source repositories, and image details needed for reproducible deployments." ---- - -# Rock OCI Images - -| Image | Registry | Source Code | Bug Report | -|------------------------------------|---------------------------------------------------------------------------|-----------------------------------------------------------------------|------------------------------------------------------------------------------| -| `ubuntu/alertmanager` | [Image](https://hub.docker.com/r/ubuntu/alertmanager) | [Source](https://github.com/canonical/alertmanager-rock) | [Issues](https://github.com/canonical/alertmanager-rock/issues) | -| `ubuntu/blackbox-exporter` | [Image](https://hub.docker.com/r/ubuntu/blackbox-exporter) | [Source](https://github.com/canonical/blackbox-exporter-rock) | [Issues](https://github.com/canonical/blackbox-exporter-rock/issues) | -| `ghcr.io/canonical/git-sync` | -- | [Source](https://github.com/canonical/git-sync-rock) | [Issues](https://github.com/canonical/git-sync-rock/issues) | -| `ubuntu/grafana-agent` | [Image](https://hub.docker.com/r/ubuntu/grafana-agent) | [Source](https://github.com/canonical/grafana-agent-rock) | [Issues](https://github.com/canonical/grafana-agent-rock/issues) | -| `ubuntu/grafana` | [Image](https://hub.docker.com/r/ubuntu/grafana) | [Source](https://github.com/canonical/grafana-rock) | [Issues](https://github.com/canonical/grafana-rock/issues) | -| `ubuntu/karma` | [Image](https://hub.docker.com/r/ubuntu/karma) | [Source](https://github.com/canonical/karma-rock) | [Issues](https://github.com/canonical/karma-rock/issues) | -| `ubuntu/loki` | [Image](https://hub.docker.com/r/ubuntu/loki) | [Source](https://github.com/canonical/loki-rock) | [Issues](https://github.com/canonical/loki-rock/issues) | -| `ubuntu/mimir` | [Image](https://hub.docker.com/r/ubuntu/mimir) | [Source](https://github.com/canonical/mimir-rock) | [Issues](https://github.com/canonical/mimir-rock/issues) | -| `ubuntu/nginx-prometheus-exporter` | [Image](https://hub.docker.com/r/ubuntu/nginx-prometheus-exporter) | [Source](https://github.com/canonical/nginx-prometheus-exporter-rock) | [Issues](https://github.com/canonical/nginx-prometheus-exporter-rock/issues) | -| `ubuntu/node-exporter` | [Image](https://hub.docker.com/r/ubuntu/node-exporter) | [Source](https://github.com/canonical/node-exporter-rock) | [Issues](https://github.com/canonical/node-exporter-rock/issues) | -| `ubuntu/opentelemetry-collector` | [Image](https://hub.docker.com/r/ubuntu/opentelemetry-collector) | [Source](https://github.com/canonical/opentelemetry-collector-rock) | [Issues](https://github.com/canonical/opentelemetry-collector-rock/issues) | -| `ubuntu/parca` | [Image](https://hub.docker.com/r/ubuntu/parca) | [Source](https://github.com/canonical/parca-rock) | [Issues](https://github.com/canonical/parca-rock/issues) | -| `ubuntu/prometheus-pushgateway` | [Image](https://hub.docker.com/r/ubuntu/prometheus-pushgateway) | [Source](https://github.com/canonical/prometheus-pushgateway-rock) | [Issues](https://github.com/canonical/prometheus-pushgateway-rock/issues) | -| `ghcr.io/canonical/s3proxy` | [Image](https://github.com/canonical/s3proxy-rock/pkgs/container/s3proxy) | [Source](https://github.com/canonical/s3proxy-rock) | [Issues](https://github.com/canonical/s3proxy-rock/issues) | -| `ubuntu/tempo` | [Image](https://hub.docker.com/r/ubuntu/tempo) | [Source](https://github.com/canonical/tempo-rock) | [Issues](https://github.com/canonical/tempo-rock/issues) | -| `ubuntu/xk6` | [Image](https://hub.docker.com/r/ubuntu/xk6) | [Source](https://github.com/canonical/xk6-rock) | [Issues](https://github.com/canonical/xk6-rock/issues) | diff --git a/docs/reference/snaps.md b/docs/reference/snaps.md deleted file mode 100644 index 14d525a3..00000000 --- a/docs/reference/snaps.md +++ /dev/null @@ -1,12 +0,0 @@ ---- -myst: - html_meta: - description: "Explore the COS snaps reference for package names, store locations, and source links for observability-related snaps." ---- - -# Snaps - -| Image | Snapcraft Store | Source Code | Bug Report | -|-------------------------|-------------------------------------------------------|---------------------------------------------------------------------|----------------------------------------------------------------------------| -| Grafana Agent | [Store](https://snapcraft.io/grafana-agent) | [Source](https://github.com/canonical/grafana-agent-snap) | [Issues](https://github.com/canonical/grafana-agent-snap/issues) | -| OpenTelemetry Collector | [Store](https://snapcraft.io/opentelemetry-collector) | [Source](https://github.com/canonical/opentelemetry-collector-snap) | [Issues](https://github.com/canonical/opentelemetry-collector-snap/issues) | diff --git a/docs/reference/system-requirements.md b/docs/reference/system-requirements.md index 1ef0e1aa..347f4d2d 100644 --- a/docs/reference/system-requirements.md +++ b/docs/reference/system-requirements.md @@ -7,7 +7,28 @@ myst: # System requirements ## COS -3 nodes of 8cpu16gb or better. +3 nodes of 8cpu16gb or better. At least 100 GB disk space. + ## COS Lite -4cpu8gb or better. +As a general rule, plan for a 4cpu8gb or better VM. + +If you have an [estimate for the expected telemetry rate](../how-to/evaluate-telemetry-volume.md), refer to the tables below. + + +### Metrics (Prometheus) + +| Metrics/min | vCPUs | Mem (GB) | Disk (GiB/day) | +|-------------|-------|----------|----------------| +| 1 M | 2 | 6 | 6 | +| 3 M | 3 | 9 | 14 | +| 6 M | 3 | 14 | 27 | + + +### Logs (Loki) + +| Logs/min | vCPUs | Mem (GB) | Disk (GiB/day) | +|----------|-------|----------|----------------| +| 60 k | 5 | 8 | 20 | +| 180 k | 5 | 8 | 60 | +| 360 k | 6 | 8 | 120 | \ No newline at end of file diff --git a/docs/tutorial/index.rst b/docs/tutorial/index.md similarity index 59% rename from docs/tutorial/index.rst rename to docs/tutorial/index.md index d820d0b4..6746ced0 100644 --- a/docs/tutorial/index.rst +++ b/docs/tutorial/index.md @@ -1,58 +1,59 @@ -.. meta:: - :description: Follow hands-on COS tutorials to deploy, configure, and operate Canonical Observability Stack with reproducible steps and practical examples. +--- +myst: + html_meta: + description: "Follow hands-on COS tutorials to deploy, configure, and operate Canonical Observability Stack with reproducible steps and practical examples." +--- -.. _tutorial: +(tutorial)= -Tutorial -******** +# Tutorial If you want to learn the basics from experience, then our tutorial will help you acquire the necessary competencies from real-life examples with fully reproducible steps. -Installation -============ +## Installation -Get COS up and running on your MicroK8s environment with ease. Each of these +Get COS up and running on your MicroK8s environment with ease. Each of these paths of the tutorial will walk you through the steps required to deploy COS or COS Lite, Juju-based observability stacks running on Kubernetes. -.. toctree:: - :maxdepth: 1 +```{toctree} +:maxdepth: 1 - 1. Deploy the observability stack +Deploy the observability stack +``` -Configuration -============= +## Configuration In this part of the tutorial you will learn how to make COS automatically sync the alert rules of your git repository to your metrics backend using the COS Configuration charm. -.. toctree:: - :maxdepth: 1 +```{toctree} +:maxdepth: 1 - 2. Sync alert rules from Git +Sync alert rules from Git +``` -Instrumentation -=============== +## Instrumentation -Bridge the gap between COS Lite running in Kubernetes and your application -running on a machine. Discover how to collect telemetry data from your charmed +Bridge the gap between COS Lite running in Kubernetes and your application +running on a machine. Discover how to collect telemetry data from your charmed application using the Grafana Agent machine charm. -.. toctree:: - :maxdepth: 1 +```{toctree} +:maxdepth: 1 - 3. Instrument machine charms +Instrument machine charms +``` - -Redaction -========= +## Redaction By implementing a solid redaction strategy you can mitigate the risk of unwanted data leaks. This helps to comply with information security policies which outline the need for redacting personally identifiable information (PII), credentials, and other sensitive data. -.. toctree:: - :maxdepth: 1 +```{toctree} +:maxdepth: 1 - 4. Redact sensitive data +Redact sensitive data +``` diff --git a/docs/tutorial/installation/index.md b/docs/tutorial/installation/index.md new file mode 100644 index 00000000..33a80275 --- /dev/null +++ b/docs/tutorial/installation/index.md @@ -0,0 +1,26 @@ +--- +myst: + html_meta: + description: "Follow step-by-step installation guides for deploying COS and COS Lite on Kubernetes and MicroK8s environments." +--- + +(installation)= + +# Deploying the observability stack + +## COS + +```{toctree} +:maxdepth: 1 + +COS on Canonical K8s +``` + +## COS Lite + +```{toctree} +:maxdepth: 1 + +COS Lite on MicroK8s +COS Lite on Canonical K8s +``` diff --git a/docs/tutorial/installation/index.rst b/docs/tutorial/installation/index.rst deleted file mode 100644 index cbaa45fe..00000000 --- a/docs/tutorial/installation/index.rst +++ /dev/null @@ -1,23 +0,0 @@ -.. meta:: - :description: Follow step-by-step installation guides for deploying COS and COS Lite on Kubernetes and MicroK8s environments. - -.. _installation: - -Deploying the observability stack -********************************* - -COS -=== -.. toctree:: - :maxdepth: 1 - - COS on Canonical K8s - - -COS Lite -======== -.. toctree:: - :maxdepth: 1 - - COS Lite on MicroK8s - COS Lite on Canonical K8s