Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions docs/how-to/correlate-colocated.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# How to correlate node-exporter metrics with multiple co-located VM charms

The otelcol charms deploy `node_exporter` as a singleton snap in a given machine
However, multiple principal charms may be co-located on the same machine.
This document shows how to correlate between node-exporter metrics and co-located charms.
Comment on lines +3 to +5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The otelcol charms deploy `node_exporter` as a singleton snap in a given machine
However, multiple principal charms may be co-located on the same machine.
This document shows how to correlate between node-exporter metrics and co-located charms.
The OpenTelemetry Collector (`otelcol`) charms deploy `node-exporter` as a singleton snap in a given machine. However, multiple principal charms may be co-located on the same machine.
<Insert 1: What does this info mean / why is the default behavior confusing?>
<Insert 2: Why does the user need to know this?>
This document describes how to <Insert 3: Do what?>

Explanation of my suggestions

Basic

  • I'd recommend first fully naming OpenTelemetry Collector, but it's also acceptable to call it otelcol (if you think that's obvious to your user base), but still it should be code-formatted in all instances ("The otelcol charms deploy...") - or I guess you could do Otelcol (capital O, no codeblock), but that looks more odd IMO
  • It's a hyphen (based on your repo - this should be formatted consistently in all instances - node-exporter

Format
The original format isn't clear where the doc is going and why - why does the user want to "correlate between node-exporter metrics and co-located charms"? (i.e., why would a user find this doc and do this?)

This is really important though, not only to frame the rest of the guide well, but it also helps confirm to the user they're in the right location at all (e.g., even if the rest of the doc sucks, you would at least know you were in the right place / know if the doc did or didn't resolve your issue)

I added some placeholders, but the format I was going for is:
{current system behavior}
{why that behavior is confusing}
{why resolving this matters for users}
{what this document provides}

So a re-written version would be like (but change the wording as necessary or if I've misunderstood something!)

"
The OpenTelemetry Collector (otelcol) charms deploy node-exporter as a singleton snap in a given machine. Additionally, multiple principal charms may be co-located on the same machine.

When node-exporter metrics are forwarded by otelcol, they include labels that identify the machine where the metrics were collected. Since these labels are shared by all charms running on that machine, the metrics don't directly indicate which charm produced the specific metric.

To understand which charm is responsible for a specific metric, you need to correlate node-exporter metrics with the charms running on the same machine.

This document describes how to perform that correlation.
"


## Manually, via label inspection
A node-exporter metric such as `node_cpu_seconds_total`, is forwarded by otelcol with labels `juju_model`, `juju_model_uuid` and `instance`, all of which are common to otelcol itself and any co-located charms. The `juju_charm` and `juju_application` labels for node-exporter metrics would have otelcol information.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A node-exporter metric such as `node_cpu_seconds_total`, is forwarded by otelcol with labels `juju_model`, `juju_model_uuid` and `instance`, all of which are common to otelcol itself and any co-located charms. The `juju_charm` and `juju_application` labels for node-exporter metrics would have otelcol information.
A `node-exporter` metric, such as `node_cpu_seconds_total`, is forwarded by `otelcol` with labels including: `juju_model`, `juju_model_uuid` and `instance`. These labels are common to `otelcol` itself and any co-located charms.
The `juju_model` and `juju_model_uuid` labels identify the Juju model where the metric was collected. The `instance` label identifies the specific machine within that model where the metric was collected.

Explanation

I found it confusing that 3 labels were introduced, but 2 of them basically disappeared (and were "replaced" by two other ones - also it's easy to gloss over and not notice juju_charm and juju_application were introduced, and aren't the same juju_* ones just mentioned with instance).

So note that I removed the reference to juju_charm and juju_application - do these need to be here? IMO it seems clear from the example that has otelcol info


Note the `instance` label. For example, in the following node-exporter metric, the instance is `juju-b2b564-0.lxd`:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note the `instance` label. For example, in the following node-exporter metric, the instance is `juju-b2b564-0.lxd`:
For example, in the following `node-exporter` metric:

(this is tied to my next suggestion but) I've made it more lightweight - IMO giving the specific ID feels easier to understand if presented after the example instead, because it's unnecessary to hold that in your brain before seeing any example to stick it to


```
node_cpu_seconds_total{
cpu="7",
instance="juju-b2b564-0.lxd",
job="juju_welcome-lxd_377f2555_otelcol1_node-exporter",
juju_application="otelcol1",
juju_charm="opentelemetry-collector",
juju_model="welcome-lxd",
juju_model_uuid="377f2555-db6c-4b2b-89c9-422668b2b564",
mode="user"
}
```

Now you can query for the application metrics you are interested in, filtering results with the label matcher `instance="juju-b2b564-0.lxd"`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Now you can query for the application metrics you are interested in, filtering results with the label matcher `instance="juju-b2b564-0.lxd"`.
The instance is `juju-b2b564-0.lxd`. Now, you can query for the application metrics you're interested in, filtering results with the label matcher `instance="juju-b2b564-0.lxd"`.


## Project charm labels onto node-exporter metrics
Every unit of otelcol renders "annotations" that look as follows:

```
subordinate_charm_info{
collector_unit="otelcol1/0",
instance="juju-b2b564-0.lxd",
job="juju_welcome-lxd_377f2555_otelcol1_node-exporter",
juju_application="otelcol1",
juju_charm="opentelemetry-collector",
juju_model="welcome-lxd",
juju_model_uuid="377f2555-db6c-4b2b-89c9-422668b2b564",
related_unit="ubuntu1/0"
}
```

Use aggregation operators `on` and `group_right` to project labels from the annotation metric onto the node-exporter metrics.

```
label_replace(
label_replace(
max without (cpu, mode) (
rate(node_cpu_seconds_total[5m])*100
) * on(instance, juju_model, juju_model_uuid) group_right
subordinate_charm_info,
"juju_application", "$1", "related_unit", "([^/]+)/.*"
),
"juju_unit", "$1", "related_unit", "(.*)"
)
```

Let's break this down:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Let's break this down:
Here's what's happening:

Personal preference here but I suggested changed just because "break down" is more complicated (it's called a "phrasal verb", which is more than one word to = one verb meaning)

- `rate(node_cpu_seconds_total)` is the raw data we're interested in (time 100 to convert to percentage).
- `max without (cpu, mode)` is an aggregation that is intended for "collapsing" the timeseries into a unique set, in preparation to the `join` (`group_right`).
- `on(instance, juju_model, juju_model_uuid) group_right` is a "join" operation that matches metric values by corresponding labels.
- The `label_replace` instructions replace the existing `juju_application` and `juju_unit` labels (from otelcol) with the `related_unit` label (from the charm otelcol is related to).

## References
- Robust Perception, [Exposing the software version to Prometheus](https://www.robustperception.io/exposing-the-software-version-to-prometheus/), August 22, 2016.
- Julien Pivotto, Brian Brazil, [Prometheus Up & Running](https://www.oreilly.com/library/view/prometheus-up/9781098131135/), page 97.
3 changes: 2 additions & 1 deletion docs/how-to/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ with COS to actually observe them.
Selectively drop telemetry using scrape config <selectively-drop-telemetry-scrape-config>
Selectively drop telemetry using opentelemetry-collector <selectively-drop-telemetry-otelcol>
Tier OpenTelemetry Collector with different pipelines per data stream <tiered-otelcols>

Correlate node-exporter metrics with multiple co-located VM charms <correlate-colocated>

Troubleshooting
===============

Expand Down
Loading