Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 112 additions & 9 deletions content/telegraf/controller/agents/status.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,127 @@
---
title: Set agent statuses
description: >
Understand how {{% product-name %}} receives and displays agent statuses from
the heartbeat output plugin.
Configure agent status evaluation using CEL expressions in the Telegraf
heartbeat output plugin and view statuses in {{% product-name %}}.
menu:
telegraf_controller:
name: Set agent statuses
parent: Manage agents
weight: 104
related:
- /telegraf/controller/reference/agent-status-eval/, Agent status evaluation reference
- /telegraf/controller/agents/reporting-rules/
- /telegraf/v1/output-plugins/heartbeat/, Heartbeat output plugin
---

Agent statuses come from the Telegraf heartbeat output plugin and are sent with
each heartbeat request.
The plugin reports an `ok` status.
Agent statuses reflect the health of a Telegraf instance based on runtime data.
The Telegraf [heartbeat output plugin](/telegraf/v1/output-plugins/heartbeat/)
evaluates [Common Expression Language (CEL)](/telegraf/controller/reference/agent-status-eval/)
expressions against agent metrics, error counts, and plugin statistics to
determine the status sent with each heartbeat.

> [!Note]
> A future Telegraf release will let you configure logic that sets the status value.
{{% product-name %}} also applies reporting rules to detect stale agents.
If an agent does not send a heartbeat within the rule's threshold, Controller
marks the agent as **Not Reporting** until it resumes sending heartbeats.
> #### Requires Telegraf v1.38.2+
>
> Agent status evaluation in the Heartbeat output plugins requires Telegraf
> v1.38.2+.

## Status values

{{% product-name %}} displays the following agent statuses:

| Status | Source | Description |
| :---------------- | :------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Ok** | Heartbeat plugin | The agent is healthy. Set when the `ok` CEL expression evaluates to `true`. |
| **Warn** | Heartbeat plugin | The agent has a potential issue. Set when the `warn` CEL expression evaluates to `true`. |
| **Fail** | Heartbeat plugin | The agent has a critical problem. Set when the `fail` CEL expression evaluates to `true`. |
| **Undefined** | Heartbeat plugin | No expression matched and the `default` is set to `undefined`, or the `initial` status is `undefined`. |
| **Not Reporting** | {{% product-name %}} | The agent has not sent a heartbeat within the [reporting rule](/telegraf/controller/agents/reporting-rules/) threshold. {{% product-name %}} applies this status automatically. |

## How status evaluation works

You define CEL expressions for `ok`, `warn`, and `fail` in the
`[outputs.heartbeat.status]` section of your heartbeat plugin configuration.
Telegraf evaluates expressions in a configurable order and assigns the status
of the first expression that evaluates to `true`.

For full details on evaluation flow, configuration options, and available
variables and functions, see the
[Agent status evaluation reference](/telegraf/controller/reference/agent-status-eval/).

## Configure agent statuses

To configure status evaluation, add `"status"` to the `include` list in your
heartbeat plugin configuration and define CEL expressions in the
`[outputs.heartbeat.status]` section.

### Example: Basic health check

Report `ok` when metrics are flowing.
If no metrics arrive, fall back to the `fail` status.

{{% telegraf/dynamic-values %}}
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "&{agent_id}"
token = "${INFLUX_TOKEN}"
interval = "1m"
include = ["hostname", "statistics", "configs", "logs", "status"]

[outputs.heartbeat.status]
ok = "metrics > 0"
default = "fail"
```
{{% /telegraf/dynamic-values %}}

### Example: Error-based status

Warn when errors are logged, fail when the error count is high.

{{% telegraf/dynamic-values %}}
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "&{agent_id}"
token = "${INFLUX_TOKEN}"
interval = "1m"
include = ["hostname", "statistics", "configs", "logs", "status"]

[outputs.heartbeat.status]
ok = "log_errors == 0 && log_warnings == 0"
warn = "log_errors > 0"
fail = "log_errors > 10"
order = ["fail", "warn", "ok"]
default = "ok"
```
{{% /telegraf/dynamic-values %}}

### Example: Composite condition

Combine error count and buffer pressure signals.

{{% telegraf/dynamic-values %}}
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "&{agent_id}"
token = "${INFLUX_TOKEN}"
interval = "1m"
include = ["hostname", "statistics", "configs", "logs", "status"]

[outputs.heartbeat.status]
ok = "metrics > 0 && log_errors == 0"
warn = "log_errors > 0 || (has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.8))"
fail = "log_errors > 5 && has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.9)"
order = ["fail", "warn", "ok"]
default = "ok"
```
{{% /telegraf/dynamic-values %}}

For more examples including buffer health, plugin-specific checks, and
time-based expressions, see
[CEL expression examples](/telegraf/controller/reference/agent-status-eval/examples/).

## View an agent's status

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
title: Agent status evaluation
description: >
Reference documentation for Common Expression Language (CEL) expressions used
to evaluate Telegraf agent status.
menu:
telegraf_controller:
name: Agent status evaluation
parent: Reference
weight: 107
related:
- /telegraf/controller/agents/status/
- /telegraf/v1/output-plugins/heartbeat/
---

The Telegraf [heartbeat output plugin](/telegraf/v1/output-plugins/heartbeat/)
uses CEL expressions to evaluate agent status based on runtime data such as
metric counts, error rates, and plugin statistics.
[CEL (Common Expression Language)](https://cel.dev) is a lightweight expression
language designed for evaluating simple conditions.

## How status evaluation works

You define CEL expressions for three status levels in the
`[outputs.heartbeat.status]` section of your Telegraf configuration:

- **ok** — The agent is healthy.
- **warn** — The agent has a potential issue.
- **fail** — The agent has a critical problem.

Each expression is a CEL program that returns a boolean value.
Telegraf evaluates expressions in a configurable order (default:
`ok`, `warn`, `fail`) and assigns the status of the **first expression that
evaluates to `true`**.

If no expression evaluates to `true`, the `default` status is used
(default: `"ok"`).

### Initial status

Use the `initial` setting to define a status before the first Telegraf flush
cycle.
If `initial` is not set or is empty, Telegraf evaluates the status expressions
immediately, even before the first flush.

### Evaluation order

The `order` setting controls which expressions are evaluated and in what
sequence.

> [!Note]
> If you omit a status from the `order` list, its expression is **not
> evaluated**.

## Configuration reference

Configure status evaluation in the `[outputs.heartbeat.status]` section of the
heartbeat output plugin.
You must include `"status"` in the `include` list for status evaluation to take
effect.

```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "agent-123"
interval = "1m"
include = ["hostname", "statistics", "status"]

[outputs.heartbeat.status]
## CEL expressions that return a boolean.
## The first expression that evaluates to true sets the status.
ok = "metrics > 0"
warn = "log_errors > 0"
fail = "log_errors > 10"

## Evaluation order (default: ["ok", "warn", "fail"])
order = ["ok", "warn", "fail"]

## Default status when no expression matches
## Options: "ok", "warn", "fail", "undefined"
default = "ok"

## Initial status before the first flush cycle
## Options: "ok", "warn", "fail", "undefined", ""
# initial = ""
```

| Option | Type | Default | Description |
|:-------|:-----|:--------|:------------|
| `ok` | string (CEL) | `"false"` | Expression that, when `true`, sets status to **ok**. |
| `warn` | string (CEL) | `"false"` | Expression that, when `true`, sets status to **warn**. |
| `fail` | string (CEL) | `"false"` | Expression that, when `true`, sets status to **fail**. |
| `order` | list of strings | `["ok", "warn", "fail"]` | Order in which expressions are evaluated. |
| `default` | string | `"ok"` | Status used when no expression evaluates to `true`. Options: `ok`, `warn`, `fail`, `undefined`. |
| `initial` | string | `""` | Status before the first flush. Options: `ok`, `warn`, `fail`, `undefined`, `""` (empty = evaluate expressions). |

{{< children hlevel="h2" >}}
Loading
Loading