Add optional flag to metricbeat modules to suppress health degradation on fetch errors#49492
Add optional flag to metricbeat modules to suppress health degradation on fetch errors#49492Oddly wants to merge 1 commit intoelastic:mainfrom
Conversation
…n on fetch errors Adds an `optional` config flag to metricbeat modules. When set, fetch failures are still logged and tracked but do not mark the agent as Degraded. This lets operators build broad Fleet policies with service integrations that may not be running on every enrolled host, without those absent services dragging the entire agent into a degraded state. The flag follows the same pattern as the existing `failure_threshold` setting: parsed from module config in `createWrapper`, stored on `metricSetWrapper`, and checked in `handleFetchError`. All periodic metricbeat modules get this for free. Fixes: elastic/elastic-agent#12885
🤖 GitHub commentsJust comment with:
|
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe changes add an optional configuration flag to metric sets that modifies health reporting behavior. When enabled, metric set fetch failures no longer cause agent health degradation. The flag is introduced in the stream health settings, propagated to metric set wrapper instances, and integrated into the error handling logic. Corresponding test cases validate that status degradation is skipped for optional metric sets while error tracking and recovery behavior remain unchanged. 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can use your project's `golangci-lint` configuration to improve the quality of Go code reviews.Add a configuration file to your project to customize how CodeRabbit runs |
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
|
Put in draft for now, don't merge please. See issue. |
…eams from degrading agent health
Adds a `suppress_health_degradation` config flag at the stream level in the
management framework. When set, fetch failures are still logged and tracked
but do not mark the agent as Degraded. This lets operators build broad Fleet
policies with integrations that may not be running on every enrolled host,
without those absent services dragging the entire agent into a degraded state.
Unlike a per-input approach, this change lives in the health aggregation layer
(`calcState()` in `unit.go`) so every input type gets the flag for free: CEL,
httpjson, metricbeat modules, filebeat inputs, and any future inputs that
report per-stream health.
The flag is read from the stream's existing `Source` protobuf field — no proto
changes or per-input modifications needed.
Behaviour when `suppress_health_degradation: true`:
- Input still retries every `period`
- Errors are logged at ERROR level
- Per-stream status still reports Degraded/Failed in the streams payload
- The stream is excluded from the unit's aggregate health calculation
- On recovery, stream status resets to Running as usual
Usage in Fleet integrations (no Kibana changes needed):
```yaml
# data_stream manifest.yml
- name: suppress_health_degradation
type: bool
title: Suppress Health Degradation
description: >-
When enabled, failures collecting this data stream will not mark the
agent as degraded. Use for data streams expected to fail on some hosts.
required: false
show_user: false
default: false
# stream.yml.hbs
{{#if suppress_health_degradation}}
suppress_health_degradation: {{suppress_health_degradation}}
{{/if}}
```
Users can also inject `suppress_health_degradation: true` via Fleet's Advanced
Settings YAML on any existing integration without package changes.
Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL,
redis/metrics, nginx/metrics) with 9 permutations:
| Streams | Endpoint | suppress | Expected | Actual |
|-----------|----------|-----------------------------|----------|----------|
| 1 stream | working | not set | HEALTHY | HEALTHY |
| 1 stream | dead | not set | DEGRADED | DEGRADED |
| 1 stream | dead | true | HEALTHY | HEALTHY |
| 1 stream | dead | false | DEGRADED | DEGRADED |
| 2 streams | dead | mixed (true + not set) | DEGRADED | DEGRADED |
| 2 streams | dead | both true | HEALTHY | HEALTHY |
| 2 streams | dead | both not set | DEGRADED | DEGRADED |
| 2 streams | working | both true | HEALTHY | HEALTHY |
| 3 streams | dead | mixed (not set+true+false) | DEGRADED | DEGRADED |
Supersedes elastic#49492
Fixes: elastic/elastic-agent#12885
…eams from degrading agent health Adds a `suppress_health_degradation` config flag at the stream level in the management framework. When set, fetch failures are still logged and tracked but do not mark the agent as Degraded. This lets operators build broad Fleet policies with integrations that may not be running on every enrolled host, without those absent services dragging the entire agent into a degraded state. Unlike a per-input approach, this change lives in the health aggregation layer (`calcState()` in `unit.go`) so every input type gets the flag for free: CEL, httpjson, metricbeat modules, filebeat inputs, and any future inputs that report per-stream health. The flag is read from the stream's existing `Source` protobuf field — no proto changes or per-input modifications needed. Behaviour when `suppress_health_degradation: true`: - Input still retries every `period` - Errors are logged at ERROR level - Per-stream status still reports Degraded/Failed in the streams payload - The stream is excluded from the unit's aggregate health calculation - On recovery, stream status resets to Running as usual Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL, redis/metrics, nginx/metrics) with 9 permutations — all passed. Supersedes elastic#49492 Fixes: elastic/elastic-agent#12885
…eams from degrading agent health Adds a `suppress_health_degradation` config flag at the stream level in the management framework. When set, fetch failures are still logged and tracked but do not mark the agent as Degraded. This lets operators build broad Fleet policies with integrations that may not be running on every enrolled host, without those absent services dragging the entire agent into a degraded state. Unlike a per-input approach, this change lives in the health aggregation layer (`calcState()` in `unit.go`) so every input type gets the flag for free: CEL, httpjson, metricbeat modules, filebeat inputs, and any future inputs that report per-stream health. The flag is read from the stream's existing `Source` protobuf field — no proto changes or per-input modifications needed. Behaviour when `suppress_health_degradation: true`: - Input still retries every `period` - Errors are logged at ERROR level - Per-stream status still reports Degraded/Failed in the streams payload - The stream is excluded from the unit's aggregate health calculation - On recovery, stream status resets to Running as usual Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL, redis/metrics, nginx/metrics) with 9 permutations — all passed. Supersedes elastic#49492 Fixes: elastic/elastic-agent#12885
|
Yes, you are right. Closed! |
Proposed commit message
Adds an
optionalconfig flag to metricbeat modules. When set, fetch failuresare still logged and tracked but do not mark the agent as Degraded. This lets
operators build broad Fleet policies with service integrations that may not be
running on every enrolled host, without those absent services dragging the entire
agent into a degraded state.
The flag follows the same pattern as the existing
failure_thresholdsetting:parsed from module config in
createWrapper, stored onmetricSetWrapper, andchecked in
handleFetchError. All periodic metricbeat modules get this for free.Behaviour when
optional: true:periodconsecutive_failuresmonitoring metric keeps incrementingUpdateStatus(Degraded)is never calledUsage in Fleet integrations (no Kibana changes needed):
Users can also set
optional: truedirectly in Fleet's Advanced Settings YAMLon any existing integration.
Backport requested:
9.3,9.2,8.19Fixes: elastic/elastic-agent#12885