Add optional flag to metricbeat modules to suppress health degradation on fetch errors by Oddly · Pull Request #49492 · elastic/beats

Oddly · 2026-03-16T11:51:52Z

Proposed commit message

Adds an optional config flag to metricbeat modules. When set, fetch failures
are still logged and tracked but do not mark the agent as Degraded. This lets
operators build broad Fleet policies with service integrations that may not be
running on every enrolled host, without those absent services dragging the entire
agent into a degraded state.

The flag follows the same pattern as the existing failure_threshold setting:
parsed from module config in createWrapper, stored on metricSetWrapper, and
checked in handleFetchError. All periodic metricbeat modules get this for free.

Behaviour when optional: true:

Fetch still retries every period
Errors are logged at ERROR level
consecutive_failures monitoring metric keeps incrementing
UpdateStatus(Degraded) is never called
On recovery, status resets to Running as usual

Usage in Fleet integrations (no Kibana changes needed):

# data_stream manifest.yml
- name: optional
  type: bool
  title: Do not report as degraded when collection fails
  default: false
  description: >
    When enabled, fetch failures for this data stream do not affect agent
    health. The agent keeps retrying and logging errors, but stays Healthy
    in Fleet.

# stream.yml.hbs
{{#if optional}}
optional: {{optional}}
{{/if}}

Users can also set optional: true directly in Fleet's Advanced Settings YAML
on any existing integration.

Backport requested: 9.3, 9.2, 8.19

Fixes: elastic/elastic-agent#12885

…n on fetch errors Adds an `optional` config flag to metricbeat modules. When set, fetch failures are still logged and tracked but do not mark the agent as Degraded. This lets operators build broad Fleet policies with service integrations that may not be running on every enrolled host, without those absent services dragging the entire agent into a degraded state. The flag follows the same pattern as the existing `failure_threshold` setting: parsed from module config in `createWrapper`, stored on `metricSetWrapper`, and checked in `handleFetchError`. All periodic metricbeat modules get this for free. Fixes: elastic/elastic-agent#12885

github-actions · 2026-03-16T11:52:02Z

🤖 GitHub comments

Just comment with:

run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2026-03-16T11:52:30Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @Oddly? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

coderabbitai · 2026-03-16T11:55:53Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1dab7a59-b96d-465c-8cff-1d2207c00645

📥 Commits

Reviewing files that changed from the base of the PR and between 9dd7c3c and 353f438.

📒 Files selected for processing (2)

metricbeat/mb/module/wrapper.go
metricbeat/mb/module/wrapper_internal_test.go

📝 Walkthrough

Walkthrough

The changes add an optional configuration flag to metric sets that modifies health reporting behavior. When enabled, metric set fetch failures no longer cause agent health degradation. The flag is introduced in the stream health settings, propagated to metric set wrapper instances, and integrated into the error handling logic. Corresponding test cases validate that status degradation is skipped for optional metric sets while error tracking and recovery behavior remain unchanged.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	The PR successfully implements the optional flag feature requested in issue `#12885`, allowing metric sets to fail without degrading agent health while maintaining error logging and retry behavior.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing the optional flag feature: config parsing, wrapper storage, error handling logic, and test coverage. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

🛠️ Update Documentation: Commit on current branch
🛠️ Update Documentation: Create PR

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use your project's `golangci-lint` configuration to improve the quality of Go code reviews.

Add a configuration file to your project to customize how CodeRabbit runs golangci-lint.

elasticmachine · 2026-03-16T12:32:46Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Oddly · 2026-03-16T13:00:12Z

Put in draft for now, don't merge please. See issue.

…eams from degrading agent health Adds a `suppress_health_degradation` config flag at the stream level in the management framework. When set, fetch failures are still logged and tracked but do not mark the agent as Degraded. This lets operators build broad Fleet policies with integrations that may not be running on every enrolled host, without those absent services dragging the entire agent into a degraded state. Unlike a per-input approach, this change lives in the health aggregation layer (`calcState()` in `unit.go`) so every input type gets the flag for free: CEL, httpjson, metricbeat modules, filebeat inputs, and any future inputs that report per-stream health. The flag is read from the stream's existing `Source` protobuf field — no proto changes or per-input modifications needed. Behaviour when `suppress_health_degradation: true`: - Input still retries every `period` - Errors are logged at ERROR level - Per-stream status still reports Degraded/Failed in the streams payload - The stream is excluded from the unit's aggregate health calculation - On recovery, stream status resets to Running as usual Usage in Fleet integrations (no Kibana changes needed): ```yaml # data_stream manifest.yml - name: suppress_health_degradation type: bool title: Suppress Health Degradation description: >- When enabled, failures collecting this data stream will not mark the agent as degraded. Use for data streams expected to fail on some hosts. required: false show_user: false default: false # stream.yml.hbs {{#if suppress_health_degradation}} suppress_health_degradation: {{suppress_health_degradation}} {{/if}} ``` Users can also inject `suppress_health_degradation: true` via Fleet's Advanced Settings YAML on any existing integration without package changes. Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL, redis/metrics, nginx/metrics) with 9 permutations: | Streams | Endpoint | suppress | Expected | Actual | |-----------|----------|-----------------------------|----------|----------| | 1 stream | working | not set | HEALTHY | HEALTHY | | 1 stream | dead | not set | DEGRADED | DEGRADED | | 1 stream | dead | true | HEALTHY | HEALTHY | | 1 stream | dead | false | DEGRADED | DEGRADED | | 2 streams | dead | mixed (true + not set) | DEGRADED | DEGRADED | | 2 streams | dead | both true | HEALTHY | HEALTHY | | 2 streams | dead | both not set | DEGRADED | DEGRADED | | 2 streams | working | both true | HEALTHY | HEALTHY | | 3 streams | dead | mixed (not set+true+false) | DEGRADED | DEGRADED | Supersedes elastic#49492 Fixes: elastic/elastic-agent#12885

…eams from degrading agent health Adds a `suppress_health_degradation` config flag at the stream level in the management framework. When set, fetch failures are still logged and tracked but do not mark the agent as Degraded. This lets operators build broad Fleet policies with integrations that may not be running on every enrolled host, without those absent services dragging the entire agent into a degraded state. Unlike a per-input approach, this change lives in the health aggregation layer (`calcState()` in `unit.go`) so every input type gets the flag for free: CEL, httpjson, metricbeat modules, filebeat inputs, and any future inputs that report per-stream health. The flag is read from the stream's existing `Source` protobuf field — no proto changes or per-input modifications needed. Behaviour when `suppress_health_degradation: true`: - Input still retries every `period` - Errors are logged at ERROR level - Per-stream status still reports Degraded/Failed in the streams payload - The stream is excluded from the unit's aggregate health calculation - On recovery, stream status resets to Running as usual Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL, redis/metrics, nginx/metrics) with 9 permutations — all passed. Supersedes elastic#49492 Fixes: elastic/elastic-agent#12885

AndersonQ · 2026-03-24T16:15:56Z

@Oddly I think we can close this as #49511 replaces it, right?

Oddly · 2026-03-24T17:56:22Z

Yes, you are right. Closed!

Oddly requested a review from a team as a code owner March 16, 2026 11:51

Oddly requested review from AndersonQ and khushijain21 March 16, 2026 11:51

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 16, 2026

mergify bot assigned Oddly Mar 16, 2026

pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Mar 16, 2026

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 16, 2026

Oddly marked this pull request as draft March 16, 2026 12:42

Oddly mentioned this pull request Mar 17, 2026

Add suppress_health_degradation stream config to prevent optional streams from degrading agent health #49511

Open

Oddly closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional flag to metricbeat modules to suppress health degradation on fetch errors#49492

Add optional flag to metricbeat modules to suppress health degradation on fetch errors#49492
Oddly wants to merge 1 commit intoelastic:mainfrom
Oddly:optional-metricbeat-input

Oddly commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

mergify bot commented Mar 16, 2026

Uh oh!

coderabbitai bot commented Mar 16, 2026

Walkthrough

Uh oh!

elasticmachine commented Mar 16, 2026

Uh oh!

Oddly commented Mar 16, 2026

Uh oh!

AndersonQ commented Mar 24, 2026

Uh oh!

Oddly commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Oddly commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed commit message

Uh oh!

github-actions bot commented Mar 16, 2026

🤖 GitHub comments

Uh oh!

mergify bot commented Mar 16, 2026

Uh oh!

coderabbitai bot commented Mar 16, 2026

Walkthrough

Uh oh!

elasticmachine commented Mar 16, 2026

Uh oh!

Oddly commented Mar 16, 2026

Uh oh!

AndersonQ commented Mar 24, 2026

Uh oh!

Oddly commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Oddly commented Mar 16, 2026 •

edited

Loading