Skip to content

Add optional flag to metricbeat modules to suppress health degradation on fetch errors#49492

Closed
Oddly wants to merge 1 commit intoelastic:mainfrom
Oddly:optional-metricbeat-input
Closed

Add optional flag to metricbeat modules to suppress health degradation on fetch errors#49492
Oddly wants to merge 1 commit intoelastic:mainfrom
Oddly:optional-metricbeat-input

Conversation

@Oddly
Copy link

@Oddly Oddly commented Mar 16, 2026

Proposed commit message

Adds an optional config flag to metricbeat modules. When set, fetch failures
are still logged and tracked but do not mark the agent as Degraded. This lets
operators build broad Fleet policies with service integrations that may not be
running on every enrolled host, without those absent services dragging the entire
agent into a degraded state.

The flag follows the same pattern as the existing failure_threshold setting:
parsed from module config in createWrapper, stored on metricSetWrapper, and
checked in handleFetchError. All periodic metricbeat modules get this for free.

Behaviour when optional: true:

  • Fetch still retries every period
  • Errors are logged at ERROR level
  • consecutive_failures monitoring metric keeps incrementing
  • UpdateStatus(Degraded) is never called
  • On recovery, status resets to Running as usual

Usage in Fleet integrations (no Kibana changes needed):

# data_stream manifest.yml
- name: optional
  type: bool
  title: Do not report as degraded when collection fails
  default: false
  description: >
    When enabled, fetch failures for this data stream do not affect agent
    health. The agent keeps retrying and logging errors, but stays Healthy
    in Fleet.

# stream.yml.hbs
{{#if optional}}
optional: {{optional}}
{{/if}}

Users can also set optional: true directly in Fleet's Advanced Settings YAML
on any existing integration.

Backport requested: 9.3, 9.2, 8.19

Fixes: elastic/elastic-agent#12885

…n on fetch errors

Adds an `optional` config flag to metricbeat modules. When set, fetch failures
are still logged and tracked but do not mark the agent as Degraded. This lets
operators build broad Fleet policies with service integrations that may not be
running on every enrolled host, without those absent services dragging the entire
agent into a degraded state.

The flag follows the same pattern as the existing `failure_threshold` setting:
parsed from module config in `createWrapper`, stored on `metricSetWrapper`, and
checked in `handleFetchError`. All periodic metricbeat modules get this for free.

Fixes: elastic/elastic-agent#12885
@Oddly Oddly requested a review from a team as a code owner March 16, 2026 11:51
@Oddly Oddly requested review from AndersonQ and khushijain21 March 16, 2026 11:51
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 16, 2026
@github-actions
Copy link
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify mergify bot assigned Oddly Mar 16, 2026
@mergify
Copy link
Contributor

mergify bot commented Mar 16, 2026

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @Oddly? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1dab7a59-b96d-465c-8cff-1d2207c00645

📥 Commits

Reviewing files that changed from the base of the PR and between 9dd7c3c and 353f438.

📒 Files selected for processing (2)
  • metricbeat/mb/module/wrapper.go
  • metricbeat/mb/module/wrapper_internal_test.go

📝 Walkthrough

Walkthrough

The changes add an optional configuration flag to metric sets that modifies health reporting behavior. When enabled, metric set fetch failures no longer cause agent health degradation. The flag is introduced in the stream health settings, propagated to metric set wrapper instances, and integrated into the error handling logic. Corresponding test cases validate that status degradation is skipped for optional metric sets while error tracking and recovery behavior remain unchanged.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR successfully implements the optional flag feature requested in issue #12885, allowing metric sets to fail without degrading agent health while maintaining error logging and retry behavior.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing the optional flag feature: config parsing, wrapper storage, error handling logic, and test coverage. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • 🛠️ Update Documentation: Commit on current branch
  • 🛠️ Update Documentation: Create PR
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use your project's `golangci-lint` configuration to improve the quality of Go code reviews.

Add a configuration file to your project to customize how CodeRabbit runs golangci-lint.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Mar 16, 2026
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 16, 2026
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@Oddly Oddly marked this pull request as draft March 16, 2026 12:42
@Oddly
Copy link
Author

Oddly commented Mar 16, 2026

Put in draft for now, don't merge please. See issue.

Oddly added a commit to Oddly/beats that referenced this pull request Mar 17, 2026
…eams from degrading agent health

Adds a `suppress_health_degradation` config flag at the stream level in the
management framework. When set, fetch failures are still logged and tracked
but do not mark the agent as Degraded. This lets operators build broad Fleet
policies with integrations that may not be running on every enrolled host,
without those absent services dragging the entire agent into a degraded state.

Unlike a per-input approach, this change lives in the health aggregation layer
(`calcState()` in `unit.go`) so every input type gets the flag for free: CEL,
httpjson, metricbeat modules, filebeat inputs, and any future inputs that
report per-stream health.

The flag is read from the stream's existing `Source` protobuf field — no proto
changes or per-input modifications needed.

Behaviour when `suppress_health_degradation: true`:

- Input still retries every `period`
- Errors are logged at ERROR level
- Per-stream status still reports Degraded/Failed in the streams payload
- The stream is excluded from the unit's aggregate health calculation
- On recovery, stream status resets to Running as usual

Usage in Fleet integrations (no Kibana changes needed):

```yaml
# data_stream manifest.yml
- name: suppress_health_degradation
  type: bool
  title: Suppress Health Degradation
  description: >-
    When enabled, failures collecting this data stream will not mark the
    agent as degraded. Use for data streams expected to fail on some hosts.
  required: false
  show_user: false
  default: false

# stream.yml.hbs
{{#if suppress_health_degradation}}
suppress_health_degradation: {{suppress_health_degradation}}
{{/if}}
```

Users can also inject `suppress_health_degradation: true` via Fleet's Advanced
Settings YAML on any existing integration without package changes.

Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL,
redis/metrics, nginx/metrics) with 9 permutations:

| Streams   | Endpoint | suppress                    | Expected | Actual   |
|-----------|----------|-----------------------------|----------|----------|
| 1 stream  | working  | not set                     | HEALTHY  | HEALTHY  |
| 1 stream  | dead     | not set                     | DEGRADED | DEGRADED |
| 1 stream  | dead     | true                        | HEALTHY  | HEALTHY  |
| 1 stream  | dead     | false                       | DEGRADED | DEGRADED |
| 2 streams | dead     | mixed (true + not set)      | DEGRADED | DEGRADED |
| 2 streams | dead     | both true                   | HEALTHY  | HEALTHY  |
| 2 streams | dead     | both not set                | DEGRADED | DEGRADED |
| 2 streams | working  | both true                   | HEALTHY  | HEALTHY  |
| 3 streams | dead     | mixed (not set+true+false)  | DEGRADED | DEGRADED |

Supersedes elastic#49492

Fixes: elastic/elastic-agent#12885
Oddly added a commit to Oddly/beats that referenced this pull request Mar 17, 2026
…eams from degrading agent health

Adds a `suppress_health_degradation` config flag at the stream level in the
management framework. When set, fetch failures are still logged and tracked
but do not mark the agent as Degraded. This lets operators build broad Fleet
policies with integrations that may not be running on every enrolled host,
without those absent services dragging the entire agent into a degraded state.

Unlike a per-input approach, this change lives in the health aggregation layer
(`calcState()` in `unit.go`) so every input type gets the flag for free: CEL,
httpjson, metricbeat modules, filebeat inputs, and any future inputs that
report per-stream health.

The flag is read from the stream's existing `Source` protobuf field — no proto
changes or per-input modifications needed.

Behaviour when `suppress_health_degradation: true`:

- Input still retries every `period`
- Errors are logged at ERROR level
- Per-stream status still reports Degraded/Failed in the streams payload
- The stream is excluded from the unit's aggregate health calculation
- On recovery, stream status resets to Running as usual

Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL,
redis/metrics, nginx/metrics) with 9 permutations — all passed.

Supersedes elastic#49492

Fixes: elastic/elastic-agent#12885
Oddly added a commit to Oddly/beats that referenced this pull request Mar 17, 2026
…eams from degrading agent health

Adds a `suppress_health_degradation` config flag at the stream level in the
management framework. When set, fetch failures are still logged and tracked
but do not mark the agent as Degraded. This lets operators build broad Fleet
policies with integrations that may not be running on every enrolled host,
without those absent services dragging the entire agent into a degraded state.

Unlike a per-input approach, this change lives in the health aggregation layer
(`calcState()` in `unit.go`) so every input type gets the flag for free: CEL,
httpjson, metricbeat modules, filebeat inputs, and any future inputs that
report per-stream health.

The flag is read from the stream's existing `Source` protobuf field — no proto
changes or per-input modifications needed.

Behaviour when `suppress_health_degradation: true`:

- Input still retries every `period`
- Errors are logged at ERROR level
- Per-stream status still reports Degraded/Failed in the streams payload
- The stream is excluded from the unit's aggregate health calculation
- On recovery, stream status resets to Running as usual

Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL,
redis/metrics, nginx/metrics) with 9 permutations — all passed.

Supersedes elastic#49492

Fixes: elastic/elastic-agent#12885
@AndersonQ
Copy link
Member

@Oddly I think we can close this as #49511 replaces it, right?

@Oddly Oddly closed this Mar 24, 2026
@Oddly
Copy link
Author

Oddly commented Mar 24, 2026

Yes, you are right. Closed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow inputs to fail without degrading agent health

4 participants