Add suppress_health_degradation stream config to prevent optional streams from degrading agent health by Oddly · Pull Request #49511 · elastic/beats

Oddly · 2026-03-17T07:48:31Z

Proposed commit message

Adds a suppress_health_degradation config flag at the stream level in the
management framework. When set, fetch failures from all inputs are still logged and tracked but do not mark the agent as Degraded. This lets operators build broad Fleet
policies with integrations that may not be running on every enrolled host,
without those absent services dragging the entire agent into a degraded state.

Unlike a per-input approach (which we tried first), this change lives in the health aggregation layer
(calcState() in unit.go) so every input type gets the flag for free: CEL,
httpjson, metricbeat modules, filebeat inputs, and any future inputs that
report per-stream health.

The flag is read from the stream's existing Source protobuf field.

Behaviour when suppress_health_degradation: true:

Input still retries every period
Errors are logged at ERROR level
Per-stream status still reports Degraded/Failed in the streams payload
The stream is excluded from the unit's aggregate health calculation
On recovery, stream status resets to Running as usual

Usage in Fleet integrations:

# data_stream manifest.yml
- name: suppress_health_degradation
  type: bool
  title: Suppress Health Degradation
  description: >-
    When enabled, failures collecting this data stream will not mark the
    agent as degraded. Use for data streams expected to fail on some hosts.
  required: false
  show_user: false
  default: false

# stream.yml.hbs
{{#if suppress_health_degradation}}
suppress_health_degradation: {{suppress_health_degradation}}
{{/if}}

Users can also inject suppress_health_degradation: true via Fleet's Advanced
Settings YAML on any existing integration without package changes.

Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL,
redis/metrics, nginx/metrics) with 9 permutations:

Streams	Endpoint	suppress	State
1 stream	working	not set	HEALTHY
1 stream	dead	not set	DEGRADED
1 stream	dead	true	HEALTHY
1 stream	dead	false	DEGRADED
2 streams	dead	mixed (true + not set)	DEGRADED
2 streams	dead	both true	HEALTHY
2 streams	dead	both not set	DEGRADED
2 streams	working	both true	HEALTHY
3 streams	dead	mixed (not set+true+false)	DEGRADED

Replaces #49492

Fixes: elastic/elastic-agent#12885

github-actions · 2026-03-17T07:48:41Z

🤖 GitHub comments

Just comment with:

run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2026-03-17T07:49:16Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @Oddly? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

coderabbitai · 2026-03-17T07:55:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 97fc9346-075f-4a3b-9af5-9361efe2aed1

📥 Commits

Reviewing files that changed from the base of the PR and between 74e2856 and 692bce9.

📒 Files selected for processing (3)

changelog/fragments/1774128388-suppress-health-degradation.yaml
x-pack/libbeat/management/unit.go
x-pack/libbeat/management/unit_test.go

✅ Files skipped from review due to trivial changes (1)

changelog/fragments/1774128388-suppress-health-degradation.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

x-pack/libbeat/management/unit.go
x-pack/libbeat/management/unit_test.go

📝 Walkthrough

Walkthrough

Adds a per-stream suppressHealthDegradation boolean to unitState and initializes it from each stream's suppress_health_degradation source config in getStreamStates. updateStateForStream and unit.update preserve and update this flag when stream entries change. calcState excludes streams with suppressHealthDegradation=true from the unit-level aggregated health and message while still reporting their individual stream status. Tests cover parsing, suppression flips, recovery, and payload publication on flag changes.

Suggested labels

Team:Elastic-Agent-Data-Plane

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	PR implements suppress_health_degradation flag addressing issue `#12885`'s primary objective: allowing optional streams to fail without degrading agent health while preserving logging and per-stream status.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing suppress_health_degradation: unit.go health aggregation logic, comprehensive unit tests, and changelog fragment.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

🛠️ Update Documentation: Commit on current branch
🛠️ Update Documentation: Create PR

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@x-pack/libbeat/management/unit.go`:
- Around line 369-375: The update() path currently only mutates
u.streamStates[*].suppressHealthDegradation and never recomputes the unit
aggregate health, leaving stale Failed/Degraded unit state; after you change
existing.suppressHealthDegradation in update(), force a recompute and publish of
the unit health so suppression flips take effect immediately—either invoke
updateStateForStream(...) for that stream in a way that bypasses the "ignore
same-state" check (add a force parameter or call a new helper), or implement a
small helper that scans u.streamStates to recalculate the aggregate health and
calls the existing publish/update routine to emit the new unit state.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: abb20d59-2b6e-4e8a-a10e-63d15e50284c

📥 Commits

Reviewing files that changed from the base of the PR and between 79621d9 and 6d1ee4d.

📒 Files selected for processing (2)

x-pack/libbeat/management/unit.go
x-pack/libbeat/management/unit_test.go

x-pack/libbeat/management/unit.go

coderabbitai

♻️ Duplicate comments (1)

x-pack/libbeat/management/unit.go (1)

369-375: ⚠️ Potential issue | 🔴 Critical

Recompute and publish unit health when suppression flips.

update() mutates existing.suppressHealthDegradation but never recalculates/emits aggregate unit status. If a stream is already Degraded/Failed, unit health can stay stale indefinitely because same-state stream updates are ignored later (Line 323).

Proposed fix (minimal)

 func (u *agentUnit) update(cu *client.Unit) {
 	u.mtx.Lock()
 	defer u.mtx.Unlock()

 	u.softDeleted = false
 	u.clientUnit = cu
+	suppressionChanged := false

 	inputStatus := getStatus(cu.Expected().State)
 	if u.inputLevelState.state != inputStatus {
 		u.inputLevelState = unitState{
 			state: inputStatus,
 		}
 	}

 	newStreamStates, newStreamIDs := getStreamStates(cu.Expected())

 	for key, state := range newStreamStates {
 		if existing, exists := u.streamStates[key]; exists {
-			// Preserve current health state but update the suppressHealthDegradation flag
-			// in case the stream config changed.
-			existing.suppressHealthDegradation = state.suppressHealthDegradation
+			if existing.suppressHealthDegradation != state.suppressHealthDegradation {
+				suppressionChanged = true
+			}
+			existing.suppressHealthDegradation = state.suppressHealthDegradation
 			u.streamStates[key] = existing
 			continue
 		}

 		u.streamStates[key] = state
 	}
@@
 	switch {
@@
 	}
+
+	if suppressionChanged {
+		state, msg := u.calcState()
+		streamsPayload := make(map[string]interface{}, len(u.streamStates))
+		for id, streamState := range u.streamStates {
+			streamsPayload[id] = map[string]interface{}{
+				"status": getUnitState(streamState.state).String(),
+				"error":  streamState.msg,
+			}
+		}
+		if err := u.clientUnit.UpdateState(getUnitState(state), msg, map[string]interface{}{"streams": streamsPayload}); err != nil {
+			u.logger.Warnf("failed to update state for input %s: %v", u.ID(), err)
+		}
+	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@x-pack/libbeat/management/unit.go` around lines 369 - 375, In update(), when
you mutate existing.suppressHealthDegradation for entries in u.streamStates,
detect that the suppression flag flipped and then trigger the unit-level health
recomputation/publish path so the aggregate unit health is recalculated and
emitted; specifically, inside the loop that updates
existing.suppressHealthDegradation (in update()), after assigning
existing.suppressHealthDegradation = state.suppressHealthDegradation, call the
same routine you use for stream-state transitions to recompute and publish
aggregate unit health (i.e., the unit health recompute/publish codepath) so unit
health doesn't remain stale when suppression toggles.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@x-pack/libbeat/management/unit.go`:
- Around line 369-375: In update(), when you mutate
existing.suppressHealthDegradation for entries in u.streamStates, detect that
the suppression flag flipped and then trigger the unit-level health
recomputation/publish path so the aggregate unit health is recalculated and
emitted; specifically, inside the loop that updates
existing.suppressHealthDegradation (in update()), after assigning
existing.suppressHealthDegradation = state.suppressHealthDegradation, call the
same routine you use for stream-state transitions to recompute and publish
aggregate unit health (i.e., the unit health recompute/publish codepath) so unit
health doesn't remain stale when suppression toggles.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da31b39f-bea6-4a10-b13e-ec60181529fd

📥 Commits

Reviewing files that changed from the base of the PR and between dac7eef and f2ff4f3.

📒 Files selected for processing (2)

x-pack/libbeat/management/unit.go
x-pack/libbeat/management/unit_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

x-pack/libbeat/management/unit_test.go

…eams from degrading agent health Adds a `suppress_health_degradation` config flag at the stream level in the management framework. When set, fetch failures are still logged and tracked but do not mark the agent as Degraded. This lets operators build broad Fleet policies with integrations that may not be running on every enrolled host, without those absent services dragging the entire agent into a degraded state. Unlike a per-input approach, this change lives in the health aggregation layer (`calcState()` in `unit.go`) so every input type gets the flag for free: CEL, httpjson, metricbeat modules, filebeat inputs, and any future inputs that report per-stream health. The flag is read from the stream's existing `Source` protobuf field — no proto changes or per-input modifications needed. Behaviour when `suppress_health_degradation: true`: - Input still retries every `period` - Errors are logged at ERROR level - Per-stream status still reports Degraded/Failed in the streams payload - The stream is excluded from the unit's aggregate health calculation - On recovery, stream status resets to Running as usual Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL, redis/metrics, nginx/metrics) with 9 permutations — all passed. Supersedes elastic#49492 Fixes: elastic/elastic-agent#12885

Oddly · 2026-03-17T10:03:19Z

Live cluster validation

Tested on a patched Elastic Agent 9.3.1 built from this branch via mage otel:crossBuild.

Fresh install (policy created from scratch each time):

Endpoint	suppress	Result
working	not set	HEALTHY
dead	not set	DEGRADED
dead	true	HEALTHY
dead	false	DEGRADED
both dead (2 streams)	mixed (true + not set)	DEGRADED
both dead (2 streams)	both true	HEALTHY
both dead (2 streams)	both not set	DEGRADED
both working (2 streams)	both true	HEALTHY
all dead (3 streams)	mixed (not set+true+false)	DEGRADED

In-place update (same policy updated via PUT without agent restart, exercises the update() recompute path):

Endpoint	suppress	Result
dead	not set	DEGRADED
dead	→ true	HEALTHY
dead	→ false	DEGRADED
dead	→ true	HEALTHY
→ working	true	HEALTHY
→ dead	true	HEALTHY
dead	→ not set	DEGRADED
→ working	not set	HEALTHY

Cross-input types (three integrations on one agent policy):

Input type	Endpoint	suppress	Result
CEL (Zabbix)	working	no	HEALTHY
redis/metrics	dead	true	suppressed
nginx/metrics	dead	no	DEGRADED

elasticmachine · 2026-03-17T10:26:06Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

AndersonQ · 2026-03-20T12:36:50Z

/test

AndersonQ · 2026-03-20T13:07:36Z

You need to update/format the files. make update should do the trick. You can check if all is good by running make check

…elog fragment

AndersonQ · 2026-03-23T07:10:39Z

/test

Oddly requested a review from a team as a code owner March 17, 2026 07:48

Oddly requested review from AndersonQ and VihasMakwana March 17, 2026 07:48

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 17, 2026

mergify bot assigned Oddly Mar 17, 2026

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

x-pack/libbeat/management/unit.go Show resolved Hide resolved

Oddly force-pushed the suppress-health-degradation branch from dac7eef to f2ff4f3 Compare March 17, 2026 08:07

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Oddly force-pushed the suppress-health-degradation branch from f2ff4f3 to 74e2856 Compare March 17, 2026 08:32

pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Mar 17, 2026

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 17, 2026

pierrehilbert requested a review from faec March 17, 2026 10:30

faec approved these changes Mar 19, 2026

View reviewed changes

Fix gofmt alignment, goimports grouping, errcheck lint, and add chang…

692bce9

…elog fragment

AndersonQ approved these changes Mar 24, 2026

View reviewed changes

AndersonQ mentioned this pull request Mar 24, 2026

Add optional flag to metricbeat modules to suppress health degradation on fetch errors #49492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add suppress_health_degradation stream config to prevent optional streams from degrading agent health#49511

Add suppress_health_degradation stream config to prevent optional streams from degrading agent health#49511
Oddly wants to merge 2 commits intoelastic:mainfrom
Oddly:suppress-health-degradation

Oddly commented Mar 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Reviews paused

Walkthrough

Suggested labels

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Oddly commented Mar 17, 2026

Uh oh!

elasticmachine commented Mar 17, 2026

Uh oh!

AndersonQ commented Mar 20, 2026

Uh oh!

AndersonQ commented Mar 20, 2026

Uh oh!

AndersonQ commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Oddly commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed commit message

Uh oh!

github-actions bot commented Mar 17, 2026

🤖 GitHub comments

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Suggested labels

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Oddly commented Mar 17, 2026

Live cluster validation

Uh oh!

elasticmachine commented Mar 17, 2026

Uh oh!

AndersonQ commented Mar 20, 2026

Uh oh!

AndersonQ commented Mar 20, 2026

Uh oh!

AndersonQ commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Oddly commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading