Skip to content

Add suppress_health_degradation stream config to prevent optional streams from degrading agent health#49511

Open
Oddly wants to merge 2 commits intoelastic:mainfrom
Oddly:suppress-health-degradation
Open

Add suppress_health_degradation stream config to prevent optional streams from degrading agent health#49511
Oddly wants to merge 2 commits intoelastic:mainfrom
Oddly:suppress-health-degradation

Conversation

@Oddly
Copy link

@Oddly Oddly commented Mar 17, 2026

Proposed commit message

Adds a suppress_health_degradation config flag at the stream level in the
management framework. When set, fetch failures from all inputs are still logged and tracked but do not mark the agent as Degraded. This lets operators build broad Fleet
policies with integrations that may not be running on every enrolled host,
without those absent services dragging the entire agent into a degraded state.

Unlike a per-input approach (which we tried first), this change lives in the health aggregation layer
(calcState() in unit.go) so every input type gets the flag for free: CEL,
httpjson, metricbeat modules, filebeat inputs, and any future inputs that
report per-stream health.

The flag is read from the stream's existing Source protobuf field.

Behaviour when suppress_health_degradation: true:

  • Input still retries every period
  • Errors are logged at ERROR level
  • Per-stream status still reports Degraded/Failed in the streams payload
  • The stream is excluded from the unit's aggregate health calculation
  • On recovery, stream status resets to Running as usual

Usage in Fleet integrations:

# data_stream manifest.yml
- name: suppress_health_degradation
  type: bool
  title: Suppress Health Degradation
  description: >-
    When enabled, failures collecting this data stream will not mark the
    agent as degraded. Use for data streams expected to fail on some hosts.
  required: false
  show_user: false
  default: false

# stream.yml.hbs
{{#if suppress_health_degradation}}
suppress_health_degradation: {{suppress_health_degradation}}
{{/if}}

Users can also inject suppress_health_degradation: true via Fleet's Advanced
Settings YAML on any existing integration without package changes.

Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL,
redis/metrics, nginx/metrics) with 9 permutations:

Streams Endpoint suppress State
1 stream working not set HEALTHY
1 stream dead not set DEGRADED
1 stream dead true HEALTHY
1 stream dead false DEGRADED
2 streams dead mixed (true + not set) DEGRADED
2 streams dead both true HEALTHY
2 streams dead both not set DEGRADED
2 streams working both true HEALTHY
3 streams dead mixed (not set+true+false) DEGRADED

Replaces #49492

Fixes: elastic/elastic-agent#12885

@Oddly Oddly requested a review from a team as a code owner March 17, 2026 07:48
@Oddly Oddly requested review from AndersonQ and VihasMakwana March 17, 2026 07:48
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 17, 2026
@github-actions
Copy link
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Mar 17, 2026

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @Oddly? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@mergify mergify bot assigned Oddly Mar 17, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 97fc9346-075f-4a3b-9af5-9361efe2aed1

📥 Commits

Reviewing files that changed from the base of the PR and between 74e2856 and 692bce9.

📒 Files selected for processing (3)
  • changelog/fragments/1774128388-suppress-health-degradation.yaml
  • x-pack/libbeat/management/unit.go
  • x-pack/libbeat/management/unit_test.go
✅ Files skipped from review due to trivial changes (1)
  • changelog/fragments/1774128388-suppress-health-degradation.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • x-pack/libbeat/management/unit.go
  • x-pack/libbeat/management/unit_test.go

📝 Walkthrough

Walkthrough

Adds a per-stream suppressHealthDegradation boolean to unitState and initializes it from each stream's suppress_health_degradation source config in getStreamStates. updateStateForStream and unit.update preserve and update this flag when stream entries change. calcState excludes streams with suppressHealthDegradation=true from the unit-level aggregated health and message while still reporting their individual stream status. Tests cover parsing, suppression flips, recovery, and payload publication on flag changes.

Suggested labels

Team:Elastic-Agent-Data-Plane

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed PR implements suppress_health_degradation flag addressing issue #12885's primary objective: allowing optional streams to fail without degrading agent health while preserving logging and per-stream status.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing suppress_health_degradation: unit.go health aggregation logic, comprehensive unit tests, and changelog fragment.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • 🛠️ Update Documentation: Commit on current branch
  • 🛠️ Update Documentation: Create PR

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@x-pack/libbeat/management/unit.go`:
- Around line 369-375: The update() path currently only mutates
u.streamStates[*].suppressHealthDegradation and never recomputes the unit
aggregate health, leaving stale Failed/Degraded unit state; after you change
existing.suppressHealthDegradation in update(), force a recompute and publish of
the unit health so suppression flips take effect immediately—either invoke
updateStateForStream(...) for that stream in a way that bypasses the "ignore
same-state" check (add a force parameter or call a new helper), or implement a
small helper that scans u.streamStates to recalculate the aggregate health and
calls the existing publish/update routine to emit the new unit state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: abb20d59-2b6e-4e8a-a10e-63d15e50284c

📥 Commits

Reviewing files that changed from the base of the PR and between 79621d9 and 6d1ee4d.

📒 Files selected for processing (2)
  • x-pack/libbeat/management/unit.go
  • x-pack/libbeat/management/unit_test.go

@Oddly Oddly force-pushed the suppress-health-degradation branch from dac7eef to f2ff4f3 Compare March 17, 2026 08:07
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
x-pack/libbeat/management/unit.go (1)

369-375: ⚠️ Potential issue | 🔴 Critical

Recompute and publish unit health when suppression flips.

update() mutates existing.suppressHealthDegradation but never recalculates/emits aggregate unit status. If a stream is already Degraded/Failed, unit health can stay stale indefinitely because same-state stream updates are ignored later (Line 323).

Proposed fix (minimal)
 func (u *agentUnit) update(cu *client.Unit) {
 	u.mtx.Lock()
 	defer u.mtx.Unlock()

 	u.softDeleted = false
 	u.clientUnit = cu
+	suppressionChanged := false

 	inputStatus := getStatus(cu.Expected().State)
 	if u.inputLevelState.state != inputStatus {
 		u.inputLevelState = unitState{
 			state: inputStatus,
 		}
 	}

 	newStreamStates, newStreamIDs := getStreamStates(cu.Expected())

 	for key, state := range newStreamStates {
 		if existing, exists := u.streamStates[key]; exists {
-			// Preserve current health state but update the suppressHealthDegradation flag
-			// in case the stream config changed.
-			existing.suppressHealthDegradation = state.suppressHealthDegradation
+			if existing.suppressHealthDegradation != state.suppressHealthDegradation {
+				suppressionChanged = true
+			}
+			existing.suppressHealthDegradation = state.suppressHealthDegradation
 			u.streamStates[key] = existing
 			continue
 		}

 		u.streamStates[key] = state
 	}
@@
 	switch {
@@
 	}
+
+	if suppressionChanged {
+		state, msg := u.calcState()
+		streamsPayload := make(map[string]interface{}, len(u.streamStates))
+		for id, streamState := range u.streamStates {
+			streamsPayload[id] = map[string]interface{}{
+				"status": getUnitState(streamState.state).String(),
+				"error":  streamState.msg,
+			}
+		}
+		if err := u.clientUnit.UpdateState(getUnitState(state), msg, map[string]interface{}{"streams": streamsPayload}); err != nil {
+			u.logger.Warnf("failed to update state for input %s: %v", u.ID(), err)
+		}
+	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@x-pack/libbeat/management/unit.go` around lines 369 - 375, In update(), when
you mutate existing.suppressHealthDegradation for entries in u.streamStates,
detect that the suppression flag flipped and then trigger the unit-level health
recomputation/publish path so the aggregate unit health is recalculated and
emitted; specifically, inside the loop that updates
existing.suppressHealthDegradation (in update()), after assigning
existing.suppressHealthDegradation = state.suppressHealthDegradation, call the
same routine you use for stream-state transitions to recompute and publish
aggregate unit health (i.e., the unit health recompute/publish codepath) so unit
health doesn't remain stale when suppression toggles.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@x-pack/libbeat/management/unit.go`:
- Around line 369-375: In update(), when you mutate
existing.suppressHealthDegradation for entries in u.streamStates, detect that
the suppression flag flipped and then trigger the unit-level health
recomputation/publish path so the aggregate unit health is recalculated and
emitted; specifically, inside the loop that updates
existing.suppressHealthDegradation (in update()), after assigning
existing.suppressHealthDegradation = state.suppressHealthDegradation, call the
same routine you use for stream-state transitions to recompute and publish
aggregate unit health (i.e., the unit health recompute/publish codepath) so unit
health doesn't remain stale when suppression toggles.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da31b39f-bea6-4a10-b13e-ec60181529fd

📥 Commits

Reviewing files that changed from the base of the PR and between dac7eef and f2ff4f3.

📒 Files selected for processing (2)
  • x-pack/libbeat/management/unit.go
  • x-pack/libbeat/management/unit_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • x-pack/libbeat/management/unit_test.go

…eams from degrading agent health

Adds a `suppress_health_degradation` config flag at the stream level in the
management framework. When set, fetch failures are still logged and tracked
but do not mark the agent as Degraded. This lets operators build broad Fleet
policies with integrations that may not be running on every enrolled host,
without those absent services dragging the entire agent into a degraded state.

Unlike a per-input approach, this change lives in the health aggregation layer
(`calcState()` in `unit.go`) so every input type gets the flag for free: CEL,
httpjson, metricbeat modules, filebeat inputs, and any future inputs that
report per-stream health.

The flag is read from the stream's existing `Source` protobuf field — no proto
changes or per-input modifications needed.

Behaviour when `suppress_health_degradation: true`:

- Input still retries every `period`
- Errors are logged at ERROR level
- Per-stream status still reports Degraded/Failed in the streams payload
- The stream is excluded from the unit's aggregate health calculation
- On recovery, stream status resets to Running as usual

Tested on a live Elastic Agent 9.3.1 cluster across three input types (CEL,
redis/metrics, nginx/metrics) with 9 permutations — all passed.

Supersedes elastic#49492

Fixes: elastic/elastic-agent#12885
@Oddly Oddly force-pushed the suppress-health-degradation branch from f2ff4f3 to 74e2856 Compare March 17, 2026 08:32
@Oddly
Copy link
Author

Oddly commented Mar 17, 2026

Live cluster validation

Tested on a patched Elastic Agent 9.3.1 built from this branch via mage otel:crossBuild.

Fresh install (policy created from scratch each time):

Endpoint suppress Result
working not set HEALTHY
dead not set DEGRADED
dead true HEALTHY
dead false DEGRADED
both dead (2 streams) mixed (true + not set) DEGRADED
both dead (2 streams) both true HEALTHY
both dead (2 streams) both not set DEGRADED
both working (2 streams) both true HEALTHY
all dead (3 streams) mixed (not set+true+false) DEGRADED

In-place update (same policy updated via PUT without agent restart, exercises the update() recompute path):

Endpoint suppress Result
dead not set DEGRADED
dead → true HEALTHY
dead → false DEGRADED
dead → true HEALTHY
→ working true HEALTHY
→ dead true HEALTHY
dead → not set DEGRADED
→ working not set HEALTHY

Cross-input types (three integrations on one agent policy):

Input type Endpoint suppress Result
CEL (Zabbix) working no HEALTHY
redis/metrics dead true suppressed
nginx/metrics dead no DEGRADED

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Mar 17, 2026
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 17, 2026
@pierrehilbert pierrehilbert requested a review from faec March 17, 2026 10:30
@AndersonQ
Copy link
Member

/test

@AndersonQ
Copy link
Member

You need to update/format the files. make update should do the trick. You can check if all is good by running make check

@AndersonQ
Copy link
Member

/test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow inputs to fail without degrading agent health

5 participants