Skip to content

Fix journalctl process lifecycle and cleanup bugs#49528

Open
belimawr wants to merge 42 commits intoelastic:mainfrom
belimawr:fix-journalctl-life-cycle
Open

Fix journalctl process lifecycle and cleanup bugs#49528
belimawr wants to merge 42 commits intoelastic:mainfrom
belimawr:fix-journalctl-life-cycle

Conversation

@belimawr
Copy link
Contributor

@belimawr belimawr commented Mar 17, 2026

Proposed commit message

This PR hardens process management in the journalctl wrapper by fixing several lifecycle
edge cases: factory failures now return a true nil interface (avoiding panics on unusable
instances), Kill() now treats already-finished processes as non-errors and waits for
full shutdown/reaping before returning, and the stdout reader removes unreachable "file 
already closed" special-casing now that goroutine/wait ordering is correct. Together,
these changes make shutdown deterministic and prevent misleading errors and goroutine
leaks.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

## Disruptive User Impact
## Author's Checklist

How to test this PR locally

Run the Journald input tests:

cd filebeat
go test ./input/journald/... -v
mage BuildSystemTestBinary
go test -count=1 -v -tags=integration ./tests/integration -run=TestJournald

Related issues

## Use cases
## Screenshots
## Logs

philippkahr and others added 30 commits March 5, 2026 16:59
 - Set timezone to match host
 - Add systemd 242
 - use absolute path for journald.conf
Kill() previously only sent SIGKILL and returned immediately, leaving
the stdout/stderr reader goroutines and the process-wait goroutine
still running. This meant Close() could return before cleanup finished,
potentially leaking goroutines.
Kill() now calls waitDone.Wait() after sending SIGKILL, blocking until
all background goroutines have exited and the process has been reaped.

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor-CLI, Model: Claude 4.6 Opus (Thinking)
belimawr added 10 commits March 13, 2026 12:16
When journalctl exits on its own (crash or short-lived commands like
--version), Kill() returned "process already finished" which Close()
propagated as a misleading error. Swallow os.ErrProcessDone since the
process is already gone and cleanup via waitDone.Wait() still runs.
The readersWG fix ensures cmd.Wait() is only called after the reader
goroutines finish, so the pipe can no longer be closed while readers
are active. The fs.PathError branch handling "file already closed" is
now unreachable. Replace it with a simple non-EOF error log, matching
the stderr reader.

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor-CLI, Model: Claude 4.6 Opus (Thinking)
The factory error paths returned &journalctl{}, a non-nil pointer to
an unusable struct. Because the return type is the Jctl interface, this
value appears non-nil to callers, defeating nil checks and panicking
on any method call. Return nil so the interface value is a true nil.

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor-CLI, Model: Claude 4.6 Opus (Thinking)
@belimawr belimawr self-assigned this Mar 17, 2026
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Mar 17, 2026
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 17, 2026
@github-actions
Copy link
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Mar 17, 2026

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

When Kill was called, if the stdout reader goroutine was trying to
send on the channel and nobody was listening, the goroutine would get
stuck and Kill would block forever.

This commit fixes it by using a channel to notify this goroutine to
return when Kill is called.
@belimawr belimawr marked this pull request as ready for review March 23, 2026 15:28
@belimawr belimawr requested a review from a team as a code owner March 23, 2026 15:28
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@coderabbitai
Copy link

coderabbitai bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

This pull request refactors the journalctl process lifecycle management. It adds a timeout to test container cleanup, improves error handling in factory initialization to avoid partially initialized instances, introduces a stop channel for coordinated reader shutdown, simplifies error logging for stdout reads, and enhances the Kill() method to explicitly close the stop channel, send SIGKILL, and wait for background goroutines before returning. Additionally, the Jctl interface now includes a Kill() method with explicit lifecycle expectations.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • 🛠️ Update Documentation: Commit on current branch
  • 🛠️ Update Documentation: Create PR

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@filebeat/input/journald/pkg/journalctl/reader.go`:
- Around line 75-78: Update all call sites of the Jctl.Kill() method to handle
its returned error (it now returns error). Locate every place that calls Kill()
(including the failing test that previously ignored the return value) and
replace the unused call with proper error handling: capture the returned error
from Jctl.Kill(), and in production code return or log it appropriately, or in
tests assert/fail on non-nil error (e.g., t.Fatalf or require.NoError). Ensure
every invocation of Kill() checks the error and propagates or reports it instead
of discarding it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ca9af9dc-afd2-4bc7-beb1-cacfbaf4a0e8

📥 Commits

Reviewing files that changed from the base of the PR and between 01782c7 and daeee8f.

📒 Files selected for processing (3)
  • filebeat/input/journald/pkg/journalctl/chroot_test.go
  • filebeat/input/journald/pkg/journalctl/journalctl.go
  • filebeat/input/journald/pkg/journalctl/reader.go

Comment on lines +75 to 78
// Kill terminates the journalctl process and blocks until all
// background goroutines (stdout/stderr readers and the process-wait
// goroutine) have exited.
Kill() error
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle Kill() errors at all call sites after the interface change (current CI blocker).

Jctl.Kill() now returns error, but at least one caller still ignores it (filebeat/input/journald/pkg/journalctl/chroot_test.go, Line 168), which is failing golangci-lint (errcheck). Please update remaining call sites to check/handle the return value.

Suggested fix (example for the failing call site)
- defer jctl.Kill() // nolint: deadcode // It's a test, there is nothing to do
+ t.Cleanup(func() {
+ 	require.NoError(t, jctl.Kill(), "failed to stop journalctl")
+ })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@filebeat/input/journald/pkg/journalctl/reader.go` around lines 75 - 78,
Update all call sites of the Jctl.Kill() method to handle its returned error (it
now returns error). Locate every place that calls Kill() (including the failing
test that previously ignored the return value) and replace the unused call with
proper error handling: capture the returned error from Jctl.Kill(), and in
production code return or log it appropriately, or in tests assert/fail on
non-nil error (e.g., t.Fatalf or require.NoError). Ensure every invocation of
Kill() checks the error and propagates or reports it instead of discarding it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants