Fix journalctl process lifecycle and cleanup bugs#49528
Fix journalctl process lifecycle and cleanup bugs#49528belimawr wants to merge 42 commits intoelastic:mainfrom
Conversation
- Set timezone to match host - Add systemd 242 - use absolute path for journald.conf
Kill() previously only sent SIGKILL and returned immediately, leaving the stdout/stderr reader goroutines and the process-wait goroutine still running. This meant Close() could return before cleanup finished, potentially leaking goroutines. Kill() now calls waitDone.Wait() after sending SIGKILL, blocking until all background goroutines have exited and the process has been reaped. GenAI-Assisted: Yes Human-Reviewed: Yes Tool: Cursor-CLI, Model: Claude 4.6 Opus (Thinking)
When journalctl exits on its own (crash or short-lived commands like --version), Kill() returned "process already finished" which Close() propagated as a misleading error. Swallow os.ErrProcessDone since the process is already gone and cleanup via waitDone.Wait() still runs.
The readersWG fix ensures cmd.Wait() is only called after the reader goroutines finish, so the pipe can no longer be closed while readers are active. The fs.PathError branch handling "file already closed" is now unreachable. Replace it with a simple non-EOF error log, matching the stderr reader. GenAI-Assisted: Yes Human-Reviewed: Yes Tool: Cursor-CLI, Model: Claude 4.6 Opus (Thinking)
The factory error paths returned &journalctl{}, a non-nil pointer to
an unusable struct. Because the return type is the Jctl interface, this
value appears non-nil to callers, defeating nil checks and panicking
on any method call. Return nil so the interface value is a true nil.
GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor-CLI, Model: Claude 4.6 Opus (Thinking)
🤖 GitHub commentsJust comment with:
|
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
When Kill was called, if the stdout reader goroutine was trying to send on the channel and nobody was listening, the goroutine would get stuck and Kill would block forever. This commit fixes it by using a channel to notify this goroutine to return when Kill is called.
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
📝 WalkthroughWalkthroughThis pull request refactors the journalctl process lifecycle management. It adds a timeout to test container cleanup, improves error handling in factory initialization to avoid partially initialized instances, introduces a stop channel for coordinated reader shutdown, simplifies error logging for stdout reads, and enhances the Kill() method to explicitly close the stop channel, send SIGKILL, and wait for background goroutines before returning. Additionally, the Jctl interface now includes a Kill() method with explicit lifecycle expectations. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@filebeat/input/journald/pkg/journalctl/reader.go`:
- Around line 75-78: Update all call sites of the Jctl.Kill() method to handle
its returned error (it now returns error). Locate every place that calls Kill()
(including the failing test that previously ignored the return value) and
replace the unused call with proper error handling: capture the returned error
from Jctl.Kill(), and in production code return or log it appropriately, or in
tests assert/fail on non-nil error (e.g., t.Fatalf or require.NoError). Ensure
every invocation of Kill() checks the error and propagates or reports it instead
of discarding it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ca9af9dc-afd2-4bc7-beb1-cacfbaf4a0e8
📒 Files selected for processing (3)
filebeat/input/journald/pkg/journalctl/chroot_test.gofilebeat/input/journald/pkg/journalctl/journalctl.gofilebeat/input/journald/pkg/journalctl/reader.go
| // Kill terminates the journalctl process and blocks until all | ||
| // background goroutines (stdout/stderr readers and the process-wait | ||
| // goroutine) have exited. | ||
| Kill() error |
There was a problem hiding this comment.
Handle Kill() errors at all call sites after the interface change (current CI blocker).
Jctl.Kill() now returns error, but at least one caller still ignores it (filebeat/input/journald/pkg/journalctl/chroot_test.go, Line 168), which is failing golangci-lint (errcheck). Please update remaining call sites to check/handle the return value.
Suggested fix (example for the failing call site)
- defer jctl.Kill() // nolint: deadcode // It's a test, there is nothing to do
+ t.Cleanup(func() {
+ require.NoError(t, jctl.Kill(), "failed to stop journalctl")
+ })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@filebeat/input/journald/pkg/journalctl/reader.go` around lines 75 - 78,
Update all call sites of the Jctl.Kill() method to handle its returned error (it
now returns error). Locate every place that calls Kill() (including the failing
test that previously ignored the return value) and replace the unused call with
proper error handling: capture the returned error from Jctl.Kill(), and in
production code return or log it appropriately, or in tests assert/fail on
non-nil error (e.g., t.Fatalf or require.NoError). Ensure every invocation of
Kill() checks the error and propagates or reports it instead of discarding it.
Proposed commit message
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added tests that prove my fix is effective or that my feature works. Where relevant, I have used thestresstest.shscript to run them under stress conditions and race detector to verify their stability../changelog/fragmentsusing the changelog tool.## Disruptive User Impact## Author's ChecklistHow to test this PR locally
Run the Journald input tests:
Related issues
--boot all#49445## Use cases## Screenshots## Logs