Fix Auditd client goroutine leak by fearful-symmetry · Pull Request #42933 · elastic/beats

fearful-symmetry · 2025-02-26T22:03:09Z

Proposed commit message

This fixes a rather severe (and interesting) goroutine leak where a failure in Run() where if client.GetStatus() fails, we can assign a new client to the client pointer before the goroutine created by closeAuditClient() exits, thus causing that goroutine to just live forever and periodically read from the client socket.

I can't get client.GetStatus() to fail on its own, but I did some testing where I hard-coded an error return, and auditbeat will grow about 0-3 goroutines a minute. In the real world, I've seen this result in 30K or so goroutines after a few days of running.

This just adds a waitgroup, so closeAuditClient() blocks until it's actually done.

This currently doesn't add any tests, I'm debating on if we want to refactor Run() so we can better add some unit tests to this; ideally we would come up with something that can test both this and the original bug fixed by #26032

We might also want to make ms.kernelLost.enabled, to enable workarounds for this in the future.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

botelastic · 2025-02-26T22:03:15Z

This pull request doesn't have a Team:<team> label.

mergify · 2025-02-26T22:03:44Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

efd6

I think this is sort of unfortunate. It looks like the draining behaviour should be in the library. I looked into this and this was also suggested in the change that added the deadlock protection.

auditbeat/module/auditd/audit_linux.go

efd6 · 2025-02-26T22:47:02Z

auditbeat/module/auditd/audit_linux.go

+	// If we don't wait for the socket to drain, the calling function may tinker with
+	// the AuditClient pointer while the goroutine is trying to drain it


Can you clarify this comment? AFAICS the *libaudit.AuditClient that's passed in here is never altered after the call returns.

efd6 · 2025-02-27T20:40:42Z

auditbeat/module/auditd/audit_linux.go

+	// If we don't wait for the socket to drain, the calling function can
+	// assign a new client instance to the same pointer while our goroutine is using it.


If this is the situation, then it seems to me that referring to the containing struct and nilling out the field on completion would do this without a wait group. This would be

defer func() { ms.control.Close() client := ms.client ms.client = nil closeAuditClient(client, ms.log) }()

in *MetricSet.Run.

Are we looking at the same chunk of code? The problem is in this loop:

for { select { case <-reporter.Done(): return case <-timer.C: if status, err := client.GetStatus(); err == nil { ms.updateKernelLostMetric(status.Lost) } else { ms.log.Error("get status request failed:", err) closeAuditClient(client, ms.log) client, err = libaudit.NewAuditClient(nil) if err != nil { ms.log.Errorw("Failure creating audit monitoring client", "error", err) reporter.Error(err) return } } } }

We have client being closed in the call which runs in a goroutine, and then client is overwritten by the call to libaudit.NewClient, so the old value is no longer available to anyone except the scope in the previous call to closeAuditClient. There is another case like that nearby which is also similarly safe. In the case in *MetricSet.Run this is not true since the ms.client is passed in and retained in the MetricSet; it's never used after that, but it could be, so nilling it out is an relinquishment of ownership.

So you're suggesting that instead of the waitgroup, we just move the variables around so we don't conflict with the goroutine's reading behavior?

So, assuming that is what you're suggesting, I feel like the waitgroup behavior is a little safer, since that way any future users of closeAuditClient don't have to care about its behavior, and can assume that it's no longer using any pointers once the function returns.

It's not that we are moving the variable around, we're just enforcing ownership. However, the ownership is already enforced in a way that would prevent any of the bugs that you see. This suggests to me that the issue is not due to the behaviour that the comment is describing; that is that this is not that the *libaudit.Client is not being used in a racy way via the value of client (or one of the same things held by the MetricSet). This is further supported by the fact that you observe that the wait group fixes the issue. What I think is going on is that the close of the old handle, and in particular the draining of the client, is too slow, and so the creation of new handles results in a dynamically increasing goroutine count — rather than a true goroutine leak. I don't have strong evidence of this, but it could be obtained by instrumenting the handle opens and closes.

Short answer though is that I think the wg is probably the right thing to do (it would be nice to have the firm evidence of the actual underlying issue), but the explanation in the comment is not correct, and also that the fix should be done in the library, not here. Doing it in the library (in the Close method, means that all client code would be safe and would not have to reimplement this obviously subtle set of actions.

Oh, you mean the file descriptor. Sorry, I was a bit confused by the comment.

So, I did some more testing, and it looks like you're right, the race doesn't happen with the pointers between the parent and child threads, and instead it looks like the file handle, as you mentioned.

As far as the fix, I think this should probably go in libaudit, but I'm also wondering if we should do a more comprehensive refactor of the read/write logic to try and avoid this entirely.

the race doesn't happen with the pointers between the parent and child threads, and instead it looks like the file handle

Excellent. Though I did not mean file handle, but rather the handle to audit client. Functionally here, they are the same thing.

As far as the fix, I think this should probably go in libaudit, but I'm also wondering if we should do a more comprehensive refactor of the read/write logic to try and avoid this entirely.

Agree.

efd6

This LGTM, though I think the work should happen in the library. Up to others as to whether it should be done here first as well.

fearful-symmetry · 2025-03-03T21:48:00Z

Yeah, agreed, currently deciding on the where/how of this as well.

fearful-symmetry · 2025-03-04T20:21:13Z

Going to start tinkering with a more comprehensive fix in go-libaudit.

fearful-symmetry · 2025-03-19T20:16:48Z

Closing in favor of #43352

fearful-symmetry added 2 commits February 26, 2025 20:47

fix goroutine leak in audit client

a43d186

clean up

99b6e02

fearful-symmetry requested review from andrewkroh and efd6 February 26, 2025 22:03

fearful-symmetry self-assigned this Feb 26, 2025

fearful-symmetry requested a review from a team as a code owner February 26, 2025 22:03

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 26, 2025

fearful-symmetry changed the title ~~Auditd client go leak~~ Fix Auditd client goroutine leak Feb 26, 2025

efd6 reviewed Feb 26, 2025

View reviewed changes

fearful-symmetry added 2 commits February 27, 2025 17:07

change goroutine behavior, docs

6e0a4de

spelling

3d77fc7

efd6 reviewed Feb 27, 2025

View reviewed changes

fix commen

03a003f

efd6 reviewed Mar 3, 2025

View reviewed changes

fearful-symmetry mentioned this pull request Mar 4, 2025

Refactor to better handle write ENOBUF, blocking writes, and other IO Issues elastic/go-libaudit#173

Open

don't use closeAuditClient() unless we need to

5a80c64

fearful-symmetry mentioned this pull request Mar 10, 2025

Use non-blocking send on pid unset, retry on close elastic/go-libaudit#176

Merged

fearful-symmetry closed this Mar 19, 2025

		// If we don't wait for the socket to drain, the calling function may tinker with
		// the AuditClient pointer while the goroutine is trying to drain it

		// If we don't wait for the socket to drain, the calling function can
		// assign a new client instance to the same pointer while our goroutine is using it.

Conversation

fearful-symmetry commented Feb 26, 2025

Proposed commit message

Checklist

Uh oh!

botelastic bot commented Feb 26, 2025

Uh oh!

mergify bot commented Feb 26, 2025

Uh oh!

efd6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efd6 left a comment

Choose a reason for hiding this comment

Uh oh!

fearful-symmetry commented Mar 3, 2025

Uh oh!

fearful-symmetry commented Mar 4, 2025

Uh oh!

fearful-symmetry commented Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants