Skip to content

Fix Auditd client goroutine leak#42933

Closed
fearful-symmetry wants to merge 6 commits intoelastic:mainfrom
fearful-symmetry:auditd_client_go_leak
Closed

Fix Auditd client goroutine leak#42933
fearful-symmetry wants to merge 6 commits intoelastic:mainfrom
fearful-symmetry:auditd_client_go_leak

Conversation

@fearful-symmetry
Copy link
Contributor

Proposed commit message

This fixes a rather severe (and interesting) goroutine leak where a failure in Run() where if client.GetStatus() fails, we can assign a new client to the client pointer before the goroutine created by closeAuditClient() exits, thus causing that goroutine to just live forever and periodically read from the client socket.

I can't get client.GetStatus() to fail on its own, but I did some testing where I hard-coded an error return, and auditbeat will grow about 0-3 goroutines a minute. In the real world, I've seen this result in 30K or so goroutines after a few days of running.

This just adds a waitgroup, so closeAuditClient() blocks until it's actually done.

This currently doesn't add any tests, I'm debating on if we want to refactor Run() so we can better add some unit tests to this; ideally we would come up with something that can test both this and the original bug fixed by #26032

We might also want to make ms.kernelLost.enabled, to enable workarounds for this in the future.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@fearful-symmetry fearful-symmetry self-assigned this Feb 26, 2025
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner February 26, 2025 22:03
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 26, 2025
@botelastic
Copy link

botelastic bot commented Feb 26, 2025

This pull request doesn't have a Team:<team> label.

@fearful-symmetry fearful-symmetry changed the title Auditd client go leak Fix Auditd client goroutine leak Feb 26, 2025
@mergify
Copy link
Contributor

mergify bot commented Feb 26, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is sort of unfortunate. It looks like the draining behaviour should be in the library. I looked into this and this was also suggested in the change that added the deadlock protection.

Comment on lines +175 to +176
// If we don't wait for the socket to drain, the calling function may tinker with
// the AuditClient pointer while the goroutine is trying to drain it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this comment? AFAICS the *libaudit.AuditClient that's passed in here is never altered after the call returns.

Comment on lines +176 to +177
// If we don't wait for the socket to drain, the calling function can
// assign a new client instance to the same pointer while our goroutine is using it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the situation, then it seems to me that referring to the containing struct and nilling out the field on completion would do this without a wait group. This would be

	defer func() {
		ms.control.Close()
		client := ms.client
		ms.client = nil
		closeAuditClient(client, ms.log)
	}()

in *MetricSet.Run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we looking at the same chunk of code? The problem is in this loop:

			for {
				select {
				case <-reporter.Done():
					return
				case <-timer.C:
					if status, err := client.GetStatus(); err == nil {
						ms.updateKernelLostMetric(status.Lost)
					} else {
						ms.log.Error("get status request failed:", err)
						closeAuditClient(client, ms.log)
						client, err = libaudit.NewAuditClient(nil)
						if err != nil {
							ms.log.Errorw("Failure creating audit monitoring client", "error", err)
							reporter.Error(err)
							return
						}
					}
				}
			}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have client being closed in the call which runs in a goroutine, and then client is overwritten by the call to libaudit.NewClient, so the old value is no longer available to anyone except the scope in the previous call to closeAuditClient. There is another case like that nearby which is also similarly safe. In the case in *MetricSet.Run this is not true since the ms.client is passed in and retained in the MetricSet; it's never used after that, but it could be, so nilling it out is an relinquishment of ownership.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're suggesting that instead of the waitgroup, we just move the variables around so we don't conflict with the goroutine's reading behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, assuming that is what you're suggesting, I feel like the waitgroup behavior is a little safer, since that way any future users of closeAuditClient don't have to care about its behavior, and can assume that it's no longer using any pointers once the function returns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that we are moving the variable around, we're just enforcing ownership. However, the ownership is already enforced in a way that would prevent any of the bugs that you see. This suggests to me that the issue is not due to the behaviour that the comment is describing; that is that this is not that the *libaudit.Client is not being used in a racy way via the value of client (or one of the same things held by the MetricSet). This is further supported by the fact that you observe that the wait group fixes the issue. What I think is going on is that the close of the old handle, and in particular the draining of the client, is too slow, and so the creation of new handles results in a dynamically increasing goroutine count — rather than a true goroutine leak. I don't have strong evidence of this, but it could be obtained by instrumenting the handle opens and closes.

Short answer though is that I think the wg is probably the right thing to do (it would be nice to have the firm evidence of the actual underlying issue), but the explanation in the comment is not correct, and also that the fix should be done in the library, not here. Doing it in the library (in the Close method, means that all client code would be safe and would not have to reimplement this obviously subtle set of actions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean the file descriptor. Sorry, I was a bit confused by the comment.

So, I did some more testing, and it looks like you're right, the race doesn't happen with the pointers between the parent and child threads, and instead it looks like the file handle, as you mentioned.

As far as the fix, I think this should probably go in libaudit, but I'm also wondering if we should do a more comprehensive refactor of the read/write logic to try and avoid this entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the race doesn't happen with the pointers between the parent and child threads, and instead it looks like the file handle

Excellent. Though I did not mean file handle, but rather the handle to audit client. Functionally here, they are the same thing.

As far as the fix, I think this should probably go in libaudit, but I'm also wondering if we should do a more comprehensive refactor of the read/write logic to try and avoid this entirely.

Agree.

Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, though I think the work should happen in the library. Up to others as to whether it should be done here first as well.

@fearful-symmetry
Copy link
Contributor Author

Yeah, agreed, currently deciding on the where/how of this as well.

@fearful-symmetry
Copy link
Contributor Author

Going to start tinkering with a more comprehensive fix in go-libaudit.

@fearful-symmetry
Copy link
Contributor Author

Closing in favor of #43352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs_team Indicates that the issue/PR needs a Team:* label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants