Fix Auditd client goroutine leak#42933
Fix Auditd client goroutine leak#42933fearful-symmetry wants to merge 6 commits intoelastic:mainfrom
Conversation
|
This pull request doesn't have a |
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
| // If we don't wait for the socket to drain, the calling function may tinker with | ||
| // the AuditClient pointer while the goroutine is trying to drain it |
There was a problem hiding this comment.
Can you clarify this comment? AFAICS the *libaudit.AuditClient that's passed in here is never altered after the call returns.
| // If we don't wait for the socket to drain, the calling function can | ||
| // assign a new client instance to the same pointer while our goroutine is using it. |
There was a problem hiding this comment.
If this is the situation, then it seems to me that referring to the containing struct and nilling out the field on completion would do this without a wait group. This would be
defer func() {
ms.control.Close()
client := ms.client
ms.client = nil
closeAuditClient(client, ms.log)
}()
in *MetricSet.Run.
There was a problem hiding this comment.
Are we looking at the same chunk of code? The problem is in this loop:
for {
select {
case <-reporter.Done():
return
case <-timer.C:
if status, err := client.GetStatus(); err == nil {
ms.updateKernelLostMetric(status.Lost)
} else {
ms.log.Error("get status request failed:", err)
closeAuditClient(client, ms.log)
client, err = libaudit.NewAuditClient(nil)
if err != nil {
ms.log.Errorw("Failure creating audit monitoring client", "error", err)
reporter.Error(err)
return
}
}
}
}There was a problem hiding this comment.
We have client being closed in the call which runs in a goroutine, and then client is overwritten by the call to libaudit.NewClient, so the old value is no longer available to anyone except the scope in the previous call to closeAuditClient. There is another case like that nearby which is also similarly safe. In the case in *MetricSet.Run this is not true since the ms.client is passed in and retained in the MetricSet; it's never used after that, but it could be, so nilling it out is an relinquishment of ownership.
There was a problem hiding this comment.
So you're suggesting that instead of the waitgroup, we just move the variables around so we don't conflict with the goroutine's reading behavior?
There was a problem hiding this comment.
So, assuming that is what you're suggesting, I feel like the waitgroup behavior is a little safer, since that way any future users of closeAuditClient don't have to care about its behavior, and can assume that it's no longer using any pointers once the function returns.
There was a problem hiding this comment.
It's not that we are moving the variable around, we're just enforcing ownership. However, the ownership is already enforced in a way that would prevent any of the bugs that you see. This suggests to me that the issue is not due to the behaviour that the comment is describing; that is that this is not that the *libaudit.Client is not being used in a racy way via the value of client (or one of the same things held by the MetricSet). This is further supported by the fact that you observe that the wait group fixes the issue. What I think is going on is that the close of the old handle, and in particular the draining of the client, is too slow, and so the creation of new handles results in a dynamically increasing goroutine count — rather than a true goroutine leak. I don't have strong evidence of this, but it could be obtained by instrumenting the handle opens and closes.
Short answer though is that I think the wg is probably the right thing to do (it would be nice to have the firm evidence of the actual underlying issue), but the explanation in the comment is not correct, and also that the fix should be done in the library, not here. Doing it in the library (in the Close method, means that all client code would be safe and would not have to reimplement this obviously subtle set of actions.
There was a problem hiding this comment.
Oh, you mean the file descriptor. Sorry, I was a bit confused by the comment.
So, I did some more testing, and it looks like you're right, the race doesn't happen with the pointers between the parent and child threads, and instead it looks like the file handle, as you mentioned.
As far as the fix, I think this should probably go in libaudit, but I'm also wondering if we should do a more comprehensive refactor of the read/write logic to try and avoid this entirely.
There was a problem hiding this comment.
the race doesn't happen with the pointers between the parent and child threads, and instead it looks like the file handle
Excellent. Though I did not mean file handle, but rather the handle to audit client. Functionally here, they are the same thing.
As far as the fix, I think this should probably go in libaudit, but I'm also wondering if we should do a more comprehensive refactor of the read/write logic to try and avoid this entirely.
Agree.
efd6
left a comment
There was a problem hiding this comment.
This LGTM, though I think the work should happen in the library. Up to others as to whether it should be done here first as well.
|
Yeah, agreed, currently deciding on the where/how of this as well. |
|
Going to start tinkering with a more comprehensive fix in go-libaudit. |
|
Closing in favor of #43352 |
Proposed commit message
This fixes a rather severe (and interesting) goroutine leak where a failure in
Run()where ifclient.GetStatus()fails, we can assign a new client to theclientpointer before the goroutine created bycloseAuditClient()exits, thus causing that goroutine to just live forever and periodically read from the client socket.I can't get
client.GetStatus()to fail on its own, but I did some testing where I hard-coded an error return, and auditbeat will grow about 0-3 goroutines a minute. In the real world, I've seen this result in 30K or so goroutines after a few days of running.This just adds a waitgroup, so
closeAuditClient()blocks until it's actually done.This currently doesn't add any tests, I'm debating on if we want to refactor
Run()so we can better add some unit tests to this; ideally we would come up with something that can test both this and the original bug fixed by #26032We might also want to make
ms.kernelLost.enabled, to enable workarounds for this in the future.Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.