[Network/system-probe] stream-ish network connections over our unix socket#16806
[Network/system-probe] stream-ish network connections over our unix socket#16806
Conversation
e4e3d17 to
ae1aa78
Compare
|
|
||
| // GetActiveConnections returns the delta for connection info from the last time it was called with the same clientID | ||
| func (t *Tracer) GetActiveConnections(clientID string) (*network.Connections, error) { | ||
| func (t *Tracer) GetActiveConnections(clientID string, maxConnectionPerMessage int) (*network.Connections, bool, error) { |
There was a problem hiding this comment.
I think it would be better if the paging was one level up as it is more a function of the request. There is also this for reference: https://cloud.google.com/apis/design/design_patterns#list_pagination
There was a problem hiding this comment.
I can't do one level up, as the goal is to reduce the allocation during the aggregation of dns, http by processing only a page/subset of connections. NPM connections as reference.
On paging we can embedded the mechanism in the protobuf, but would matter as it's used only on our internal api (unix socket)
There was a problem hiding this comment.
The paging needs to be a little more sophisticated to handle cases like where process-agent may crash/restart. Since process-agent uses the same client id every time, just using client id to track pages won't be sufficient.
For moving one level up, I meant do the pagination in modules/network_tracer.go. Another option would be to add something like GetActiveConnectionsPaged to the tracer which would entail fewer side-affects to existing code.
There was a problem hiding this comment.
Adding GetActiveConnectionsPaged sound good
moving up to modules/network_tracer.go it's too late we will got the allocation spike from dns, http already
There was a problem hiding this comment.
I thought the allocations were coming from aggregations, and those happen in writeConnections, right? In the call to Marshal?
There was a problem hiding this comment.
yes it happen on writeConnections, that why I give only a subset of connections
Next patch would fix the GetDelta()->GetStats per protocols that iterate only on a subset of connections, this would limit the allocation even more
c2afcb6 to
ba4c0ce
Compare
ba4c0ce to
a6b8dcc
Compare
Bloop Bleep... Dogbot HereRegression Detector ResultsRun ID: 19a6319c-eb64-471d-9227-5b10dddd1e3d ExplanationA regression test is an integrated performance test for Because a target's optimization goal performance in each experiment will vary somewhat each time it is run, we can only estimate mean differences in optimization goal relative to the baseline target. We express these differences as a percentage change relative to the baseline target, denoted "Δ mean %". These estimates are made to a precision that balances accuracy and cost control. We represent this precision as a 90.00% confidence interval denoted "Δ mean % CI": there is a 90.00% chance that the true value of "Δ mean %" is in that interval. We decide whether a change in performance is a "regression" -- a change worth investigating further -- if both of the following two criteria are true:
The table below, if present, lists those experiments that have experienced a statistically significant change in mean optimization goal performance between baseline and comparison SHAs with 90.00% confidence OR have been detected as newly erratic. Negative values of "Δ mean %" mean that baseline is faster, whereas positive values of "Δ mean %" mean that comparison is faster. Results that do not exhibit more than a ±5.00% change in their mean optimization goal are discarded. An experiment is erratic if its coefficient of variation is greater than 0.1. The abbreviated table will be omitted if no interesting change is observed. Changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%:
Fine details of change detection per experiment.
|
| if !more { | ||
| return connections, err | ||
| } | ||
| connections.Aggregate(cnx) |
There was a problem hiding this comment.
Where is this defined? I couldn't find it in agent-payload
There was a problem hiding this comment.
| ebpfTracer connection.Tracer | ||
| bpfTelemetry *telemetry.EBPFTelemetry | ||
|
|
||
| clientConnections map[string]connections |
There was a problem hiding this comment.
This could use a comment explaining how it's used. It took some reading to figure out that this field was used to cache the last run of the tracer
| t.clientConnections[clientID] = connections{ | ||
| latestTime: latestTime, | ||
| active: t.activeBuffer.Connections(), | ||
| } |
There was a problem hiding this comment.
This will not copy the underlying buffer no? So if getActiveConnections() gets called with another client, the current value in clientConnections[firstClient] will get clobbered
edae1bf to
48ea60f
Compare
48ea60f to
f044638
Compare
f044638 to
f53fc73
Compare
What does this PR do?
This would avoid spike memory allocation on system-probe as
we will pull all the NPM connections, but for DNS and USM we will populate only a bucket (maxConnectionsPerMessage)
and send it to process-agent when requested.
The mechanism relay on http status code 206 (StatusPartialContent), meaning the caller need to call system-probe again as we have more data.
Motivation
Avoid spike allocation and OOMs in containerized production
Additional Notes
Depend on agent-payload PR
Let's say it's a first step, as we could a better job on process-agent and system-probe
Would require specific E2E tests
Possible Drawbacks / Trade-offs
Describe how to test/QA your changes
Reviewer's Checklist
Triagemilestone is set.major_changelabel if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.changelog/no-changeloglabel has been applied.qa/skip-qalabel is not applied.team/..label has been applied, indicating the team(s) that should QA this change.need-change/operatorandneed-change/helmlabels have been applied.k8s/<min-version>label, indicating the lowest Kubernetes version compatible with this feature.