Skip to content

update ECS fargate collector to use v4 endpoint#23253

Merged
dd-mergequeue[bot] merged 11 commits intomainfrom
kangyi/ecs-fargate
Apr 22, 2024
Merged

update ECS fargate collector to use v4 endpoint#23253
dd-mergequeue[bot] merged 11 commits intomainfrom
kangyi/ecs-fargate

Conversation

@kangyili
Copy link
Copy Markdown
Contributor

@kangyili kangyili commented Feb 28, 2024

What does this PR do?

This PR is based on #21836. It updates ECS Fargate Collector to use v4 endpoint is feature flag is true. We can set this feature flag to true in future so we can fully depend on version detection

func (c *collector) setTaskCollectionParser() {
_, err := ecsmeta.V4FromCurrentTask()
if c.taskCollectionEnabled && err == nil {
c.taskCollectionParser = c.parseTaskFromV4Endpoint
return
}
c.taskCollectionParser = c.parseTaskFromV2Endpoint
}

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Feb 28, 2024

Bloop Bleep... Dogbot Here

Regression Detector Results

Run ID: 7d872c5d-5368-4a56-944c-f56428a08997
Baseline: 51a2884
Comparison: e7cfc51

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI
file_to_blackhole % cpu utilization -1.36 [-7.63, +4.92]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI
file_tree memory utilization +1.17 [+1.05, +1.29]
uds_dogstatsd_to_api_cpu % cpu utilization +0.93 [-1.94, +3.80]
basic_py_check % cpu utilization +0.92 [-1.50, +3.34]
otel_to_otel_logs ingress throughput +0.03 [-0.61, +0.66]
trace_agent_msgpack ingress throughput +0.00 [-0.00, +0.00]
uds_dogstatsd_to_api ingress throughput -0.00 [-0.06, +0.06]
trace_agent_json ingress throughput -0.00 [-0.01, +0.01]
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.05, +0.05]
pycheck_1000_100byte_tags % cpu utilization -0.25 [-5.19, +4.69]
tcp_syslog_to_blackhole ingress throughput -0.32 [-0.40, -0.24]
process_agent_standard_check_with_stats memory utilization -0.50 [-0.54, -0.46]
process_agent_standard_check memory utilization -0.52 [-0.57, -0.48]
idle memory utilization -1.01 [-1.05, -0.97]
process_agent_real_time_mode memory utilization -1.02 [-1.06, -0.98]
file_to_blackhole % cpu utilization -1.36 [-7.63, +4.92]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@kangyili kangyili force-pushed the kangyi/ecs branch 2 times, most recently from 8d02b6a to 09f57df Compare March 1, 2024 09:19
@kangyili kangyili force-pushed the kangyi/ecs-fargate branch from b15bc95 to ae81e64 Compare March 1, 2024 09:21
@kangyili kangyili force-pushed the kangyi/ecs-fargate branch from ae81e64 to e7cfc51 Compare March 15, 2024 15:55
@kangyili kangyili marked this pull request as ready for review March 15, 2024 15:56
@kangyili kangyili requested a review from a team as a code owner March 15, 2024 15:56
@kangyili kangyili requested a review from a team as a code owner March 19, 2024 14:09
@kangyili kangyili force-pushed the kangyi/ecs-fargate branch from e7cfc51 to f2f8373 Compare March 19, 2024 14:12
@kangyili kangyili removed the request for review from a team March 19, 2024 14:13
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Mar 19, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=32510944 --os-family=ubuntu

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Mar 19, 2024

Regression Detector

Regression Detector Results

Run ID: e1c20b57-dd8b-4d1c-968f-fd12b125818c
Baseline: 4276b6e
Comparison: 4953559

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI
file_to_blackhole % cpu utilization -1.24 [-6.79, +4.31]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI
process_agent_real_time_mode memory utilization +1.09 [+1.04, +1.13]
tcp_syslog_to_blackhole ingress throughput +0.67 [+0.58, +0.76]
basic_py_check % cpu utilization +0.42 [-1.97, +2.81]
process_agent_standard_check_with_stats memory utilization +0.23 [+0.17, +0.29]
otel_to_otel_logs ingress throughput +0.02 [-0.40, +0.44]
trace_agent_msgpack ingress throughput +0.02 [+0.01, +0.03]
trace_agent_json ingress throughput +0.01 [-0.00, +0.03]
file_tree memory utilization -0.00 [-0.13, +0.12]
tcp_dd_logs_filter_exclude ingress throughput -0.01 [-0.05, +0.03]
uds_dogstatsd_to_api ingress throughput -0.03 [-0.24, +0.18]
idle memory utilization -0.55 [-0.60, -0.51]
process_agent_standard_check memory utilization -0.61 [-0.68, -0.55]
pycheck_1000_100byte_tags % cpu utilization -1.06 [-5.90, +3.79]
uds_dogstatsd_to_api_cpu % cpu utilization -1.08 [-4.02, +1.86]
file_to_blackhole % cpu utilization -1.24 [-6.79, +4.31]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

Comment thread comp/core/workloadmeta/collectors/internal/ecsfargate/ecsfargate.go Outdated
Comment thread comp/core/workloadmeta/collectors/internal/ecsfargate/ecsfargate.go Outdated
Copy link
Copy Markdown
Contributor

@clamoriniere clamoriniere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few questions

@kangyili kangyili force-pushed the kangyi/ecs-fargate branch from f2f8373 to 8ebf82c Compare March 28, 2024 16:20
@kangyili kangyili force-pushed the kangyi/ecs-fargate branch from 8ebf82c to b181131 Compare March 28, 2024 16:49
@kangyili kangyili requested a review from clamoriniere April 4, 2024 08:46
@kangyili kangyili force-pushed the kangyi/ecs-fargate branch from b181131 to 3ea06d6 Compare April 16, 2024 13:40
Base automatically changed from kangyi/ecs to main April 16, 2024 14:14
@kangyili kangyili requested a review from a team April 18, 2024 12:25
@kangyili kangyili added this to the 7.54.0 milestone Apr 19, 2024
Comment thread comp/core/workloadmeta/collectors/internal/ecsfargate/ecsfargate.go
Name: taskID,
},
ClusterName: parseClusterName(task.ClusterName),
Region: parseRegion(task.ClusterName),
Copy link
Copy Markdown
Contributor

@adel121 adel121 Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

I know this was not introduced in your PR, but mentioning it in case you have time to improve things a bit.

you can already use the AWS public go sdk to get the region instead of manually parsing the ARN.

Reference

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think importing sdk will increase the pkg size and the binary size test can fail

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point 👍


// the AvailabilityZone metadata is only available for
// Fargate tasks using platform version 1.4 or later
AvailabilityZone: task.AvailabilityZone,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here if we are using a platform with version older than 1.4? Does this return an empty string in that case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Errorf("unexpected entity type: %T", entity)
}
}
require.Equal(t, 4, count)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like your tests only include ecs task events, but no container events. is there any reason for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tests both 2 events

// one ECS task event and three container events should be notified
require.Len(t, store.notifiedEvents, 4)

require.Equal(t, "ecs-cluster", entity.ClusterName)
require.Equal(t, "my-redis", entity.Family)
require.Equal(t, "1", entity.Version)
require.Equal(t, workloadmeta.ECSLaunchTypeFargate, entity.LaunchType)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we asserting the Containers field?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +61 to +64
require.Equal(t, "RUNNING", entity.KnownStatus)
require.Equal(t, "awslogs", entity.LogDriver)
require.Len(t, entity.Networks, 1)
require.Equal(t, "awsvpc", entity.Networks[0].NetworkMode)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why didn't we assert these for the case of v2?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@adel121 adel121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kangyili for addressing all comments 🙇

@kangyili
Copy link
Copy Markdown
Contributor Author

/merge

@dd-devflow
Copy link
Copy Markdown

dd-devflow bot commented Apr 22, 2024

🚂 MergeQueue

Pull request added to the queue.

This build is going to start soon! (estimated merge in less than 25m)

Use /merge -c to cancel this operation!

@dd-mergequeue dd-mergequeue bot merged commit 808c008 into main Apr 22, 2024
@dd-mergequeue dd-mergequeue bot deleted the kangyi/ecs-fargate branch April 22, 2024 07:37
YoannGh added a commit that referenced this pull request Apr 22, 2024
paulcacheux added a commit that referenced this pull request Apr 22, 2024
dd-mergequeue bot pushed a commit that referenced this pull request Apr 22, 2024
kangyili added a commit that referenced this pull request Apr 22, 2024
dd-mergequeue bot pushed a commit that referenced this pull request Apr 24, 2024
…4924)

* Revert "Revert "update ECS fargate collector to use v4 endpoint (#23253)" (#2…"

This reverts commit d8fc7e1.

* init clients
alexgallotta pushed a commit that referenced this pull request May 9, 2024
* update ECS collector to use v4 endpoint

* address feedback

* address feedback

* update ECS fargate collector to use v4 endpoint

* address feedback

* address feedback

* address feedback
alexgallotta pushed a commit that referenced this pull request May 9, 2024
…4924)

* Revert "Revert "update ECS fargate collector to use v4 endpoint (#23253)" (#2…"

This reverts commit d8fc7e1.

* init clients
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants