update ECS fargate collector to use v4 endpoint by kangyili · Pull Request #23253 · DataDog/datadog-agent

kangyili · 2024-02-28T14:39:14Z

What does this PR do?

This PR is based on #21836. It updates ECS Fargate Collector to use v4 endpoint is feature flag is true. We can set this feature flag to true in future so we can fully depend on version detection

datadog-agent/comp/core/workloadmeta/collectors/internal/ecsfargate/ecsfargate.go

Lines 80 to 87 in 1b10170

    
           func (c *collector) setTaskCollectionParser() { 
        
           	_, err := ecsmeta.V4FromCurrentTask() 
        
           	if c.taskCollectionEnabled && err == nil { 
        
           		c.taskCollectionParser = c.parseTaskFromV4Endpoint 
        
           		return 
        
           	} 
        
           	c.taskCollectionParser = c.parseTaskFromV2Endpoint 
        
           }

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter · 2024-02-28T16:55:58Z

Bloop Bleep... Dogbot Here

Regression Detector Results

Run ID: 7d872c5d-5368-4a56-944c-f56428a08997
Baseline: 51a2884
Comparison: e7cfc51

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI
➖	file_to_blackhole	% cpu utilization	-1.36	[-7.63, +4.92]

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI
➖	file_tree	memory utilization	+1.17	[+1.05, +1.29]
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.93	[-1.94, +3.80]
➖	basic_py_check	% cpu utilization	+0.92	[-1.50, +3.34]
➖	otel_to_otel_logs	ingress throughput	+0.03	[-0.61, +0.66]
➖	trace_agent_msgpack	ingress throughput	+0.00	[-0.00, +0.00]
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.06, +0.06]
➖	trace_agent_json	ingress throughput	-0.00	[-0.01, +0.01]
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.05, +0.05]
➖	pycheck_1000_100byte_tags	% cpu utilization	-0.25	[-5.19, +4.69]
➖	tcp_syslog_to_blackhole	ingress throughput	-0.32	[-0.40, -0.24]
➖	process_agent_standard_check_with_stats	memory utilization	-0.50	[-0.54, -0.46]
➖	process_agent_standard_check	memory utilization	-0.52	[-0.57, -0.48]
➖	idle	memory utilization	-1.01	[-1.05, -0.97]
➖	process_agent_real_time_mode	memory utilization	-1.02	[-1.06, -0.98]
➖	file_to_blackhole	% cpu utilization	-1.36	[-7.63, +4.92]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

pr-commenter · 2024-03-19T16:09:37Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=32510944 --os-family=ubuntu

pr-commenter · 2024-03-19T16:38:18Z

Regression Detector

Regression Detector Results

Run ID: e1c20b57-dd8b-4d1c-968f-fd12b125818c
Baseline: 4276b6e
Comparison: 4953559

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI
➖	file_to_blackhole	% cpu utilization	-1.24	[-6.79, +4.31]

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI
➖	process_agent_real_time_mode	memory utilization	+1.09	[+1.04, +1.13]
➖	tcp_syslog_to_blackhole	ingress throughput	+0.67	[+0.58, +0.76]
➖	basic_py_check	% cpu utilization	+0.42	[-1.97, +2.81]
➖	process_agent_standard_check_with_stats	memory utilization	+0.23	[+0.17, +0.29]
➖	otel_to_otel_logs	ingress throughput	+0.02	[-0.40, +0.44]
➖	trace_agent_msgpack	ingress throughput	+0.02	[+0.01, +0.03]
➖	trace_agent_json	ingress throughput	+0.01	[-0.00, +0.03]
➖	file_tree	memory utilization	-0.00	[-0.13, +0.12]
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.01	[-0.05, +0.03]
➖	uds_dogstatsd_to_api	ingress throughput	-0.03	[-0.24, +0.18]
➖	idle	memory utilization	-0.55	[-0.60, -0.51]
➖	process_agent_standard_check	memory utilization	-0.61	[-0.68, -0.55]
➖	pycheck_1000_100byte_tags	% cpu utilization	-1.06	[-5.90, +3.79]
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-1.08	[-4.02, +1.86]
➖	file_to_blackhole	% cpu utilization	-1.24	[-6.79, +4.31]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

clamoriniere

few questions

adel121 · 2024-04-19T14:23:48Z

+			Name: taskID,
+		},
+		ClusterName: parseClusterName(task.ClusterName),
+		Region:      parseRegion(task.ClusterName),


nit

I know this was not introduced in your PR, but mentioning it in case you have time to improve things a bit.

you can already use the AWS public go sdk to get the region instead of manually parsing the ARN.

Reference

I think importing sdk will increase the pkg size and the binary size test can fail

Good point 👍

adel121 · 2024-04-19T14:25:36Z

+
+		// the AvailabilityZone metadata is only available for
+		// Fargate tasks using platform version 1.4 or later
+		AvailabilityZone: task.AvailabilityZone,


What happens here if we are using a platform with version older than 1.4? Does this return an empty string in that case?

It would be empty as it's our current behaviour https://github.com/DataDog/datadog-agent/blob/main/comp/core/workloadmeta/collectors/internal/ecsfargate/ecsfargate.go#L122

adel121 · 2024-04-19T14:33:46Z

+			t.Errorf("unexpected entity type: %T", entity)
+		}
+	}
+	require.Equal(t, 4, count)


It seems like your tests only include ecs task events, but no container events. is there any reason for this?

It tests both 2 events

datadog-agent/comp/core/workloadmeta/collectors/internal/ecsfargate/v2parser_test.go

Lines 44 to 45 in 4953559

// one ECS task event and three container events should be notified

require.Len(t, store.notifiedEvents, 4)

adel121 · 2024-04-19T14:34:47Z

+			require.Equal(t, "ecs-cluster", entity.ClusterName)
+			require.Equal(t, "my-redis", entity.Family)
+			require.Equal(t, "1", entity.Version)
+			require.Equal(t, workloadmeta.ECSLaunchTypeFargate, entity.LaunchType)


Why aren't we asserting the Containers field?

adel121 · 2024-04-19T14:38:27Z

+			require.Equal(t, "RUNNING", entity.KnownStatus)
+			require.Equal(t, "awslogs", entity.LogDriver)
+			require.Len(t, entity.Networks, 1)
+			require.Equal(t, "awsvpc", entity.Networks[0].NetworkMode)


why didn't we assert these for the case of v2?

adel121

Thanks @kangyili for addressing all comments 🙇

kangyili · 2024-04-22T07:07:22Z

/merge

dd-devflow · 2024-04-22T07:07:27Z

🚂 MergeQueue

Pull request added to the queue.

This build is going to start soon! (estimated merge in less than 25m)

Use /merge -c to cancel this operation!

This reverts commit 808c008.

…4914) This reverts commit 808c008.

…)" (#2…" This reverts commit d8fc7e1.

…4924) * Revert "Revert "update ECS fargate collector to use v4 endpoint (#23253)" (#2…" This reverts commit d8fc7e1. * init clients

* update ECS collector to use v4 endpoint * address feedback * address feedback * update ECS fargate collector to use v4 endpoint * address feedback * address feedback * address feedback

…4914) This reverts commit 808c008.

…4924) * Revert "Revert "update ECS fargate collector to use v4 endpoint (#23253)" (#2…" This reverts commit d8fc7e1. * init clients

kangyili added changelog/no-changelog No changelog entry needed [deprecated] team/container-app labels Feb 28, 2024

kangyili force-pushed the kangyi/ecs-fargate branch from 6680e26 to 975bac6 Compare February 28, 2024 14:41

kangyili mentioned this pull request Feb 28, 2024

Update Workloadmeta ECS collector to use metadata v4 endpoint #21836

Merged

10 tasks

kangyili force-pushed the kangyi/ecs branch from e7d3c14 to f6ddfaa Compare February 28, 2024 15:16

kangyili force-pushed the kangyi/ecs-fargate branch from 975bac6 to b15bc95 Compare February 28, 2024 15:17

kangyili force-pushed the kangyi/ecs branch 2 times, most recently from 8d02b6a to 09f57df Compare March 1, 2024 09:19

kangyili force-pushed the kangyi/ecs-fargate branch from b15bc95 to ae81e64 Compare March 1, 2024 09:21

kangyili force-pushed the kangyi/ecs branch from 09f57df to 3163cee Compare March 15, 2024 15:54

kangyili force-pushed the kangyi/ecs-fargate branch from ae81e64 to e7cfc51 Compare March 15, 2024 15:55

kangyili marked this pull request as ready for review March 15, 2024 15:56

kangyili requested a review from a team as a code owner March 15, 2024 15:56

kangyili mentioned this pull request Mar 18, 2024

[Orchestrator] add new check to collect ecs tasks #22060

Merged

10 tasks

update ECS collector to use v4 endpoint

b6f1f0b

kangyili force-pushed the kangyi/ecs branch from 3caa3f2 to b6f1f0b Compare March 19, 2024 14:09

kangyili requested a review from a team as a code owner March 19, 2024 14:09

kangyili force-pushed the kangyi/ecs-fargate branch from e7cfc51 to f2f8373 Compare March 19, 2024 14:12

kangyili removed the request for review from a team March 19, 2024 14:13

clamoriniere reviewed Mar 22, 2024

View reviewed changes

Comment thread comp/core/workloadmeta/collectors/internal/ecsfargate/ecsfargate.go Outdated

clamoriniere reviewed Mar 22, 2024

View reviewed changes

Comment thread comp/core/workloadmeta/collectors/internal/ecsfargate/ecsfargate.go Outdated

clamoriniere reviewed Mar 22, 2024

View reviewed changes

address feedback

66b205d

kangyili force-pushed the kangyi/ecs-fargate branch from f2f8373 to 8ebf82c Compare March 28, 2024 16:20

Merge branch 'main' into kangyi/ecs

999f087

kangyili force-pushed the kangyi/ecs-fargate branch from 8ebf82c to b181131 Compare March 28, 2024 16:49

kangyili requested a review from clamoriniere April 4, 2024 08:46

address feedback

3ea06d6

kangyili force-pushed the kangyi/ecs-fargate branch from b181131 to 3ea06d6 Compare April 16, 2024 13:40

Base automatically changed from kangyi/ecs to main April 16, 2024 14:14

Merge branch 'main' into kangyi/ecs-fargate

1b10170

kangyili requested a review from a team April 18, 2024 12:25

Merge branch 'main' into kangyi/ecs-fargate

ce05414

kangyili added this to the 7.54.0 milestone Apr 19, 2024

adel121 reviewed Apr 19, 2024

View reviewed changes

Comment thread comp/core/workloadmeta/collectors/internal/ecsfargate/ecsfargate.go

adel121 reviewed Apr 19, 2024

View reviewed changes

clamoriniere approved these changes Apr 19, 2024

View reviewed changes

adel121 reviewed Apr 19, 2024

View reviewed changes

Comment thread comp/core/workloadmeta/collectors/internal/ecsfargate/v4parser_test.go

address feedback

4953559

adel121 approved these changes Apr 19, 2024

View reviewed changes

dd-mergequeue bot merged commit 808c008 into main Apr 22, 2024

dd-mergequeue bot deleted the kangyi/ecs-fargate branch April 22, 2024 07:37

YoannGh added a commit that referenced this pull request Apr 22, 2024

Revert "update ECS fargate collector to use v4 endpoint (#23253)"

d0eb4e1

This reverts commit 808c008.

paulcacheux added a commit that referenced this pull request Apr 22, 2024

Revert "update ECS fargate collector to use v4 endpoint (#23253)"

513bf32

This reverts commit 808c008.

paulcacheux mentioned this pull request Apr 22, 2024

Revert "update ECS fargate collector to use v4 endpoint" #24914

Merged

dd-mergequeue bot pushed a commit that referenced this pull request Apr 22, 2024

Revert "update ECS fargate collector to use v4 endpoint (#23253)" (#2…

d8fc7e1

…4914) This reverts commit 808c008.

kangyili added a commit that referenced this pull request Apr 22, 2024

Revert "Revert "update ECS fargate collector to use v4 endpoint (#23253…

5b80ea5

…)" (#2…" This reverts commit d8fc7e1.

alexgallotta pushed a commit that referenced this pull request May 9, 2024

Revert "update ECS fargate collector to use v4 endpoint (#23253)" (#2…

c9e4934

…4914) This reverts commit 808c008.

	func (c *collector) setTaskCollectionParser() {
	_, err := ecsmeta.V4FromCurrentTask()
	if c.taskCollectionEnabled && err == nil {
	c.taskCollectionParser = c.parseTaskFromV4Endpoint
	return
	}
	c.taskCollectionParser = c.parseTaskFromV2Endpoint
	}

	// one ECS task event and three container events should be notified
	require.Len(t, store.notifiedEvents, 4)

Conversation

kangyili commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Uh oh!

pr-commenter bot commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bloop Bleep... Dogbot Here

Regression Detector Results

No significant changes in experiment optimization goals

Experiments ignored for regressions

Fine details of change detection per experiment

Explanation

Uh oh!

pr-commenter bot commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test changes on VM

Uh oh!

pr-commenter bot commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Experiments ignored for regressions

Fine details of change detection per experiment

Explanation

Uh oh!

Uh oh!

Uh oh!

clamoriniere left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adel121 Apr 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adel121 left a comment

Choose a reason for hiding this comment

Uh oh!

kangyili commented Apr 22, 2024

Uh oh!

dd-devflow bot commented Apr 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kangyili commented Feb 28, 2024 •

edited

Loading

pr-commenter bot commented Feb 28, 2024 •

edited

Loading

pr-commenter bot commented Mar 19, 2024 •

edited

Loading

pr-commenter bot commented Mar 19, 2024 •

edited

Loading

adel121 Apr 19, 2024 •

edited

Loading