Update Workloadmeta ECS collector to use metadata v4 endpoint by kangyili · Pull Request #21836 · DataDog/datadog-agent

kangyili · 2024-01-03T10:46:46Z

Motivation

A new core check has been introduced to collect ECS tasks by using Workloadmeta. Now Workloadmeta only relies on the v1/tasks endpoint, which offers limited information compared to the more recent v4 endpoint. But the v4 endpoint lacks support for retrieving a list of tasks. This PR enhances Workloadmeta by using the v4 endpoint, iterating through the responses from the v1/tasks endpoint.

What does this PR do?

This PR is based on

Related PR

update ECS fargate collector to use v4 endpoint #23253

This PR is used by

[Orchestrator] add new check to collect ecs tasks #22060

ECS Collector

If ecs_task_collection_enabled is set to true, the ECS agent V4 metadata endpoint gets queried for each task returned by v1/tasks endpoint.
The file structure has been slightly reorganised by adding v1/v4 parser files. This PR also enables the detection of endpoint versions and the use of a corresponding parser based on the version detected.

Given the ECS agent metadata endpoint's default rate limit of 40 requests per minute (source), a rate limiter within the worker has been implemented to prevent throttling.

datadog-agent/comp/core/workloadmeta/collectors/internal/ecs/worker.go

Lines 23 to 28 in 66b205d

    
           type worker[T any] struct { 
        
           	processFunc func(ctx context.Context, task v1.Task) (T, error) 
        
           	taskQueue   workqueue.RateLimitingInterface 
        
           	taskCache   *cache.Cache 
        
           }

Each pull operation initiates a worker to fetch tasks from the v4 endpoint. If a task is already present in the cache, it will not be collected again until the cache's TTL has expired.

datadog-agent/comp/core/workloadmeta/collectors/internal/ecs/v4parser.go

Lines 32 to 33 in 66b205d

    
           taskWorker := newWorker(c.taskRateRPS, c.taskRateBurst, c.taskCache, c.getTaskWithTagsFromV4Endpoint) 
        
           processed, rest, skipped := taskWorker.execute(ctx, tasks)

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Add FF DD_ECS_TASK_COLLECTION_ENABLED=true and verify we don't have any big impact on agent performance.
This FF allows Workloadmeta to query an additional endpoint to get more data and the data will be used by ECS check

Reviewer's Checklist

pr-commenter · 2024-01-03T12:05:14Z

Bloop Bleep... Dogbot Here

Regression Detector Results

Run ID: f3698333-5b33-4754-bdac-88bce0a5e340
Baseline: b4f0a17
Comparison: 3caa3f2

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI
➖	file_to_blackhole	% cpu utilization	+4.01	[-2.51, +10.53]

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI
➖	file_to_blackhole	% cpu utilization	+4.01	[-2.51, +10.53]
➖	basic_py_check	% cpu utilization	+2.34	[-0.07, +4.74]
➖	uds_dogstatsd_to_api	ingress throughput	+0.00	[-0.06, +0.06]
➖	trace_agent_msgpack	ingress throughput	+0.00	[-0.00, +0.00]
➖	trace_agent_json	ingress throughput	-0.00	[-0.01, +0.01]
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.06, +0.06]
➖	file_tree	memory utilization	-0.31	[-0.42, -0.21]
➖	process_agent_standard_check_with_stats	memory utilization	-0.36	[-0.39, -0.33]
➖	process_agent_standard_check	memory utilization	-0.62	[-0.66, -0.59]
➖	process_agent_real_time_mode	memory utilization	-0.64	[-0.68, -0.61]
➖	idle	memory utilization	-0.79	[-0.82, -0.76]
➖	tcp_syslog_to_blackhole	ingress throughput	-1.27	[-1.35, -1.20]
➖	pycheck_1000_100byte_tags	% cpu utilization	-1.30	[-6.20, +3.60]
➖	otel_to_otel_logs	ingress throughput	-1.72	[-2.37, -1.08]
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-2.24	[-4.95, +0.47]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

GustavoCaso · 2024-01-12T11:14:39Z

@kangyili, fantastic work 🎉

One thing I found reviewing all of the changes is that is quite difficult to be able to hold all of the changes in my head during the review. Not saying you have to, but if possible splitting the work into smaller PRs to make the process of reviewing them easier would be fantastic

zhuminyi · 2024-04-08T12:31:10Z

Can you update QA information?

zhuminyi · 2024-04-08T14:04:05Z

LGTM

GustavoCaso · 2024-04-15T16:20:59Z

+// IsTaskCollectionEnabled returns true if the task metadata collection is enabled for core agent
+// If agent launch type is EC2, collector will query the latest ECS metadata endpoint for each task returned by v1/tasks
+// If agent launch type is Fargate, collector will query the latest ECS metadata endpoint
+func IsTaskCollectionEnabled() bool {


Following the other comment regarding using the config component I would suggest passing the config component to this function. That way we do not use pkg/config here either 😄

GustavoCaso · 2024-04-15T16:23:10Z

+}
+
+// ParseV4Task parses a metadata v4 task into a workloadmeta.ECSTask
+func ParseV4Task(task v3or4.Task, seen map[workloadmeta.EntityID]struct{}) []workloadmeta.CollectorEvent {


Why are these functions (ParseV4Task, and ParseV4TaskContainers) extracted from the comp/core/workloadmeta/collectors/internal/ecs/v4parser.go package? The util package usually is dedicated for functions that are shared by multiple packages? These functions are only used in comp/core/workloadmeta/collectors/internal/ecs/v4parser.go. I would suggest moving them to that package

#21836 (comment)

GustavoCaso · 2024-04-15T16:36:57Z

+
+//go:build docker
+
+package util


base on this PR I'm not sure we need an util package. Both TaskParser and IsTaskCollectionEnabled is only used on comp/core/workloadmeta/collectors/internal/ecs/ecs.go.

Base on that and the comments below we can probably add the function we have here to the packages they are using it

Actually I have another PR to update ECS fargate collector, both collectors use TaskParser and IsTaskCollectionEnabled, so I created ecs_util.go
https://github.com/DataDog/datadog-agent/pull/23253/files#diff-b755149238dd28f3c616d3974000977c004ebfb915be4340ce601f868a43420bR37

GustavoCaso

Thank you so much for addressing all the feedback 🎉

kangyili · 2024-04-16T13:17:29Z

/merge

dd-devflow · 2024-04-16T13:17:34Z

🚂 MergeQueue

Pull request added to the queue.

There are 4 builds ahead! (estimated merge in less than 2h)

Use /merge -c to cancel this operation!

* update ECS collector to use v4 endpoint * address feedback * address feedback

github-advanced-security AI found potential problems Jan 3, 2024

View reviewed changes

Comment thread comp/core/workloadmeta/collectors/util/ecs_util.go Fixed

AliDatadog reviewed Jan 3, 2024

View reviewed changes

Comment thread comp/core/workloadmeta/collectors/internal/ecs/ecs.go Outdated

AliDatadog reviewed Jan 3, 2024

View reviewed changes

Comment thread comp/core/workloadmeta/collectors/internal/ecs/ecs.go Outdated

kangyili marked this pull request as ready for review January 5, 2024 13:36

kangyili requested review from a team as code owners January 5, 2024 13:36

kangyili changed the title ~~[Orchestrator] Collection ECS Tasks~~ [Orchestrator] Add new check to collect ECS tasks from ESC-EC2 and ECS-Fargate Jan 5, 2024

kangyili added the [deprecated] team/container-app label Jan 5, 2024

kangyili force-pushed the kangyi/ecs branch from f67c3d6 to b941227 Compare January 5, 2024 14:14

kangyili requested a review from a team as a code owner January 5, 2024 14:14

kangyili added this to the 7.52.0 milestone Jan 5, 2024

kangyili commented Jan 5, 2024

View reviewed changes

Comment thread go.mod Outdated

kangyili mentioned this pull request Jan 5, 2024

add ecs task payload DataDog/agent-payload#281

Merged

kangyili modified the milestones: 7.52.0, Triage Jan 5, 2024

kangyili force-pushed the kangyi/ecs branch from b941227 to 6458f1a Compare January 5, 2024 15:12

estherk15 approved these changes Jan 5, 2024

View reviewed changes

Comment thread releasenotes/notes/add-new-check-orchestrator-ecs-fa5d1511b1a550c3.yaml Outdated

kangyili mentioned this pull request Jan 10, 2024

Send ECS task lifecycle events #21984

Merged

10 tasks

GustavoCaso reviewed Jan 12, 2024

View reviewed changes

Comment thread comp/core/workloadmeta/collectors/internal/ecsfargate/ecsfargate.go Outdated