Fix OTLP HTTP/protobuf export failures and improve OTLP integration test reliability#8449
Fix OTLP HTTP/protobuf export failures and improve OTLP integration test reliability#8449
Conversation
|
@codex review |
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing This PR (8449) and master. ✅ No regressions detected - check the details below Full Metrics ComparisonFakeDbCommand
HttpMessageHandler
Comparison explanationExecution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). Duration chartsFakeDbCommand (.NET Framework 4.8)gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (73ms) : 70, 75
master - mean (74ms) : 70, 77
section Bailout
This PR (8449) - mean (80ms) : 76, 84
master - mean (77ms) : 74, 79
section CallTarget+Inlining+NGEN
This PR (8449) - mean (1,079ms) : 1034, 1123
master - mean (1,080ms) : 1022, 1139
FakeDbCommand (.NET Core 3.1)gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (114ms) : 107, 120
master - mean (116ms) : 108, 124
section Bailout
This PR (8449) - mean (117ms) : 111, 123
master - mean (114ms) : 112, 117
section CallTarget+Inlining+NGEN
This PR (8449) - mean (797ms) : 769, 825
master - mean (800ms) : 774, 826
FakeDbCommand (.NET 6)gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (102ms) : 96, 108
master - mean (100ms) : 95, 105
section Bailout
This PR (8449) - mean (105ms) : 100, 110
master - mean (105ms) : 99, 110
section CallTarget+Inlining+NGEN
This PR (8449) - mean (946ms) : 908, 985
master - mean (942ms) : 910, 974
FakeDbCommand (.NET 8)gantt
title Execution time (ms) FakeDbCommand (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (99ms) : 95, 103
master - mean (101ms) : 97, 105
section Bailout
This PR (8449) - mean (103ms) : 98, 108
master - mean (101ms) : 99, 103
section CallTarget+Inlining+NGEN
This PR (8449) - mean (826ms) : 780, 872
master - mean (825ms) : 786, 863
HttpMessageHandler (.NET Framework 4.8)gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (210ms) : 205, 214
master - mean (204ms) : 199, 209
section Bailout
This PR (8449) - mean (214ms) : 209, 218
master - mean (207ms) : 204, 211
section CallTarget+Inlining+NGEN
This PR (8449) - mean (1,224ms) : 1175, 1272
master - mean (1,200ms) : 1158, 1241
HttpMessageHandler (.NET Core 3.1)gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (303ms) : 297, 310
master - mean (294ms) : 288, 301
section Bailout
This PR (8449) - mean (302ms) : 296, 308
master - mean (295ms) : 291, 300
section CallTarget+Inlining+NGEN
This PR (8449) - mean (1,006ms) : 977, 1035
master - mean (991ms) : 966, 1017
HttpMessageHandler (.NET 6)gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (293ms) : 286, 300
master - mean (288ms) : 282, 294
section Bailout
This PR (8449) - mean (295ms) : 290, 301
master - mean (289ms) : 283, 294
section CallTarget+Inlining+NGEN
This PR (8449) - mean (1,178ms) : 1136, 1219
master - mean (1,167ms) : 1133, 1202
HttpMessageHandler (.NET 8)gantt
title Execution time (ms) HttpMessageHandler (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (295ms) : 289, 302
master - mean (288ms) : 281, 294
section Bailout
This PR (8449) - mean (294ms) : 288, 300
master - mean (288ms) : 283, 293
section CallTarget+Inlining+NGEN
This PR (8449) - mean (1,067ms) : 998, 1136
master - mean (1,063ms) : 980, 1145
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BenchmarksBenchmark execution time: 2026-04-17 13:35:19 Comparing candidate commit 7b707ac in PR branch Found 1 performance improvements and 2 performance regressions! Performance is the same for 24 metrics, 0 unstable metrics, 87 known flaky benchmarks.
|
b6d7dd5 to
96ecfea
Compare
| <type fullname="System.Net.DnsEndPoint" /> | ||
| <type fullname="System.Net.EndPoint" /> | ||
| <type fullname="System.Net.HttpStatusCode" /> | ||
| <type fullname="System.Net.HttpVersion" /> |
There was a problem hiding this comment.
The removal of System.Net.HttpVersion appears to be due to the removal of DefaultRequestVersion = HttpVersion.Version20 and DefaultVersionPolicy = HttpVersionPolicy.RequestVersionOrHigher
| DefaultRequestVersion = HttpVersion.Version20, | ||
| DefaultVersionPolicy = HttpVersionPolicy.RequestVersionOrHigher | ||
| }; | ||
| return new HttpClient(handler); |
There was a problem hiding this comment.
I couldn't find any documentation / reason why it had to be HTTP/2 so I think this is fine, but feel free to request it to be reverted.
96ecfea to
f827579
Compare
The vendored gRPC client already sets HTTP/2 per-request via RequireHttp2, so forcing DefaultRequestVersion=HTTP/2 on the shared HttpClient breaks HTTP/protobuf export against HTTP/1.1-only servers (e.g., dd-apm-test-agent on port 4318). Remove System.Net.HttpVersion from trimming XML since it is no longer referenced.
Shut down the OtlpExporter in OtlpSubmissionLogSink.DisposeAsync() so the underlying HttpClient is disposed after the final flush.
Replace fire-and-forget session clear with a retry helper that waits for the test-agent HTTP endpoint to be ready. Replace single-GET data fetch with a polling helper (WaitForTestAgentData) that retries for up to 30 seconds, accounting for export flush timing after process exit. Fix incorrect env var OTEL_LOG_EXPORT_INTERVAL (does not exist) to OTEL_BLRP_SCHEDULE_DELAY. Set to 500ms so the OTel SDK gets multiple periodic exports before LoggerProviderSdk.Dispose() hits its 5s shutdown timeout. This is especially important for gRPC, where the first export warms the HTTP/2 connection. Set OTEL_METRIC_EXPORT_INTERVAL to 60000ms so only the shutdown flush fires, preventing duplicate metric batches from observable instruments that broke snapshot comparison. Remove [Flaky] attributes from SubmitsOtlpMetrics and SubmitsOtlpLogs.
b8990e6 to
3a29f10
Compare
Summary of changes
This fixes some export failures occurring in OTLP Logs/Metrics exports that were caused by both bugs within the exporter code (forced HTTP/2 and a unclosed
_otlpExporter) and also with some test changes.Reason for change
I've been noticing on many of my CI runs that I kept running into the following error that requires a retry:
On looking further I found a few issues of note:
HttpClients for Logs/Metrics for our OTLP implementation forced HTTP/2. However, the test agent that we us only supports an HTTP/1.1 aiohttp server. Note: this is technically a production change for a test issue I'm fine reverting this and just leaving these tests as[Flaky]OtlpSubmissionLogSink.DisposeAsync()never called_otlpExporter.Shutdown()which appears to have left theHttpClientopen.OTEL_LOG_EXPORT_INTERVALinstead ofOTEL_BLRP_SCHEDULE_DELAY(former doesn't seem to exist ever or anymore)Implementation details
It does seem that the test agent is a bit finicky to work with and since we don't use it very much anywhere else I opted to go for some polling strategies instead of single shots for getting request data.
OTEL_LOG_EXPORT_INTERVALtoOTEL_BLRP_SCHEDULE_DELAY/test/session/clearwithClearTestAgentSessionretry helperWaitForTestAgentDatapolling helper (30s timeout, 500ms interval)OTEL_METRIC_EXPORT_INTERVAL=60000so only the shutdown flush fires (one clean batch)OTEL_BLRP_SCHEDULE_DELAY=500; The OpenTelemetry SDK has a hard 5s shutdown timeout that we hit so this should be good enough to get our data out and flush it before it is timed out. I'll add this to the Other Details as well. This seems to be the flakiest bit left[Flaky]fromSubmitsOtlpMetricsandSubmitsOtlpLogsTest coverage
I'm going to run the CI a few times against this to see if this helps get the tests running quicker and pass
Other details
The test-agent's HTTP server on port 4318 is created in agent.py: otlp_http_app = make_otlp_http_app(agent) # line 2436 otlp_http_runner = web.AppRunner(otlp_http_app) # line 2470 otlp_http_site = web.TCPSite(otlp_http_runner, port=parsed_args.otlp_http_port) # line 2490 This is aiohttp.web -- its TCPSite serves HTTP/1.1 only. aiohttp does not support HTTP/2. You can see it declared as a dependency in setup.py: install_requires=[ "aiohttp", ... ] There's no h2, hyper, hypercorn, or any other HTTP/2 library in the dependencies. For contrast, port 4317 (gRPC) uses grpc_aio.server() (line 2481), which speaks HTTP/2 natively -- that's why gRPC tests were never affected. Source: https://github.com/DataDog/dd-apm-test-agent -- setup.py for dependencies, ddapm_test_agent/agent.py lines 2436-2490 for server setup.