[system_tests] Fix race conditions in 'set_and_wait_rc' by counting telemetry events#6371
[system_tests] Fix race conditions in 'set_and_wait_rc' by counting telemetry events#6371vlad-scherbich wants to merge 5 commits intomainfrom
Conversation
|
|
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🔗 Commit SHA: b6929c3 | Docs | Datadog PR Page | Was this helpful? Give us feedback! |
1efc4da to
a8cf21a
Compare
43ab89f to
6f22dcf
Compare
0fff71e to
7885079
Compare
|
@cbeauchesne , second attempt to generalize the fix for #6342 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 057ab1dc44
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if context.library.name in _SLOW_TRACERS: | ||
| # these tracers do not reliably emit app-client-configuration-change on RC update | ||
| _set_rc(test_agent, rc_config, config_id) |
There was a problem hiding this comment.
Clear stale RC requests in slow-tracer fallback path
In the context.library.name in _SLOW_TRACERS branch, set_and_wait_rc skips telemetry gating and immediately waits for ACKNOWLEDGED, but wait_for_rc_apply_state matches any previously recorded ACK in the session. Because RC polling is continuous, dotnet/php/ruby can have stale ACK requests from the prior config, so this call may return before the new config is applied; that recreates the same race this patch is trying to eliminate for those tracers.
Useful? React with 👍 / 👎.
| on RC update, so we skip the telemetry wait and use config_id filtering to avoid stale ACKs. | ||
| """ | ||
| rc_config = _create_rc_config(config_overrides) | ||
| resolved_config_id = config_id or str(hash(json.dumps(rc_config))) |
There was a problem hiding this comment.
Does json.dumps provide deterministic ordering of fields if you don't pass sort_fields or whatever the flag is? I think it could mean the resolved_config_id is not deterministic either (and I'm not sure whether it matters in this context)
| for _ in range(_MAX_RC_EVENT_WAIT_LOOPS): | ||
| if test_agent.count_telemetry_events("app-client-configuration-change") > pre_count: | ||
| break | ||
| time.sleep(0.01) | ||
| else: |
There was a problem hiding this comment.
for / else really is something I'll never be able to wrap my head around, but good job using it here 😅
| if message.get("request_type") == event_name: | ||
| if message.get("application", {}).get("language_version") != "SIDECAR": | ||
| count += 1 |
There was a problem hiding this comment.
Could we do guard-style here? like if not: continue instead of if: if: if: do()?
https://datadoghq.atlassian.net/browse/PROF-13796
Motivation
Out of the latest 100 flaked system-test CI runs on
dd-trace-pymain, 20 are dynamic configuration tests (15% of total). See JIRA for details.Root cause was identified as a race condition in
set_and_wait_rc. The function can match staleapp-client-configuration-changetelemetry events from a previous RC update and return before the new config is actually applied.Tests fixed
note: obtained a list of 11 unique flaky tests from 2026 via this script
test_dynamic_configuration.py::TestDynamicConfigSamplingRules::test_remote_sampling_rules_retentiontest_dynamic_configuration.py::TestDynamicConfigV1::test_trace_sampling_rate_override_defaulttest_dynamic_configuration.py::TestDynamicConfigSamplingRules::test_capability_tracing_sample_rulestest_dynamic_configuration.py::TestDynamicConfigSamplingRules::test_trace_sampling_rules_override_envOther tests fixed (same root cause):
test_apply_state,test_trace_sampling_rate_override_env,test_trace_sampling_rate_with_sampling_rules,test_log_injection_enabled,test_tracing_client_tracing_tags,test_trace_sampling_rules_override_rate,test_trace_sampling_rules_with_tags.Changes
set_and_wait_rc: Replacewait_for_telemetry_event(..., clear=True)with telemetry event counting. Snapshot the count ofapp-client-configuration-changeevents before setting RC, then poll until the count increases—guaranteeing we match a genuinely new event, not a stale one.set_and_wait_rc(slow tracers): For dotnet/php/ruby, which skip telemetry gating, addtest_agent.clear()after setting RC to discard stale RC requests from prior configs. Prevents matching an old ACK before the new config is applied.test_capability_tracing_sample_rules: Increasewait_loopsfrom 100 to 400 (~4s) so the library has enough time to send its first RC request.Reviewer checklist
tests/ormanifests/is modified ? I have the approval from R&P teambuild-XXX-imagelabel is present