Skip to content

Comments

[system_tests] Fix race conditions in 'set_and_wait_rc' by counting telemetry events#6371

Draft
vlad-scherbich wants to merge 5 commits intomainfrom
vlad/fix-dynamic-config-flakes
Draft

[system_tests] Fix race conditions in 'set_and_wait_rc' by counting telemetry events#6371
vlad-scherbich wants to merge 5 commits intomainfrom
vlad/fix-dynamic-config-flakes

Conversation

@vlad-scherbich
Copy link
Contributor

@vlad-scherbich vlad-scherbich commented Feb 23, 2026

https://datadoghq.atlassian.net/browse/PROF-13796

Motivation

Out of the latest 100 flaked system-test CI runs on dd-trace-py main, 20 are dynamic configuration tests (15% of total). See JIRA for details.

Root cause was identified as a race condition in set_and_wait_rc. The function can match stale app-client-configuration-change telemetry events from a previous RC update and return before the new config is actually applied.

Tests fixed

note: obtained a list of 11 unique flaky tests from 2026 via this script

Rank Hits Test
1 6 test_dynamic_configuration.py::TestDynamicConfigSamplingRules::test_remote_sampling_rules_retention
2 6 test_dynamic_configuration.py::TestDynamicConfigV1::test_trace_sampling_rate_override_default
3 5 test_dynamic_configuration.py::TestDynamicConfigSamplingRules::test_capability_tracing_sample_rules
4 3 test_dynamic_configuration.py::TestDynamicConfigSamplingRules::test_trace_sampling_rules_override_env

Other tests fixed (same root cause): test_apply_state, test_trace_sampling_rate_override_env, test_trace_sampling_rate_with_sampling_rules, test_log_injection_enabled, test_tracing_client_tracing_tags, test_trace_sampling_rules_override_rate, test_trace_sampling_rules_with_tags.

Changes

  • set_and_wait_rc: Replace wait_for_telemetry_event(..., clear=True) with telemetry event counting. Snapshot the count of app-client-configuration-change events before setting RC, then poll until the count increases—guaranteeing we match a genuinely new event, not a stale one.
  • set_and_wait_rc (slow tracers): For dotnet/php/ruby, which skip telemetry gating, add test_agent.clear() after setting RC to discard stale RC requests from prior configs. Prevents matching an old ACK before the new config is applied.
  • test_capability_tracing_sample_rules: Increase wait_loops from 100 to 400 (~4s) so the library has enough time to send its first RC request.
  • reverted unflake a System Test: 'test_remote_sampling_rules_retention' #6342 , as this generalized fix applies there as well

Reviewer checklist

  • Anything but tests/ or manifests/ is modified ? I have the approval from R&P team
  • A docker base image is modified?
    • the relevant build-XXX-image label is present
  • A scenario is added, removed or renamed?

@github-actions
Copy link
Contributor

github-actions bot commented Feb 23, 2026

CODEOWNERS have been resolved as:

tests/appsec/test_asm_standalone.py                                     @DataDog/asm-libraries @DataDog/system-tests-core
tests/parametric/test_dynamic_configuration.py                          @DataDog/system-tests-core @DataDog/apm-sdk-capabilities
utils/docker_fixtures/_test_agent.py                                    @DataDog/system-tests-core

@datadog-official
Copy link

datadog-official bot commented Feb 23, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: b6929c3 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@vlad-scherbich vlad-scherbich force-pushed the vlad/fix-dynamic-config-flakes branch 2 times, most recently from 1efc4da to a8cf21a Compare February 23, 2026 21:10
@vlad-scherbich vlad-scherbich changed the title Attempt to fix race conditions in 'set_and_wait_rc' by counting telem… [system_tests] Fix race conditions in 'set_and_wait_rc' by counting telemetry events Feb 24, 2026
@vlad-scherbich vlad-scherbich force-pushed the vlad/fix-dynamic-config-flakes branch 2 times, most recently from 43ab89f to 6f22dcf Compare February 24, 2026 16:55
@vlad-scherbich vlad-scherbich force-pushed the vlad/fix-dynamic-config-flakes branch from 0fff71e to 7885079 Compare February 24, 2026 17:14
@vlad-scherbich vlad-scherbich marked this pull request as ready for review February 24, 2026 21:04
@vlad-scherbich vlad-scherbich requested review from a team as code owners February 24, 2026 21:04
@vlad-scherbich vlad-scherbich requested review from mtoffl01 and removed request for a team February 24, 2026 21:04
@vlad-scherbich
Copy link
Contributor Author

vlad-scherbich commented Feb 24, 2026

@cbeauchesne , second attempt to generalize the fix for #6342

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 057ab1dc44

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 206 to 208
if context.library.name in _SLOW_TRACERS:
# these tracers do not reliably emit app-client-configuration-change on RC update
_set_rc(test_agent, rc_config, config_id)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clear stale RC requests in slow-tracer fallback path

In the context.library.name in _SLOW_TRACERS branch, set_and_wait_rc skips telemetry gating and immediately waits for ACKNOWLEDGED, but wait_for_rc_apply_state matches any previously recorded ACK in the session. Because RC polling is continuous, dotnet/php/ruby can have stale ACK requests from the prior config, so this call may return before the new config is applied; that recreates the same race this patch is trying to eliminate for those tracers.

Useful? React with 👍 / 👎.

@vlad-scherbich vlad-scherbich marked this pull request as draft February 24, 2026 22:54
on RC update, so we skip the telemetry wait and use config_id filtering to avoid stale ACKs.
"""
rc_config = _create_rc_config(config_overrides)
resolved_config_id = config_id or str(hash(json.dumps(rc_config)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does json.dumps provide deterministic ordering of fields if you don't pass sort_fields or whatever the flag is? I think it could mean the resolved_config_id is not deterministic either (and I'm not sure whether it matters in this context)

Comment on lines +219 to +223
for _ in range(_MAX_RC_EVENT_WAIT_LOOPS):
if test_agent.count_telemetry_events("app-client-configuration-change") > pre_count:
break
time.sleep(0.01)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for / else really is something I'll never be able to wrap my head around, but good job using it here 😅

Comment on lines +725 to +727
if message.get("request_type") == event_name:
if message.get("application", {}).get("language_version") != "SIDECAR":
count += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do guard-style here? like if not: continue instead of if: if: if: do()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants