You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a targeted retry wrapper around docker run in run-in-docker.yml to handle transient Docker daemon errors without blanket-retrying the entire CI step.
Reason for change
After merging #8358 (Docker daemon readiness checks), we still observed transient failures where the daemon was confirmed ready but docker run failed with D-Bus/cgroup errors like:
unable to start unit "docker-....scope": Message recipient disconnected from message bus without replying
The daemon is healthy (passes docker info), but systemd's D-Bus has a momentary hiccup during container cgroup creation. This is a transient infrastructure issue, not a test failure.
Implementation details
Adds a docker_run_with_retry shell function that wraps the docker run invocation
Only retries on exit code 125 (Docker daemon error — container creation failure), up to 3 attempts with 10s delay
Does not retry on any other exit code (test/build failures inside the container pass through immediately)
Always logs the exit code on failure (##[warning]), so we can observe actual exit codes for future Docker errors regardless of whether they trigger a retry
Test coverage
CI pipeline change — no unit tests. Validated by reviewing Docker's exit code conventions (125 = daemon error).
Comparing candidate commit 38f6271 in PR branch nacho/DockerRunRetry with baseline commit 58f22e8 in branch master.
Found 1 performance improvements and 1 performance regressions! Performance is the same for 25 metrics, 0 unstable metrics, 87 known flaky benchmarks.
Explanation
This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:
🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline
We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.
If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.
Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.
More details about the CI and significant changes
You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.
CIs of the difference of means are often centered around 0%, because often changes are not that big:
---------------------------------(------|---^--------)-------------------------------->
-0.6% 0% 0.3% +1.2%
| | |
lower bound of the CI --' | |
sample mean (center of the CI) -------------' |
upper bound of the CI ----------------------'
As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).
For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:
----------------------------------------|---------|---(---------^---------)---------->
0% 1% 1.3% 2.2% 3.1%
| | | |
significant impact threshold --------------' | | |
lower bound of CI --------------' | |
sample mean (center of the CI) --------------------------' |
upper bound of CI ----------------------------------'
ignore allocated_mem [+0 bytes; +0 bytes] or [-0.005%; +0.005%]
ignore execution_time [+1.444ms; +5.347ms] or [+0.733%; +2.716%]
ignore throughput [+22002.269op/s; +36507.050op/s] or [+3.072%; +5.097%]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of changes
Add a targeted retry wrapper around
docker runinrun-in-docker.ymlto handle transient Docker daemon errors without blanket-retrying the entire CI step.Reason for change
After merging #8358 (Docker daemon readiness checks), we still observed transient failures where the daemon was confirmed ready but
docker runfailed with D-Bus/cgroup errors like:The daemon is healthy (passes
docker info), but systemd's D-Bus has a momentary hiccup during container cgroup creation. This is a transient infrastructure issue, not a test failure.Implementation details
docker_run_with_retryshell function that wraps thedocker runinvocation##[warning]), so we can observe actual exit codes for future Docker errors regardless of whether they trigger a retryTest coverage
CI pipeline change — no unit tests. Validated by reviewing Docker's exit code conventions (125 = daemon error).
Other details
Follow-up to #8358.