chore(ci): stabilize memory-all — bound concurrency, shard across 2 runners, verbose go-test streaming#20965
Merged
Merged
Conversation
Print every test as it runs so the streamed CircleCI log captures test-start events. Today --format=testname only emits a line when a whole package finishes, so when a runner dies mid-job, the log is truncated at a buffered package boundary and we lose all signal about which test was actually running at the time of the kill. This is the cheapest way to get per-test breadcrumbs without changing where artifacts are written (artifacts are not produced when the runner itself dies, so off-box capture isn't an option).
Adds `-p` to bound how many test packages execute simultaneously:
- `go-tests-{short,full,fraud-proofs}`: `-p=4` on the gotestsum
invocations. Previously unbounded (defaulted to GOMAXPROCS=16-20),
so heavy e2e packages on `op-e2e/system/*` could all run at once
on the same shard.
- `op-acceptance-tests`: drop `DEFAULT_JOBS` from `$CPU_COUNT` to a
fixed `3`. Each acceptance test package launches a full devstack
(L1 geth + op-node + op-reth/op-geth + batcher + proposer +
challenger), so 20 concurrent packages saturated CPU and RAM on
the runner.
Motivation: #20966 and the wave of
"context deadline exceeded" failures on memory-all-* jobs (e.g.
job 5097460, 24 simultaneous failures) both point at resource
contention from unbounded package-level test parallelism. The
resources tab on those runs showed sustained 100% CPU and 100% RAM.
`-parallel` (intra-package `t.Parallel()` cap) is left unchanged.
Contributor
Author
|
Claude: Baseline durations captured for comparison after this PR runs CI. Sources: last 5 successful
After this PR's
|
Previous cap of -p=3 / -parallel=10 made memory-all-* jobs run ~40m (roughly 2x the develop baseline of ~21m). Wall time was dominated by serializing 77 acceptance test packages through only 3 slots. Only 13 of 177 acceptance tests call t.Parallel(), so -parallel is essentially a no-op for this workload and the total concurrency budget is dominated by -p. Bump -p to 8 (still well below the ~20 crash threshold) and pin -parallel=2 to keep the few opt-in parallel tests from compounding the concurrent-devstack count. Expected wall time: ~15 min, in line with or slightly below the pre-cap baseline.
Previous -p=8 -parallel=2 (cap 16) was aimed wrong: I had under-counted acceptance tests as ~7% parallel, but the actual ratio is ~83% parallel once devstack's ParallelT wrapper (op-devstack/devtest/testing.go:409) is counted. Each ParallelT test spins its own devstack inside the test function, so the real concurrency cost is -p * -parallel, not -p alone. Set -parallel=1 so intra-package parallelism is disabled and the only knob is -p, which directly equals concurrent test devstacks. Pick -p=12 to land wall time near the ~21min pre-cap baseline while staying well under the observed ~30-effective-devstack crash threshold.
Apply `parallelism: 2` to all three memory-all-* variants and run each shard with `-p=8` (down from `-p=12`). With ~205 test-minutes of total work and a single ~7-minute longest test, a timing-balanced 2-way split brings per-shard wall to ~13 min (vs ~22-24 min today) while cutting per-box concurrent devstacks from 12 to 8. Restores the test-level splitting recipe from #19832: enumerate tests with `go test -list`, split via `circleci tests split --split-by=timings`, feed the assigned subset back as a `-run=^(...)$` regex. Local runs and single-node CI behave identically to today. The new `acceptance_test_jobs` job parameter exports ACCEPTANCE_TEST_JOBS only when non-empty, so the justfile's DEFAULT_JOBS=12 stays in place for any caller that doesn't override.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bundle of fixes aimed at the elevated flake rate currently blocking merges (#20966 and the recent wave of mass
context deadline exceededfailures onmemory-all-*jobs, e.g. job 5097460 with 24 simultaneous failures and 100% CPU/RAM on the resources tab).Switch gotestsum to
--format=standard-verboseongo-tests-{short,full,fraud-proofs}so each test's=== RUN/--- PASSevents stream to the CircleCI log as they happen. Withtestnameformatting, the streamed log truncated at the same buffered package boundary every time, so we never saw which test was running when the runner died. Standard-verbose makes the in-flight test visible even when no artifacts are produced (which is exactly what happens when the runner agent itself dies —store_artifactsnever runs). Log volume stays well under CircleCI's 4MB-per-step cap.Bound
go testconcurrency on heavy shards so the runner doesn't get saturated by concurrent devstacks.go-tests-{short,full,fraud-proofs}: add-p=4. Previously unbounded (defaulted to GOMAXPROCS = 16–20), so all heavyop-e2e/system/*packages on a single shard could run simultaneously. Intra-packaget.Parallel()is rare in these suites, so-parallelis left at the existingPARALLEL=12env override.op-acceptance-tests: each ParallelT test (op-devstack/devtest/testing.go:409) spins its own full devstack, and ~83% of acceptance tests are ParallelT (127/153 non-helper test funcs). Collapse to a single concurrency axis by setting-parallel=1(disable intra-package parallelism) and-p=12directly. 12 is well below the observed ~30-effective-devstack crash threshold.Shard the memory-all matrix across 2 runners to recover wall time and cut per-box devstack pressure. Restores the test-level splitting from chore(ci): add test-level parallelism to memory-all-opn-op-geth #19832: enumerate tests with
go test -list, split viacircleci tests split --split-by=timings, run the assigned subset via-run. Each sharded job uses-p=8per shard (down from 12 single-box), so per-box concurrent devstacks drop from 12 → 8 while wall time drops below the pre-cap baseline. Local runs and any single-node caller are unaffected because the recipe falls back to running everything whenCIRCLE_NODE_TOTALis unset.Runtime comparison
go-tests-fulldoesn't run on PRs, so its post-change data will land when this merges.-p=12 -parallel=1-p=8 -parallel=1go-tests-short(-p=4)memory-all-opn-op-gethmemory-all-opn-op-rethmemory-all-kona-op-rethAll three sharded memory-all variants finished below the pre-cap develop baseline while also halving per-box concurrent devstack pressure (~20 unbounded → 8 capped).
¹
go-tests-shortdoesn't run on develop'smainworkflow; baseline taken from this PR's pre-cap run (job 5097444).