Skip to content

chore(ci): stabilize memory-all — bound concurrency, shard across 2 runners, verbose go-test streaming#20965

Merged
ajsutton merged 5 commits into
developfrom
aj/chore/gotestsum-verbose
May 22, 2026
Merged

chore(ci): stabilize memory-all — bound concurrency, shard across 2 runners, verbose go-test streaming#20965
ajsutton merged 5 commits into
developfrom
aj/chore/gotestsum-verbose

Conversation

@ajsutton
Copy link
Copy Markdown
Contributor

@ajsutton ajsutton commented May 21, 2026

Bundle of fixes aimed at the elevated flake rate currently blocking merges (#20966 and the recent wave of mass context deadline exceeded failures on memory-all-* jobs, e.g. job 5097460 with 24 simultaneous failures and 100% CPU/RAM on the resources tab).

  1. Switch gotestsum to --format=standard-verbose on go-tests-{short,full,fraud-proofs} so each test's === RUN / --- PASS events stream to the CircleCI log as they happen. With testname formatting, the streamed log truncated at the same buffered package boundary every time, so we never saw which test was running when the runner died. Standard-verbose makes the in-flight test visible even when no artifacts are produced (which is exactly what happens when the runner agent itself dies — store_artifacts never runs). Log volume stays well under CircleCI's 4MB-per-step cap.

  2. Bound go test concurrency on heavy shards so the runner doesn't get saturated by concurrent devstacks.

    • go-tests-{short,full,fraud-proofs}: add -p=4. Previously unbounded (defaulted to GOMAXPROCS = 16–20), so all heavy op-e2e/system/* packages on a single shard could run simultaneously. Intra-package t.Parallel() is rare in these suites, so -parallel is left at the existing PARALLEL=12 env override.
    • op-acceptance-tests: each ParallelT test (op-devstack/devtest/testing.go:409) spins its own full devstack, and ~83% of acceptance tests are ParallelT (127/153 non-helper test funcs). Collapse to a single concurrency axis by setting -parallel=1 (disable intra-package parallelism) and -p=12 directly. 12 is well below the observed ~30-effective-devstack crash threshold.
  3. Shard the memory-all matrix across 2 runners to recover wall time and cut per-box devstack pressure. Restores the test-level splitting from chore(ci): add test-level parallelism to memory-all-opn-op-geth #19832: enumerate tests with go test -list, split via circleci tests split --split-by=timings, run the assigned subset via -run. Each sharded job uses -p=8 per shard (down from 12 single-box), so per-box concurrent devstacks drop from 12 → 8 while wall time drops below the pre-cap baseline. Local runs and any single-node caller are unaffected because the recipe falls back to running everything when CIRCLE_NODE_TOTAL is unset.

Runtime comparison

go-tests-full doesn't run on PRs, so its post-change data will land when this merges.

Job Develop baseline (median) Single-box -p=12 -parallel=1 Final: 2-way shard -p=8 -parallel=1 Δ vs baseline
go-tests-short (-p=4) 392s (6m32s)¹ 386s (6m26s) 388s (6m28s) ~0%
memory-all-opn-op-geth 1243s (20m43s) 1372s (22m52s) 1041s (17m21s) −16%
memory-all-opn-op-reth 1270s (21m10s) 1465s (24m25s) (1 pre-existing flake) 1151s (19m11s) −9%
memory-all-kona-op-reth 1243s (20m43s) 1380s (23m00s) 1055s (17m35s) −15%

All three sharded memory-all variants finished below the pre-cap develop baseline while also halving per-box concurrent devstack pressure (~20 unbounded → 8 capped).

¹ go-tests-short doesn't run on develop's main workflow; baseline taken from this PR's pre-cap run (job 5097444).

Print every test as it runs so the streamed CircleCI log captures
test-start events. Today --format=testname only emits a line when a
whole package finishes, so when a runner dies mid-job, the log is
truncated at a buffered package boundary and we lose all signal about
which test was actually running at the time of the kill.

This is the cheapest way to get per-test breadcrumbs without changing
where artifacts are written (artifacts are not produced when the runner
itself dies, so off-box capture isn't an option).
Adds `-p` to bound how many test packages execute simultaneously:

- `go-tests-{short,full,fraud-proofs}`: `-p=4` on the gotestsum
  invocations. Previously unbounded (defaulted to GOMAXPROCS=16-20),
  so heavy e2e packages on `op-e2e/system/*` could all run at once
  on the same shard.
- `op-acceptance-tests`: drop `DEFAULT_JOBS` from `$CPU_COUNT` to a
  fixed `3`. Each acceptance test package launches a full devstack
  (L1 geth + op-node + op-reth/op-geth + batcher + proposer +
  challenger), so 20 concurrent packages saturated CPU and RAM on
  the runner.

Motivation: #20966 and the wave of
"context deadline exceeded" failures on memory-all-* jobs (e.g.
job 5097460, 24 simultaneous failures) both point at resource
contention from unbounded package-level test parallelism. The
resources tab on those runs showed sustained 100% CPU and 100% RAM.

`-parallel` (intra-package `t.Parallel()` cap) is left unchanged.
@ajsutton ajsutton changed the title chore(ci): switch gotestsum format to standard-verbose chore(ci): unblock CI — verbose test streaming + cap package concurrency May 21, 2026
@ajsutton
Copy link
Copy Markdown
Contributor Author

Claude: Baseline durations captured for comparison after this PR runs CI.

Sources: last 5 successful main-workflow runs on develop (pipelines 125915, 125917, 125918, 125933, 126031). go-tests-short does not run on develop, so its baseline is taken from this PR's own pre-cap run on commit 0b20c82a19 (CircleCI job 5097444, pipeline 126004 sibling).

Job Baseline median Baseline range n
go-tests-short (parallelism=12) 392s (6m32s) 1 (PR pre-cap)
go-tests-full (parallelism=16) 376s (6m16s) 369–385s 5
memory-all-opn-op-geth 1243s (20m43s) 1235–1288s 5
memory-all-opn-op-reth 1270s (21m10s) 1258–1280s 5
memory-all-kona-op-reth 1243s (20m43s) 1221–1261s 5

After this PR's main workflow completes, compare the same job durations. Expected direction:

  • go-tests-*: small slowdown on the lighter shards (build/compile serialization at -p=4 vs. previously unbounded), wall-clock dominated by the heaviest shard either way — should be ≤ +30% in the worst case.
  • memory-all-*: more significant wall-clock increase (each shard now runs at most 3 devstacks at a time vs. up to 20). The cost we're buying back with stability.

ajsutton added 3 commits May 22, 2026 09:50
Previous cap of -p=3 / -parallel=10 made memory-all-* jobs run ~40m
(roughly 2x the develop baseline of ~21m). Wall time was dominated by
serializing 77 acceptance test packages through only 3 slots.

Only 13 of 177 acceptance tests call t.Parallel(), so -parallel is
essentially a no-op for this workload and the total concurrency budget
is dominated by -p. Bump -p to 8 (still well below the ~20 crash
threshold) and pin -parallel=2 to keep the few opt-in parallel tests
from compounding the concurrent-devstack count.

Expected wall time: ~15 min, in line with or slightly below the
pre-cap baseline.
Previous -p=8 -parallel=2 (cap 16) was aimed wrong: I had under-counted
acceptance tests as ~7% parallel, but the actual ratio is ~83% parallel
once devstack's ParallelT wrapper (op-devstack/devtest/testing.go:409)
is counted. Each ParallelT test spins its own devstack inside the test
function, so the real concurrency cost is -p * -parallel, not -p alone.

Set -parallel=1 so intra-package parallelism is disabled and the only
knob is -p, which directly equals concurrent test devstacks. Pick -p=12
to land wall time near the ~21min pre-cap baseline while staying well
under the observed ~30-effective-devstack crash threshold.
Apply `parallelism: 2` to all three memory-all-* variants and run each
shard with `-p=8` (down from `-p=12`). With ~205 test-minutes of total
work and a single ~7-minute longest test, a timing-balanced 2-way split
brings per-shard wall to ~13 min (vs ~22-24 min today) while cutting
per-box concurrent devstacks from 12 to 8.

Restores the test-level splitting recipe from #19832: enumerate tests
with `go test -list`, split via `circleci tests split --split-by=timings`,
feed the assigned subset back as a `-run=^(...)$` regex. Local runs and
single-node CI behave identically to today.

The new `acceptance_test_jobs` job parameter exports
ACCEPTANCE_TEST_JOBS only when non-empty, so the justfile's
DEFAULT_JOBS=12 stays in place for any caller that doesn't override.
@ajsutton ajsutton changed the title chore(ci): unblock CI — verbose test streaming + cap package concurrency chore(ci): unblock CI — verbose streaming + bound concurrency + shard memory-all May 22, 2026
@ajsutton ajsutton changed the title chore(ci): unblock CI — verbose streaming + bound concurrency + shard memory-all chore(ci): stabilize memory-all — bound concurrency, shard across 2 runners, verbose go-test streaming May 22, 2026
@ajsutton ajsutton enabled auto-merge May 22, 2026 01:30
Copy link
Copy Markdown
Contributor

@wwared wwared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤞

@ajsutton ajsutton added this pull request to the merge queue May 22, 2026
Merged via the queue into develop with commit 008e66f May 22, 2026
119 checks passed
@ajsutton ajsutton deleted the aj/chore/gotestsum-verbose branch May 22, 2026 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants