Skip to content

socat CI: run the test suite as parallel shards via parallel-make-check.py#10771

Open
julek-wolfssl wants to merge 6 commits into
wolfSSL:masterfrom
julek-wolfssl:socat-parallel-shards
Open

socat CI: run the test suite as parallel shards via parallel-make-check.py#10771
julek-wolfssl wants to merge 6 commits into
wolfSSL:masterfrom
julek-wolfssl:socat-parallel-shards

Conversation

@julek-wolfssl

Copy link
Copy Markdown
Member

The socat suite runs ~590 tests sequentially in a single job and is
sleep-bound: a handful of tests sit in fixed waits (INTRANETRIPPER alone sleeps
~140s at -t 1.0) that dominate the ~10 min runtime.

This generalizes the shared parallel runner
(.github/scripts/parallel-make-check.py) so any command can ride its worker
pool, not just wolfSSL build configs, and uses it to shard the socat tests
across a single runner.

parallel-make-check.py — three additive config keys

Defaults keep existing build configs behaving exactly as before:

  • build: false — skip configure/make/check; run only the prepare/run
    commands, so an arbitrary command can use the pool.
  • netns: true — run each command under bwrap --unshare-net (its own
    network namespace) so parallel network tests can't collide on ports. Needs
    bubblewrap; warns and falls back to the shared namespace if bwrap is
    missing.
  • shards: N — fan a config out into N instances, each with $SHARD
    (1..N) and $SHARDS=N in its env and its own build-<name>-<k> dir. The
    pool (--threads) bounds how many run at once, so N > threads
    load-balances dynamically. Composes with the existing --shard CI split.

socat.yml

One config (build:false, netns:true, shards:12) runs a round-robin slice
of test.sh per shard (seq $SHARD $SHARDS 999), each in its own network
namespace and its own copy of the build dir (their generated certs/temp files
would otherwise race). --no-fail-fast runs every shard so all unexpected
failures are reported, as the unsharded run did. The job timeout drops from 30
to 15 min.

…ck.py

The socat suite runs ~590 tests sequentially in one job and is
sleep-bound: a few tests sit in fixed waits (INTRANETRIPPER alone sleeps
~140s at -t 1.0) that dominate the ~10 min runtime.

Generalize the parallel runner so any command can ride its worker pool,
not just wolfSSL build configs. Three additive config keys (defaults keep
existing build configs behaving exactly as before):

  build:false  skip configure/make/check; run only the prepare+run commands
  netns:true   run each command under 'bwrap --unshare-net' so parallel
               network tests can't collide on ports
  shards:N     fan a config out into N instances, each with $SHARD/$SHARDS
               in its env and its own build-<name>-<k> dir; the pool bounds
               how many run at once, so N>threads load-balances dynamically

socat.yml uses it: one config (build:false, netns:true, shards:12) runs a
round-robin slice of test.sh per shard, each in its own network namespace
and build-dir copy, with --no-fail-fast so every failure is reported as the
unsharded run did.
Copilot AI review requested due to automatic review settings June 24, 2026 21:06
@julek-wolfssl julek-wolfssl self-assigned this Jun 24, 2026
@dgarske dgarske self-requested a review June 24, 2026 21:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@julek-wolfssl julek-wolfssl marked this pull request as ready for review June 24, 2026 21:11
@github-actions

Copy link
Copy Markdown

retest this please

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread .github/workflows/socat.yml
Comment thread .github/scripts/parallel-make-check.py
socat.yml sets shards to 3*nproc (a few per CPU so the pool always has
queued work to balance the slow tests against), and guards each shard's
slice with ${tests:-0} so a shard that draws no test numbers is a no-op
(test 0 matches nothing) instead of letting test.sh fall back to running
the whole suite.

parallel-make-check.py: guard the summary's occupancy/utilization ratios
against a zero wall time (every job a no-op when shards exceed the work)
so it can't divide by zero.
The first CI run failed because each shard's bwrap --unshare-net netns
differs from the host namespace the expect_fail lists were calibrated for:
 - IPv6: the netns is IPv4-only, so ::1 and dual-stack (v6only) tests fail.
   Re-create IPv6 loopback in each shard (disable_ipv6=0, add ::1,
   bindv6only=0), best-effort (|| true) so IPv4-only runners still work.
 - Timing/port flakiness: one thread per CPU oversubscribes during TLS
   handshakes (each shard runs a server+client socat). Cap --threads at
   half the CPUs.
Also turn on fail-fast (drop --no-fail-fast).
Address a review comment: the <name>-<k> instances from shard fan-out
could collide with another config's name and share a build-<name> dir.
Validate after fan-out, matching the duplicate-name check in load_configs.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment on lines +60 to +62
sparse-checkout: |
.github/actions
.github/scripts
Comment thread .github/scripts/parallel-make-check.py
Last run's IPv6 re-creation had no effect (netns stayed IPv4-only).
 - parallel-make-check.py: add --cap-add CAP_NET_ADMIN to the netns bwrap
   so a shard can configure its own loopback.
 - socat.yml: bring lo up, add ::1 plus a non-loopback ULA (so the resolver
   treats IPv6 as configured) and set bindv6only=0. Drop 2>/dev/null so the
   setup's success/failure shows in the log.
The IPv6 re-creation failed with "Operation not permitted": Ubuntu 24.04
restricts unprivileged user namespaces via AppArmor, leaving CAP_NET_ADMIN
ineffective inside bwrap's netns. Add the same
kernel.apparmor_restrict_unprivileged_userns=0 step the other bwrap
workflows use, so each shard can configure its netns loopback (::1,
dual-stack) and the suite's IPv6 tests run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants