Symptom (committed benchmark data)
In the committed ferrflow_parallel stat, the parallel variant (ferrflow <cmd>, default jobs) sits at a uniform ~275–300 ms regardless of fixture size (1-package "single" and 100-package "mono-large" both ~280 ms), while --jobs 1 matches the real work (2–1035 ms). So it reads as a constant ~+275 ms tax on the default-jobs path. Job runs on ubuntu-latest (4 vCPU, runner_cores: 4).
Local reproduction: it does NOT reproduce
Built the CLI and ran a 1-package / 100-commit fixture (the "single" shape) with --timing, single vs default:
COLD (cache cleared) --jobs 1 default
open_repo 26 ms 26 ms
build TagIndex 1 ms 1 ms
per-package compute 32 ms 30 ms ← identical
total 59 ms 58 ms
WARM (cache hit) → "per-package compute" is skipped; both ~25 ms (open_repo dominates).
There is no single-vs-default difference locally (Windows), cold or warm. Two earlier hypotheses are ruled out:
- Cache-key asymmetry:
cache::compute_key keys on (HEAD, config, format) — not argv/--jobs — so both commands share the same entry; not the cause.
- Per-package compute cost: identical (30 vs 32 ms) when the par_iter actually runs.
Interpretation
The ~280 ms is uniform and fixed (independent of package count) and does not reproduce off the runner. That points to a one-time, environment-specific cost on the Linux/ubuntu-latest release build — most plausibly rayon's default global-pool construction on the first par_iter (built lazily only on the default-jobs path; --jobs 1 builds a cheap 1-thread pool eagerly in concurrency::init), interacting with mimalloc/glibc thread spawn on Linux. It is not visible in any --timing stage, which is consistent with it happening at pool-construction time around the par_iter rather than inside it.
It is also possible the committed ferrflow_parallel numbers are a stale/one-off measurement; the next fresh bench run should be checked.
Next step to localize (on the runner, not locally)
Run on ubuntu-latest, cold cache:
ferrflow check --jobs 1 --timing # baseline
ferrflow check --timing # default
If the ~280 ms shows up outside every --timing stage, it's pool construction → fix in concurrency::init: build/bound the global pool eagerly even when --jobs is unset (default to std::thread::available_parallelism()), and/or skip the per-package par_iter below a package-count threshold so small/medium repos never construct the pool. If it shows up inside per-package compute, profile that stage on Linux.
Bench-gate any fix: ferrflow_parallel must not regress vs --jobs 1.
Status
Not confirmed as a code bug — local runs show parity. Needs one runner-side --timing capture (or a fresh bench) before changing concurrency.rs.
Symptom (committed benchmark data)
In the committed
ferrflow_parallelstat, the parallel variant (ferrflow <cmd>, default jobs) sits at a uniform ~275–300 ms regardless of fixture size (1-package "single" and 100-package "mono-large" both ~280 ms), while--jobs 1matches the real work (2–1035 ms). So it reads as a constant ~+275 ms tax on the default-jobs path. Job runs onubuntu-latest(4 vCPU,runner_cores: 4).Local reproduction: it does NOT reproduce
Built the CLI and ran a 1-package / 100-commit fixture (the "single" shape) with
--timing, single vs default:There is no single-vs-default difference locally (Windows), cold or warm. Two earlier hypotheses are ruled out:
cache::compute_keykeys on(HEAD, config, format)— not argv/--jobs— so both commands share the same entry; not the cause.Interpretation
The ~280 ms is uniform and fixed (independent of package count) and does not reproduce off the runner. That points to a one-time, environment-specific cost on the Linux/
ubuntu-latestrelease build — most plausibly rayon's default global-pool construction on the firstpar_iter(built lazily only on the default-jobs path;--jobs 1builds a cheap 1-thread pool eagerly inconcurrency::init), interacting with mimalloc/glibc thread spawn on Linux. It is not visible in any--timingstage, which is consistent with it happening at pool-construction time around the par_iter rather than inside it.It is also possible the committed
ferrflow_parallelnumbers are a stale/one-off measurement; the next fresh bench run should be checked.Next step to localize (on the runner, not locally)
Run on
ubuntu-latest, cold cache:If the ~280 ms shows up outside every
--timingstage, it's pool construction → fix inconcurrency::init: build/bound the global pool eagerly even when--jobsis unset (default tostd::thread::available_parallelism()), and/or skip the per-package par_iter below a package-count threshold so small/medium repos never construct the pool. If it shows up insideper-package compute, profile that stage on Linux.Bench-gate any fix:
ferrflow_parallelmust not regress vs--jobs 1.Status
Not confirmed as a code bug — local runs show parity. Needs one runner-side
--timingcapture (or a fresh bench) before changingconcurrency.rs.