Skip to content

perf: investigate ~275ms parallel-mode overhead seen only on the bench runner (not reproducible locally) #623

Description

@BryanFRD

Symptom (committed benchmark data)

In the committed ferrflow_parallel stat, the parallel variant (ferrflow <cmd>, default jobs) sits at a uniform ~275–300 ms regardless of fixture size (1-package "single" and 100-package "mono-large" both ~280 ms), while --jobs 1 matches the real work (2–1035 ms). So it reads as a constant ~+275 ms tax on the default-jobs path. Job runs on ubuntu-latest (4 vCPU, runner_cores: 4).

Local reproduction: it does NOT reproduce

Built the CLI and ran a 1-package / 100-commit fixture (the "single" shape) with --timing, single vs default:

COLD (cache cleared)        --jobs 1     default
  open_repo                   26 ms       26 ms
  build TagIndex               1 ms        1 ms
  per-package compute         32 ms       30 ms      ← identical
  total                       59 ms       58 ms

WARM (cache hit) → "per-package compute" is skipped; both ~25 ms (open_repo dominates).

There is no single-vs-default difference locally (Windows), cold or warm. Two earlier hypotheses are ruled out:

  • Cache-key asymmetry: cache::compute_key keys on (HEAD, config, format) — not argv/--jobs — so both commands share the same entry; not the cause.
  • Per-package compute cost: identical (30 vs 32 ms) when the par_iter actually runs.

Interpretation

The ~280 ms is uniform and fixed (independent of package count) and does not reproduce off the runner. That points to a one-time, environment-specific cost on the Linux/ubuntu-latest release build — most plausibly rayon's default global-pool construction on the first par_iter (built lazily only on the default-jobs path; --jobs 1 builds a cheap 1-thread pool eagerly in concurrency::init), interacting with mimalloc/glibc thread spawn on Linux. It is not visible in any --timing stage, which is consistent with it happening at pool-construction time around the par_iter rather than inside it.

It is also possible the committed ferrflow_parallel numbers are a stale/one-off measurement; the next fresh bench run should be checked.

Next step to localize (on the runner, not locally)

Run on ubuntu-latest, cold cache:

ferrflow check --jobs 1 --timing      # baseline
ferrflow check --timing               # default

If the ~280 ms shows up outside every --timing stage, it's pool construction → fix in concurrency::init: build/bound the global pool eagerly even when --jobs is unset (default to std::thread::available_parallelism()), and/or skip the per-package par_iter below a package-count threshold so small/medium repos never construct the pool. If it shows up inside per-package compute, profile that stage on Linux.

Bench-gate any fix: ferrflow_parallel must not regress vs --jobs 1.

Status

Not confirmed as a code bug — local runs show parity. Needs one runner-side --timing capture (or a fresh bench) before changing concurrency.rs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementImprovement to existing feature

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions