perf: investigate ~275ms parallel-mode overhead seen only on the bench runner (not reproducible locally)

## Symptom (committed benchmark data)

In the committed `ferrflow_parallel` stat, the parallel variant (`ferrflow <cmd>`, default jobs) sits at a **uniform ~275–300 ms regardless of fixture size** (1-package "single" and 100-package "mono-large" both ~280 ms), while `--jobs 1` matches the real work (2–1035 ms). So it reads as a constant ~+275 ms tax on the default-jobs path. Job runs on `ubuntu-latest` (4 vCPU, `runner_cores: 4`).

## Local reproduction: it does NOT reproduce

Built the CLI and ran a 1-package / 100-commit fixture (the "single" shape) with `--timing`, single vs default:

```
COLD (cache cleared)        --jobs 1     default
  open_repo                   26 ms       26 ms
  build TagIndex               1 ms        1 ms
  per-package compute         32 ms       30 ms      ← identical
  total                       59 ms       58 ms

WARM (cache hit) → "per-package compute" is skipped; both ~25 ms (open_repo dominates).
```

There is **no single-vs-default difference locally** (Windows), cold or warm. Two earlier hypotheses are ruled out:
- **Cache-key asymmetry**: `cache::compute_key` keys on `(HEAD, config, format)` — not argv/`--jobs` — so both commands share the same entry; not the cause.
- **Per-package compute cost**: identical (30 vs 32 ms) when the par_iter actually runs.

## Interpretation

The ~280 ms is **uniform and fixed** (independent of package count) and **does not reproduce off the runner**. That points to a one-time, environment-specific cost on the Linux/`ubuntu-latest` release build — most plausibly rayon's **default global-pool construction** on the first `par_iter` (built lazily only on the default-jobs path; `--jobs 1` builds a cheap 1-thread pool eagerly in `concurrency::init`), interacting with mimalloc/glibc thread spawn on Linux. It is not visible in any `--timing` stage, which is consistent with it happening at pool-construction time around the par_iter rather than inside it.

It is also possible the committed `ferrflow_parallel` numbers are a stale/one-off measurement; the next fresh bench run should be checked.

## Next step to localize (on the runner, not locally)

Run on `ubuntu-latest`, cold cache:

```
ferrflow check --jobs 1 --timing      # baseline
ferrflow check --timing               # default
```

If the ~280 ms shows up **outside** every `--timing` stage, it's pool construction → fix in `concurrency::init`: build/bound the global pool eagerly even when `--jobs` is unset (default to `std::thread::available_parallelism()`), and/or skip the per-package par_iter below a package-count threshold so small/medium repos never construct the pool. If it shows up **inside** `per-package compute`, profile that stage on Linux.

Bench-gate any fix: `ferrflow_parallel` must not regress vs `--jobs 1`.

## Status
Not confirmed as a code bug — local runs show parity. Needs one runner-side `--timing` capture (or a fresh bench) before changing `concurrency.rs`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf: investigate ~275ms parallel-mode overhead seen only on the bench runner (not reproducible locally) #623

Symptom (committed benchmark data)

Local reproduction: it does NOT reproduce

Interpretation

Next step to localize (on the runner, not locally)

Status

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

perf: investigate ~275ms parallel-mode overhead seen only on the bench runner (not reproducible locally) #623

Description

Symptom (committed benchmark data)

Local reproduction: it does NOT reproduce

Interpretation

Next step to localize (on the runner, not locally)

Status

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions