CI: Rust tests job takes 26+ minutes; shard it and shrink real-time test windows

## Problem

The PR-gating **Rust tests** job took **26m22s** on #334 (run `27250885594`), a docs/changelog/version-bump PR. That is too long for the inner loop, and the duration profile says most of it is serialized waiting, not useful work.

Data points from that run:

- One test binary alone took **180.14s** (41 tests, includes `test_v031_backfills_queue_storage_failed_done_metric_index`).
- Other binaries: 60.36s (21 tests), 46.23s (18 tests), 32.56s (4 tests).
- Many individual integration tests trip the harness's "has been running for over 60 seconds" warning and then pass: `test_queue_storage_receipt_claims_rescue_after_grace_window`, `test_queue_storage_receipt_claims_retry_successfully`, `test_queue_storage_receipt_deadline_rescue_force_closes_expired_claim`, `test_queue_storage_register_callback_rejects_stale_lease`, `test_queue_storage_retry_from_dlq_surfaces_unique_conflict`, `test_queue_storage_runtime_callback_timeout_moves_to_dlq`, and more.

A test that needs >60s of wall clock is almost always waiting on a real-time window (grace periods, rescue intervals, rotation cadence, poll intervals) rather than doing 60s of work.

## Proposed work

1. **Shard the job.** Split `Rust tests` into a matrix of concurrent CI jobs — per package or per test-binary group — each with its own Postgres service container so DB contention doesn't serialize across shards. Target: worst shard under ~8 minutes.
2. **Adopt `cargo-nextest`.** Per-test parallelism with per-test timing output, which also gives us a durable list of the slowest tests per run instead of one-off log archaeology.
3. **Audit the >60s tests.** For each, check whether the configured window (lease grace, deadline, rescue cadence, callback timeout, maintenance tick) can be shrunk to hundreds of milliseconds in the test fixture. These windows are configuration, not contract — the test should pin the behavior, not the production default duration.
4. **Keep an eye on per-binary DB setup.** The 180s binary suggests migrations/setup may be re-running per test; worth checking whether schema setup can be done once per binary (or per shard) with per-test schema/database isolation.

Related: `test_weight_proportionality` flaked on the same run for a different reason (drain-race in the assertion) — fixed on the #334 branch. Worth a quick pass over other finite-job-pool tests for the same assert-after-drain shape while doing item 3.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Rust tests job takes 26+ minutes; shard it and shrink real-time test windows #335

Problem

Proposed work

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CI: Rust tests job takes 26+ minutes; shard it and shrink real-time test windows #335

Description

Problem

Proposed work

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions