Skip to content

Raise RLIMIT_NOFILE on startup; make worker recycling opt-in#105

Open
thejchap wants to merge 7 commits into
mainfrom
recycle-workers-on-test-count
Open

Raise RLIMIT_NOFILE on startup; make worker recycling opt-in#105
thejchap wants to merge 7 commits into
mainfrom
recycle-workers-on-test-count

Conversation

@thejchap
Copy link
Copy Markdown
Owner

@thejchap thejchap commented May 9, 2026

Summary

Address the macOS 256-FD per-process soft default at its source — the runner's own rlimit — instead of papering over it with mandatory worker recycling.

  • RLIMIT_NOFILE bump on startup (crates/tryke/src/fdlimit.rs). On Unix, tryke now raises its own soft limit toward the inherited hard limit before any worker is spawned. Children inherit the bumped rlimit, so a single call lifts the FD ceiling for every Python interpreter the runner spawns. Follows the systemd convention (soft 1024 / hard 524288) that Home Assistant OS 16 adopted in mid-2025: ship modest soft defaults, generous hard defaults, and let each application raise its own soft limit at startup based on actual needs.

    • macOS quirk: kern.maxfilesperproc caps per-process FDs below the reported hard limit, so unconditional soft = hard returns EINVAL. On rejection we fall back to a 10 240 target — still dwarfs the 256-FD soft default that caused the failure mode. Windows: no-op (no RLIMIT_NOFILE analogue).
    • Failure is non-fatal: a sandbox that pins the limit logs a warning and the user can still raise it manually via ulimit -n before launching.
  • Worker recycling is now fully opt-in (WorkerLimits::default() returns all-None). The same recycle machinery from c20ca3b is retained — peak RSS, open FDs, wall-clock age, in priority order memory > FDs > age, deferred to end-of-unit so per="scope" teardown runs. But the previous 1 GiB / 200 FDs / 600 s defaults are gone: they made the runner's behaviour depend on workload shape rather than user intent, and surprised suites that legitimately ran long or held many FDs. With the FD ceiling raised at startup the recycling motivation is mostly gone; users who still need it on long-lived suites can wire explicit caps through WorkerLimits.

  • Docs: new "Resource limits" section in docs/guides/configuration.md covering both the FD bump (with the Home Assistant link) and the opt-in recycling knobs.

Options summary

Knob Default Behaviour
RLIMIT_NOFILE (process-wide) raise soft → hard at startup; fallback 10 240 on macOS EINVAL; warning on other errors inherited by every worker subprocess
WorkerLimits::max_rss_bytes None recycle when getrusage(RUSAGE_SELF).ru_maxrss (normalised to bytes) crosses cap. POSIX only.
WorkerLimits::max_open_fds None recycle when /proc/self/fd or /dev/fd count crosses cap. Linux + macOS only.
WorkerLimits::max_age None recycle when worker wall-clock age crosses cap. Cross-platform.

None on any worker-side cap means "never recycle on this signal". When multiple signals trip at once the runner reports the strongest in priority order memory > FDs > age, so debug logs attribute the recycle to the most user-visible failure mode.

Test plan

  • fdlimit::tests::raise_returns_meaningful_outcome_on_unix — getrlimit succeeds, raise returns Raised { from <= to } or Skipped { current > 0 }.
  • fdlimit::tests::raise_is_idempotent — a second call never lowers the soft limit.
  • pool::tests::worker_recycles_when_age_exceeds_limit — passes explicit WorkerLimits with a tight age cap; recycle still fires.
  • pool::tests::recycle_does_not_skip_scope_fixture_teardown — explicit limits, end-of-unit recycle still respects per="scope" teardown.
  • worker::tests::evaluate_recycle_* — priority / no-signal / no-reading / unlimited coverage on the pure decision helper.
  • cargo nextest run --workspace --all-features — 629 passed.
  • cargo clippy --workspace --all-targets --all-features -- -D warnings — clean.
  • uvx prek run -a — clean.

References

Copilot AI review requested due to automatic review settings May 9, 2026 04:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces automatic recycling of long-lived Python worker subprocesses after a fixed number of executed tests to bound file-descriptor/state accumulation in large suites, reusing the existing crash-recovery respawn + hook-replay flow.

Changes:

  • Added per-worker test counting (tests_completed) and a hard cap (MAX_TESTS_PER_WORKER = 128) to decide when a worker should be recycled.
  • Implemented recycle logic in the worker pool to kill and respawn workers once they exceed the cap.
  • Added a Rust integration test that runs more than the cap and asserts the worker PID changes exactly as expected.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
crates/tryke_runner/src/worker.rs Adds MAX_TESTS_PER_WORKER, tracks tests_completed, and exposes a should_recycle() helper.
crates/tryke_runner/src/pool.rs Recycles workers after the threshold and adds a test validating PID turnover at the boundary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tryke_runner/src/pool.rs Outdated
Comment thread crates/tryke_runner/src/pool.rs Outdated
Copilot AI review requested due to automatic review settings May 11, 2026 13:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comment thread python/tryke/worker.py Outdated
Comment thread python/tryke/worker.py
Comment thread python/tryke/worker.py
Comment thread crates/tryke_runner/src/pool.rs
@thejchap thejchap changed the title Recycle worker process after 128 tests to bound FD accumulation Recycle worker process on memory/FD/age signals to bound FD accumulation May 11, 2026
thejchap and others added 5 commits May 13, 2026 15:41
Long-lived Python workers accumulate module-level state across imports —
logging handlers, sqlite/ssl objects, atexit callbacks — that holds open
file descriptors which `del sys.modules[name]` cannot reclaim because
the FDs are owned by objects living outside the module dict. On macOS
where the per-process FD soft limit is 256, this exhausts FDs when
running tryke against large suites like homeassistant/core
(~5,267 test files).

WorkerProcess now tracks completed tests; the pool recycles a worker
via the existing crash-recovery path (shutdown + ensure_worker respawn
with cached hook replay) once the count reaches 128.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recycling inside `run_single_test` could fire mid-unit, dropping the
live process before `handle_unit`'s `finalize_hooks` loop ran — and
that loop is gated on `state.process.as_mut()`, so per="scope" fixture
teardown was silently skipped (or run on a fresh worker's freshly
re-imported fixtures, double-counting setup). Move the recycle check to
after `finalize_hooks` in `handle_unit`.

Adjusts the existing recycle test to submit one unit per test (so the
end-of-unit check fires at the threshold) and adds a coverage test
that crosses the threshold inside a single unit with a per="scope"
fixture, asserting setup and teardown each run exactly once.

Co-authored-by: Claude <noreply@anthropic.com>
Replaces the hardcoded `MAX_TESTS_PER_WORKER = 128` cap with
self-reported resource snapshots: every `run_test`/`run_doctest`
response carries a `WorkerHealthWire { rss_bytes, open_fds }` that the
runner consults at unit boundaries alongside a wall-clock age check.
Recycle decisions now name the tripped signal via a `RecycleReason`
enum, so debug logs can attribute drops to memory pressure vs FD
exhaustion vs slow drift.

Soft ceilings live on `WorkerLimits` (1 GiB RSS / 200 FDs / 10 min by
default, all `Option<u64>` so platforms without `/proc/self/fd` or
`resource` simply skip that signal). Tests use the new
`WorkerPool::with_python_path_and_limits` constructor with tiny caps
(or `WorkerLimits::unlimited`) to exercise recycle behaviour
deterministically; production call sites in `tryke`, `tryke_server`,
and `tryke_dev` are unchanged because `with_python_path` defaults to
`WorkerLimits::default()`.

Test coverage: end-to-end age-based recycle test (replaces the
test-count one); existing scope-fixture-teardown test now triggers via
age cap; new unit tests for `evaluate_recycle` priority order and
no-signal/no-cap fallbacks.

https://claude.ai/code/session_01PMbxzSuASTEYbDFEu4SQqy

Co-authored-by: Claude <noreply@anthropic.com>
Three CI test failures (ubuntu/macos/windows on slower Python versions)
traced to the same flake: the integration test pinned the recycle count
to exactly 1 (= 2 distinct worker pids), but on slow runners the first
unit's end-of-unit age check already trips the 100 ms cap, producing an
extra recycle and 3 pids. The "age recycle eventually fires" property
still holds in that scenario, so relax the assertion to >= 2 distinct
pids and document why. The "no recycle when under cap" property is
already covered by evaluate_recycle_returns_none_when_no_signals_tripped.

Also address two of the Copilot review comments on c20ca3b:

- _measure_rss_bytes returns peak RSS (ru_maxrss), not current. Update
  the docstring to call that out and explain why peak is the right
  signal for recycling (transient spikes leave fragmentation/state we
  can't free).
- _measure_open_fds was off by one because iterating /proc/self/fd or
  /dev/fd holds an extra dir FD that gets counted. Subtract it so the
  reading reflects FDs held by the worker, not the measurement.

Regenerated reporter doc samples (v0.0.27 -> v0.0.28 + new scheduler
warning) caught up by the generate-reporter-examples hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recycling thresholds (1 GiB RSS / 200 FDs / 600 s) were tuned to
mitigate the macOS 256-FD per-process soft default. That mitigation
has costs: surprise process churn on suites that legitimately run
long or hold many FDs, plus the lost Python startup/import cache.

Address the underlying FD ceiling directly instead. On startup tryke
now raises its own RLIMIT_NOFILE soft limit toward the inherited hard
limit, following the systemd convention (1024/524288) Home Assistant
OS 16 adopted in mid-2025: ship modest soft defaults, generous hard
defaults, and let applications raise their own at startup. Worker
subprocesses inherit the bumped rlimit, so the FD-exhaustion failure
mode disappears without churning interpreters.

`WorkerLimits` defaults all three caps to `None` (opt-in only). The
recycling machinery itself is unchanged — the same code path is still
exercised by tests, just not on every `tryke test` invocation.

macOS quirk: kern.maxfilesperproc caps per-process FDs below the
reported hard limit; on EINVAL we fall back to a 10240 target, which
still dwarfs the 256-FD default.

Docs: new "Resource limits" section in guides/configuration.md
covering both the FD bump and the opt-in recycling knobs.
Copilot AI review requested due to automatic review settings May 13, 2026 15:50
@thejchap thejchap force-pushed the recycle-workers-on-test-count branch from 84c8101 to dbca0ac Compare May 13, 2026 15:50
@thejchap thejchap changed the title Recycle worker process on memory/FD/age signals to bound FD accumulation Raise RLIMIT_NOFILE on startup; make worker recycling opt-in May 13, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

crates/tryke/Cargo.toml:33

  • [target.'cfg(unix)'.dependencies] currently includes tokio and multiple workspace crates. That makes these dependencies unix-only, which will break building the tryke crate on non-Unix targets (e.g. Windows) because the code still imports/uses tokio and the other crates unconditionally. Keep only libc under the unix target table, and move tokio / tryke_* deps back under the main [dependencies] table (or add a new [dependencies] section after the target table).
[target.'cfg(unix)'.dependencies]
libc = "0.2"
tokio = { workspace = true }
tokio-stream = { workspace = true }
tryke_config = { workspace = true }
tryke_discovery = { workspace = true }
tryke_reporter = { workspace = true }
tryke_runner = { workspace = true }
tryke_server = { workspace = true }
tryke_types = { workspace = true }

Comment thread crates/tryke/src/fdlimit.rs Outdated
Comment on lines +35 to +36
//! inside a sandbox that pinned it) we log at `debug` and proceed —
//! the user can still override via `ulimit -n` before launching.
Comment on lines +178 to +183
**Defaults: all caps disabled.** Out of the box no worker is ever
recycled — the runner pairs the FD-limit bump above with the assumption
that most suites do not need process churn, and recycling has costs of
its own (Python interpreter startup latency, cached fixture loss). If
you have a suite that leaks memory or FDs over hours of runtime, the
hooks are still there:
claude and others added 2 commits May 13, 2026 15:59
The previous commit accidentally inserted
`[target.'cfg(unix)'.dependencies]` mid-stream in the `[dependencies]`
table, sweeping tokio, tokio-stream, and the six tryke_* workspace
crates into the unix-only section. `uv sync` (which builds the
maturin wheel) on Windows then failed with "unresolved dependencies"
because none of those crates were available on the windows target.

Move the unix-gated table to the end so it only contains `libc`,
restoring the cross-platform dependency surface.

Also reword the fdlimit module preamble per Copilot review: this
module returns an `io::Result` rather than logging directly; the
caller (main.rs) is what logs at warn level.
Copilot AI review requested due to automatic review settings May 15, 2026 05:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 1 comment.

@@ -44,6 +158,7 @@ impl WorkerProcess {
python_path: &[&Path],
root: &Path,
log_level: log::LevelFilter,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants