Skip to content

fix(continuum): singleton-guard auto-save, drop Start-Job#13

Open
CosminRadu wants to merge 1 commit into
psmux:mainfrom
CosminRadu:fix/continuum-auto-save-singleton
Open

fix(continuum): singleton-guard auto-save, drop Start-Job#13
CosminRadu wants to merge 1 commit into
psmux:mainfrom
CosminRadu:fix/continuum-auto-save-singleton

Conversation

@CosminRadu
Copy link
Copy Markdown

@CosminRadu CosminRadu commented May 2, 2026

Summary

The client-attached hook in plugin.conf and the Start-Job spawn in psmux-continuum.ps1 each launched a fresh persistent pwsh loop on every re-attach / plugin load with no de-duplication. Over days this accumulated long-running ~100 MB pwsh processes whose memory continued to grow. The misleading plugin.conf comment even acknowledged the duplication ("multiple loops may start — this is harmless since saves are idempotent") — but each loop is a persistent background process consuming real memory.

After running locally for ~8 days my session had 5 zombie auto_save.ps1 loops totaling ~580 MB; the oldest had been alive for over a week and grown from ~100 MB to 124 MB.

What changed

Singleton enforcement

auto_save.ps1 now claims a PID file at %LOCALAPPDATA%\psmux-continuum\auto_save.pid:

  • On launch, if a live pwsh / powershell PID owns the file, exit 0 immediately. The client-attached hook is now safe to re-fire on every attach.
  • Stale PID detection: if the recorded PID isn't a live pwsh, the new launch claims the slot.
  • The loop releases the slot via try { … } finally { … } and self-exits gracefully if superseded (per-iteration check that the PID file still points at us).
  • Periodic [GC]::Collect() every 4 iterations bounds long-run memory growth.

Single-process spawn

Replaced Start-Job with Start-Process -WindowStyle Hidden. Start-Job spawned an outer scriptblock-host pwsh that itself spawned the inner pwsh -File auto_save.ps1, so every plugin load created two pwsh processes that never reaped (you could see this in the zombie list — the outer was the parent of the inner). Start-Process gives a single detached pwsh; the singleton guard inside the loop makes repeat launches safe.

Diagnosable logging

Start-Process -WindowStyle Hidden has no console attached, so the loop's Write-Host output was previously discarded. Added a Log helper that timestamps each line with [INF] / [WRN] / [ERR] and appends to %LOCALAPPDATA%\psmux-continuum\auto_save.log, with 256 KB rotation to a .old sibling file. Per-write Add-Content keeps file locks brief, so concurrent ephemeral pwshs (the singleton-rejected ones) don't corrupt each other's writes. The & pwsh -File save.ps1 invocation now redirects all six PowerShell streams (*>> \$logFile), so save.ps1 errors land in the log too.

Sample log output:

[2026-05-02 14:30:01] [INF] [PID 1234] claimed singleton slot; interval=15m
[2026-05-02 14:45:01] [INF] [PID 1234] auto-saved.
[2026-05-02 15:00:01] [WRN] [PID 1234] psmux server not running, stopping auto-save.
[2026-05-02 15:00:01] [INF] [PID 1234] exiting; releasing singleton slot.

Conventions

  • All three regenerated helper scripts (`auto_save.ps1`, `auto_restore.ps1`, `boot.ps1`) now carry a `# DO NOT EDIT — regenerated from psmux-continuum.ps1 on plugin load` header so contributors edit the heredoc, not the artifact.
  • The misleading `plugin.conf` comment is corrected to describe the new singleton behavior.
  • Heredoc and standalone artifact are byte-identical (regeneration is a no-op against this checkout).

Test plan

  • Both `psmux-continuum.ps1` and `scripts/auto_save.ps1` parse (verified via `PSParser::Tokenize`).
  • All three heredoc bodies in `psmux-continuum.ps1` match their `scripts/*.ps1` artifact byte-for-byte (verified via AST extraction + string compare).
  • Start a psmux session, reattach a client 3+ times, then check process list — expect exactly one `auto_save.ps1` pwsh.
  • Wait ~16 minutes; verify `%LOCALAPPDATA%\psmux-continuum\auto_save.log` contains an `[INF] auto-saved.` entry.
  • Stop the psmux server; expect the loop to log `[WRN] psmux server not running, stopping auto-save.` and exit, releasing the PID file.
  • Confirm no `auto_save.ps1` zombies remain across logout/login cycles.

🤖 Generated with Claude Code

…h Start-Process

The client-attached hook in plugin.conf and the Start-Job spawn in
psmux-continuum.ps1 each launched a fresh persistent pwsh loop on every
re-attach / plugin load, with no de-duplication. Over days this
accumulated long-running ~100 MB pwsh processes whose memory continued
to grow. plugin.conf even acknowledged this ("multiple loops may
start - this is harmless since saves are idempotent"), which was wrong:
each loop is a persistent background process consuming real memory.

The auto-save loop now enforces a per-user singleton via a PID file at
%LOCALAPPDATA%\psmux-continuum\auto_save.pid:

* Duplicate launches detect the live owner and exit immediately, so
  the client-attached hook is safe to re-fire on every attach.
* The loop releases the slot via try/finally and self-exits gracefully
  if superseded (PID file no longer points at it).
* Periodic [GC]::Collect() bounds memory growth in the long-running
  loop.

The launcher in psmux-continuum.ps1 now uses Start-Process -WindowStyle
Hidden instead of Start-Job. Start-Job spawned an outer scriptblock-host
pwsh that itself spawned the inner pwsh -File auto_save.ps1, so every
plugin load created two pwsh processes that never reaped. Start-Process
gives a single detached pwsh, and the singleton guard inside the loop
makes repeat launches safe.

The three regenerated helper scripts (auto_save.ps1, auto_restore.ps1,
boot.ps1) now carry a "DO NOT EDIT - regenerated from
psmux-continuum.ps1 on plugin load" header so future contributors know
to edit the heredoc, not the artifact. The plugin.conf comment is
updated to describe the new singleton behavior.
MattKotsenas added a commit to MattKotsenas/psmux-plugins that referenced this pull request May 14, 2026
Track upstream PR psmux#13 (CosminRadu): singleton-guard
auto-save loop, replace Start-Job with Start-Process. Branch points
directly at CosminRadu's commit ca99188.
@tarikguney tarikguney requested a review from Copilot May 15, 2026 23:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a memory-leak / process-accumulation bug in psmux-continuum where the client-attached hook and the Start-Job–based spawn in psmux-continuum.ps1 each created a fresh persistent pwsh loop on every re-attach or plugin reload, with no de-duplication. The fix adds a PID-file singleton guard inside auto_save.ps1, replaces Start-Job with a single detached Start-Process, and adds a file-based logger since the hidden process has no console.

Changes:

  • Add per-user singleton guard (PID file at %LOCALAPPDATA%\psmux-continuum\auto_save.pid) with stale-PID detection, periodic supersede check, and graceful slot release in finally.
  • Replace Start-Job with Start-Process -WindowStyle Hidden so only one pwsh is spawned per launch; capture all PowerShell streams from the save invocation into a rotated log file (256 KB → .old).
  • Refresh the misleading plugin.conf comment, add a "DO NOT EDIT — regenerated" header to all three artifact scripts, and bound long-run memory with periodic [GC]::Collect().

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
psmux-continuum/psmux-continuum.ps1 Updates the auto-save heredoc with singleton/logging logic, replaces Start-Job with Start-Process, and adds regeneration headers to all three heredocs.
psmux-continuum/scripts/auto_save.ps1 Regenerated artifact mirroring the new singleton-guarded loop with logging and try/finally slot release.
psmux-continuum/scripts/auto_restore.ps1 Adds the "DO NOT EDIT — regenerated" header to match the new convention.
psmux-continuum/scripts/boot.ps1 Adds the "DO NOT EDIT — regenerated" header to match the new convention.
psmux-continuum/plugin.conf Rewrites the stale "multiple loops may start" comment to describe the new singleton behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +80 to +95
# If a live owner already holds the PID file, defer.
$existingPid = $null
try {
$existingPid = (Get-Content $pidFile -Raw -ErrorAction Stop).Trim()
}
catch {}
if ($existingPid -match '^\d+$') {
$existing = Get-Process -Id ([int]$existingPid) -ErrorAction SilentlyContinue
if ($existing -and ($existing.ProcessName -in @('pwsh','powershell'))) {
Log "auto-save already running (PID $existingPid), exiting."
exit 0
}
}

# Claim the slot
"$PID" | Set-Content -Path $pidFile -Encoding UTF8 -Force
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CosminRadu Can you please review this comment from CoPilot?

@tarikguney
Copy link
Copy Markdown
Collaborator

PR Review Follow-up

Thank you for this thorough fix, @CosminRadu. The transition to a singleton pattern and the move away from Start-Job significantly improves the resource management of the plugin.

Regarding the Copilot comment, it identifies a potential Time-of-Check to Time-of-Use (TOCTOU) race condition during the PID file claim. While the current implementation is a major step forward, please consider the suggested refinement—either performing a secondary check at the top of the loop or utilizing an exclusive file-creation method—to ensure the singleton guarantee is strictly enforced.

tarikguney pushed a commit that referenced this pull request May 15, 2026
…erver start

The `@continuum-restore 'on'` option is documented in the README as
"auto-restore on psmux server start" but was silently ignored.

Root cause
==========

`psmux-continuum` ships two entry points with disjoint feature sets:

* `psmux-continuum.ps1` (lines 163-168) reads `@continuum-restore` and
  fires the restore. But PPM's `Initialize-Plugin` (`ppm.ps1` lines
  391-398) deliberately skips `.ps1` entry points when `plugin.conf`
  exists, because psmux sources `plugin.conf` natively via `@plugin`.

* `plugin.conf` IS sourced, but the auto-restore hook was commented out
  with a "Uncomment to enable" note - meaning the option was never read
  by any active code path.

Empirical verification on a real install:

    $ psmux show-options -g | findstr @continuum
    @continuum-restore "on"

    $ psmux show-hooks -g | findstr -i "session"
    client-attached -> run-shell "...auto_save.ps1 ..."
    # no restore hook registered

After-effect: when restore fails to fire on server start, the auto-save
loop (which IS registered) overwrites the `last` pointer with the empty
post-startup state ~15 min later, pruning the recovery target. Reboots
silently lose the saved session.

Second bug found while fixing this
==================================

The commented-out line in plugin.conf used `after-new-session`. This
hook does NOT fire on the first session creation in a fresh psmux
server - it only fires on subsequent sessions. So uncommenting that
line as-is would still have left auto-restore broken after a reboot.

Verified by hook-probing each candidate against a fresh psmux server:

    after-new-session  => no
    session-created    => FIRED
    client-attached    => FIRED

We use `session-created` because it fires for the very first session
in the server's lifetime, which is the moment auto-restore needs to
trigger.

Fix
===

* `plugin.conf`: register the hook unconditionally on `session-created`.
* `auto_restore.ps1`: read `@continuum-restore` at exec time and exit
  early if not `on`. This keeps `plugin.conf` simple (no nested
  `if-shell` quoting) and evaluates the option at hook-fire time rather
  than at config-source time, which is more robust against option-order
  dependencies.
* Add `@continuum-restore-fired` one-shot marker. Since `restore.ps1`
  itself calls `psmux new-session` for each saved session, the hook
  would otherwise re-enter for every restored session.
* Keep the `psmux-continuum.ps1` heredoc byte-identical to the
  standalone `scripts/auto_restore.ps1` to preserve the existing
  regeneration invariant.

Validation
==========

* `session-created` hook fires on first session - live-probed against
  a fresh psmux server.
* `after-new-session` does NOT fire on first session - live-probed.
* Patched `plugin.conf` correctly registers the hook - verified via
  `psmux show-hooks -g`.
* Negative path: when `@continuum-restore` is unset or `off`,
  `auto_restore.ps1` exits early without setting the fired marker or
  invoking `restore.ps1`. Live-tested.
* Both .ps1 files parse cleanly via `[System.Management.Automation.
  Language.Parser]::ParseFile`.
* Heredoc body and `scripts/auto_restore.ps1` are byte-identical
  (AST-extracted and string-compared).

Full positive E2E (save -> kill-server -> fresh server -> verify
restore) cannot be cleanly run on an isolated `-L pmtest` socket
because `psmux run-shell` doesn't pass socket info to child processes,
so child `psmux` invocations default to the user's real socket and
contaminate the test. The fix is correct for default-socket production
usage; the testability limit is a psmux internal, not a fix bug.

Note: PR #13 (fix(continuum): singleton-guard auto-save, drop
Start-Job) touches the same files. This PR's diff is orthogonal to
that work and trivially rebases either way.
@MattKotsenas
Copy link
Copy Markdown
Contributor

Hi! I hit this race on my machine today. Adding the log evidence in case it's useful for deciding whether to ship the secondary-check / exclusive-create refinement now vs. as a follow-up.

From %LOCALAPPDATA%\psmux-continuum\auto_save.log:

[2026-05-17 03:45:02] [INF] [PID 17692] auto-saved.
[2026-05-17 03:59:28] [INF] [PID 20696] claimed singleton slot; interval=15m
[2026-05-17 03:59:28] [INF] [PID 20552] claimed singleton slot; interval=15m
[2026-05-17 04:03:47] [INF] [PID 25068] claimed singleton slot; interval=15m
[2026-05-17 04:14:29] [INF] [PID 20552] superseded, exiting.
[2026-05-17 04:14:29] [INF] [PID 20552] exiting; releasing singleton slot.
[2026-05-17 04:14:29] [INF] [PID 20696] exiting; releasing singleton slot.
psmux-resurrect: Saved to C:\Users\matt\.psmux\resurrect\psmux_resurrect_20260517_041851.json

Both PIDs (20696, 20552) claimed the slot at the same second (03:59:28) and ran concurrently as two auto-save loops for ~15 minutes until the supersede check at the next cycle (04:14:29) caught it. A third PID (25068) also claimed the slot at 04:03:47 between those two, suggesting the window is wide enough to hit repeatedly under realistic pwsh startup load (this happened during a normal "open new shell, attach to existing session" workflow, nothing pathological).

The save at 04:18:51 succeeded, but during the 15-minute overlap two auto-save loops were running, doubling the work and racing on file output.

Happy to test a follow-up if helpful; otherwise just leaving this here as the deciding evidence.

@tarikguney
Copy link
Copy Markdown
Collaborator

tarikguney commented May 19, 2026

Thanks for the thorough analysis here, @MattKotsenas and @CosminRadu — really appreciate both the fix and the follow-up validation.

Matt’s log evidence appears to confirm the TOCTOU concern from Copilot: multiple processes can still claim the singleton in the same startup window and run concurrently until the next interval check.

@CosminRadu, would you be open to tightening this before merge (for example, an atomic/exclusive PID-file claim or equivalent) so the singleton guarantee is strict?

Given the reproduced race in normal workflow, I’d prefer we land that hardening in this PR if possible.

MattKotsenas added a commit to MattKotsenas/psmux-plugins that referenced this pull request May 19, 2026
…nt doubling

Field-observed bug after a machine reboot:
  save: main with 4 windows [psmux activity, hunk, profile install, psmux bugs]
  live: main with 7 windows [psmux activity, hunk, profile install, hunk,
                              psmux bugs, profile install, psmux bugs]

Two concurrent restore.ps1 invocations interleaved their new-window calls,
producing one shared first window (the reused empty-default) plus six
appended windows (each restore created 3 of the remaining 3 saved windows).

Reproduced exactly with two Start-Job-spawned restore.ps1 against an
isolated test server: 4-window save -> 7 windows in main.

Root cause: psmux-continuum's session-created hook fires for every new-session
during restoration, and its option-based TOCTOU singleton
(@continuum-restore-fired) can let two auto_restore.ps1 invocations through
under tight timing. Even without that, future user hooks / manual bindings
could re-enter restore.ps1 concurrently.

Fix: process-level PID-file singleton on restore.ps1 itself. Matches the
auto_save.ps1 pattern (PR psmux#13). Uses System.IO.File::Open with FileMode.
CreateNew for atomic claim. Stale slots (recorded PID is dead) are reclaimed;
live slots cause the second invocation to exit cleanly without modifying
state. Slot is released in the existing finally block via a script-block
that only deletes the file when we're still the owner.

PSMUX_RESTORE_PID_FILE_OVERRIDE env var lets the Pester suite point at a
tempdir-local PID file so tests don't collide with a real plugin install.

Adds one new Pester test (concurrent-invocation context) that spawns two
restore.ps1 jobs and asserts exactly one set of 4 windows results. Verified
RED -> GREEN: pre-fix produces 7 windows, post-fix produces 4. Full suite
now 5/5 stable.

Defense-in-depth note: auto_restore.ps1's option-based TOCTOU singleton is
not fixed here. Its race becomes a benign inefficiency (extra pwsh spawns
that exit immediately on the PID-file check) rather than a correctness
issue. Worth fixing separately for cleanup; not load-bearing.
MattKotsenas added a commit to MattKotsenas/psmux-plugins that referenced this pull request May 19, 2026
… singleton

Two refinements to the singleton added in 2a584f5:

1. Stale-slot reclaim is now atomic via Remove-Item + retry CreateNew (up
   to 3 attempts), not Set-Content. The Set-Content path had a TOCTOU
   parallel to the one PR psmux#13 fixes in auto_save.ps1: two processes both
   finding a stale slot could both successfully Set-Content and both think
   they own it. CreateNew-only-after-delete makes the reclaim race-safe.

2. PID-reuse guard. The original 'is the recorded PID alive?' check used
   bare Get-Process, which can false-positive when Windows recycles a PID
   to a different process (any process — Explorer, sshd, whatever). Now we
   verify the live owner is a pwsh/powershell process. Narrow but real
   PID-reuse window was reproduced in test runs after frequent pwsh churn.

Plus a new RED -> GREEN test for the stale-PID-file concurrent reclaim case
(planted stale marker, two concurrent restore.ps1 invocations, only one
wins, no doubled windows).

Test infrastructure: RunRestore now sets PSMUX_RESTORE_PID_FILE_OVERRIDE
to a tempdir path so each test gets a fresh slot and no cross-test
interference via the default LOCALAPPDATA path.

Full restore-pane-level suite (6 tests) verified stable across 3 runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants