feat(watchdog): runner-side self-heal for stale or hung bridge by atoomic · Pull Request #2 · Koan-Bot/koan

atoomic · 2026-05-17T07:44:30Z

Summary

Adds a runner-side watchdog that detects stale sys.modules and hung/dead bridge processes, then escalates recovery through four tiers (request_restart → SIGTERM → SIGKILL + start_awake → circuit-broken alert).
Bridge stamps its git HEAD SHA at startup (.koan-bridge-version); runner compares against git rev-parse HEAD each tick so it catches the frozen-imports case the heartbeat cannot.
New module koan/app/bridge_watchdog.py, ~330 LOC; 12 behavioral tests; full design doc at docs/bridge-watchdog.md.

Why

cb6e927 ("per-process restart markers") fixed the marker race between two new-code processes, but the very /update that lands the fix is still performed by the old code, leaving the bridge stuck on its pre-update sys.modules. Operator-side observable: every /list after the update fails with `cannot import name 'PROJECT_NAME_CHARS' from 'app.utils'`, and neither subsequent /update nor /restart recovers it — the bridge's poll loop has no path back to consistency without an external kick.

The runner is structurally always fresh code (its wrapper relaunches it on every exit), so it's the right place to own bridge supervision.

Two failure modes covered

Failure	Detector
Bridge alive but on stale code (`/update` import-cache divergence)	SHA stamp vs disk HEAD
Bridge hung / dead / no PID	Heartbeat mtime + `kill(pid, 0)`

Tier escalation

Tier 1  request_restart()                        [cooperative]
Tier 2  SIGTERM bridge pid
Tier 3  SIGKILL (after 5 s grace) + start_awake  [last resort]
Tier 4  circuit-broken, alert and stop

State persists in .koan-bridge-heal-state. Per-tier cooldown (45 s) prevents acting before the previous tier has had a chance. Circuit breaker trips after 3 consecutive tier-3 failures.

If the bridge has no PID file at all (crashed long ago, never restarted), tiers 1–2 are skipped and the watchdog goes straight to tier 3 — no process to signal cooperatively.

Notification

Every heal action emits a one-line summary forwarded to:

Telegram via _notify_raw (terse, no Claude reformat — goes through the outbox so a freshly-relaunched bridge flushes it on first poll).
Journal under today's koan.md for after-the-fact audit.

Test plan

pytest koan/tests/ — 12788 passed, 0 new failures.
ruff check on all touched files — clean. (Pre-existing SIM105 in koan/app/cli_exec.py:187 is on main already and unrelated to this PR.)
New unit tests cover: healthy no-op, SHA mismatch tier 1, heartbeat stale tier 1, missing stamp = no-op (defensive), cooldown suppression, tier 2 SIGTERM, tier 3 SIGKILL + start_awake, no-PID cold-start, circuit-breaker alert path.
Recommended manual smoke: deploy to one bridge instance, do /update, verify the watchdog log line bridge_watchdog: … and Telegram 🩹 message land within ~5 min.

Tunables (all in `koan/app/bridge_watchdog.py`)

Constant	Default
`BRIDGE_HEARTBEAT_STALE_S`	90.0
`HEAL_TIER_COOLDOWN_S`	45.0
`SIGTERM_GRACE_S`	5.0
`POST_HEAL_QUIET_S`	60.0
`HEAL_CIRCUIT_BREAKER_LIMIT`	3
`_BRIDGE_WATCHDOG_INTERVAL` (run.py)	5 iterations

Design doc

docs/bridge-watchdog.md — failure modes, escalation diagram, state files, operator notes, rollback.

🤖 Generated with Claude Code

Reduces CI time for pull requests by running tests only against Python 3.14. The full matrix (3.11 + 3.14) still runs on pushes to main and workflow_dispatch, ensuring compatibility is verified before code lands. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tting Adds a new `startup_reflection` config key (default: false) that controls whether the periodic self-reflection check runs at startup. Previously it ran unconditionally on every boot, potentially triggering a Claude CLI call without warning. Fixes Anantys-oss#1135 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Check PR branch prefix before allowing /rebase, /ci_check, or auto-queued rebase from /check. If the head branch doesn't match this instance's configured branch_prefix, refuse with a clear "Not my PR" message instead of attempting to modify another instance's work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

` — and since we now call it, the test should still pass. The mock just needs to be available. Good. Here's the summary of changes: - **Fixed early return skipping `mark_checked()` in `check_runner.py`** (blocking review issue): Moved the ownership guard before the action logic as a blanket gate. Foreign PRs now call `mark_checked()` before returning, preventing repeated "not mine" notifications on every polling cycle. - **Extended ownership guard to cover all actions in `check_runner.py`** (reviewer request): The ownership check now gates both rebase and review queue paths. Previously only the rebase path was guarded, allowing foreign PRs to still get review missions queued. - **Changed test mocks to patch `is_own_pr` at source** in `test_pr_ownership.py` and `test_rebase_skill.py` (reviewer request): Replaced `patch.object(handler, "is_own_pr", ...)` with `patch("app.github_skill_helpers.is_own_pr", ...)` so mocks remain effective regardless of how the handler imports the function.

Here's a summary of the changes: - **Switched `is_own_pr` mocks from `patch.object(handler, ...)` to `patch("app.github_skill_helpers.is_own_pr", ...)`** in both `test_pr_ownership.py` (2 occurrences in `TestCiCheckOwnership`) and `test_rebase_skill.py` (3 occurrences in `TestPROwnership`). Per reviewer feedback, patching at the source module (`app.github_skill_helpers`) is more robust than patching the handler module's local binding — if the handler's import style ever changes from `from ... import` to attribute access, the old `patch.object` approach would silently stop intercepting the real function. The other two review issues (🔴 `mark_checked` missing on early return, 🟡 ownership guard only protecting rebase path) were already resolved in prior commits on this branch.

Check the most recent CI run before pushing the rebased branch. If CI was failing, fetch the logs via gh CLI, invoke Claude to fix the issues, and include the fix in the push — avoiding a round-trip of push → CI fail → re-push. New function check_existing_ci() in claude_step.py does a single-shot CI status check (no polling). _fix_existing_ci_failures() in rebase_pr.py orchestrates the fetch-analyze-fix cycle as step 5 of the rebase pipeline, between review feedback and push. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

nable change is adding the stale-CI caveat to the prompt. Here's my summary of changes: - **Added stale-CI caveat to `koan/skills/core/rebase/prompts/ci_fix.md`** per reviewer request: the pre-push CI check inspects results from *before* the rebase, so failures may already be resolved. Added an "Important Context" section warning Claude that the logs are from before the rebase, and added step 2 ("Cross-check against the current diff") and step 7 ("If all failures appear to be already resolved, make no changes") to prevent unnecessary or harmful fixes based on stale CI results. - **No changes needed for missing mocks** (reviewer concern #1 and #3): verified that all existing `run_rebase` integration tests already include `@patch('app.rebase_pr._fix_existing_ci_failures', return_value=False)`, and the 10 new test classes (`TestCheckExistingCi` with 5 tests, `TestFixExistingCiFailures` with 5 tests) are complete and present in the test file — the truncation was only in the PR diff view.

- Anantys-oss#1138: pass content string to fallback_extract instead of re-reading file - Anantys-oss#1139: replace TOCTOU exists()+read_text() with try/except FileNotFoundError - Anantys-oss#1140: log GitHub API failures instead of silently swallowing exceptions - Anantys-oss#1141: return escalated missions from recover_missions instead of re-reading historical log - Anantys-oss#1142: log OSError in _reset_failure_count to match _increment_failure_count pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

on `recover_missions()`** (`recover.py:171`): Changed `-> int` to `-> tuple` to match the new `(count, escalated_list)` return value, per reviewer's Important #1. - **Eliminated TOCTOU race in `recover_missions()`** (`recover.py:190-192`): Replaced `if not missions_path.exists()` with `try/except FileNotFoundError` pattern, consistent with the same fix applied to `pick_mission.py` in this PR, per reviewer's Important #2. - **Replaced `print(..., file=sys.stderr)` with `log.warning()` in `_reset_failure_count()`** (`pr_review_learning.py:343-344`): Uses the module's existing logger for consistency with the `log.warning()` calls added in `github_notifications.py` in this same PR, per reviewer's Suggestion #1. Also resolves the quality report's "debug print statement" flag. - **Added return type annotation to `fallback_extract()`** (`pick_mission.py:21`): Changed bare `-> tuple` to `-> tuple[str | None, str | None]` for clarity on the nullable return values, per reviewer's Suggestion #2.

…leanup Merged branches previously had their local refs deleted during git sync, but the corresponding remote refs remained. This adds remote branch deletion (git push origin --delete) after each successful local deletion, and exposes a branch_cleanup config section to enable/disable cleanup and remote deletion independently. - Add `delete_remote` parameter to `cleanup_merged_branches()` (default True); after local deletion, push-deletes the remote ref. Remote deletion failures are tolerated silently (branch may already be gone via GitHub auto-delete). - Add `get_branch_cleanup_config()` to config.py reading `branch_cleanup.enabled` and `branch_cleanup.delete_remote_branches` from config.yaml (both default true). - Update `build_sync_report()` to check config before running cleanup and pass the `delete_remote` flag through. - Document the new `branch_cleanup` section in instance.example/config.yaml. - Add 8 new tests covering remote deletion, failure tolerance, and config flags. Closes Anantys-oss#1086 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lure** — Implemented. Changed `run_git()` call to capture return value and log a debug message when it returns empty (failure), matching the existing logging pattern for local deletions. 2. **Add integration test for build_sync_report config path** — Already covered by the existing `TestBranchCleanupConfig` class (tests at lines 854-901) which verifies `cleanup_disabled_by_config_skips_deletion`, `delete_remote_false_passed_to_cleanup`, and `cleanup_enabled_by_default`. No additional test needed. The "Suggestion" to batch remote deletions was explicitly marked as optional with trade-off notes, so I left the per-branch loop for individual failure tolerance. --- **Summary for commit message:** - Added debug logging when remote branch deletion fails in `cleanup_merged_branches()` (`git_sync.py` L345-348), per reviewer request — failures were completely silent, making auth/network issues impossible to diagnose. Now logs at DEBUG level matching the existing pattern for local deletions.

Introduce a ## CI section in missions.md as a single source of truth for in-flight CI monitoring. Adds _SECTION_MAP entry, updated DEFAULT_SKELETON, and four new public functions: add_ci_item(), remove_ci_item(), get_ci_items(), update_ci_item_attempt(). Each CI entry stores project tag, PR URL, branch, repo, queued timestamp, and attempt counter (attempt N/M) inline for human visibility. parse_sections() now returns a "ci" key alongside pending/in_progress/ done/failed. CI items are not mission sources — count_pending() and extract_next_pending() remain unaffected. Part of Anantys-oss#1132. Co-Authored-By: Claude <noreply@anthropic.com>

Replace the hidden .ci-queue.json with a visible ## CI section in missions.md (issue Anantys-oss#1132). CI items are now human-readable with attempt counters. Changes: - ci_queue_runner: drain_one() reads ## CI via get_ci_items(), updates attempt counter via update_ci_item_attempt(), removes entries via remove_ci_item(), injects /ci_check mission on failure (under max), writes outbox on success/final-failure. _reenqueue_for_monitoring() now uses add_ci_item() instead of ci_queue.enqueue(). - ci_queue_runner: one-time migration in _maybe_migrate_json_queue() moves any existing .ci-queue.json entries to ## CI and deletes the JSON file. - rebase_pr: _enqueue_ci_check() now calls add_ci_item() instead of ci_queue.enqueue(). Reads ci_fix_max_attempts from config (default 5). - instance.example/config.yaml: document ci_fix_max_attempts (default 5). - Tests updated to reflect new ## CI section-based drain_one() behavior. Part of Anantys-oss#1132. Co-Authored-By: Claude <noreply@anthropic.com>

Done. Summary of changes: - **Replaced all 4 `__import__` hacks with direct imports** in `drain_one()` (`ci_queue_runner.py`): Added `remove_ci_item` and `update_ci_item_attempt` to the existing `from app.missions import` statement at function scope, then replaced all `__import__("app.missions", fromlist=[...]).func(...)` lambda calls with simple `func(...)` references. Per reviewer's blocking feedback — the `__import__` pattern was fragile and unnecessary. - **Documented accepted TOCTOU race** (`ci_queue_runner.py`, L86): Added a comment explaining that the unlocked `missions_path.read_text()` before `check_ci_status()` is an accepted race — the slow external call makes locking impractical, and the lambdas passed to `modify_missions_file` re-read content under lock. Per reviewer's important feedback to at minimum document this.

Add a GitHub pre-check guard to contemplative mode's Mission Proposal path and to the autonomous issue-selection logic in agent.md. Before proposing or picking a GitHub issue, the agent must verify the issue is unassigned (or self-assigned) and has no open PR addressing it. - build_contemplative_prompt() accepts github_nickname param; strips the check block from the rendered prompt when no nickname is configured - build_contemplative_command() resolves nickname from config automatically - contemplative.md: Option 2 includes assignment + open-PR gh checks - agent.md: new "GitHub Issue Selection" section under Autonomous Mode Guidance - Tests: 3 new tests for nickname rendering; existing signature tests updated Co-Authored-By: Claude <noreply@anthropic.com>

When start_on_pause is enabled and /resume is sent during the startup sequence, handle_start_on_pause() would (re-)create the pause file after /resume removed it, effectively ignoring the resume command. The fix uses a .koan-skip-start-pause signal file: /resume writes a timestamped file that handle_start_on_pause checks and respects (if fresh, <5min). This covers both race scenarios: - /resume before handle_start_on_pause: skip file prevents pause creation - /resume after handle_start_on_pause: skip file prevents re-creation on subsequent restart Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace raw ISO timestamps (⏳(2026-04-07T20:14)) with human-friendly format: "⏳@ 8:14pm" for today, "⏳Mon @ 9am" for this week, or "⏳Mon 3/31 @ 9am" for older dates. Only affects rendering, not storage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the Claude CLI exits with code 1 and writes nothing to stderr, the error message was an unhelpful "Exit code 1: no stderr". The actual error often appears in stdout (e.g. "context window exceeded"), but this was discarded at two levels: 1. run_claude() in claude_step.py only checked stderr for the error snippet, ignoring stdout entirely. 2. _run_claude_review() in review_runner.py discarded result["output"] (stdout) on failure, losing the only diagnostic information. Fix at the source: when stderr is empty but stdout has content, include a stdout tail in the error message ("no stderr | stdout: <last 500 chars>"). Also add belt-and-suspenders logging in review_runner to log stdout before discarding it on failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace hardcoded max_turns values (15, 20, 25, 30) with get_skill_max_turns() from config.yaml in review_runner, plan_runner, pr_review, rebase_pr, and recreate_pr. Previously each runner used a different hardcoded ceiling, causing failures on large PRs (e.g. "Reached max turns (15)") that the user could not override. Now all runners respect the skill_max_turns config key (default 200), giving a single knob to control turn limits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

If the bridge (awake.py) was running before is_own_pr was added to github_skill_helpers (commit 37baab8), sys.modules caches the old module without the function. The rebase and ci_check handlers then fail at call-time with: ❌ Failed to check PR ownership: module 'app.github_skill_helpers' has no attribute 'is_own_pr' Fix: when is_own_pr is absent on the cached module, call importlib.reload() to refresh it in-place before invoking the function. The reload is a no-op in the normal case (function is present), so there is no performance impact on healthy deployments. Adds a parametrised regression test that simulates the stale-cache scenario for both rebase and ci_check handlers. Fixes Anantys-oss#1162 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…k support The CI fix runner (ci_queue_runner.py) was missed by b56a62a which made max_turns configurable across all other runners. This fixes four issues: - Replace hardcoded max_turns=15 with get_skill_max_turns() (default 200) - Add get_skill_timeout() instead of relying on default 600s - Replace unsafe `git fetch origin branch:branch` + `git checkout branch` with the safe _fetch_branch + `checkout -B` pattern used in rebase_pr - Replace hardcoded `origin` remote with proper remote resolution via _find_remote_for_repo, supporting fork setups where the PR branch may be on upstream or a third-party remote Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…quota check The usage percentages displayed by `make logs` were misleading — they're token-count estimates against arbitrary limits (500K/5M) that don't match real API quotas. This caused confusion when comparing with Claude CLI's actual usage bars (Anantys-oss#1131). Changes: - Prefix all usage percentages with ~ to signal they're estimates - Add "token estimate — may differ from real API quota" header in logs - Display today's actual API cost (from cost_tracker JSONL) alongside estimates - Add disclaimer in /quota skill output with hint to use /quota <N> to correct - Remove bogus `claude usage` subprocess call from ClaudeProvider (`claude usage` is not a real subcommand — it gets treated as a prompt and hangs) - Update usage.md parser regex to accept optional ~ prefix (backward compatible) Closes Anantys-oss#1131 Co-Authored-By: Claude <noreply@anthropic.com>

ot optional)." Since the code explicitly handles `ImportError`, it's treated as optional — the deferred import is the right pattern here. Summary of changes: - **Narrowed exception handling in `_get_cost_today`** (`iteration_manager.py:141`): Changed `except Exception` to `except (ImportError, OSError, ValueError, KeyError)` per reviewer suggestion, matching the pattern used by `_get_usage_decision` above and avoiding silently swallowing unexpected bugs. - **Made cost_today check explicit** (`run.py:1420`): Changed `if plan.get("cost_today"):` to `if plan.get("cost_today", 0.0) > 0:` per reviewer suggestion, making the falsy-zero behavior intentional and explicit rather than accidental. - **Kept deferred import** for `cost_tracker` since the module is treated as optional (ImportError is caught), making the deferred import the correct pattern. - **Skipped check_quota_available cleanup** — reviewer marked it as "not blocking, worth a follow-up" rather than a change request for this PR.

Add sanitize_github_comment() helper in github.py that replaces bare @copilot mentions (case-insensitive) with backtick-escaped variants to prevent triggering the Copilot bot. Apply at all 12 comment-posting call sites across review_runner, rebase_pr, recreate_pr, squash_pr, pr_review, pr_quality, plan_runner, claude_step, github_reply, and github_command_handler. Add prompt guidance to github-reply.md. Closes Anantys-oss#1165 Co-Authored-By: Claude <noreply@anthropic.com>

Here's a summary of the changes: - **Expanded bot mention scope** (`koan/app/github.py`): Replaced single `_COPILOT_RE` regex with a `_BOT_USERNAMES` tuple containing `copilot`, `dependabot`, and `github-actions`, and a `_BOT_MENTION_RE` regex built from that list — per reviewer request to cover all three bots (refs issue Anantys-oss#1165 comment) - **Fixed type hint** (`koan/app/github.py`): Changed `sanitize_github_comment(text: str) -> str` to `Optional[str] -> Optional[str]` to match the actual `None`-passthrough behavior - **Updated prompt guidance** (`koan/system-prompts/github-reply.md`): Added `@dependabot` to the list of bot accounts that must be backtick-escaped - **Added tests** (`koan/tests/test_github.py`): 8 new test cases covering `@dependabot` (bare, capitalized, already-escaped, partial-match) and `@github-actions` (bare, capitalized, already-escaped), plus a mixed-bots test

@bot

…aliases The help reply from @bot help now groups commands by category (Code, Pull Requests, etc.) with section headers, includes skill emoji and alias shortcuts for a more scannable, useful response. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When GitHub's Private Vulnerability Reporting is enabled on a target repo, /security_audit now submits critical/high findings as private security advisories instead of public issues. Lower-severity findings remain public. Configurable per-project via projects.yaml security section (pvrs mode + threshold). Graceful fallback to public issues on any PVRS API failure. Implements: Anantys-oss#1341 Plan: PR Anantys-oss#1269 (docs/plan-pvrs-security-audit.md) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ails publicly

Ensure all remote tracking refs for the base branch are fresh before any mission work begins — both new branches and rebases. Two complementary changes: - git_prep: after primary sync, fetch base from all secondary remotes - claude_step: pre-fetch all relevant remotes before the rebase loop Addresses review feedback on PR Anantys-oss#1333 about keeping main up to date regardless of its name before working on branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a Linting section to CLAUDE.md documenting ruff as the project's linter, and add a `make lint` Makefile target so contributors can check compliance before pushing. Addresses review feedback on Anantys-oss#1345. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds a third messaging provider option (KOAN_MESSAGING_PROVIDER=matrix) implemented as a thin synchronous wrapper around the Matrix Client-Server HTTP API. No new Python dependencies — uses the existing requests package, mirroring the Telegram provider's style. What: - app/messaging/matrix.py — MatrixProvider with send/poll/typing using PUT /_matrix/client/v3/rooms/{roomId}/send/m.room.message/{txnId} and long-polling GET /_matrix/client/v3/sync. - Auto-registered via _PROVIDER_MODULES in app/messaging/__init__.py. - Filters: only m.room.message events with msgtype=m.text, ignores the bot's own user_id and events from rooms other than the configured one. - Initial /sync discards historical events; only the next_batch cursor is kept, so the bot doesn't replay history on startup. - 5xx errors are retried via retry_with_backoff; 4xx fail fast. - Onboarding wizard gains a Matrix option that writes the four env vars. - docs/messaging-matrix.md walks through setup (account, access token, room ID); README, INSTALL.md, and cross-doc links updated. - 29 unit tests cover config validation, send chunking, URL encoding, sync cursor handling, message filtering, and the typing indicator. Why: The codebase already had a clean MessagingProvider abstraction with Telegram and Slack implementations — Matrix slots in without touching any of the bridge code, giving self-hosted / federated users a third option without reaching for slack-sdk or a Telegram bot account. Note: end-to-end encryption (Olm/Megolm) is intentionally out of scope. Operators should use unencrypted rooms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…legacy

…ppress Add SIM105 to the enforced ruff rule set and refactor all 64 existing violations to use `contextlib.suppress(...)`. This catches the try/except/pass pattern automatically so future PRs cannot reintroduce it. Tests untouched by intent — only the suppression idiom changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…trying When a fork PR is opened by a first-time contributor, GitHub Actions gates every workflow run on a maintainer "Approve and run" click. The gh API exposes these runs with status="action_required" (or "waiting" for environment-protected jobs) and no conclusion. Until this change, aggregate_ci_runs() classified them as "pending" and drain_one() left the PR in ## CI indefinitely — every iteration polled again, and retried the /ci_check fix loop on stagnation. See aio-libs/aiohttp#12553 where Kōan retried multiple times against runs that physically could not start until a human clicked Approve. Now aggregate_ci_runs() returns a new "blocked_approval" status when any run is gated this way (failure still wins so a genuinely broken workflow on the same push gets surfaced). drain_one() removes the PR from ## CI and notifies the outbox so the operator can either approve in the GitHub UI or ping the maintainer. run_ci_check_and_fix() and rebase_pr._run_ci_check_and_fix() exit early without attempting fixes when CI is blocked, since pushing more commits triggers the same gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Skill runners (rebase_pr, recreate_pr) invoke Claude via run_claude(), which used subprocess.run with capture_output=True. That buffers all stdout until the subprocess exits, so the run.py liveness watchdog (600s no-output kill) fires whenever a single Claude call exceeds the threshold — observed on the aioesphomeapi #1660 rebase. A secondary failure mode: subprocess.run timeout sends SIGKILL to the direct child but not grandchildren. When Claude has spawned helper subprocesses that inherit stdout, communicate()'s post-kill drain can block indefinitely, so the inner 120s/300s/600s timeouts never fire. Switch run_claude to popen_cli with start_new_session=True and stream stdout line-by-line through sys.stdout. Every emitted line resets the parent watchdog and surfaces real-time progress to /live. A threading Timer-backed watchdog kills the entire process group via os.killpg on timeout, breaking the grandchild-pipe hang. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…og race

…eanup

Three file writes used non-atomic operations (write_text / json.dump), risking corruption if the process is killed mid-write: - quarantine cap (_enforce_quarantine_cap): truncated quarantine could lose entries on crash during write - outbox staging (flush): partial staging file causes garbled Telegram messages on recovery - stagnation retry tracker: corrupted JSON resets all retry counters All three now use atomic_write / atomic_write_json (temp + fsync + rename). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace raw "blocked_approval" string literals across ci_queue_runner, claude_step, and rebase_pr with a single exported constant to prevent typos and improve grep-ability. Closes review comment from PR Anantys-oss#1351. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Five pre-existing isolation bugs surfaced when running the suite under pytest-xdist. Each failure was a real test pollution issue that also showed up in stand-alone subset runs — xdist just made it routine. - conftest: give every xdist worker its own KOAN_ROOT directory. `app.utils` snapshots KOAN_ROOT at import time, so workers that inherit the same env race on shared files under that path. - test_github_command_handler / test_github_subscribe: autouse-mock `_is_subject_closed`, which otherwise hits the real GitHub API (network-flaky in serial, parallel-unsafe under xdist). - test_mission_retry: autouse-reset the `_last_mission_*` flags on `app.run`. Other tests leave them set; xdist surfaced the leak. - test_skill_dispatch: patch parse_pr_url on the handler module's local binding instead of on `app.github_url_parser`. The handler does `from app.github_url_parser import parse_pr_url` at module load, so patching the source after that import leaks a stale binding into later tests that import the same handler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The full suite takes ~3m38s serial. Adding pytest-xdist drops it to ~1m58s on 2 workers (1.85x) and ~1m08s with -n auto (3.2x) on this box. Makefile picks the worker count based on environment: - CI / GitHub Actions -> -n auto - Remote SSH session -> -n 2 (be polite on shared hosts) - Local terminal -> -n auto Override via `make test PYTEST_WORKERS=N` (PYTEST_WORKERS=0 disables xdist). All commands use `--dist loadfile` so each test file lands on one worker; that keeps file-internal collection order intact and avoids surprises from class-level state. CI (.github/workflows/tests.yml) installs pytest-xdist and passes `-n auto --dist loadfile` to every pytest invocation.

…eshold Adds two behavioral tests to TestCompactLearnings confirming that the git subprocess (_get_file_tree) is never spawned when compaction is skipped due to the line-count threshold or hash-unchanged guard. Closes Anantys-oss#1322 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When a /rebase, /review, /recreate, /squash, /ci_check, /fix, /check, or /gh_request mission is queued for a URL that already has an identical command pending or in progress, the insert is silently skipped and the user is notified with a warning message. This prevents the queue from filling with redundant missions when the same action is triggered from multiple sources (Telegram, GitHub notifications, check_runner). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e duplicate handling boilerplate

The stale-module detector in `_refresh_stale_app_modules` only reloaded `app.*` modules whose mtime had changed since a previously cached value. On the very first observation of a module the function silently recorded the current mtime without reloading, on the assumption that the first encounter happens during a fresh boot when in-memory and on-disk content already agree. That assumption breaks when auto-update (or a manual `git pull`) rewrites a file BEFORE the first post-update skill invocation. The new file mtime is then recorded as the baseline while `sys.modules` still holds the pre-update module, and no later refresh ever notices the divergence. Reproduction: a Kōan process booted before `5742248` (`refactor(missions): consolidate project-tag regexes`) pulled the commit, then received `/list`. The handler does `from app.utils import PROJECT_NAME_CHARS` and crashed with: Skill error (core.list): cannot import name 'PROJECT_NAME_CHARS' from 'app.utils' (.../koan/app/utils.py) Because the trap is "any new symbol added to any `app.*` module after the process started", every future addition to `app.utils` (or any sibling) would re-trigger the same failure on first use after update. Fix: capture `_PROCESS_START_TIME = time.time()` when `app.skills` is imported, and treat a first-time observation as stale whenever the source file's mtime is greater than that timestamp. Clean boots stay zero-cost (file mtime ≤ process start ⇒ skip the reload, just record the baseline) while post-update first observations now reload as intended. Tests: update the existing "first encounter, no reload" case to stub `_PROCESS_START_TIME` past the fake file's mtime so it keeps exercising the old-file branch, and add a new regression test asserting that a first observation of a file newer than process start triggers `importlib.reload`. Operators carrying the bad state need a one-time restart; the fix prevents recurrence for any future `app.*` addition.

`request_restart` previously wrote a single shared `.koan-restart` marker that both the bridge (`awake.py`) and the runner (`run.py`) polled. The runner's wrapper-restart cycle is fast: after `request_restart`, the runner exits with code 42, its wrapper relaunches it, and the fresh runner's startup wipe (`run.py:785`) deletes the marker within ~1 s. The bridge polls every `POLL_INTERVAL` (≈3 s) at `awake.py:778`, so the marker was usually gone before the bridge's next tick, and the bridge never re-execed. Observable failure: after `/update`, the runner pulled the new code and the runner process did restart, but the bridge kept its pre-update Python interpreter alive. The bridge's `sys.modules['app.utils']` remained the stale copy, so `/list` raised `cannot import name 'PROJECT_NAME_CHARS' from 'app.utils'` even after the upstream stale-module fix (`75bf3e2`) had landed on disk — the detector itself was in the stale `sys.modules`. Give each consumer its own marker file (`.koan-restart-bridge`, `.koan-restart-run`). `request_restart` writes both, plus the legacy `.koan-restart` for backward compatibility so a pre-upgrade bridge still polling the old name can pick up the first post-upgrade request and re-exec into the new code. `check_restart` and `clear_restart` take an optional `target=` ("bridge" / "run" / None); each consumer only clears its own marker, so the runner's startup wipe can no longer silence the bridge's signal. Unknown targets raise `ValueError` to catch typos at call sites. Bridge calls in `awake.py` (L754, L755, L778) pass `target="bridge"`. Runner calls in `run.py` (L734, L785, L854, L856) pass `target="run"`. The pause-loop check at L734 keeps its existing no-`since` semantics because the startup clear at L785 already handles stale markers from a previous incarnation. Tests: - `TestPerProcessRestartMarkers` in `test_restart_manager.py` adds 5 cases including a direct race regression (`test_runner_wrapper_restart_does_not_silence_bridge`). - `test_uses_atomic_write` now asserts all three markers are written. - `test_run.py` and `test_restart.py` updated to expect the new `target=` kwarg and the per-process marker filenames.

`_ensure_providers_loaded` short-circuited the moment `_providers` was non-empty: def _ensure_providers_loaded(): if _providers: return for module_name in _PROVIDER_MODULES: __import__(module_name) The intent ("skip the loop on repeat calls") was fine; the sentinel was wrong. `_providers` becomes non-empty as a side-effect of *any* code importing one of the provider modules — including the default startup path where the bridge imports `app.messaging.telegram` before anything asks for the full provider list. Once that happens, the very first `_ensure_providers_loaded` call hits the early return, the `matrix` / `slack` imports never run, and `_create_provider("matrix")` SystemExits with `Unknown messaging provider: 'matrix'` even though the module ships on disk. Symptom on CI: `test_matrix_provider.py::TestRegistry::test_matrix_registered` flapped under pytest-xdist depending on whether a sibling test (e.g. `test_telegram_provider.py`'s module-level `from app.messaging.telegram import TelegramProvider`) had already populated `_providers` in the worker before the matrix test ran. Fix: track loop-completion explicitly with a `_modules_loaded` flag rather than inferring it from `bool(_providers)`. Same short-circuit on repeat calls, correct semantics on the first one. Tests: - `TestEnsureProvidersLoaded.test_loads_matrix_when_telegram_imported_first` reproduces the production scenario in a fresh subprocess. Verified the test fails without the fix: AssertionError: missing providers after load: {'matrix', 'slack'} - `test_idempotent_when_called_repeatedly` (also subprocess) guards against future regressions where repeat calls might drift. - `test_matrix_provider.py::TestRegistry::test_matrix_registered` switched to the same subprocess pattern so it no longer flaps based on cross-file test ordering: `_providers` is process-wide module state and the `clean_registry` fixture in `test_messaging_provider.py` can leave the in-process registry out of sync with `sys.modules`, which no in-process call to `_ensure_providers_loaded` can recover from (cached modules don't re-run their `@register_provider` decorators). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace ❌ with 🚦 in the CI-failure outbox notification to avoid confusion with system-level errors. The traffic light emoji signals "CI gate not passing" without implying a crash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Telegram bridge (`awake.py`) is a long-running process with no wrapper, so a frozen `sys.modules` after `/update` leaves it serving stale imports until someone with shell access kicks it. The recent `cb6e927` ("per-process restart markers") fix the race that drops restart signals once both processes are on the new code, but the transitional `/update` from old to new still wedged the bridge — and no in-Telegram recovery path worked, because every subsequent `/restart` hit the same import-cache state and the bridge's polling loop has no way to bootstrap its `awake.py` globals back to the freshly-reloaded `app.restart_manager`. Add a runner-side watchdog that detects two distinct failure modes and escalates recovery through four tiers. The runner is *always* fresh code (its wrapper relaunches on every exit), so it is the right place to own this responsibility. Detection: - **Stale modules**: bridge stamps its git HEAD SHA at startup (`.koan-bridge-version`); runner compares against `git rev-parse HEAD` on each tick. Heartbeat alone cannot catch this — process is alive and responsive, just running old code. - **Hung / dead bridge**: heartbeat older than `BRIDGE_HEARTBEAT_STALE_S` (90 s) or PID gone / pointing at a dead process. Escalation (one tier per call, persisted in `.koan-bridge-heal-state`): - Tier 1: `request_restart()` — runner is on fresh code so its triple-marker write is correct and not subject to the old race - Tier 2: SIGTERM the bridge PID - Tier 3: SIGKILL (after `SIGTERM_GRACE_S`) + `pid_manager.start_awake` - Tier 4: circuit-broken, alert and stop. Counter trips after `HEAL_CIRCUIT_BREAKER_LIMIT` (3) consecutive tier-3 failures. If the bridge has no PID file at all (crashed long ago, never restarted), tiers 1–2 are skipped and the watchdog goes straight to tier 3 — there is no process to signal cooperatively. The runner throttles probes to one per `_BRIDGE_WATCHDOG_INTERVAL` (5) main-loop iterations. Each iteration is ~60–300 s, so detection latency is ~5–25 min in the worst case, which is fine for "don't let a stuck bridge sit forever." Per-tier cooldown (`HEAL_TIER_COOLDOWN_S` = 45 s) prevents acting again before the previous tier has had a chance to take effect. When the watchdog acts, the message is forwarded both to Telegram via `_notify_raw` (terse, no Claude reformat — goes through the outbox so a freshly-relaunched bridge flushes it on first poll) and to today's journal entry under project "koan" for after-the-fact audit. Files: - `koan/app/bridge_watchdog.py` (new, ~330 LOC): detection, state machine, tier execution, bridge-side version stamp writer. - `koan/app/signals.py`: two new constants (`BRIDGE_VERSION_FILE`, `BRIDGE_HEAL_STATE_FILE`). - `koan/app/awake.py`: call `write_bridge_version_stamp` once at startup, right after the existing heartbeat write. - `koan/app/run.py`: throttled `_maybe_run_bridge_watchdog` helper called once per iteration after the existing `check_restart` block. Forwards heal messages to Telegram + journal. - `koan/tests/test_bridge_watchdog.py` (12 tests): healthy no-op, SHA mismatch tier 1, heartbeat stale tier 1, missing stamp = no-op (defensive), cooldown suppression, tier 2 SIGTERM after tier 1, tier 3 SIGKILL + start_awake, no-PID cold-start skipping straight to tier 3, circuit-breaker alert path. - `docs/bridge-watchdog.md`: design rationale, failure-mode table, escalation diagram, tunables, operator notes, rollback. Verified: `pytest koan/tests/` = 12788 passed, ruff clean on all touched files. (Pre-existing SIM105 in `koan/app/cli_exec.py:187` is unrelated and present on `main`.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lure counter

Koan-Bot and others added 30 commits April 7, 2026 08:56

fix: resolve CI failures on Anantys-oss#1121 (attempt 1)

fc89410

fix: resolve CI failures on Anantys-oss#1121 (attempt 1)

5e3e555

fix: resolve CI failures on Anantys-oss#1121 (attempt 1)

30af609

fix: resolve pre-existing CI failures on Anantys-oss#1166

fc8a7a2

Koan-Bot and others added 29 commits May 16, 2026 22:53

fix: redact PVRS fallback issues to prevent leaking vulnerability det…

04476fc

…ails publicly

fix: address review — remove double-fetch, fix fragile rebase tests

3b39ba7

fix: resolve CI failures on Anantys-oss#1334 (attempt 1)

97fcac3

docs(matrix): advertise instance/config.yaml as primary, env vars as …

2e3fac7

…legacy

refactor(cli_exec): extract stream_with_timeout helper and fix watchd…

d5ff6d8

…og race

refactor(cli_exec): use contextlib.suppress in stream_with_timeout cl…

b21a5ab

…eanup

test: address review feedback on pytest-xdist isolation fixtures

a87abf7

test: mock resolve_pr_location in conftest to fix xdist gh flakes

dbdca5e

fix(missions): handle duplicate return value in all skill handlers

6f6a4d4

refactor(missions): extract queue_github_mission_once helper to reduc…

50e9a89

…e duplicate handling boilerplate

fix(watchdog): apply review feedback — docs path, zombie comment, fai…

9f68a2b

…lure counter

Koan-Bot force-pushed the restart-fixup-3 branch from fe61f4c to 9f68a2b Compare May 17, 2026 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(watchdog): runner-side self-heal for stale or hung bridge#2

feat(watchdog): runner-side self-heal for stale or hung bridge#2
atoomic wants to merge 430 commits into
Koan-Bot:mainfrom
atoomic:restart-fixup-3

atoomic commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

atoomic commented May 17, 2026

Summary

Why

Two failure modes covered

Tier escalation

Notification

Test plan

Tunables (all in koan/app/bridge_watchdog.py)

Design doc

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Tunables (all in `koan/app/bridge_watchdog.py`)