Skip to content

feat(watchdog): runner-side self-heal for stale or hung bridge#2

Open
atoomic wants to merge 430 commits into
Koan-Bot:mainfrom
atoomic:restart-fixup-3
Open

feat(watchdog): runner-side self-heal for stale or hung bridge#2
atoomic wants to merge 430 commits into
Koan-Bot:mainfrom
atoomic:restart-fixup-3

Conversation

@atoomic
Copy link
Copy Markdown

@atoomic atoomic commented May 17, 2026

Summary

  • Adds a runner-side watchdog that detects stale sys.modules and hung/dead bridge processes, then escalates recovery through four tiers (request_restart → SIGTERM → SIGKILL + start_awake → circuit-broken alert).
  • Bridge stamps its git HEAD SHA at startup (.koan-bridge-version); runner compares against git rev-parse HEAD each tick so it catches the frozen-imports case the heartbeat cannot.
  • New module koan/app/bridge_watchdog.py, ~330 LOC; 12 behavioral tests; full design doc at docs/bridge-watchdog.md.

Why

cb6e927 ("per-process restart markers") fixed the marker race between two new-code processes, but the very /update that lands the fix is still performed by the old code, leaving the bridge stuck on its pre-update sys.modules. Operator-side observable: every /list after the update fails with `cannot import name 'PROJECT_NAME_CHARS' from 'app.utils'`, and neither subsequent /update nor /restart recovers it — the bridge's poll loop has no path back to consistency without an external kick.

The runner is structurally always fresh code (its wrapper relaunches it on every exit), so it's the right place to own bridge supervision.

Two failure modes covered

Failure Detector
Bridge alive but on stale code (/update import-cache divergence) SHA stamp vs disk HEAD
Bridge hung / dead / no PID Heartbeat mtime + kill(pid, 0)

Tier escalation

Tier 1  request_restart()                        [cooperative]
Tier 2  SIGTERM bridge pid
Tier 3  SIGKILL (after 5 s grace) + start_awake  [last resort]
Tier 4  circuit-broken, alert and stop

State persists in .koan-bridge-heal-state. Per-tier cooldown (45 s) prevents acting before the previous tier has had a chance. Circuit breaker trips after 3 consecutive tier-3 failures.

If the bridge has no PID file at all (crashed long ago, never restarted), tiers 1–2 are skipped and the watchdog goes straight to tier 3 — no process to signal cooperatively.

Notification

Every heal action emits a one-line summary forwarded to:

  1. Telegram via _notify_raw (terse, no Claude reformat — goes through the outbox so a freshly-relaunched bridge flushes it on first poll).
  2. Journal under today's koan.md for after-the-fact audit.

Test plan

  • pytest koan/tests/12788 passed, 0 new failures.
  • ruff check on all touched files — clean. (Pre-existing SIM105 in koan/app/cli_exec.py:187 is on main already and unrelated to this PR.)
  • New unit tests cover: healthy no-op, SHA mismatch tier 1, heartbeat stale tier 1, missing stamp = no-op (defensive), cooldown suppression, tier 2 SIGTERM, tier 3 SIGKILL + start_awake, no-PID cold-start, circuit-breaker alert path.
  • Recommended manual smoke: deploy to one bridge instance, do /update, verify the watchdog log line bridge_watchdog: … and Telegram 🩹 message land within ~5 min.

Tunables (all in koan/app/bridge_watchdog.py)

Constant Default
BRIDGE_HEARTBEAT_STALE_S 90.0
HEAL_TIER_COOLDOWN_S 45.0
SIGTERM_GRACE_S 5.0
POST_HEAL_QUIET_S 60.0
HEAL_CIRCUIT_BREAKER_LIMIT 3
_BRIDGE_WATCHDOG_INTERVAL (run.py) 5 iterations

Design doc

docs/bridge-watchdog.md — failure modes, escalation diagram, state files, operator notes, rollback.

🤖 Generated with Claude Code

Koan-Bot and others added 30 commits April 7, 2026 08:56
Reduces CI time for pull requests by running tests only against Python 3.14.
The full matrix (3.11 + 3.14) still runs on pushes to main and workflow_dispatch,
ensuring compatibility is verified before code lands.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tting

Adds a new `startup_reflection` config key (default: false) that controls
whether the periodic self-reflection check runs at startup. Previously it
ran unconditionally on every boot, potentially triggering a Claude CLI call
without warning.

Fixes Anantys-oss#1135

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Check PR branch prefix before allowing /rebase, /ci_check, or
auto-queued rebase from /check. If the head branch doesn't match
this instance's configured branch_prefix, refuse with a clear
"Not my PR" message instead of attempting to modify another
instance's work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
` — and since we now call it, the test should still pass. The mock just needs to be available. Good.

Here's the summary of changes:

- **Fixed early return skipping `mark_checked()` in `check_runner.py`** (blocking review issue): Moved the ownership guard before the action logic as a blanket gate. Foreign PRs now call `mark_checked()` before returning, preventing repeated "not mine" notifications on every polling cycle.
- **Extended ownership guard to cover all actions in `check_runner.py`** (reviewer request): The ownership check now gates both rebase and review queue paths. Previously only the rebase path was guarded, allowing foreign PRs to still get review missions queued.
- **Changed test mocks to patch `is_own_pr` at source** in `test_pr_ownership.py` and `test_rebase_skill.py` (reviewer request): Replaced `patch.object(handler, "is_own_pr", ...)` with `patch("app.github_skill_helpers.is_own_pr", ...)` so mocks remain effective regardless of how the handler imports the function.
Here's a summary of the changes:

- **Switched `is_own_pr` mocks from `patch.object(handler, ...)` to `patch("app.github_skill_helpers.is_own_pr", ...)`** in both `test_pr_ownership.py` (2 occurrences in `TestCiCheckOwnership`) and `test_rebase_skill.py` (3 occurrences in `TestPROwnership`). Per reviewer feedback, patching at the source module (`app.github_skill_helpers`) is more robust than patching the handler module's local binding — if the handler's import style ever changes from `from ... import` to attribute access, the old `patch.object` approach would silently stop intercepting the real function.

The other two review issues (🔴 `mark_checked` missing on early return, 🟡 ownership guard only protecting rebase path) were already resolved in prior commits on this branch.
Check the most recent CI run before pushing the rebased branch.
If CI was failing, fetch the logs via gh CLI, invoke Claude to fix
the issues, and include the fix in the push — avoiding a round-trip
of push → CI fail → re-push.

New function check_existing_ci() in claude_step.py does a single-shot
CI status check (no polling). _fix_existing_ci_failures() in
rebase_pr.py orchestrates the fetch-analyze-fix cycle as step 5 of
the rebase pipeline, between review feedback and push.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nable change is adding the stale-CI caveat to the prompt.

Here's my summary of changes:

- **Added stale-CI caveat to `koan/skills/core/rebase/prompts/ci_fix.md`** per reviewer request: the pre-push CI check inspects results from *before* the rebase, so failures may already be resolved. Added an "Important Context" section warning Claude that the logs are from before the rebase, and added step 2 ("Cross-check against the current diff") and step 7 ("If all failures appear to be already resolved, make no changes") to prevent unnecessary or harmful fixes based on stale CI results.
- **No changes needed for missing mocks** (reviewer concern #1 and #3): verified that all existing `run_rebase` integration tests already include `@patch('app.rebase_pr._fix_existing_ci_failures', return_value=False)`, and the 10 new test classes (`TestCheckExistingCi` with 5 tests, `TestFixExistingCiFailures` with 5 tests) are complete and present in the test file — the truncation was only in the PR diff view.
- Anantys-oss#1138: pass content string to fallback_extract instead of re-reading file
- Anantys-oss#1139: replace TOCTOU exists()+read_text() with try/except FileNotFoundError
- Anantys-oss#1140: log GitHub API failures instead of silently swallowing exceptions
- Anantys-oss#1141: return escalated missions from recover_missions instead of re-reading historical log
- Anantys-oss#1142: log OSError in _reset_failure_count to match _increment_failure_count pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
 on `recover_missions()`** (`recover.py:171`): Changed `-> int` to `-> tuple` to match the new `(count, escalated_list)` return value, per reviewer's Important #1.
- **Eliminated TOCTOU race in `recover_missions()`** (`recover.py:190-192`): Replaced `if not missions_path.exists()` with `try/except FileNotFoundError` pattern, consistent with the same fix applied to `pick_mission.py` in this PR, per reviewer's Important #2.
- **Replaced `print(..., file=sys.stderr)` with `log.warning()` in `_reset_failure_count()`** (`pr_review_learning.py:343-344`): Uses the module's existing logger for consistency with the `log.warning()` calls added in `github_notifications.py` in this same PR, per reviewer's Suggestion #1. Also resolves the quality report's "debug print statement" flag.
- **Added return type annotation to `fallback_extract()`** (`pick_mission.py:21`): Changed bare `-> tuple` to `-> tuple[str | None, str | None]` for clarity on the nullable return values, per reviewer's Suggestion #2.
…leanup

Merged branches previously had their local refs deleted during git sync,
but the corresponding remote refs remained. This adds remote branch deletion
(git push origin --delete) after each successful local deletion, and exposes
a branch_cleanup config section to enable/disable cleanup and remote deletion
independently.

- Add `delete_remote` parameter to `cleanup_merged_branches()` (default True);
  after local deletion, push-deletes the remote ref. Remote deletion failures
  are tolerated silently (branch may already be gone via GitHub auto-delete).
- Add `get_branch_cleanup_config()` to config.py reading `branch_cleanup.enabled`
  and `branch_cleanup.delete_remote_branches` from config.yaml (both default true).
- Update `build_sync_report()` to check config before running cleanup and pass
  the `delete_remote` flag through.
- Document the new `branch_cleanup` section in instance.example/config.yaml.
- Add 8 new tests covering remote deletion, failure tolerance, and config flags.

Closes Anantys-oss#1086

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lure** — Implemented. Changed `run_git()` call to capture return value and log a debug message when it returns empty (failure), matching the existing logging pattern for local deletions.

2. **Add integration test for build_sync_report config path** — Already covered by the existing `TestBranchCleanupConfig` class (tests at lines 854-901) which verifies `cleanup_disabled_by_config_skips_deletion`, `delete_remote_false_passed_to_cleanup`, and `cleanup_enabled_by_default`. No additional test needed.

The "Suggestion" to batch remote deletions was explicitly marked as optional with trade-off notes, so I left the per-branch loop for individual failure tolerance.

---

**Summary for commit message:**

- Added debug logging when remote branch deletion fails in `cleanup_merged_branches()` (`git_sync.py` L345-348), per reviewer request — failures were completely silent, making auth/network issues impossible to diagnose. Now logs at DEBUG level matching the existing pattern for local deletions.
Introduce a ## CI section in missions.md as a single source of truth
for in-flight CI monitoring. Adds _SECTION_MAP entry, updated
DEFAULT_SKELETON, and four new public functions: add_ci_item(),
remove_ci_item(), get_ci_items(), update_ci_item_attempt(). Each CI
entry stores project tag, PR URL, branch, repo, queued timestamp, and
attempt counter (attempt N/M) inline for human visibility.

parse_sections() now returns a "ci" key alongside pending/in_progress/
done/failed. CI items are not mission sources — count_pending() and
extract_next_pending() remain unaffected.

Part of Anantys-oss#1132.

Co-Authored-By: Claude <noreply@anthropic.com>
Replace the hidden .ci-queue.json with a visible ## CI section in
missions.md (issue Anantys-oss#1132). CI items are now human-readable with
attempt counters.

Changes:
- ci_queue_runner: drain_one() reads ## CI via get_ci_items(), updates
  attempt counter via update_ci_item_attempt(), removes entries via
  remove_ci_item(), injects /ci_check mission on failure (under max),
  writes outbox on success/final-failure. _reenqueue_for_monitoring()
  now uses add_ci_item() instead of ci_queue.enqueue().
- ci_queue_runner: one-time migration in _maybe_migrate_json_queue()
  moves any existing .ci-queue.json entries to ## CI and deletes the
  JSON file.
- rebase_pr: _enqueue_ci_check() now calls add_ci_item() instead of
  ci_queue.enqueue(). Reads ci_fix_max_attempts from config (default 5).
- instance.example/config.yaml: document ci_fix_max_attempts (default 5).
- Tests updated to reflect new ## CI section-based drain_one() behavior.

Part of Anantys-oss#1132.

Co-Authored-By: Claude <noreply@anthropic.com>
Done. Summary of changes:

- **Replaced all 4 `__import__` hacks with direct imports** in `drain_one()` (`ci_queue_runner.py`): Added `remove_ci_item` and `update_ci_item_attempt` to the existing `from app.missions import` statement at function scope, then replaced all `__import__("app.missions", fromlist=[...]).func(...)` lambda calls with simple `func(...)` references. Per reviewer's blocking feedback — the `__import__` pattern was fragile and unnecessary.
- **Documented accepted TOCTOU race** (`ci_queue_runner.py`, L86): Added a comment explaining that the unlocked `missions_path.read_text()` before `check_ci_status()` is an accepted race — the slow external call makes locking impractical, and the lambdas passed to `modify_missions_file` re-read content under lock. Per reviewer's important feedback to at minimum document this.
Add a GitHub pre-check guard to contemplative mode's Mission Proposal path
and to the autonomous issue-selection logic in agent.md. Before proposing or
picking a GitHub issue, the agent must verify the issue is unassigned (or
self-assigned) and has no open PR addressing it.

- build_contemplative_prompt() accepts github_nickname param; strips the
  check block from the rendered prompt when no nickname is configured
- build_contemplative_command() resolves nickname from config automatically
- contemplative.md: Option 2 includes assignment + open-PR gh checks
- agent.md: new "GitHub Issue Selection" section under Autonomous Mode Guidance
- Tests: 3 new tests for nickname rendering; existing signature tests updated

Co-Authored-By: Claude <noreply@anthropic.com>
When start_on_pause is enabled and /resume is sent during the startup
sequence, handle_start_on_pause() would (re-)create the pause file after
/resume removed it, effectively ignoring the resume command.

The fix uses a .koan-skip-start-pause signal file: /resume writes a
timestamped file that handle_start_on_pause checks and respects (if
fresh, <5min). This covers both race scenarios:
- /resume before handle_start_on_pause: skip file prevents pause creation
- /resume after handle_start_on_pause: skip file prevents re-creation on
  subsequent restart

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace raw ISO timestamps (⏳(2026-04-07T20:14)) with human-friendly
format: "⏳@ 8:14pm" for today, "⏳Mon @ 9am" for this week, or
"⏳Mon 3/31 @ 9am" for older dates. Only affects rendering, not storage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the Claude CLI exits with code 1 and writes nothing to stderr,
the error message was an unhelpful "Exit code 1: no stderr". The actual
error often appears in stdout (e.g. "context window exceeded"), but this
was discarded at two levels:

1. run_claude() in claude_step.py only checked stderr for the error
   snippet, ignoring stdout entirely.
2. _run_claude_review() in review_runner.py discarded result["output"]
   (stdout) on failure, losing the only diagnostic information.

Fix at the source: when stderr is empty but stdout has content, include
a stdout tail in the error message ("no stderr | stdout: <last 500
chars>"). Also add belt-and-suspenders logging in review_runner to log
stdout before discarding it on failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded max_turns values (15, 20, 25, 30) with
get_skill_max_turns() from config.yaml in review_runner,
plan_runner, pr_review, rebase_pr, and recreate_pr. Previously
each runner used a different hardcoded ceiling, causing failures
on large PRs (e.g. "Reached max turns (15)") that the user could
not override. Now all runners respect the skill_max_turns config
key (default 200), giving a single knob to control turn limits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If the bridge (awake.py) was running before is_own_pr was added to
github_skill_helpers (commit 37baab8), sys.modules caches the old
module without the function.  The rebase and ci_check handlers then
fail at call-time with:

  ❌ Failed to check PR ownership: module 'app.github_skill_helpers'
  has no attribute 'is_own_pr'

Fix: when is_own_pr is absent on the cached module, call
importlib.reload() to refresh it in-place before invoking the
function.  The reload is a no-op in the normal case (function is
present), so there is no performance impact on healthy deployments.

Adds a parametrised regression test that simulates the stale-cache
scenario for both rebase and ci_check handlers.

Fixes Anantys-oss#1162

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…k support

The CI fix runner (ci_queue_runner.py) was missed by b56a62a which made
max_turns configurable across all other runners. This fixes four issues:

- Replace hardcoded max_turns=15 with get_skill_max_turns() (default 200)
- Add get_skill_timeout() instead of relying on default 600s
- Replace unsafe `git fetch origin branch:branch` + `git checkout branch`
  with the safe _fetch_branch + `checkout -B` pattern used in rebase_pr
- Replace hardcoded `origin` remote with proper remote resolution via
  _find_remote_for_repo, supporting fork setups where the PR branch
  may be on upstream or a third-party remote

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…quota check

The usage percentages displayed by `make logs` were misleading — they're
token-count estimates against arbitrary limits (500K/5M) that don't match
real API quotas. This caused confusion when comparing with Claude CLI's
actual usage bars (Anantys-oss#1131).

Changes:
- Prefix all usage percentages with ~ to signal they're estimates
- Add "token estimate — may differ from real API quota" header in logs
- Display today's actual API cost (from cost_tracker JSONL) alongside estimates
- Add disclaimer in /quota skill output with hint to use /quota <N> to correct
- Remove bogus `claude usage` subprocess call from ClaudeProvider
  (`claude usage` is not a real subcommand — it gets treated as a prompt and hangs)
- Update usage.md parser regex to accept optional ~ prefix (backward compatible)

Closes Anantys-oss#1131

Co-Authored-By: Claude <noreply@anthropic.com>
ot optional)." Since the code explicitly handles `ImportError`, it's treated as optional — the deferred import is the right pattern here.

Summary of changes:

- **Narrowed exception handling in `_get_cost_today`** (`iteration_manager.py:141`): Changed `except Exception` to `except (ImportError, OSError, ValueError, KeyError)` per reviewer suggestion, matching the pattern used by `_get_usage_decision` above and avoiding silently swallowing unexpected bugs.
- **Made cost_today check explicit** (`run.py:1420`): Changed `if plan.get("cost_today"):` to `if plan.get("cost_today", 0.0) > 0:` per reviewer suggestion, making the falsy-zero behavior intentional and explicit rather than accidental.
- **Kept deferred import** for `cost_tracker` since the module is treated as optional (ImportError is caught), making the deferred import the correct pattern.
- **Skipped check_quota_available cleanup** — reviewer marked it as "not blocking, worth a follow-up" rather than a change request for this PR.
Add sanitize_github_comment() helper in github.py that replaces bare
@copilot mentions (case-insensitive) with backtick-escaped variants to
prevent triggering the Copilot bot. Apply at all 12 comment-posting call
sites across review_runner, rebase_pr, recreate_pr, squash_pr, pr_review,
pr_quality, plan_runner, claude_step, github_reply, and
github_command_handler. Add prompt guidance to github-reply.md.

Closes Anantys-oss#1165

Co-Authored-By: Claude <noreply@anthropic.com>
Here's a summary of the changes:

- **Expanded bot mention scope** (`koan/app/github.py`): Replaced single `_COPILOT_RE` regex with a `_BOT_USERNAMES` tuple containing `copilot`, `dependabot`, and `github-actions`, and a `_BOT_MENTION_RE` regex built from that list — per reviewer request to cover all three bots (refs issue Anantys-oss#1165 comment)
- **Fixed type hint** (`koan/app/github.py`): Changed `sanitize_github_comment(text: str) -> str` to `Optional[str] -> Optional[str]` to match the actual `None`-passthrough behavior
- **Updated prompt guidance** (`koan/system-prompts/github-reply.md`): Added `@dependabot` to the list of bot accounts that must be backtick-escaped
- **Added tests** (`koan/tests/test_github.py`): 8 new test cases covering `@dependabot` (bare, capitalized, already-escaped, partial-match) and `@github-actions` (bare, capitalized, already-escaped), plus a mixed-bots test
…aliases

The help reply from @bot help now groups commands by category (Code,
Pull Requests, etc.) with section headers, includes skill emoji and
alias shortcuts for a more scannable, useful response.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Koan-Bot and others added 29 commits May 16, 2026 22:53
When GitHub's Private Vulnerability Reporting is enabled on a target
repo, /security_audit now submits critical/high findings as private
security advisories instead of public issues. Lower-severity findings
remain public. Configurable per-project via projects.yaml security
section (pvrs mode + threshold). Graceful fallback to public issues
on any PVRS API failure.

Implements: Anantys-oss#1341
Plan: PR Anantys-oss#1269 (docs/plan-pvrs-security-audit.md)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensure all remote tracking refs for the base branch are fresh before
any mission work begins — both new branches and rebases.

Two complementary changes:
- git_prep: after primary sync, fetch base from all secondary remotes
- claude_step: pre-fetch all relevant remotes before the rebase loop

Addresses review feedback on PR Anantys-oss#1333 about keeping main up to date
regardless of its name before working on branches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a Linting section to CLAUDE.md documenting ruff as the project's
linter, and add a `make lint` Makefile target so contributors can
check compliance before pushing. Addresses review feedback on Anantys-oss#1345.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a third messaging provider option (KOAN_MESSAGING_PROVIDER=matrix)
implemented as a thin synchronous wrapper around the Matrix Client-Server
HTTP API. No new Python dependencies — uses the existing requests
package, mirroring the Telegram provider's style.

What:
- app/messaging/matrix.py — MatrixProvider with send/poll/typing using
  PUT /_matrix/client/v3/rooms/{roomId}/send/m.room.message/{txnId}
  and long-polling GET /_matrix/client/v3/sync.
- Auto-registered via _PROVIDER_MODULES in app/messaging/__init__.py.
- Filters: only m.room.message events with msgtype=m.text, ignores
  the bot's own user_id and events from rooms other than the configured one.
- Initial /sync discards historical events; only the next_batch cursor
  is kept, so the bot doesn't replay history on startup.
- 5xx errors are retried via retry_with_backoff; 4xx fail fast.
- Onboarding wizard gains a Matrix option that writes the four env vars.
- docs/messaging-matrix.md walks through setup (account, access token,
  room ID); README, INSTALL.md, and cross-doc links updated.
- 29 unit tests cover config validation, send chunking, URL encoding,
  sync cursor handling, message filtering, and the typing indicator.

Why:
The codebase already had a clean MessagingProvider abstraction with
Telegram and Slack implementations — Matrix slots in without touching
any of the bridge code, giving self-hosted / federated users a third
option without reaching for slack-sdk or a Telegram bot account.

Note: end-to-end encryption (Olm/Megolm) is intentionally out of scope.
Operators should use unencrypted rooms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ppress

Add SIM105 to the enforced ruff rule set and refactor all 64 existing
violations to use `contextlib.suppress(...)`. This catches the
try/except/pass pattern automatically so future PRs cannot reintroduce it.

Tests untouched by intent — only the suppression idiom changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…trying

When a fork PR is opened by a first-time contributor, GitHub Actions
gates every workflow run on a maintainer "Approve and run" click. The
gh API exposes these runs with status="action_required" (or "waiting"
for environment-protected jobs) and no conclusion.

Until this change, aggregate_ci_runs() classified them as "pending" and
drain_one() left the PR in ## CI indefinitely — every iteration polled
again, and retried the /ci_check fix loop on stagnation. See
aio-libs/aiohttp#12553 where Kōan retried multiple times against runs
that physically could not start until a human clicked Approve.

Now aggregate_ci_runs() returns a new "blocked_approval" status when
any run is gated this way (failure still wins so a genuinely broken
workflow on the same push gets surfaced). drain_one() removes the PR
from ## CI and notifies the outbox so the operator can either approve
in the GitHub UI or ping the maintainer. run_ci_check_and_fix() and
rebase_pr._run_ci_check_and_fix() exit early without attempting fixes
when CI is blocked, since pushing more commits triggers the same gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skill runners (rebase_pr, recreate_pr) invoke Claude via run_claude(),
which used subprocess.run with capture_output=True. That buffers all
stdout until the subprocess exits, so the run.py liveness watchdog
(600s no-output kill) fires whenever a single Claude call exceeds
the threshold — observed on the aioesphomeapi #1660 rebase.

A secondary failure mode: subprocess.run timeout sends SIGKILL to the
direct child but not grandchildren. When Claude has spawned helper
subprocesses that inherit stdout, communicate()'s post-kill drain can
block indefinitely, so the inner 120s/300s/600s timeouts never fire.

Switch run_claude to popen_cli with start_new_session=True and stream
stdout line-by-line through sys.stdout. Every emitted line resets the
parent watchdog and surfaces real-time progress to /live. A threading
Timer-backed watchdog kills the entire process group via os.killpg on
timeout, breaking the grandchild-pipe hang.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three file writes used non-atomic operations (write_text / json.dump),
risking corruption if the process is killed mid-write:

- quarantine cap (_enforce_quarantine_cap): truncated quarantine could
  lose entries on crash during write
- outbox staging (flush): partial staging file causes garbled Telegram
  messages on recovery
- stagnation retry tracker: corrupted JSON resets all retry counters

All three now use atomic_write / atomic_write_json (temp + fsync + rename).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace raw "blocked_approval" string literals across ci_queue_runner,
claude_step, and rebase_pr with a single exported constant to prevent
typos and improve grep-ability.

Closes review comment from PR Anantys-oss#1351.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Five pre-existing isolation bugs surfaced when running the suite under
pytest-xdist. Each failure was a real test pollution issue that also
showed up in stand-alone subset runs — xdist just made it routine.

- conftest: give every xdist worker its own KOAN_ROOT directory.
  `app.utils` snapshots KOAN_ROOT at import time, so workers that
  inherit the same env race on shared files under that path.
- test_github_command_handler / test_github_subscribe: autouse-mock
  `_is_subject_closed`, which otherwise hits the real GitHub API
  (network-flaky in serial, parallel-unsafe under xdist).
- test_mission_retry: autouse-reset the `_last_mission_*` flags on
  `app.run`. Other tests leave them set; xdist surfaced the leak.
- test_skill_dispatch: patch parse_pr_url on the handler module's
  local binding instead of on `app.github_url_parser`. The handler
  does `from app.github_url_parser import parse_pr_url` at module
  load, so patching the source after that import leaks a stale
  binding into later tests that import the same handler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The full suite takes ~3m38s serial. Adding pytest-xdist drops it to
~1m58s on 2 workers (1.85x) and ~1m08s with -n auto (3.2x) on this box.

Makefile picks the worker count based on environment:
- CI / GitHub Actions      -> -n auto
- Remote SSH session       -> -n 2 (be polite on shared hosts)
- Local terminal           -> -n auto

Override via `make test PYTEST_WORKERS=N` (PYTEST_WORKERS=0 disables xdist).
All commands use `--dist loadfile` so each test file lands on one worker;
that keeps file-internal collection order intact and avoids surprises from
class-level state.

CI (.github/workflows/tests.yml) installs pytest-xdist and passes
`-n auto --dist loadfile` to every pytest invocation.
…eshold

Adds two behavioral tests to TestCompactLearnings confirming that the
git subprocess (_get_file_tree) is never spawned when compaction is
skipped due to the line-count threshold or hash-unchanged guard.

Closes Anantys-oss#1322

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a /rebase, /review, /recreate, /squash, /ci_check, /fix, /check,
or /gh_request mission is queued for a URL that already has an identical
command pending or in progress, the insert is silently skipped and the
user is notified with a warning message.

This prevents the queue from filling with redundant missions when the
same action is triggered from multiple sources (Telegram, GitHub
notifications, check_runner).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The stale-module detector in `_refresh_stale_app_modules` only reloaded
`app.*` modules whose mtime had changed since a previously cached value.
On the very first observation of a module the function silently recorded
the current mtime without reloading, on the assumption that the first
encounter happens during a fresh boot when in-memory and on-disk content
already agree.

That assumption breaks when auto-update (or a manual `git pull`) rewrites
a file BEFORE the first post-update skill invocation. The new file mtime
is then recorded as the baseline while `sys.modules` still holds the
pre-update module, and no later refresh ever notices the divergence.

Reproduction: a Kōan process booted before `5742248`
(`refactor(missions): consolidate project-tag regexes`) pulled the
commit, then received `/list`. The handler does
`from app.utils import PROJECT_NAME_CHARS` and crashed with:

    Skill error (core.list): cannot import name 'PROJECT_NAME_CHARS'
    from 'app.utils' (.../koan/app/utils.py)

Because the trap is "any new symbol added to any `app.*` module after
the process started", every future addition to `app.utils` (or any
sibling) would re-trigger the same failure on first use after update.

Fix: capture `_PROCESS_START_TIME = time.time()` when `app.skills` is
imported, and treat a first-time observation as stale whenever the
source file's mtime is greater than that timestamp. Clean boots stay
zero-cost (file mtime ≤ process start ⇒ skip the reload, just record
the baseline) while post-update first observations now reload as
intended.

Tests: update the existing "first encounter, no reload" case to stub
`_PROCESS_START_TIME` past the fake file's mtime so it keeps exercising
the old-file branch, and add a new regression test asserting that a
first observation of a file newer than process start triggers
`importlib.reload`. Operators carrying the bad state need a one-time
restart; the fix prevents recurrence for any future `app.*` addition.
`request_restart` previously wrote a single shared `.koan-restart`
marker that both the bridge (`awake.py`) and the runner (`run.py`)
polled. The runner's wrapper-restart cycle is fast: after
`request_restart`, the runner exits with code 42, its wrapper relaunches
it, and the fresh runner's startup wipe (`run.py:785`) deletes the
marker within ~1 s. The bridge polls every `POLL_INTERVAL` (≈3 s) at
`awake.py:778`, so the marker was usually gone before the bridge's next
tick, and the bridge never re-execed.

Observable failure: after `/update`, the runner pulled the new code and
the runner process did restart, but the bridge kept its pre-update
Python interpreter alive. The bridge's `sys.modules['app.utils']`
remained the stale copy, so `/list` raised
`cannot import name 'PROJECT_NAME_CHARS' from 'app.utils'` even after
the upstream stale-module fix (`75bf3e2`) had landed on disk — the
detector itself was in the stale `sys.modules`.

Give each consumer its own marker file (`.koan-restart-bridge`,
`.koan-restart-run`). `request_restart` writes both, plus the legacy
`.koan-restart` for backward compatibility so a pre-upgrade bridge
still polling the old name can pick up the first post-upgrade request
and re-exec into the new code. `check_restart` and `clear_restart`
take an optional `target=` ("bridge" / "run" / None); each consumer
only clears its own marker, so the runner's startup wipe can no longer
silence the bridge's signal. Unknown targets raise `ValueError` to
catch typos at call sites.

Bridge calls in `awake.py` (L754, L755, L778) pass `target="bridge"`.
Runner calls in `run.py` (L734, L785, L854, L856) pass `target="run"`.
The pause-loop check at L734 keeps its existing no-`since` semantics
because the startup clear at L785 already handles stale markers from
a previous incarnation.

Tests:
- `TestPerProcessRestartMarkers` in `test_restart_manager.py` adds 5
  cases including a direct race regression
  (`test_runner_wrapper_restart_does_not_silence_bridge`).
- `test_uses_atomic_write` now asserts all three markers are written.
- `test_run.py` and `test_restart.py` updated to expect the new
  `target=` kwarg and the per-process marker filenames.
`_ensure_providers_loaded` short-circuited the moment `_providers` was
non-empty:

    def _ensure_providers_loaded():
        if _providers:
            return
        for module_name in _PROVIDER_MODULES:
            __import__(module_name)

The intent ("skip the loop on repeat calls") was fine; the sentinel was
wrong.  `_providers` becomes non-empty as a side-effect of *any* code
importing one of the provider modules — including the default startup
path where the bridge imports `app.messaging.telegram` before anything
asks for the full provider list.  Once that happens, the very first
`_ensure_providers_loaded` call hits the early return, the `matrix` /
`slack` imports never run, and `_create_provider("matrix")` SystemExits
with `Unknown messaging provider: 'matrix'` even though the module
ships on disk.

Symptom on CI: `test_matrix_provider.py::TestRegistry::test_matrix_registered`
flapped under pytest-xdist depending on whether a sibling test (e.g.
`test_telegram_provider.py`'s module-level `from app.messaging.telegram
import TelegramProvider`) had already populated `_providers` in the
worker before the matrix test ran.

Fix: track loop-completion explicitly with a `_modules_loaded` flag
rather than inferring it from `bool(_providers)`.  Same short-circuit
on repeat calls, correct semantics on the first one.

Tests:
- `TestEnsureProvidersLoaded.test_loads_matrix_when_telegram_imported_first`
  reproduces the production scenario in a fresh subprocess.  Verified
  the test fails without the fix:
      AssertionError: missing providers after load: {'matrix', 'slack'}
- `test_idempotent_when_called_repeatedly` (also subprocess) guards
  against future regressions where repeat calls might drift.
- `test_matrix_provider.py::TestRegistry::test_matrix_registered`
  switched to the same subprocess pattern so it no longer flaps
  based on cross-file test ordering: `_providers` is process-wide
  module state and the `clean_registry` fixture in
  `test_messaging_provider.py` can leave the in-process registry out
  of sync with `sys.modules`, which no in-process call to
  `_ensure_providers_loaded` can recover from (cached modules don't
  re-run their `@register_provider` decorators).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace ❌ with 🚦 in the CI-failure outbox notification to avoid
confusion with system-level errors. The traffic light emoji signals
"CI gate not passing" without implying a crash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Telegram bridge (`awake.py`) is a long-running process with no
wrapper, so a frozen `sys.modules` after `/update` leaves it serving
stale imports until someone with shell access kicks it.  The recent
`cb6e927` ("per-process restart markers") fix the race that drops
restart signals once both processes are on the new code, but the
transitional `/update` from old to new still wedged the bridge — and
no in-Telegram recovery path worked, because every subsequent
`/restart` hit the same import-cache state and the bridge's polling
loop has no way to bootstrap its `awake.py` globals back to the
freshly-reloaded `app.restart_manager`.

Add a runner-side watchdog that detects two distinct failure modes
and escalates recovery through four tiers.  The runner is *always*
fresh code (its wrapper relaunches on every exit), so it is the
right place to own this responsibility.

Detection:
- **Stale modules**: bridge stamps its git HEAD SHA at startup
  (`.koan-bridge-version`); runner compares against
  `git rev-parse HEAD` on each tick.  Heartbeat alone cannot catch
  this — process is alive and responsive, just running old code.
- **Hung / dead bridge**: heartbeat older than
  `BRIDGE_HEARTBEAT_STALE_S` (90 s) or PID gone / pointing at a
  dead process.

Escalation (one tier per call, persisted in `.koan-bridge-heal-state`):
- Tier 1: `request_restart()` — runner is on fresh code so its
  triple-marker write is correct and not subject to the old race
- Tier 2: SIGTERM the bridge PID
- Tier 3: SIGKILL (after `SIGTERM_GRACE_S`) + `pid_manager.start_awake`
- Tier 4: circuit-broken, alert and stop.  Counter trips after
  `HEAL_CIRCUIT_BREAKER_LIMIT` (3) consecutive tier-3 failures.

If the bridge has no PID file at all (crashed long ago, never
restarted), tiers 1–2 are skipped and the watchdog goes straight to
tier 3 — there is no process to signal cooperatively.

The runner throttles probes to one per `_BRIDGE_WATCHDOG_INTERVAL`
(5) main-loop iterations.  Each iteration is ~60–300 s, so detection
latency is ~5–25 min in the worst case, which is fine for "don't let
a stuck bridge sit forever."  Per-tier cooldown
(`HEAL_TIER_COOLDOWN_S` = 45 s) prevents acting again before the
previous tier has had a chance to take effect.

When the watchdog acts, the message is forwarded both to Telegram
via `_notify_raw` (terse, no Claude reformat — goes through the
outbox so a freshly-relaunched bridge flushes it on first poll) and
to today's journal entry under project "koan" for after-the-fact
audit.

Files:
- `koan/app/bridge_watchdog.py` (new, ~330 LOC): detection, state
  machine, tier execution, bridge-side version stamp writer.
- `koan/app/signals.py`: two new constants
  (`BRIDGE_VERSION_FILE`, `BRIDGE_HEAL_STATE_FILE`).
- `koan/app/awake.py`: call `write_bridge_version_stamp` once at
  startup, right after the existing heartbeat write.
- `koan/app/run.py`: throttled `_maybe_run_bridge_watchdog` helper
  called once per iteration after the existing `check_restart`
  block.  Forwards heal messages to Telegram + journal.
- `koan/tests/test_bridge_watchdog.py` (12 tests): healthy no-op,
  SHA mismatch tier 1, heartbeat stale tier 1, missing stamp =
  no-op (defensive), cooldown suppression, tier 2 SIGTERM after
  tier 1, tier 3 SIGKILL + start_awake, no-PID cold-start
  skipping straight to tier 3, circuit-breaker alert path.
- `docs/bridge-watchdog.md`: design rationale, failure-mode table,
  escalation diagram, tunables, operator notes, rollback.

Verified: `pytest koan/tests/` = 12788 passed, ruff clean on all
touched files.  (Pre-existing SIM105 in `koan/app/cli_exec.py:187`
is unrelated and present on `main`.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants