Skip to content

fix(claude_step): stream stdout + kill process group on run_claude timeout#1

Closed
bluetoothbot wants to merge 62 commits into
mainfrom
koan/run-claude-pgroup-kill
Closed

fix(claude_step): stream stdout + kill process group on run_claude timeout#1
bluetoothbot wants to merge 62 commits into
mainfrom
koan/run-claude-pgroup-kill

Conversation

@bluetoothbot
Copy link
Copy Markdown
Owner

What

Make claude_step.run_claude stream child stdout in real time and kill the entire process group on timeout, instead of silently capturing output via subprocess.run(capture_output=True).

Why

Investigated the aioesphomeapi #1660 rebase stagnation. The mission log showed:

[rebase] Running already-solved check (Claude)
[20:22:56] [error] No output for 600s — killing stuck process (elapsed: 614s)
[20:22:56] [error] Skill runner timed out (liveness: 600s)

Two run_claude functions exist:

  • run.py::run_claude_task (mission loop) — already uses popen_cli + start_new_session=True with line streaming and process-group kill.
  • claude_step.py::run_claude (called by rebase, recreate, review_runner, squash, contemplative, etc.) — used subprocess.run via run_cli_with_retry, which captures the entire stdout silently.

When a rebase skill subprocess invoked _check_if_already_solved or _apply_review_feedback, the inner Claude call could legitimately take many minutes, and during that time the skill's own stdout produced no lines. run.py's first_output_timeout watchdog (default 600s) then killed the whole skill chain, surfacing as a stagnation timeout with no diagnostic.

How

  • Drop run_cli_with_retry for claude_step.run_claude and replace with popen_cli + start_new_session=True.
  • Stream each stdout line through to the parent's stdout (resetting the outer liveness watchdog).
  • Watchdog threading.Timer calls os.killpg(SIGKILL) on the new session so MCP servers / hooks / any grandchildren also die.
  • Preserve the return contract ({success, output, error}) — existing callers (rebase, recreate, review, claudemd_refresh, contemplative, squash, post_mission_reflection) are untouched.
  • Retry layer is dropped here on purpose: it only made sense for short silent probes; with streaming, the caller observes failures live.

Testing

  • KOAN_ROOT=/tmp/test-koan .venv/bin/pytest koan/tests/12712 passed.
  • make lint → clean.
  • New tests in test_claude_step.py::TestRunClaudeStreamsOutput cover: line-by-line streaming with arrival-time assertions, process-group kill (grandchild marker file must not appear after timeout), timeout return shape, and preserved result['output'] contract.
  • Updated existing TestRunClaude (in both test_claude_step.py and test_pr_review.py) to mock popen_cli — caught a pre-existing leak where test_pr_review.py::TestRunClaude had been patching the wrong subprocess path and actually invoking real Claude during tests.

🤖 Generated with Claude Code

bluetoothbot and others added 30 commits May 15, 2026 05:31
When a user provides a PR URL from a different owner (e.g.,
Anantys-oss/pull/171 instead of Anantys-oss/pull/171), the
PR fetch would fail because the PR doesn't exist at the given
owner/repo combination.

Added resolve_pr_location() to claude_step.py that first checks the
given owner/repo, then falls back to all git remotes from the local
project to find where the PR actually lives. Applied to run_rebase,
run_recreate, run_squash, and run_review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… in resolve_pr_location

Done. Here's the summary:

- **Replaced debug `print()` with `logging.info()`** in `resolve_pr_location()` — the `print(..., file=sys.stderr)` was a debug leftover flagged by both the quality report and the reviewer. Converted to standard `logging.info()` with `%s` formatting per Python logging conventions.
- **Made dedup case-insensitive explicitly** — added `.lower()` on `remote_slug` before checking/adding to the `tried` set, so the dedup no longer silently depends on `get_all_github_remotes()` returning lowercase slugs.
extract_tokens_detailed() was called 3 times on the same stdout file
(cost tracking, activity logging, cache line extraction). Similarly,
load_projects_config() was called 3 times with the same instance dir
(quality gate, lint gate, auto-merge). Both are now extracted once at
the top of run_post_mission() and passed through to each consumer.

All new parameters are optional with None defaults so existing callers
(contemplative runner, tests, CLI) continue to work unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Here's what I changed:

- **Extracted `_ensure_tokens()` helper** to DRY up the repeated 4-line fallback pattern (`if tokens is None: import; tokens = extract_tokens_detailed(...)`) that was duplicated across `_extract_cache_line`, `_record_cost_event`, and `_log_activity_usage`. Per reviewer suggestion Anantys-oss#2, this consolidates the fallback logic into a single location while preserving the same behavior.

The other two review points were observations (not change requests): the silent degradation note was flagged as "worth noting" and the quota early-return behavior change was flagged as "worth calling out in the PR description" — neither requested code changes.
…nfig

Remove four functions that are defined and tested but never called in
production code:

- is_project_branch_saturated() in branch_limiter.py: superseded by
  inline logic in iteration_manager._filter_exploration_projects()
  which calls count_pending_branches() directly
- _get_project_by_index() in iteration_manager.py: never called
  anywhere, replaced by _select_random_exploration_project()
- get_tool_flags_for_shell() in config.py: defined and re-exported
  via utils.py but never called
- get_output_flags_for_shell() in config.py: same pattern, never called

Also removes the corresponding test classes and cleans up imports.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixed both cosmetic nits from the review:

- **`koan/app/config.py`**: Collapsed double blank line before `get_auto_merge_config()` to single blank line (PEP 8)
- **`koan/app/iteration_manager.py`**: Collapsed double blank line before `_get_known_project_names()` to single blank line (PEP 8)

Both were leftover artifacts from the function removals, as noted by the reviewer.
Closes Anantys-oss#1306.

Adds a lightweight Jaccard similarity filter so the agent prompt only carries
learnings relevant to the active mission text. Capped at memory.max_relevant_learnings
lines (default 40) plus a 5-line recency hedge that always keeps freshly captured
lessons. Use [recall:full] in mission text to bypass filtering when needed.

No new dependencies — tokenize/score/select live in koan/app/memory_recall.py
and are wired into prompt_builder._get_learnings_section, which writes a
one-line stderr trace recording kept/dropped counts for tuning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds _notify_mission_result() to the post-mission pipeline so the Claude
session's final result string reaches Telegram even when the session's
sandbox blocked writes to instance/outbox.md. Activates when the result
contains alert markers (SKIP/FAIL/ERROR/BLOCKED, "permission deadlock",
"hard stop", etc.) or when the mission title matches a customer-facing
skill (/cpfix, wp-bug-resolver).

Idempotent: skipped silently when the session has already written to
outbox.md after the mission start time. Posts at ACTION priority so the
message clears the default min_priority filter; the alert flag only
toggles the icon (⚠️ for alerts, ℹ️ for customer-facing success).

Gated by the new notify_mission_results config key (default: True) so
operators can opt out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The incremental review flow previously required 3 API calls to embed
commit SHAs: (1) post/patch review comment, (2) re-fetch the posted
comment to get its body, (3) PATCH again to add the SHA block. This
also introduced a TOCTOU race where the comment could be modified
between post and re-fetch.

Pass commit_shas directly to _post_review_comment so they are embedded
in the body before the single API call. Removes the now-unused
_patch_comment_body helper. Preserves existing behavior for the no-SHA
case (prior commit IDs from existing comment are carried forward).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nantys-oss#1295)

Adds three layers of optional integration with rtk (https://github.com/rtk-ai/rtk),
a Rust CLI proxy that compresses git/ls/cat/grep/pytest/cargo/gh/docker output
60-90% before Claude reads it. Strictly complementary to caveman (Anantys-oss#1279):
caveman trims what Claude writes, rtk trims what Claude reads.

L1 — Detection
- koan/app/rtk_detector.py: cached `detect_rtk()` probes binary, version,
  jq, the ~/.claude/settings.json hook, and rtk's own config file. All
  probes are read-only and degrade gracefully.
- koan/app/run.py: one-line boot log via the detector at main_loop start.

L2 — Awareness injection
- koan/system-prompts/rtk-awareness.md: 25-line directive listing the
  command surface where rtk has filters.
- koan/app/prompt_builder.py: `_get_rtk_section(project_name)` mirrors
  `_get_caveman_section`. Wired into both build_agent_prompt and
  build_agent_prompt_parts (system-prompt slot, prefix-cache friendly).
- koan/app/config.py: `is_rtk_mode()` / `is_rtk_awareness_enabled()` —
  resolution order is .koan-rtk-override > optimizations.rtk.enabled >
  detector. Default `auto` = on iff binary detected.
- koan/app/projects_config.py: `get_project_rtk_enabled()` for per-project
  opt-out via projects.yaml.
- koan/app/config_validator.py: schema + nested validation for the new
  `optimizations.rtk` block.

L3 — /rtk skill
- koan/skills/core/rtk/: status / setup (preview + confirm) / uninstall /
  gain / discover / on / off. Hook installation only ever via explicit
  `/rtk setup confirm` — Kōan never silently mutates ~/.claude/settings.json.

Tests
- 14 detector tests, 13 skill tests, 10 prompt-builder/config tests.
- Full koan/tests/ suite: 12,142 passed, no regressions.

Docs
- docs/rtk.md (new): full integration reference.
- instance.example/config.yaml: documented `optimizations.rtk` block.
- CLAUDE.md + docs/user-manual.md: skill listing + Quick Reference row.

Closes Anantys-oss#1295

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lated .spec/ gitignore

 mock `load_projects_config` instead of `load_config`.

- **Removed unrelated `.spec/` gitignore entry** (.gitignore): Per reviewer request, removed the `.spec/` gitignore addition which is unrelated to the RTK integration and should be in a separate commit.

**Reviewed but no changes needed (already correct):**

- Shell injection in `/rtk gain` and `/rtk discover`: `_run_rtk()` already uses `subprocess.run(["rtk", *args])` with list form (no `shell=True`). Safe by design.
- `/rtk setup confirm` safety: same `_run_rtk()` list-form subprocess call; `confirm` check uses strict equality.
- Cache staleness: handler already calls `reset_cache()` after mutations (`setup confirm`, `uninstall`) and uses `detect_rtk(force=True)` to re-probe.
- `is_rtk_mode()` auto handling: `coerce_rtk_enabled()` correctly handles bool, string true/false/auto, returning `None` for auto which falls through to binary detection.
- Config validation: `_validate_rtk_nested()` already exists in `config_validator.py`.
When a mission starts, a checkpoint JSON file is created under
instance/journal/checkpoints/. During execution it captures branch
info, progress from pending.md, and CHECKPOINT markers from stdout.
On success the file is cleaned up; on crash, recover.py reads the
checkpoint and injects structured recovery context into pending.md.

- checkpoint_manager.py: create/read/update/delete/list + parsing
- run.py: create checkpoint at mission start, update with branch +
  pending.md + stdout markers before finalization, cleanup on success
- recover.py: checkpoint-aware classification (partial vs dead),
  recovery context injection into pending.md, JSONL audit logging
- 35 new tests (checkpoint_manager) + 6 new tests (recover integration)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
o it's always defined when referenced on the recovery path. This was a blocking bug — if the checkpoint_manager import failed, `clean_text` would be undefined, crashing the entire `recover_missions()` flow.

- **Replaced custom file locking with `atomic_write` from utils** (`checkpoint_manager.py`): `_write_checkpoint` now delegates to the project's standard `atomic_write()` instead of hand-rolling temp file + lock + rename. Removed unused `fcntl` and `sys` imports.

- **Removed cosmetic read-side lock** (`checkpoint_manager.py`): `_read_checkpoint` no longer acquires `LOCK_SH` since it never coordinated with the write-side lock (which locked the temp file, not the target). Atomic rename on write guarantees read consistency without locking.

- **Used `atomic_write` for pending.md injection** (`recover.py`): Checkpoint context injection into `pending.md` now uses `atomic_write()` instead of raw `open()`, consistent with project conventions. Removed debug print statements from this path.
Here's a summary:

**Changes made:**

- **Moved checkpoint update before auth/quota error checks** (`koan/app/run.py`): The checkpoint update block (branch detection, pending.md parsing, stdout marker extraction) was moved from after the auth/quota early-return checks to immediately after stdout parsing. This ensures mission progress is captured in the checkpoint even when auth expiration or quota exhaustion triggers an early return — addressing the reviewer's concern about the checkpoint only containing empty initial state if a crash occurs during the main execution window.

- **Delete checkpoint on any finalized mission, not just success** (`koan/app/run.py`): Changed the checkpoint cleanup condition from `claude_exit == 0` to unconditional (whenever `original_mission_title` is set). Once a mission is finalized (success or failure), the checkpoint serves no purpose — `recover.py` only processes in-progress missions. This prevents orphaned checkpoint files from accumulating over time.
Token extraction logic (JSON field traversal, cache metrics, cost) was
spread across usage_estimator.py with consumers in mission_runner.py and
cost_tracker.py duplicating field access patterns and cache hit rate
calculations independently.

Introduces token_parser.py as single source of truth with:
- TokenResult dataclass replacing raw dicts for structured access
- extract_tokens() consolidating all JSON format handling
- compute_cache_hit_rate() eliminating duplicated rate formulas

Existing callers updated to delegate through the new module. The
usage_estimator.extract_tokens_detailed() remains as a thin dict
adapter for backward compatibility.

Closes Anantys-oss#1323

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- _get_file_tree: log timeout/OSError at stderr instead of bare pass,
  making it visible when git ls-files fails or times out
- compact_learnings: include exception type and message in fallback log,
  add "method" (semantic/fallback) and "error" fields to return dict
  so callers can distinguish real compaction from emergency truncation

Closes Anantys-oss#1320, closes Anantys-oss#1321

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Read the archive file once with encoding="utf-8" and reuse both the
content string and the line set, instead of reading without encoding
for the set and re-reading with encoding for content append.

Closes Anantys-oss#1319

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a GitHub Actions workflow that automates the release process:
- Manual trigger via workflow_dispatch with version input (vX.Y or vX.Y.Z)
- Runs full test suite (Python 3.11 + 3.14, fast + slow groups) before releasing
- Validates version format, checks tag uniqueness, ensures commits exist
- Creates annotated tag, updates stable branch, publishes GitHub release
- Changelog generated from git log (no Claude CLI needed in CI)

Mirrors the steps in scripts/release.sh but adapted for CI (non-interactive,
bot git identity, GITHUB_TOKEN auth).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clean. Here's the summary:

- Removed the entire `test` job (lines 15–69) and its `needs: test` dependency from the `release` job, per reviewer request — the existing `tests.yml` CI pipeline already covers testing, so duplicating it in the release workflow is unnecessary maintenance burden
Changes applied:

- **Replaced `stable` branch with `stable` tag** per reviewer request at line 65: the "Update stable branch" step (which used `git branch` + `git push`) is now "Update stable tag" using `git tag -f stable` + force push
- **Added `update_stable` boolean input** to `workflow_dispatch` per reviewer request: defaults to `true`, controls whether the `stable` tag gets updated via an `if: inputs.update_stable` condition
- **Removed the stable branch logic entirely** per reviewer's "remove" comment at line 72: no more branch creation, fetch, or fast-forward — replaced with a simple tag update
OutboxManager handles the critical message delivery pipeline (read,
format, send, crash recovery) but had zero test coverage. This adds
36 tests covering all public methods and key behaviors:

- parse_outbox_priority: header parsing, priority ranking, stripping
- recover_staged: crash recovery from interrupted flushes
- flush: full lifecycle (scan, format, send, quarantine, requeue)
- flush_async: background thread management and skip-if-busy
- requeue / _write_failed: retry and last-resort persistence
- _format_message: Claude formatting with fallback paths
- _expand_github_refs: project context detection and URL expansion
- _get_last_message_id: provider delegation and error handling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The notification poller was silently dropping legitimate @mentions because
fetch_jira_mentions() capped results at the 20 most-recently-updated issues
across all mapped projects. On a multi-project deployment, 24h of activity
routinely produces 50–100 issues, so anything not in the top 20 by
`updated DESC` never had its comments inspected.

Empirically, an issue ranked 46 of 100 in production carried a legitimate
`@<bot> <command>` mention but the polling cycle logged "No new Jira
notifications" and the mention was lost. JQL pre-filtering by mention text
isn't viable either: `text ~ "<nickname>"` and `comment ~ ...` both miss
ADF mention nodes because Jira indexes the accountId reference, not the
displayName.

Changes:
- Bump _MAX_ISSUES_PER_CYCLE from 20 to 200. Cold-start cost rises from
  21 to ~201 API calls (~20s at 10 req/s), comparable to GitHub's "~1 min"
  cold start. Subsequent polls use _last_jira_check_iso (≤60s window) so
  steady-state cost is unchanged.
- Promote the per-cycle search log from DEBUG to INFO and surface the JQL
  since-window + result count, so future "no mentions" mysteries are
  diagnosable from run.log alone.
- When the cap actually clips, emit a WARNING with guidance to tighten
  max_age_hours or shorten check_interval_seconds.
- Add a Telegram cold-start banner "🔍 Scanning Jira notifications..."
  parallel to GitHub's existing one, so the user sees that Jira IS being
  scanned in the same shape and time as GitHub.

CLAUDE.md gains a "Never leak private skill/agent/project names" convention
documenting the policy and the pre-commit grep check (this commit also
scrubs three pre-existing leaks of private project keys in test fixtures
and a docstring).

Regression test: stub a 100-issue result set with the target mention at
index 46 and assert fetch_jira_mentions still returns it. With the old
cap of 20 this test would fail.
review_requested and assign notifications kept re-queueing the same
mission after every restart, because none of the existing dedup layers
covered the case:

  - The in-memory _notif_cache in loop_manager.py is lost on restart.
  - The persistent comment tracker keys on comment IDs, but these
    notifications have no comment object to react to or hash.
  - The fallback missions.md check in _try_assignment_notification
    only scans the Pending section, so once the prior /review moved
    to In Progress/Done/Failed it became invisible.

Add a parallel persistent tracker in github_notification_tracker.py
that records (notification_id, updated_at) composite keys for 7 days
in instance/.koan-github-processed-threads.json. Wire it into
_try_assignment_notification with an early-return guard above the
staleness/closed/repo checks, plus track_thread calls on both the
"already pending" and "successfully inserted" paths so subsequent
restarts short-circuit cleanly.

The composite key naturally invalidates when GitHub bumps updated_at
(re-requested review, new commits pushed), so a renewed request still
queues a fresh mission rather than being permanently silenced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every existing test in TestParseJiraMentionCommand uses a single-token
nickname ("koan-bot"). None exercises the regex against a nickname
containing whitespace, which leaves a gap: a refactor that drops
re.escape() or interpolates the nickname directly into the pattern
would silently break any deployment whose configured nickname has a
space — and such deployments do exist in practice, because Jira's
ADF mention.attrs.text preserves the literal display-name spacing.

Add three parametrized cases exercising:
  - the basic spaced-nickname mention,
  - a mention followed by command-context (ensures the trailing
    capture group isn't disrupted by the whitespace in the nickname),
  - a fully-lowercased mention (locks in the re.IGNORECASE flag,
    which clients sometimes rely on when rendering mentions).

No production code changes.
Lets a custom skill own its full mission lifecycle (e.g. enforce a
JIRA outcome comment after every fix mission) without touching Kōan
core. Previously hooks could only live in instance/hooks/ and were
instance-wide; skill authors had no way to ship pre/post-mission logic
alongside their handler.py.

HookRegistry now also walks instance/skills/<scope>/<name>/ for files
named after a known event (session_start.py, session_end.py,
pre_mission.py, post_mission.py) and registers each module's run(ctx)
as a handler for that event. Instance-wide hooks still fire first;
skill-bound hooks run after. Errors stay isolated per handler.

mission_runner._fire_post_mission_hook now pre-reads the truncated
Claude stdout via _read_stdout_summary and passes it to hooks as
ctx['result_text'], so hooks can parse JIRA keys / PR URLs / RESULT
lines without re-implementing file I/O.

Makefile gains a test-skills target that auto-discovers
instance/skills/**/tests/test_*.py (handles symlinked skill repos via
find -L) and runs them with KOAN_REPO + PYTHONPATH set up. It chains
into make test and make test-strict so skill regressions surface in
the regular suite; skipped cleanly when no skill tests exist.

instance.example/hooks/README.md documents both hook flavors, the new
result_text ctx key, and the convention for shipping tests alongside
a skill (tests/conftest.py template + pytest invocation recipes).

Tested: 224 koan tests pass (52 existing hook + 9 new skill-bound
discovery + 163 mission_runner); 6 pre-existing environment failures
are unrelated (confirmed via git stash). Tests cover discovery,
ordering, error isolation, ctx contents, and the missing-skills-dir
edge case.
Koan-Bot and others added 26 commits May 16, 2026 08:35
Project tag regex `[a-zA-Z0-9_-]+` rejected domain-like names such as
`developers.esphome.io`. `is_skill_mission` then returned False for
`[project:developers.esphome.io] /review …`, missions fell through to
the agent loop, and ran with the wrong working directory.

Add `.` to the character class wherever a `[project:X]` tag is parsed
or stripped (8 modules + dashboard HTML). Same fix for `_PROJECT_WORD_RE`
so raw word prefixes (`developers.esphome.io /plan …`) work too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Claude provider was passing the agent system prompt (~7 KB and
growing) as a positional argv string via `--append-system-prompt`,
which exposes the full prompt text in `ps`, process supervisors, and
any other tool that snapshots argv.  On a shared host or in any
hostile environment, that's a needless leak of operator policies and
instance learnings.

Switch to `--append-system-prompt-file` (documented, print-mode only,
which is the only mode Kōan uses).  A new
`build_full_command_managed()` helper writes the system prompt to a
0600 temp file and returns the command list paired with a cleanup
path that the caller MUST unlink after the subprocess exits — same
lifecycle pattern already used for stdout/stderr captures and plugin
dirs in `run.py` and `session_manager.py`.

- Adds `supports_system_prompt_file()` and
  `build_system_prompt_file_args()` capability methods on the provider
  base class; Claude opts in, the other providers fall through to the
  existing prepend-to-user-prompt behavior.
- `mission_runner.build_mission_command()` now returns
  `Tuple[List[str], List[str]]` — call sites in `run.py` and
  `session_manager.py` unpack and clean up.
- Adds `tests/test_provider_system_prompt_file.py` covering the
  capability flag, file-flag emission, file-mode precedence over
  inline content, cleanup behavior, and an argv-leak regression test
  that proves the prompt text never lands in argv.

The legacy `build_full_command()` and `--append-system-prompt` path
remain available for direct callers that haven't migrated, so this
change is additive at the provider API surface.

Out of scope (follow-up): the user prompt (~20 KB) is still passed via
`-p <text>` in argv.  Claude CLI doesn't currently document a clean
file/stdin substitute for the user prompt — worth a dedicated probe
before refactoring that side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The warning "Claude hit the max turns limit (5). To increase: set
skill_max_turns in instance/config.yaml (current: 5)" was misleading
when fired from chat-style callers (ask, github_reply, github_intent,
spec_generator, deepplan/plan reviewers, implement commit-subject
helper). These callers hardcode max_turns=1/3/5; skill_max_turns
(default 200) does not affect them, so the suggested remedy did
nothing.

Threads a max_turns_source argument through run_command()/
run_command_streaming(). Skill runners keep the default
("skill_max_turns") and are unchanged. Hardcoded-limit callers
pass max_turns_source=None, which produces a clearer message
("This call uses a hardcoded limit and is not configurable.")
instead of pointing the user at an unrelated config key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cold-start notification processing was serial: each notification triggered
several sequential `gh` API calls (fetch comment, find_mention_in_thread,
check subject state, mark read, react). With a 24h lookback returning 10+
notifications, this added 5-20s of wall-clock latency before the first
iteration could plan.

- New `github.parallel_workers` config (default 4, range 1-16) controls
  the worker pool used by `_process_notifications_concurrent`.
- Per-notification work is I/O bound (subprocess + HTTP) so threads scale
  near-linearly. workers=1 keeps the original serial path.
- Existing thread-safe primitives cover the shared state: cache lock,
  atomic mission writes, lock-guarded backoff counters.
classify_cli_error uses loose quota patterns (e.g. "rate limit",
"too many requests") on combined stdout+stderr, causing false QUOTA
classification when Claude's response discusses API rate limiting.

These tests demonstrate the bug before the fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
classify_cli_error() used loose quota patterns (e.g. "rate limit",
"too many requests", "usage limit") on combined stdout+stderr. When
Claude's response discussed API rate limiting, this falsely triggered
QUOTA classification, causing spurious mission requeueing and pauses.

Apply the same split-detection strategy already used by
handle_quota_exhaustion in quota_handler.py: strict patterns only
for stdout, all patterns for stderr.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of changes:

- **Fixed CI regression in `classify_cli_error()`** (`koan/app/cli_errors.py`): The split stdout/stderr quota check broke `test_failure_raises_runtime_error`, which passes a `MagicMock` for stdout. The original combined f-string implicitly coerced non-string values; the new direct `re.search(stdout)` call did not. Added explicit `str()` coercion for both `stdout` and `stderr` before regex matching, restoring the defensive behavior while keeping the false-positive fix intact. Addresses @atoomic's request to view and fix the CI failure.
Skill runners (rebase_pr, recreate_pr, squash_pr) produced zero stdout
during execution — all prints went to stderr. The liveness watchdog
(600s timeout) only resets on stdout lines, so long-running Claude CLI
calls caused the subprocess to be killed before completion.

Add print("[skill] ...", flush=True) heartbeat statements before every
blocking operation (Claude invocations, API calls, git operations) to
keep the watchdog satisfied.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nantys-oss#1307)

Adds a rolling burn-rate estimator that tracks the percentage of session
quota consumed per minute, computes time-to-exhaustion, and lets Koan act
on the projection before the wall.

- `burn_rate.py` maintains a 20-sample circular buffer in
  `instance/.burn-rate.json` plus a `last_warned_at` cursor so warnings
  fire at most once per quota cycle.
- `mission_runner.update_usage` records a sample after every run; the
  cost is the percentage of `session_token_limit` consumed by that run.
- `UsageTracker.decide_mode()` consults the buffer through a new
  `instance_dir` argument; if projected exhaustion is < 30 min, it drops
  one tier (deep→implement→review) and surfaces the downgrade in the
  decision reason.
- `iteration_manager` checks each iteration whether projected exhaustion
  is < 60 min while the next reset is still > 2 h away, and if so emits
  a one-shot Telegram alert via the outbox.
- `/quota` now prints the live burn rate (%/h) and estimated time to
  exhaustion when there is enough history.

Tests cover buffer trimming, persistence, edge cases (no history, zero
span, invalid values), mode multipliers, and the tracker downgrade
integration. Full suite passes (12364 tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skills can now declare Python package dependencies via a `requirements:`
field in their SKILL.md frontmatter. Missing packages are auto-installed
via pip before the handler's first execution in a session.

- Parse `requirements: [pkg1, pkg2]` in frontmatter (inline list or single string)
- Check importability before installing (fast path for satisfied deps)
- Support version specifiers (>=, ==, <)
- Cache per-skill per-session to avoid repeated checks
- Return SkillError on install failure (surfaced to Telegram)
- Document the field in koan/skills/README.md

Closes Anantys-oss#1245

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… cleanup

All three review items have been addressed. Here's the summary:

- **Pip flag injection validation** (blocking): Added early validation in `ensure_requirements()` that rejects any requirement entry starting with `-`, preventing pip CLI flag injection (e.g. `--index-url`) via crafted SKILL.md files. Added `test_ensure_requirements_rejects_flag_injection` test.
- **Incomplete version specifier stripping** (important): Replaced chained `.split(">=")[0].split("==")[0].split("<")[0]` with `re.split(r'[><=!~]', pkg)[0]` to handle all PEP 440 operators (`~=`, `<=`, `!=`, `===`, etc.). Added `test_ensure_requirements_handles_tilde_specifier` test.
- **Fragile test state management** (important): Added `_reset_requirements_cache()` helper in `skills.py` and an `autouse` pytest fixture in `TestSkillRequirements` that clears the cache before/after each test. Removed all manual `try/finally` blocks and direct `_requirements_satisfied` manipulation from tests.
Adds ruff lint config with the PERF ruleset, excludes tests/ from PERF
checks (fixture loops add little perf value), and rewrites the production
+ skill violations: list append in for-loop → list.extend / comprehension,
dict iterator key→value, dict-build loop → dict comprehension, slice copy
→ list.copy.

All 12633 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`gh run list --branch X --limit 1` was returning the "Dependabot
auto-merge" workflow (conclusion=skipped) on non-Dependabot PRs,
which check_ci_status, wait_for_ci, and check_existing_ci all
treated as a failure. drain_one then kept injecting /ci_check fix
missions against a PR whose real CI was green — see
aio-libs/yarl#1681 for the field report.

Add aggregate_ci_runs(runs) and fetch_branch_ci_runs(branch, repo)
helpers in claude_step. The aggregator filters out skipped,
cancelled, neutral, and action_required conclusions, then reduces
the remaining runs to (status, run_id) with failure > pending >
success priority. All three CI status callers now share the same
filtering, fetching up to 20 runs per branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When GitHub's Private Vulnerability Reporting is enabled on a target
repo, /security_audit now submits critical/high findings as private
security advisories instead of public issues. Lower-severity findings
remain public. Configurable per-project via projects.yaml security
section (pvrs mode + threshold). Graceful fallback to public issues
on any PVRS API failure.

Implements: Anantys-oss#1341
Plan: PR Anantys-oss#1269 (docs/plan-pvrs-security-audit.md)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensure all remote tracking refs for the base branch are fresh before
any mission work begins — both new branches and rebases.

Two complementary changes:
- git_prep: after primary sync, fetch base from all secondary remotes
- claude_step: pre-fetch all relevant remotes before the rebase loop

Addresses review feedback on PR Anantys-oss#1333 about keeping main up to date
regardless of its name before working on branches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a Linting section to CLAUDE.md documenting ruff as the project's
linter, and add a `make lint` Makefile target so contributors can
check compliance before pushing. Addresses review feedback on Anantys-oss#1345.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds three behavioral tests against claude_step.run_claude that capture
the aioesphomeapi #1660 stagnation root cause:

1. test_streams_each_stdout_line_in_real_time — child stdout must reach
   parent stdout as produced, not buffered until exit. Currently fails
   because run_claude uses subprocess.run(capture_output=True), which
   silently buffers, starving the run.py liveness watchdog.

2. test_timeout_kills_process_group — backgrounded grandchildren must die
   when run_claude times out. Currently fails because subprocess.run only
   kills the immediate child; the session group survives.

3. test_collected_output_still_available_in_return — existing return-shape
   contract (result['output'] holds full text) must be preserved by the
   streaming impl.

The pre-existing test_timeout_returns_before_grandchild_dies passes on
Python 3.12, so the original grandchild-pipe-block hypothesis was not
the root cause — silent capture is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_claude in claude_step.py used subprocess.run(capture_output=True)
via run_cli_with_retry. That silently buffered Claude's stdout for the
entire call, so any skill subprocess wrapping it (rebase, recreate,
review_runner, …) produced zero output to its own stdout while Claude
was running. run.py's first_output_timeout watchdog (default 600s)
killed the skill when the wrapped Claude call legitimately took longer
than the watchdog window, leaving the user with a stagnation timeout
and no diagnostic — the exact failure mode seen on aioesphomeapi #1660
during the already-solved check.

Replace the call path with popen_cli + start_new_session=True, stream
each stdout line through to the parent's stdout (resetting any outer
liveness watchdog), and kill the entire process group with SIGKILL on
timeout to reclaim grandchildren (MCP servers, hooks) that would
otherwise hold inherited pipes open.

The retry layer from run_cli_with_retry is intentionally dropped here:
classify-and-retry only makes sense for short probe calls where the
caller can't observe failure live. With streaming, the caller sees
failures immediately and can decide. Existing rebase/recreate/review
callers already handle a single failure result correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdraco
Copy link
Copy Markdown

bdraco commented May 16, 2026

You made the pr to the fork instead of upstream

@bluetoothbot
Copy link
Copy Markdown
Owner Author

landed on the wrong repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants