fix(claude_step): stream stdout + kill process group on run_claude timeout#1
Closed
bluetoothbot wants to merge 62 commits into
Closed
fix(claude_step): stream stdout + kill process group on run_claude timeout#1bluetoothbot wants to merge 62 commits into
bluetoothbot wants to merge 62 commits into
Conversation
…tle behind threshold
When a user provides a PR URL from a different owner (e.g., Anantys-oss/pull/171 instead of Anantys-oss/pull/171), the PR fetch would fail because the PR doesn't exist at the given owner/repo combination. Added resolve_pr_location() to claude_step.py that first checks the given owner/repo, then falls back to all git remotes from the local project to find where the PR actually lives. Applied to run_rebase, run_recreate, run_squash, and run_review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… in resolve_pr_location Done. Here's the summary: - **Replaced debug `print()` with `logging.info()`** in `resolve_pr_location()` — the `print(..., file=sys.stderr)` was a debug leftover flagged by both the quality report and the reviewer. Converted to standard `logging.info()` with `%s` formatting per Python logging conventions. - **Made dedup case-insensitive explicitly** — added `.lower()` on `remote_slug` before checking/adding to the `tried` set, so the dedup no longer silently depends on `get_all_github_remotes()` returning lowercase slugs.
extract_tokens_detailed() was called 3 times on the same stdout file (cost tracking, activity logging, cache line extraction). Similarly, load_projects_config() was called 3 times with the same instance dir (quality gate, lint gate, auto-merge). Both are now extracted once at the top of run_post_mission() and passed through to each consumer. All new parameters are optional with None defaults so existing callers (contemplative runner, tests, CLI) continue to work unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Here's what I changed: - **Extracted `_ensure_tokens()` helper** to DRY up the repeated 4-line fallback pattern (`if tokens is None: import; tokens = extract_tokens_detailed(...)`) that was duplicated across `_extract_cache_line`, `_record_cost_event`, and `_log_activity_usage`. Per reviewer suggestion Anantys-oss#2, this consolidates the fallback logic into a single location while preserving the same behavior. The other two review points were observations (not change requests): the silent degradation note was flagged as "worth noting" and the quota early-return behavior change was flagged as "worth calling out in the PR description" — neither requested code changes.
…nfig Remove four functions that are defined and tested but never called in production code: - is_project_branch_saturated() in branch_limiter.py: superseded by inline logic in iteration_manager._filter_exploration_projects() which calls count_pending_branches() directly - _get_project_by_index() in iteration_manager.py: never called anywhere, replaced by _select_random_exploration_project() - get_tool_flags_for_shell() in config.py: defined and re-exported via utils.py but never called - get_output_flags_for_shell() in config.py: same pattern, never called Also removes the corresponding test classes and cleans up imports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixed both cosmetic nits from the review: - **`koan/app/config.py`**: Collapsed double blank line before `get_auto_merge_config()` to single blank line (PEP 8) - **`koan/app/iteration_manager.py`**: Collapsed double blank line before `_get_known_project_names()` to single blank line (PEP 8) Both were leftover artifacts from the function removals, as noted by the reviewer.
Closes Anantys-oss#1306. Adds a lightweight Jaccard similarity filter so the agent prompt only carries learnings relevant to the active mission text. Capped at memory.max_relevant_learnings lines (default 40) plus a 5-line recency hedge that always keeps freshly captured lessons. Use [recall:full] in mission text to bypass filtering when needed. No new dependencies — tokenize/score/select live in koan/app/memory_recall.py and are wired into prompt_builder._get_learnings_section, which writes a one-line stderr trace recording kept/dropped counts for tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds _notify_mission_result() to the post-mission pipeline so the Claude session's final result string reaches Telegram even when the session's sandbox blocked writes to instance/outbox.md. Activates when the result contains alert markers (SKIP/FAIL/ERROR/BLOCKED, "permission deadlock", "hard stop", etc.) or when the mission title matches a customer-facing skill (/cpfix, wp-bug-resolver). Idempotent: skipped silently when the session has already written to outbox.md after the mission start time. Posts at ACTION priority so the message clears the default min_priority filter; the alert flag only toggles the icon (⚠️ for alerts, ℹ️ for customer-facing success). Gated by the new notify_mission_results config key (default: True) so operators can opt out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The incremental review flow previously required 3 API calls to embed commit SHAs: (1) post/patch review comment, (2) re-fetch the posted comment to get its body, (3) PATCH again to add the SHA block. This also introduced a TOCTOU race where the comment could be modified between post and re-fetch. Pass commit_shas directly to _post_review_comment so they are embedded in the body before the single API call. Removes the now-unused _patch_comment_body helper. Preserves existing behavior for the no-SHA case (prior commit IDs from existing comment are carried forward). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nantys-oss#1295) Adds three layers of optional integration with rtk (https://github.com/rtk-ai/rtk), a Rust CLI proxy that compresses git/ls/cat/grep/pytest/cargo/gh/docker output 60-90% before Claude reads it. Strictly complementary to caveman (Anantys-oss#1279): caveman trims what Claude writes, rtk trims what Claude reads. L1 — Detection - koan/app/rtk_detector.py: cached `detect_rtk()` probes binary, version, jq, the ~/.claude/settings.json hook, and rtk's own config file. All probes are read-only and degrade gracefully. - koan/app/run.py: one-line boot log via the detector at main_loop start. L2 — Awareness injection - koan/system-prompts/rtk-awareness.md: 25-line directive listing the command surface where rtk has filters. - koan/app/prompt_builder.py: `_get_rtk_section(project_name)` mirrors `_get_caveman_section`. Wired into both build_agent_prompt and build_agent_prompt_parts (system-prompt slot, prefix-cache friendly). - koan/app/config.py: `is_rtk_mode()` / `is_rtk_awareness_enabled()` — resolution order is .koan-rtk-override > optimizations.rtk.enabled > detector. Default `auto` = on iff binary detected. - koan/app/projects_config.py: `get_project_rtk_enabled()` for per-project opt-out via projects.yaml. - koan/app/config_validator.py: schema + nested validation for the new `optimizations.rtk` block. L3 — /rtk skill - koan/skills/core/rtk/: status / setup (preview + confirm) / uninstall / gain / discover / on / off. Hook installation only ever via explicit `/rtk setup confirm` — Kōan never silently mutates ~/.claude/settings.json. Tests - 14 detector tests, 13 skill tests, 10 prompt-builder/config tests. - Full koan/tests/ suite: 12,142 passed, no regressions. Docs - docs/rtk.md (new): full integration reference. - instance.example/config.yaml: documented `optimizations.rtk` block. - CLAUDE.md + docs/user-manual.md: skill listing + Quick Reference row. Closes Anantys-oss#1295 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lated .spec/ gitignore mock `load_projects_config` instead of `load_config`. - **Removed unrelated `.spec/` gitignore entry** (.gitignore): Per reviewer request, removed the `.spec/` gitignore addition which is unrelated to the RTK integration and should be in a separate commit. **Reviewed but no changes needed (already correct):** - Shell injection in `/rtk gain` and `/rtk discover`: `_run_rtk()` already uses `subprocess.run(["rtk", *args])` with list form (no `shell=True`). Safe by design. - `/rtk setup confirm` safety: same `_run_rtk()` list-form subprocess call; `confirm` check uses strict equality. - Cache staleness: handler already calls `reset_cache()` after mutations (`setup confirm`, `uninstall`) and uses `detect_rtk(force=True)` to re-probe. - `is_rtk_mode()` auto handling: `coerce_rtk_enabled()` correctly handles bool, string true/false/auto, returning `None` for auto which falls through to binary detection. - Config validation: `_validate_rtk_nested()` already exists in `config_validator.py`.
When a mission starts, a checkpoint JSON file is created under instance/journal/checkpoints/. During execution it captures branch info, progress from pending.md, and CHECKPOINT markers from stdout. On success the file is cleaned up; on crash, recover.py reads the checkpoint and injects structured recovery context into pending.md. - checkpoint_manager.py: create/read/update/delete/list + parsing - run.py: create checkpoint at mission start, update with branch + pending.md + stdout markers before finalization, cleanup on success - recover.py: checkpoint-aware classification (partial vs dead), recovery context injection into pending.md, JSONL audit logging - 35 new tests (checkpoint_manager) + 6 new tests (recover integration) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
o it's always defined when referenced on the recovery path. This was a blocking bug — if the checkpoint_manager import failed, `clean_text` would be undefined, crashing the entire `recover_missions()` flow. - **Replaced custom file locking with `atomic_write` from utils** (`checkpoint_manager.py`): `_write_checkpoint` now delegates to the project's standard `atomic_write()` instead of hand-rolling temp file + lock + rename. Removed unused `fcntl` and `sys` imports. - **Removed cosmetic read-side lock** (`checkpoint_manager.py`): `_read_checkpoint` no longer acquires `LOCK_SH` since it never coordinated with the write-side lock (which locked the temp file, not the target). Atomic rename on write guarantees read consistency without locking. - **Used `atomic_write` for pending.md injection** (`recover.py`): Checkpoint context injection into `pending.md` now uses `atomic_write()` instead of raw `open()`, consistent with project conventions. Removed debug print statements from this path.
Here's a summary: **Changes made:** - **Moved checkpoint update before auth/quota error checks** (`koan/app/run.py`): The checkpoint update block (branch detection, pending.md parsing, stdout marker extraction) was moved from after the auth/quota early-return checks to immediately after stdout parsing. This ensures mission progress is captured in the checkpoint even when auth expiration or quota exhaustion triggers an early return — addressing the reviewer's concern about the checkpoint only containing empty initial state if a crash occurs during the main execution window. - **Delete checkpoint on any finalized mission, not just success** (`koan/app/run.py`): Changed the checkpoint cleanup condition from `claude_exit == 0` to unconditional (whenever `original_mission_title` is set). Once a mission is finalized (success or failure), the checkpoint serves no purpose — `recover.py` only processes in-progress missions. This prevents orphaned checkpoint files from accumulating over time.
Token extraction logic (JSON field traversal, cache metrics, cost) was spread across usage_estimator.py with consumers in mission_runner.py and cost_tracker.py duplicating field access patterns and cache hit rate calculations independently. Introduces token_parser.py as single source of truth with: - TokenResult dataclass replacing raw dicts for structured access - extract_tokens() consolidating all JSON format handling - compute_cache_hit_rate() eliminating duplicated rate formulas Existing callers updated to delegate through the new module. The usage_estimator.extract_tokens_detailed() remains as a thin dict adapter for backward compatibility. Closes Anantys-oss#1323 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- _get_file_tree: log timeout/OSError at stderr instead of bare pass, making it visible when git ls-files fails or times out - compact_learnings: include exception type and message in fallback log, add "method" (semantic/fallback) and "error" fields to return dict so callers can distinguish real compaction from emergency truncation Closes Anantys-oss#1320, closes Anantys-oss#1321 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Read the archive file once with encoding="utf-8" and reuse both the content string and the line set, instead of reading without encoding for the set and re-reading with encoding for content append. Closes Anantys-oss#1319 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a GitHub Actions workflow that automates the release process: - Manual trigger via workflow_dispatch with version input (vX.Y or vX.Y.Z) - Runs full test suite (Python 3.11 + 3.14, fast + slow groups) before releasing - Validates version format, checks tag uniqueness, ensures commits exist - Creates annotated tag, updates stable branch, publishes GitHub release - Changelog generated from git log (no Claude CLI needed in CI) Mirrors the steps in scripts/release.sh but adapted for CI (non-interactive, bot git identity, GITHUB_TOKEN auth). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clean. Here's the summary: - Removed the entire `test` job (lines 15–69) and its `needs: test` dependency from the `release` job, per reviewer request — the existing `tests.yml` CI pipeline already covers testing, so duplicating it in the release workflow is unnecessary maintenance burden
Changes applied: - **Replaced `stable` branch with `stable` tag** per reviewer request at line 65: the "Update stable branch" step (which used `git branch` + `git push`) is now "Update stable tag" using `git tag -f stable` + force push - **Added `update_stable` boolean input** to `workflow_dispatch` per reviewer request: defaults to `true`, controls whether the `stable` tag gets updated via an `if: inputs.update_stable` condition - **Removed the stable branch logic entirely** per reviewer's "remove" comment at line 72: no more branch creation, fetch, or fast-forward — replaced with a simple tag update
OutboxManager handles the critical message delivery pipeline (read, format, send, crash recovery) but had zero test coverage. This adds 36 tests covering all public methods and key behaviors: - parse_outbox_priority: header parsing, priority ranking, stripping - recover_staged: crash recovery from interrupted flushes - flush: full lifecycle (scan, format, send, quarantine, requeue) - flush_async: background thread management and skip-if-busy - requeue / _write_failed: retry and last-resort persistence - _format_message: Claude formatting with fallback paths - _expand_github_refs: project context detection and URL expansion - _get_last_message_id: provider delegation and error handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The notification poller was silently dropping legitimate @mentions because fetch_jira_mentions() capped results at the 20 most-recently-updated issues across all mapped projects. On a multi-project deployment, 24h of activity routinely produces 50–100 issues, so anything not in the top 20 by `updated DESC` never had its comments inspected. Empirically, an issue ranked 46 of 100 in production carried a legitimate `@<bot> <command>` mention but the polling cycle logged "No new Jira notifications" and the mention was lost. JQL pre-filtering by mention text isn't viable either: `text ~ "<nickname>"` and `comment ~ ...` both miss ADF mention nodes because Jira indexes the accountId reference, not the displayName. Changes: - Bump _MAX_ISSUES_PER_CYCLE from 20 to 200. Cold-start cost rises from 21 to ~201 API calls (~20s at 10 req/s), comparable to GitHub's "~1 min" cold start. Subsequent polls use _last_jira_check_iso (≤60s window) so steady-state cost is unchanged. - Promote the per-cycle search log from DEBUG to INFO and surface the JQL since-window + result count, so future "no mentions" mysteries are diagnosable from run.log alone. - When the cap actually clips, emit a WARNING with guidance to tighten max_age_hours or shorten check_interval_seconds. - Add a Telegram cold-start banner "🔍 Scanning Jira notifications..." parallel to GitHub's existing one, so the user sees that Jira IS being scanned in the same shape and time as GitHub. CLAUDE.md gains a "Never leak private skill/agent/project names" convention documenting the policy and the pre-commit grep check (this commit also scrubs three pre-existing leaks of private project keys in test fixtures and a docstring). Regression test: stub a 100-issue result set with the target mention at index 46 and assert fetch_jira_mentions still returns it. With the old cap of 20 this test would fail.
review_requested and assign notifications kept re-queueing the same
mission after every restart, because none of the existing dedup layers
covered the case:
- The in-memory _notif_cache in loop_manager.py is lost on restart.
- The persistent comment tracker keys on comment IDs, but these
notifications have no comment object to react to or hash.
- The fallback missions.md check in _try_assignment_notification
only scans the Pending section, so once the prior /review moved
to In Progress/Done/Failed it became invisible.
Add a parallel persistent tracker in github_notification_tracker.py
that records (notification_id, updated_at) composite keys for 7 days
in instance/.koan-github-processed-threads.json. Wire it into
_try_assignment_notification with an early-return guard above the
staleness/closed/repo checks, plus track_thread calls on both the
"already pending" and "successfully inserted" paths so subsequent
restarts short-circuit cleanly.
The composite key naturally invalidates when GitHub bumps updated_at
(re-requested review, new commits pushed), so a renewed request still
queues a fresh mission rather than being permanently silenced.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every existing test in TestParseJiraMentionCommand uses a single-token
nickname ("koan-bot"). None exercises the regex against a nickname
containing whitespace, which leaves a gap: a refactor that drops
re.escape() or interpolates the nickname directly into the pattern
would silently break any deployment whose configured nickname has a
space — and such deployments do exist in practice, because Jira's
ADF mention.attrs.text preserves the literal display-name spacing.
Add three parametrized cases exercising:
- the basic spaced-nickname mention,
- a mention followed by command-context (ensures the trailing
capture group isn't disrupted by the whitespace in the nickname),
- a fully-lowercased mention (locks in the re.IGNORECASE flag,
which clients sometimes rely on when rendering mentions).
No production code changes.
Lets a custom skill own its full mission lifecycle (e.g. enforce a JIRA outcome comment after every fix mission) without touching Kōan core. Previously hooks could only live in instance/hooks/ and were instance-wide; skill authors had no way to ship pre/post-mission logic alongside their handler.py. HookRegistry now also walks instance/skills/<scope>/<name>/ for files named after a known event (session_start.py, session_end.py, pre_mission.py, post_mission.py) and registers each module's run(ctx) as a handler for that event. Instance-wide hooks still fire first; skill-bound hooks run after. Errors stay isolated per handler. mission_runner._fire_post_mission_hook now pre-reads the truncated Claude stdout via _read_stdout_summary and passes it to hooks as ctx['result_text'], so hooks can parse JIRA keys / PR URLs / RESULT lines without re-implementing file I/O. Makefile gains a test-skills target that auto-discovers instance/skills/**/tests/test_*.py (handles symlinked skill repos via find -L) and runs them with KOAN_REPO + PYTHONPATH set up. It chains into make test and make test-strict so skill regressions surface in the regular suite; skipped cleanly when no skill tests exist. instance.example/hooks/README.md documents both hook flavors, the new result_text ctx key, and the convention for shipping tests alongside a skill (tests/conftest.py template + pytest invocation recipes). Tested: 224 koan tests pass (52 existing hook + 9 new skill-bound discovery + 163 mission_runner); 6 pre-existing environment failures are unrelated (confirmed via git stash). Tests cover discovery, ordering, error isolation, ctx contents, and the missing-skills-dir edge case.
Project tag regex `[a-zA-Z0-9_-]+` rejected domain-like names such as `developers.esphome.io`. `is_skill_mission` then returned False for `[project:developers.esphome.io] /review …`, missions fell through to the agent loop, and ran with the wrong working directory. Add `.` to the character class wherever a `[project:X]` tag is parsed or stripped (8 modules + dashboard HTML). Same fix for `_PROJECT_WORD_RE` so raw word prefixes (`developers.esphome.io /plan …`) work too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Claude provider was passing the agent system prompt (~7 KB and growing) as a positional argv string via `--append-system-prompt`, which exposes the full prompt text in `ps`, process supervisors, and any other tool that snapshots argv. On a shared host or in any hostile environment, that's a needless leak of operator policies and instance learnings. Switch to `--append-system-prompt-file` (documented, print-mode only, which is the only mode Kōan uses). A new `build_full_command_managed()` helper writes the system prompt to a 0600 temp file and returns the command list paired with a cleanup path that the caller MUST unlink after the subprocess exits — same lifecycle pattern already used for stdout/stderr captures and plugin dirs in `run.py` and `session_manager.py`. - Adds `supports_system_prompt_file()` and `build_system_prompt_file_args()` capability methods on the provider base class; Claude opts in, the other providers fall through to the existing prepend-to-user-prompt behavior. - `mission_runner.build_mission_command()` now returns `Tuple[List[str], List[str]]` — call sites in `run.py` and `session_manager.py` unpack and clean up. - Adds `tests/test_provider_system_prompt_file.py` covering the capability flag, file-flag emission, file-mode precedence over inline content, cleanup behavior, and an argv-leak regression test that proves the prompt text never lands in argv. The legacy `build_full_command()` and `--append-system-prompt` path remain available for direct callers that haven't migrated, so this change is additive at the provider API surface. Out of scope (follow-up): the user prompt (~20 KB) is still passed via `-p <text>` in argv. Claude CLI doesn't currently document a clean file/stdin substitute for the user prompt — worth a dedicated probe before refactoring that side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The warning "Claude hit the max turns limit (5). To increase: set
skill_max_turns in instance/config.yaml (current: 5)" was misleading
when fired from chat-style callers (ask, github_reply, github_intent,
spec_generator, deepplan/plan reviewers, implement commit-subject
helper). These callers hardcode max_turns=1/3/5; skill_max_turns
(default 200) does not affect them, so the suggested remedy did
nothing.
Threads a max_turns_source argument through run_command()/
run_command_streaming(). Skill runners keep the default
("skill_max_turns") and are unchanged. Hardcoded-limit callers
pass max_turns_source=None, which produces a clearer message
("This call uses a hardcoded limit and is not configurable.")
instead of pointing the user at an unrelated config key.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cold-start notification processing was serial: each notification triggered several sequential `gh` API calls (fetch comment, find_mention_in_thread, check subject state, mark read, react). With a 24h lookback returning 10+ notifications, this added 5-20s of wall-clock latency before the first iteration could plan. - New `github.parallel_workers` config (default 4, range 1-16) controls the worker pool used by `_process_notifications_concurrent`. - Per-notification work is I/O bound (subprocess + HTTP) so threads scale near-linearly. workers=1 keeps the original serial path. - Existing thread-safe primitives cover the shared state: cache lock, atomic mission writes, lock-guarded backoff counters.
classify_cli_error uses loose quota patterns (e.g. "rate limit", "too many requests") on combined stdout+stderr, causing false QUOTA classification when Claude's response discusses API rate limiting. These tests demonstrate the bug before the fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
classify_cli_error() used loose quota patterns (e.g. "rate limit", "too many requests", "usage limit") on combined stdout+stderr. When Claude's response discussed API rate limiting, this falsely triggered QUOTA classification, causing spurious mission requeueing and pauses. Apply the same split-detection strategy already used by handle_quota_exhaustion in quota_handler.py: strict patterns only for stdout, all patterns for stderr. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of changes: - **Fixed CI regression in `classify_cli_error()`** (`koan/app/cli_errors.py`): The split stdout/stderr quota check broke `test_failure_raises_runtime_error`, which passes a `MagicMock` for stdout. The original combined f-string implicitly coerced non-string values; the new direct `re.search(stdout)` call did not. Added explicit `str()` coercion for both `stdout` and `stderr` before regex matching, restoring the defensive behavior while keeping the false-positive fix intact. Addresses @atoomic's request to view and fix the CI failure.
Skill runners (rebase_pr, recreate_pr, squash_pr) produced zero stdout
during execution — all prints went to stderr. The liveness watchdog
(600s timeout) only resets on stdout lines, so long-running Claude CLI
calls caused the subprocess to be killed before completion.
Add print("[skill] ...", flush=True) heartbeat statements before every
blocking operation (Claude invocations, API calls, git operations) to
keep the watchdog satisfied.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nantys-oss#1307) Adds a rolling burn-rate estimator that tracks the percentage of session quota consumed per minute, computes time-to-exhaustion, and lets Koan act on the projection before the wall. - `burn_rate.py` maintains a 20-sample circular buffer in `instance/.burn-rate.json` plus a `last_warned_at` cursor so warnings fire at most once per quota cycle. - `mission_runner.update_usage` records a sample after every run; the cost is the percentage of `session_token_limit` consumed by that run. - `UsageTracker.decide_mode()` consults the buffer through a new `instance_dir` argument; if projected exhaustion is < 30 min, it drops one tier (deep→implement→review) and surfaces the downgrade in the decision reason. - `iteration_manager` checks each iteration whether projected exhaustion is < 60 min while the next reset is still > 2 h away, and if so emits a one-shot Telegram alert via the outbox. - `/quota` now prints the live burn rate (%/h) and estimated time to exhaustion when there is enough history. Tests cover buffer trimming, persistence, edge cases (no history, zero span, invalid values), mode multipliers, and the tracker downgrade integration. Full suite passes (12364 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skills can now declare Python package dependencies via a `requirements:` field in their SKILL.md frontmatter. Missing packages are auto-installed via pip before the handler's first execution in a session. - Parse `requirements: [pkg1, pkg2]` in frontmatter (inline list or single string) - Check importability before installing (fast path for satisfied deps) - Support version specifiers (>=, ==, <) - Cache per-skill per-session to avoid repeated checks - Return SkillError on install failure (surfaced to Telegram) - Document the field in koan/skills/README.md Closes Anantys-oss#1245 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… cleanup
All three review items have been addressed. Here's the summary:
- **Pip flag injection validation** (blocking): Added early validation in `ensure_requirements()` that rejects any requirement entry starting with `-`, preventing pip CLI flag injection (e.g. `--index-url`) via crafted SKILL.md files. Added `test_ensure_requirements_rejects_flag_injection` test.
- **Incomplete version specifier stripping** (important): Replaced chained `.split(">=")[0].split("==")[0].split("<")[0]` with `re.split(r'[><=!~]', pkg)[0]` to handle all PEP 440 operators (`~=`, `<=`, `!=`, `===`, etc.). Added `test_ensure_requirements_handles_tilde_specifier` test.
- **Fragile test state management** (important): Added `_reset_requirements_cache()` helper in `skills.py` and an `autouse` pytest fixture in `TestSkillRequirements` that clears the cache before/after each test. Removed all manual `try/finally` blocks and direct `_requirements_satisfied` manipulation from tests.
Adds ruff lint config with the PERF ruleset, excludes tests/ from PERF checks (fixture loops add little perf value), and rewrites the production + skill violations: list append in for-loop → list.extend / comprehension, dict iterator key→value, dict-build loop → dict comprehension, slice copy → list.copy. All 12633 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`gh run list --branch X --limit 1` was returning the "Dependabot auto-merge" workflow (conclusion=skipped) on non-Dependabot PRs, which check_ci_status, wait_for_ci, and check_existing_ci all treated as a failure. drain_one then kept injecting /ci_check fix missions against a PR whose real CI was green — see aio-libs/yarl#1681 for the field report. Add aggregate_ci_runs(runs) and fetch_branch_ci_runs(branch, repo) helpers in claude_step. The aggregator filters out skipped, cancelled, neutral, and action_required conclusions, then reduces the remaining runs to (status, run_id) with failure > pending > success priority. All three CI status callers now share the same filtering, fetching up to 20 runs per branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When GitHub's Private Vulnerability Reporting is enabled on a target repo, /security_audit now submits critical/high findings as private security advisories instead of public issues. Lower-severity findings remain public. Configurable per-project via projects.yaml security section (pvrs mode + threshold). Graceful fallback to public issues on any PVRS API failure. Implements: Anantys-oss#1341 Plan: PR Anantys-oss#1269 (docs/plan-pvrs-security-audit.md) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensure all remote tracking refs for the base branch are fresh before any mission work begins — both new branches and rebases. Two complementary changes: - git_prep: after primary sync, fetch base from all secondary remotes - claude_step: pre-fetch all relevant remotes before the rebase loop Addresses review feedback on PR Anantys-oss#1333 about keeping main up to date regardless of its name before working on branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a Linting section to CLAUDE.md documenting ruff as the project's linter, and add a `make lint` Makefile target so contributors can check compliance before pushing. Addresses review feedback on Anantys-oss#1345. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds three behavioral tests against claude_step.run_claude that capture the aioesphomeapi #1660 stagnation root cause: 1. test_streams_each_stdout_line_in_real_time — child stdout must reach parent stdout as produced, not buffered until exit. Currently fails because run_claude uses subprocess.run(capture_output=True), which silently buffers, starving the run.py liveness watchdog. 2. test_timeout_kills_process_group — backgrounded grandchildren must die when run_claude times out. Currently fails because subprocess.run only kills the immediate child; the session group survives. 3. test_collected_output_still_available_in_return — existing return-shape contract (result['output'] holds full text) must be preserved by the streaming impl. The pre-existing test_timeout_returns_before_grandchild_dies passes on Python 3.12, so the original grandchild-pipe-block hypothesis was not the root cause — silent capture is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_claude in claude_step.py used subprocess.run(capture_output=True) via run_cli_with_retry. That silently buffered Claude's stdout for the entire call, so any skill subprocess wrapping it (rebase, recreate, review_runner, …) produced zero output to its own stdout while Claude was running. run.py's first_output_timeout watchdog (default 600s) killed the skill when the wrapped Claude call legitimately took longer than the watchdog window, leaving the user with a stagnation timeout and no diagnostic — the exact failure mode seen on aioesphomeapi #1660 during the already-solved check. Replace the call path with popen_cli + start_new_session=True, stream each stdout line through to the parent's stdout (resetting any outer liveness watchdog), and kill the entire process group with SIGKILL on timeout to reclaim grandchildren (MCP servers, hooks) that would otherwise hold inherited pipes open. The retry layer from run_cli_with_retry is intentionally dropped here: classify-and-retry only makes sense for short probe calls where the caller can't observe failure live. With streaming, the caller sees failures immediately and can decide. Existing rebase/recreate/review callers already handle a single failure result correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
You made the pr to the fork instead of upstream |
Owner
Author
|
landed on the wrong repo |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Make
claude_step.run_claudestream child stdout in real time and kill the entire process group on timeout, instead of silently capturing output viasubprocess.run(capture_output=True).Why
Investigated the aioesphomeapi #1660 rebase stagnation. The mission log showed:
Two
run_claudefunctions exist:run.py::run_claude_task(mission loop) — already usespopen_cli + start_new_session=Truewith line streaming and process-group kill.claude_step.py::run_claude(called by rebase, recreate, review_runner, squash, contemplative, etc.) — usedsubprocess.runviarun_cli_with_retry, which captures the entire stdout silently.When a rebase skill subprocess invoked
_check_if_already_solvedor_apply_review_feedback, the inner Claude call could legitimately take many minutes, and during that time the skill's own stdout produced no lines.run.py'sfirst_output_timeoutwatchdog (default 600s) then killed the whole skill chain, surfacing as a stagnation timeout with no diagnostic.How
run_cli_with_retryforclaude_step.run_claudeand replace withpopen_cli + start_new_session=True.threading.Timercallsos.killpg(SIGKILL)on the new session so MCP servers / hooks / any grandchildren also die.{success, output, error}) — existing callers (rebase, recreate, review, claudemd_refresh, contemplative, squash, post_mission_reflection) are untouched.Testing
KOAN_ROOT=/tmp/test-koan .venv/bin/pytest koan/tests/→ 12712 passed.make lint→ clean.test_claude_step.py::TestRunClaudeStreamsOutputcover: line-by-line streaming with arrival-time assertions, process-group kill (grandchild marker file must not appear after timeout), timeout return shape, and preservedresult['output']contract.TestRunClaude(in bothtest_claude_step.pyandtest_pr_review.py) to mockpopen_cli— caught a pre-existing leak wheretest_pr_review.py::TestRunClaudehad been patching the wrong subprocess path and actually invoking real Claude during tests.🤖 Generated with Claude Code