Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,13 @@ All notable user-visible changes to CASCADE are documented here. The format is l

## Unreleased

### Added

- **Alerting agent now investigates Sentry alerts and files bug investigation work items** (spec 018, plan 1 of 2). The `alerting` agent had been wired end-to-end except for its system prompt template — definition YAML, capabilities, trigger handlers, context pipeline, and Sentry integration were all in place, but `src/agents/prompts/templates/alerting.eta` was missing, so the worker crashed at agent boot with `ENOENT` when the first prod-traffic Sentry alert arrived (cascade project, 2026-05-06). This plan ships the prompt: a three-phase investigator (parse pre-loaded event → confirm root cause via source reads → file or comment) with an explicit `INVESTIGATE-AND-FILE-ONLY` guardrail. The agent does not edit source, commit, push, or open PRs — that property is enforced at the capability layer (no `fs:write`, no `scm:*`), pinned by a static test that asserts the resolved gadget allowlist excludes `WriteFile`, `CreatePR`, and `CreatePRReview`. When the trigger context provides an existing work item, the agent comments on it; otherwise it creates a new bug investigation work item in the configured backlog. Output structure is predictable: `Investigate: <ErrorType> in <Function> (<file>:<line>)` title and a 4-6 sentence + bullets description. Engine-agnostic prose; reuses `partials/environment` for the shared preamble. See [spec 018](docs/specs/018-alerting-agent-and-worker-boot-visibility.md.done). Plan 2 of 2 closes the silent-failure path that masked this gap (worker boot failures will produce visible failed run rows, exit code 2, Sentry capture under `worker_boot_failure`).

### Changed

- **Worker boot-failure visibility: boot-time agent failures now produce visible failed runs** (spec 018, plan 2 of 2). The worker now creates the `agent_runs` row before plan resolution so template-load, model-resolution, context-pipeline, definition-lookup, and run-record failures are not silently converted into invisible successful jobs. Boot failures are marked `failed` with the structured cause in the run row, captured to Sentry under `worker_boot_failure`, and re-thrown so the worker exits with code `2`; ordinary in-execution crashes keep the existing exit/result semantics. The router crash-reason formatter labels exit code `2` as `Worker boot failed`, Sentry-driven alerting runs receive stable synthesized `workItemId`s (`sentry:issue:<id>` / `sentry:metric:<org>:<title>`), and a conformance test now fails CI when a YAML-registered agent type has no matching prompt template. See [spec 018](docs/specs/018-alerting-agent-and-worker-boot-visibility.md.done).
- **Pipeline-capacity gate now enforces `maxInFlightItems` for PM `status-changed` triggers** (spec 017, plan 2 of 3). The gate at `src/triggers/shared/pipeline-capacity-gate.ts` is the hard cap on the active pipeline (TODO + IN_PROGRESS + IN_REVIEW work items) introduced after a prior incident where a human moved three cards into TODO simultaneously and three concurrent implementation runs fired against a project pinned to `maxInFlightItems: 1`. The gate calls `getPMProvider()` to count in-flight items, but for every PM `status-changed` trigger the call threw `No PMProvider in scope` because the three PM router adapters (`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in their per-PM-type credential `AsyncLocalStorage` scope but NOT in PM-provider scope (the GitHub adapter at `src/router/adapters/github.ts:280` already had both wrappings). The gate fell through to its conservative branch (`WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` and `return false`) — silently no-op for the only triggers that actually need it. 32 occurrences/day on cascade-router (verified 2026-04-29). The fix introduces a shared helper `withPMScopeForDispatch(project, dispatch)` at `src/router/adapters/_shared.ts` that the three PM router adapters consume, mirroring the GitHub adapter's correct shape. The gate's "PM provider unavailable" branch is converted from `WARN + return false` (allow) to ERROR-level + Sentry capture under stable tag `pipeline_capacity_gate_no_pm_provider` + `return true` (block) — once the routine path establishes scope, hitting that branch is a real `AsyncLocalStorage` scope leak operators need to investigate. A static-guard test at `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts` enforces the wrapping invariant per adapter; CLAUDE.md gains a "Capacity-gate invariant" passage in the Architecture section. See [spec 017](docs/specs/017-router-silent-failure-hardening.md).
- **PM-ack dispatch consolidation: Linear-based PM-focused agents now post their PM-side ack comment** (spec 017, plan 1 of 3). PM-focused agents (e.g. `backlog-manager`) triggered from a GitHub webhook used to silently skip their PM-side ack on Linear projects: the router-adapter's local `postPMAck` helper had `if (pmType === 'trello')` / `if (pmType === 'jira')` branches but no Linear branch, so Linear-based projects fell through to a `WARN: Unknown PM type for PM-focused agent ack, skipping` and never saw the "🔧 On it" comment that Trello/JIRA projects got (24 silent skips per day on cascade-router, all from `ucho`, verified 2026-04-29). A near-identical helper at `src/triggers/shared/pm-ack.ts` already had the Linear branch — pure parallel-path drift. The fix introduces a single consolidated helper `dispatchPMAck` at `src/router/pm-ack-dispatch.ts` that indexes the manifest registry directly and invokes `manifest.platformClientFactory(projectId).postComment(...)` — no per-PM-type literal branching anywhere on the dispatch surface. Both legacy call sites delegate. The PM manifest conformance harness gains a per-provider `dispatchPMAck reaches this provider without throwing` assertion, and a static-guard test pins "no `pmType === '<literal>'` branching" against all three call sites; adding a future PM provider to the registry lands the dispatch path for free. Genuinely-unknown PM types (configuration error: project pinned to a deleted provider) now log at ERROR + capture to Sentry under stable tag `pm_ack_unknown_pm_type` instead of a silent WARN. See [spec 017](docs/specs/017-router-silent-failure-hardening.md).
- **Progress-comment lifecycle: post-agent cleanup hook now skips when an in-run gadget already deleted the comment** (spec 017, plan 3 of 3). The post-agent `deleteProgressCommentOnSuccess` hook used to read `sessionState.initialCommentId`, fall back to `result.agentInput.ackCommentId` when session state was empty, and issue a redundant DELETE — but "session state cleared by a gadget" was indistinguishable from "session state never populated", so the fallback fired and re-deleted comments that were already gone. GitHub returned 404 and `WARN: Failed to delete progress comment after agent success` was logged 72 times per day on cascade-router (live audit on 2026-04-29). Adds an explicit `initialCommentIdConsumed: boolean` flag on `SessionStateData`. Both `deleteInitialComment` (gadget-driven) and `clearInitialComment` (sidecar-driven) now set the flag to `true` after disposing of the comment. The post-agent hook checks the flag first and skips the entire deletion path — including the legacy `agentInput.ackCommentId` fallback — when consumed. As defense in depth, `githubClient.deletePRComment` now treats HTTP 404 as success (RFC-7231 idempotency) and logs at DEBUG instead of letting the error bubble as a WARN; other HTTP errors (5xx, 401, network) continue to throw. The legacy fallback to `agentInput.ackCommentId` continues to work for code paths that never populate session state. See [spec 017](docs/specs/017-router-silent-failure-hardening.md).
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ For the full setup walkthrough — projects, credentials, webhooks, and triggers
## ⚡ Features

- **Multi-PM support** — Works with Trello, JIRA, and Linear out of the box
- **11 agent types** — Splitting, planning, implementation, review, debug, respond-to-review, respond-to-CI, and more
- **12 agent types** — Splitting, planning, implementation, review, debug, respond-to-review, respond-to-CI, alerting, and more
- **Dual-persona GitHub model** — Separate implementer and reviewer bot accounts to prevent feedback loops
- **Web dashboard + CLI** — Monitor runs, manage projects, configure triggers
- **Extensible trigger system** — Add new events without touching core logic
Expand Down Expand Up @@ -78,6 +78,7 @@ Cascade runs as three independent services:
| `debug` | Session log uploaded | Analyzes agent session logs and creates a debug card |
| `resolve-conflicts` | Merge conflict detected | Resolves git merge conflicts |
| `backlog-manager` | Scheduled / manual | Manages and prioritizes the backlog |
| `alerting` | Sentry alert webhook | Investigates the alert (parses stacktrace, reads source) and files a bug investigation work item or comments on an existing one. Read-only — never edits source, opens PRs, or pushes commits. |

---

Expand Down
Loading
Loading