mongrel-intelligence · zbigniewsobiecki · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,8 +4,13 @@ All notable user-visible changes to CASCADE are documented here. The format is l
 
 ## Unreleased
 
+### Added
+
+- **Alerting agent now investigates Sentry alerts and files bug investigation work items** (spec 018, plan 1 of 2). The `alerting` agent had been wired end-to-end except for its system prompt template — definition YAML, capabilities, trigger handlers, context pipeline, and Sentry integration were all in place, but `src/agents/prompts/templates/alerting.eta` was missing, so the worker crashed at agent boot with `ENOENT` when the first prod-traffic Sentry alert arrived (cascade project, 2026-05-06). This plan ships the prompt: a three-phase investigator (parse pre-loaded event → confirm root cause via source reads → file or comment) with an explicit `INVESTIGATE-AND-FILE-ONLY` guardrail. The agent does not edit source, commit, push, or open PRs — that property is enforced at the capability layer (no `fs:write`, no `scm:*`), pinned by a static test that asserts the resolved gadget allowlist excludes `WriteFile`, `CreatePR`, and `CreatePRReview`. When the trigger context provides an existing work item, the agent comments on it; otherwise it creates a new bug investigation work item in the configured backlog. Output structure is predictable: `Investigate: <ErrorType> in <Function> (<file>:<line>)` title and a 4-6 sentence + bullets description. Engine-agnostic prose; reuses `partials/environment` for the shared preamble. See [spec 018](docs/specs/018-alerting-agent-and-worker-boot-visibility.md.done). Plan 2 of 2 closes the silent-failure path that masked this gap (worker boot failures will produce visible failed run rows, exit code 2, Sentry capture under `worker_boot_failure`).
+
 ### Changed
 
+- **Worker boot-failure visibility: boot-time agent failures now produce visible failed runs** (spec 018, plan 2 of 2). The worker now creates the `agent_runs` row before plan resolution so template-load, model-resolution, context-pipeline, definition-lookup, and run-record failures are not silently converted into invisible successful jobs. Boot failures are marked `failed` with the structured cause in the run row, captured to Sentry under `worker_boot_failure`, and re-thrown so the worker exits with code `2`; ordinary in-execution crashes keep the existing exit/result semantics. The router crash-reason formatter labels exit code `2` as `Worker boot failed`, Sentry-driven alerting runs receive stable synthesized `workItemId`s (`sentry:issue:<id>` / `sentry:metric:<org>:<title>`), and a conformance test now fails CI when a YAML-registered agent type has no matching prompt template. See [spec 018](docs/specs/018-alerting-agent-and-worker-boot-visibility.md.done).
 - **Pipeline-capacity gate now enforces `maxInFlightItems` for PM `status-changed` triggers** (spec 017, plan 2 of 3). The gate at `src/triggers/shared/pipeline-capacity-gate.ts` is the hard cap on the active pipeline (TODO + IN_PROGRESS + IN_REVIEW work items) introduced after a prior incident where a human moved three cards into TODO simultaneously and three concurrent implementation runs fired against a project pinned to `maxInFlightItems: 1`. The gate calls `getPMProvider()` to count in-flight items, but for every PM `status-changed` trigger the call threw `No PMProvider in scope` because the three PM router adapters (`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in their per-PM-type credential `AsyncLocalStorage` scope but NOT in PM-provider scope (the GitHub adapter at `src/router/adapters/github.ts:280` already had both wrappings). The gate fell through to its conservative branch (`WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` and `return false`) — silently no-op for the only triggers that actually need it. 32 occurrences/day on cascade-router (verified 2026-04-29). The fix introduces a shared helper `withPMScopeForDispatch(project, dispatch)` at `src/router/adapters/_shared.ts` that the three PM router adapters consume, mirroring the GitHub adapter's correct shape. The gate's "PM provider unavailable" branch is converted from `WARN + return false` (allow) to ERROR-level + Sentry capture under stable tag `pipeline_capacity_gate_no_pm_provider` + `return true` (block) — once the routine path establishes scope, hitting that branch is a real `AsyncLocalStorage` scope leak operators need to investigate. A static-guard test at `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts` enforces the wrapping invariant per adapter; CLAUDE.md gains a "Capacity-gate invariant" passage in the Architecture section. See [spec 017](docs/specs/017-router-silent-failure-hardening.md).
 - **PM-ack dispatch consolidation: Linear-based PM-focused agents now post their PM-side ack comment** (spec 017, plan 1 of 3). PM-focused agents (e.g. `backlog-manager`) triggered from a GitHub webhook used to silently skip their PM-side ack on Linear projects: the router-adapter's local `postPMAck` helper had `if (pmType === 'trello')` / `if (pmType === 'jira')` branches but no Linear branch, so Linear-based projects fell through to a `WARN: Unknown PM type for PM-focused agent ack, skipping` and never saw the "🔧 On it" comment that Trello/JIRA projects got (24 silent skips per day on cascade-router, all from `ucho`, verified 2026-04-29). A near-identical helper at `src/triggers/shared/pm-ack.ts` already had the Linear branch — pure parallel-path drift. The fix introduces a single consolidated helper `dispatchPMAck` at `src/router/pm-ack-dispatch.ts` that indexes the manifest registry directly and invokes `manifest.platformClientFactory(projectId).postComment(...)` — no per-PM-type literal branching anywhere on the dispatch surface. Both legacy call sites delegate. The PM manifest conformance harness gains a per-provider `dispatchPMAck reaches this provider without throwing` assertion, and a static-guard test pins "no `pmType === '<literal>'` branching" against all three call sites; adding a future PM provider to the registry lands the dispatch path for free. Genuinely-unknown PM types (configuration error: project pinned to a deleted provider) now log at ERROR + capture to Sentry under stable tag `pm_ack_unknown_pm_type` instead of a silent WARN. See [spec 017](docs/specs/017-router-silent-failure-hardening.md).
 - **Progress-comment lifecycle: post-agent cleanup hook now skips when an in-run gadget already deleted the comment** (spec 017, plan 3 of 3). The post-agent `deleteProgressCommentOnSuccess` hook used to read `sessionState.initialCommentId`, fall back to `result.agentInput.ackCommentId` when session state was empty, and issue a redundant DELETE — but "session state cleared by a gadget" was indistinguishable from "session state never populated", so the fallback fired and re-deleted comments that were already gone. GitHub returned 404 and `WARN: Failed to delete progress comment after agent success` was logged 72 times per day on cascade-router (live audit on 2026-04-29). Adds an explicit `initialCommentIdConsumed: boolean` flag on `SessionStateData`. Both `deleteInitialComment` (gadget-driven) and `clearInitialComment` (sidecar-driven) now set the flag to `true` after disposing of the comment. The post-agent hook checks the flag first and skips the entire deletion path — including the legacy `agentInput.ackCommentId` fallback — when consumed. As defense in depth, `githubClient.deletePRComment` now treats HTTP 404 as success (RFC-7231 idempotency) and logs at DEBUG instead of letting the error bubble as a WARN; other HTTP errors (5xx, 401, network) continue to throw. The legacy fallback to `agentInput.ackCommentId` continues to work for code paths that never populate session state. See [spec 017](docs/specs/017-router-silent-failure-hardening.md).

diff --git a/README.md b/README.md
@@ -39,7 +39,7 @@ For the full setup walkthrough — projects, credentials, webhooks, and triggers
 ## ⚡ Features
 
 - **Multi-PM support** — Works with Trello, JIRA, and Linear out of the box
-- **11 agent types** — Splitting, planning, implementation, review, debug, respond-to-review, respond-to-CI, and more
+- **12 agent types** — Splitting, planning, implementation, review, debug, respond-to-review, respond-to-CI, alerting, and more
 - **Dual-persona GitHub model** — Separate implementer and reviewer bot accounts to prevent feedback loops
 - **Web dashboard + CLI** — Monitor runs, manage projects, configure triggers
 - **Extensible trigger system** — Add new events without touching core logic
@@ -78,6 +78,7 @@ Cascade runs as three independent services:
 | `debug` | Session log uploaded | Analyzes agent session logs and creates a debug card |
 | `resolve-conflicts` | Merge conflict detected | Resolves git merge conflicts |
 | `backlog-manager` | Scheduled / manual | Manages and prioritizes the backlog |
+| `alerting` | Sentry alert webhook | Investigates the alert (parses stacktrace, reads source) and files a bug investigation work item or comments on an existing one. Read-only — never edits source, opens PRs, or pushes commits. |
 
 ---