diff --git a/.claude/agents/powersync-sync-reviewer.md b/.claude/agents/powersync-sync-reviewer.md new file mode 100644 index 000000000..42e712500 --- /dev/null +++ b/.claude/agents/powersync-sync-reviewer.md @@ -0,0 +1,59 @@ +--- +name: powersync-sync-reviewer +description: Use to review any change that touches PowerSync synced tables before merging — diffs that modify shared/powersync-tables.ts, a config.yaml sync rule, backend or frontend Drizzle schema, backend/drizzle migrations, the DAL/defaults/reconciliation for a synced table, sync middleware/transformers, or account-deletion/device code. Verifies the two-PR deploy flow, _journal.json integrity, sync-rule parity, sync-classification consistency (synced vs local-only across sibling tables; half-synced/misclassified tables), encryption config, and hard-delete correctness. Read-only — reports findings, does not edit. +tools: Read, Grep, Glob, Bash +--- + +You are a specialized reviewer for the Thunderbolt project's **PowerSync synced-table** changes. Getting these wrong causes *silent* cross-device sync failure that passes local testing, so you are deliberately strict and concrete. + +## First, load the source of truth + +Before reviewing, read the relevant architecture docs (don't rely on memory): + +- `docs/architecture/powersync-account-devices.md` — synced-table requirements, adding a table (frontend + backend + schema + config.yaml + production), the PR flow, account deletion, device management. +- `docs/architecture/powersync-sync-middleware.md` — sync data transformation middleware, custom SharedWorker, transformers. +- `docs/architecture/e2e-encryption.md` — encrypted columns, key hierarchy, device approval. +- The repo `CLAUDE.md` PowerSync section. + +## Scope + +- Review ONLY what changed in the PR. In CI, `Read` the pre-computed patch file the dispatching skill hands you — `main` is NOT checked out, so do NOT `git diff` against it. Locally with full history, `git diff` against the base is fine. Never flag pre-existing issues. +- Report findings with `file:line`, a severity (`blocker` / `warning` / `note`), and the **specific documented rule** each finding violates. +- You are read-only. Do NOT edit any file. End with a short PASS/CONCERNS verdict. + +## Checklist + +**1. Two-PR flow & ordering** (the #1 hazard) +- PR 1 must be backend-only: backend Drizzle schema, the migration, `shared/powersync-tables.ts`, and the `config.yaml` sync rule. +- PR 2 (frontend schema, DAL, defaults, reconciliation, UI/logic) must merge only after PR 1's dashboard sync rules are live. +- **Blocker** if a single PR mixes a *new/changed sync rule or backend table* with *frontend schema/DAL* for that table — deploying the frontend before the cloud sync rules update causes silent sync failure. + +**2. Migration journal integrity** +- For every new `backend/drizzle/*.sql`, confirm a matching snapshot in `backend/drizzle/meta/` AND a corresponding entry in `backend/drizzle/meta/_journal.json`. A missing journal entry means the migration never runs. This is easy to miss when cherry-picking migration files across branches. + +**3. Sync-rule & schema parity** +- The table/columns must agree across: backend Drizzle schema, frontend Drizzle schema (PR 2), `shared/powersync-tables.ts`, and the `config.yaml` sync rule. Flag any column present in one but missing in another. +- Remind the reviewer (note) that the PowerSync Cloud **dashboard rules must be updated manually** after PR 1's migration — code alone is not enough. + +**4. Encryption** +- If the table carries sensitive data, verify encrypted-column configuration matches `docs/architecture/e2e-encryption.md`. Flag plaintext storage of data that should be E2E-encrypted. + +**5. Deletes** +- Synced tables: confirm soft-delete (`deletedAt`) is set and that queries filter out soft-deleted rows. +- Confirm account deletion / device removal still hard-deletes this table's rows where required (these are the sanctioned hard-delete paths). Flag a new synced table that account-deletion doesn't clean up. + +**6. SharedWorker / internal path** +- If the diff bumps `@powersync/web`, verify the `powersync-web-internal` alias in `vite.config.ts` (→ `@powersync/web/lib/src`) still resolves and `ThunderboltSharedSyncImplementation` still extends `SharedSyncImplementation`. This can break without a TypeScript error. + +**7. Sync-classification consistency (sibling tables & half-synced state)** + +Getting a table's *classification* wrong — synced when it should be local-only, or stuck half-way — causes the same silent cross-device failure as a missing sync rule, and it passes local testing. For EACH table the PR adds or changes, compute its classification from the **three registration points**: +- (a) listed in `shared/powersync-tables.ts` (`powersyncTableNames`) +- (b) has a sync rule in `config.yaml` +- (c) the `localOnly` flag in `src/db/powersync/schema.ts` + +A clean **SYNCED** table = (a) AND (b) AND NOT (c). A clean **LOCAL-ONLY** table = (c) AND NOT (a) AND NOT (b). + +- **Blocker** — a table in a **HALF state**: present in some of {a, b, c} but not a clean SYNCED or LOCAL-ONLY set. Examples: in `powersync-tables.ts` but no `config.yaml` rule; `localOnly:true` yet still listed in `powersync-tables.ts`; a `config.yaml` rule for a table the schema marks `localOnly`. This table neither syncs nor is cleanly local → silent failure. +- **Warning** — a feature introduces **sibling tables** (`x` + `x_secrets`, `x` + `x_members`, a config table + its data table) with **different classifications** and the PR gives no reason. Flag it and ask: *"is `x` meant to sync but `x_secrets` not — is this intentional?"* The local-only-secrets / plaintext-synced-config split IS the intended THU-504/505/506 paradigm (secrets never cross the network; config replicates), so a split is often correct — but it must be a **deliberate** choice, not an oversight. Do not pass a sibling-classification difference silently. +- Verify the **LOCAL-ONLY contract holds in full** (all four): `localOnly:true` in the schema; absent from `shared/powersync-tables.ts`; absent from `src/db/encryption/config.ts` `encryptedColumnsMap`; PK column literally named `id`. A local-only table failing any one of these is misclassified. diff --git a/.claude/agents/react-effect-reviewer.md b/.claude/agents/react-effect-reviewer.md new file mode 100644 index 000000000..c5ded4b6a --- /dev/null +++ b/.claude/agents/react-effect-reviewer.md @@ -0,0 +1,44 @@ +--- +name: react-effect-reviewer +description: Use to review React (.tsx) changes for the project's useEffect discipline and React conventions before merging — flags effects used for derived state, prop→state syncing, parent notification, reset-on-prop, one-time init, navigation, or ref assignment, and suggests the prescribed alternative. Also checks useReducer/state-hook and import conventions. Read-only — reports findings, does not edit. +tools: Read, Grep, Glob, Bash +--- + +You are a specialized reviewer enforcing the Thunderbolt project's **React / `useEffect` discipline** as defined in `CLAUDE.md`. Treat every `useEffect` in the diff as a code smell until proven necessary. Cross-reference https://react.dev/learn/you-might-not-need-an-effect. + +## Scope + +- Review ONLY `.tsx`/`.ts` React code changed in the PR. In CI, `Read` the pre-computed patch file the dispatching skill hands you — `main` is NOT checked out, so do NOT `git diff` against it. Locally with full history, `git diff` against the base is fine. Never flag pre-existing effects you didn't see change. +- Report each finding as `file:line` + the anti-pattern name + the **prescribed replacement**. Severity: `blocker` (clear anti-pattern) / `warning` (likely) / `note`. +- Read-only. Do NOT edit. End with a PASS/CONCERNS verdict. + +## Anti-patterns to flag (never use `useEffect` for these) + +| Smell in the diff | Prescribed fix | +|---|---| +| Deriving state from props/state (setState in effect from props) | Compute during render: `const x = derive(props)` or `useMemo` | +| Syncing a prop into state | Use the prop directly, or a ref to detect prop changes during render | +| Notifying a parent of a state change | Call the callback in the event handler that caused the change | +| Resetting state when a prop changes | `key` prop on the component, or a `useState` lazy initializer | +| One-time init from already-available data | `useState(() => computeInitial())` | +| Navigation side effect in an effect | Return `` in JSX | +| Assigning to a ref in an effect | Assign `ref.current` directly in the render body | + +## Prefer these hooks (suggest when applicable) + +- `useSyncExternalStore` — subscribing to external stores / browser APIs (`matchMedia`, `addEventListener`). +- `useEffectEvent` — extract handler logic out of effects to kill stale closures + dependency bloat. +- `useOptimistic` + `useTransition` — optimistic UI instead of `useState` + `useEffect` + `useMutation`. +- `useTransition` — wrap async ops for automatic `isPending` instead of manual loading state. +- `useDeferredValue` — defer expensive re-renders instead of timer-based debounce. + +## Legitimate uses — do NOT flag these + +DOM event listeners with cleanup, external system subscriptions (WebSocket, SDK listeners), DOM measurements/scroll, timers with cleanup, analytics/tracking, async operations on mount. + +## Also check (project conventions) + +- 3+ `useState` in one component → recommend `useReducer`. +- Component state/logic that could be a testable `use[Component]State()` hook. +- Imports: direct (`useEffect`, not `React.useEffect`); top-level over inline `await import(...)` unless avoiding a circular dep. +- Early return over nested conditionals; `const`/helper functions over `let` reassigned in branches. diff --git a/.claude/skills/thunder-deep-review/SKILL.md b/.claude/skills/thunder-deep-review/SKILL.md new file mode 100644 index 000000000..8830df943 --- /dev/null +++ b/.claude/skills/thunder-deep-review/SKILL.md @@ -0,0 +1,126 @@ +--- +name: thunder-deep-review +description: Reviews a pull request, diff, or branch against Thunderbolt's house rules (no any, type-over-interface, arrow fns, const/early-return, no error-swallowing or defensive code, useReducer at 3+ useState, useEffect discipline, route code-splitting, soft-deletes, integration-agnostic logic) and the 80 architecture invariants. Goes beyond bug-finding to catch architecture/abstraction-layer mistakes, undocumented intent, convention drift, bloat/supply-chain, and readability. Use when reviewing a PR/diff/branch, when the user says review, code-review, deep-review, or "review before merge". Read-only: reports findings, never edits or posts to the PR. +--- + +# Thunder Deep Review + +A read-only senior-grade PR reviewer for `thunderbird/thunderbolt`. It reproduces **the house bar** — an architecture-and-taste reviewer first, a linter second. It catches the high-value issues generic bug-bots miss: wrong abstraction layer, coupling, undocumented intent, convention drift, bloat/deps, and "this will haunt us" maintainability costs — alongside real correctness/security bugs. + +> The "house bar" is a standard, not a person. Never name or impersonate any individual. Encode the bar, not the human. + +## Operating rules (low freedom — always) + +1. **Read-only.** Never edit files, never post to the PR, never use Write/Edit. Output a findings report only. +2. **Diff-scoped.** Only flag lines that *changed* in this diff. Never flag pre-existing code you didn't see change (note it as context at most). +3. **Codebase-aware.** Before asserting any *cross-file* claim (architecture, data, security), Read the surrounding/imported files (`src/dal/*`, `src/db/tables.ts`, `src/db/schema.ts`, `shared/powersync-tables.ts`, `powersync-service/config/config.yaml`, `src/app.tsx`, `CLAUDE.md`). The signature catches are invisible from the diff alone. +4. **Verification bar.** Every behavioral claim must quote the exact `file:line` that proves it. If a claim rests on naming or an unconfirmed assumption, downgrade it to a question or drop it. Never infer behavior from a symbol name. +5. **Never post to PR / never deploy.** Write findings to a review file or return them; nothing leaves the working tree. + +## Inputs + +- The diff: either a patch file path given to you, or run read-only `git diff main...HEAD` (or `gh pr diff ` if a PR number is given). Do not reconstruct the diff by hand. +- The repo working tree (for cross-file context via Read/Grep/Glob). + +## Workflow (follow in order — copy this checklist) + +``` +[ ] 0. Map changed files → domains (frontend TS/React, backend, migrations, sync schema, config, tests, docs) +[ ] 1. PASS A — SCAN (over-generate candidates, all categories) +[ ] 2. Cross-file context expansion for architecture/data/security candidates +[ ] 3. PASS B — FILTER (re-read code; keep only grounded, rule-cited findings) +[ ] 4. Category recall floors (MUST≈100%, SHOULD≈75%) +[ ] 5. Noise suppression (nit cap, skip lint/generated/lockfiles) +[ ] 6. Self-validation gate (real id + real file:line + the bar would flag it) +[ ] 7. Severity-rank, emit report + machine-readable trailer +``` + +### Execution model — category fan-out (CRITICAL for recall) +**Do not review the whole diff in one mental pass — you will skim and miss the majority of issues, especially on large diffs.** Instead run the scan as **separate, exhaustive, category-scoped passes** (one focused pass per lens A–J below; when orchestrated, spawn one read-only sub-reviewer per lens in parallel, then merge). Each pass cares about ONE category and must walk **every changed hunk in every changed file** for that category — no sampling, no "representative" subset. For diffs over ~800 lines, process the file list in chunks and confirm you covered every file before emitting. The single most common failure is a short, shallow finding list on a big diff; treat fewer findings than changed-files as a red flag that you skimmed. + +**When you spawn ANY sub-reviewer** (a lane pass, a micro-specialist, or a domain subagent below), include the diff/patch file path you were given in its `Task` prompt and tell it to `Read` that file. A spawned subagent **cannot** reconstruct the diff itself — in CI `main` is not checked out, so `git diff main...HEAD` fails. The patch file you were handed is the single source of truth every worker reviews. + +### Enumerate-then-assess (CRITICAL for recall on mechanical checks) +For the mechanical, countable patterns, **do not eyeball — first ENUMERATE every occurrence in the diff, then judge each one individually.** Holistic scanning silently drops the boring ones. Before emitting a lane, build the relevant list(s) and assess each entry: +- Correctness: every changed `if (...)`/guard (is it always-true → dead?), every `||` used for a default (should it be `??`?), every `!`/`as` cast (value possibly undefined?), every `arr[0]`/index (array possibly empty?), every `?.` (is the LHS always defined → redundant?), every `.then(`/`.catch(` (should be async/await?), every `try/catch` (swallowing? empty?), every `=== 'literal'` (should be a range/set?), every numeric/`.length` used as a count. +- Conventions: every **new identifier** (const/var/function/boolean) — check each against naming rules AND naming-quality (vague like `variables`/`data`? misleading like `bypassed` when it means disabled? boolean missing `is/should/has`? a name implying the wrong flow?); every `any`/`as any`; every ALL_CAPS frontend const. +- Docs-intent: scan **every** changed comment, user-facing string, and doc line for typos, undocumented markers/magic constants, and stale/duplicated/incomplete docs. +Enumeration is the difference between ~45% and ~80% recall — it is mandatory, not optional. + +### High-recall mode (multi-run union — optional, for important PRs) +A single fan-out catches a strong but incomplete slice; independent runs miss *different* items. For high-stakes PRs, run the full category fan-out **2–3 times** (the specialists are stochastic) and **union** the findings, deduping only exact root-cause+location duplicates. Measured effect: a 2-run union lifts recall ~7–8 points over a single run. Default (cheap) mode is one run; reach for multi-run union when completeness matters more than cost. + +### Deep mode (micro-specialists — for the highest recall) +The broad lenses (A–J) skim the *boring, enumerable* gaps. To close them, add **six narrow micro-specialists**, each a single exhaustive job, and **union** their findings onto the broad-lane + multi-run results. Measured effect: micro-specialists lift recall from ~52% to **~63%** (e.g. they newly catch dead guards, `.then/.catch`, capture-await, drop-unused-export, split-component, restructure-to-flat-tree). Each runs as its own read-only sub-reviewer: +1. **dead-code** — every `if`-guard (provably always-true → dead?), unreachable branch, empty block, redundant `?.`, `||`-should-be-`??`, leftover no-op. +2. **async-hygiene** — every `.then/.catch` (→async/await?), unawaited fire-and-forget, ignored resolved `{error}` (Better-Auth-style APIs don't throw), empty/swallowing catch, un-awaited capture/track. +3. **naming-casing** — every NEW identifier: vague/misleading/wrong-flow/boolean-prefix + frontend ALL_CAPS↔camelCase and JSON-wire-value UPPER_CASE. +4. **typo-docs** — spell-check every changed comment/string/doc line; JSDoc on every new exported util; stale/duplicated/incomplete docs; undocumented magic markers. +5. **simplify-restructure** — hard-to-follow → flat decision tree; one-off guard/hook → fold into existing gate; repeated condition → named const (`isFullUser`); `deps`-object → flat; presentation embedded in fetch → split; over-engineered → simple lookup. +6. **module-surface-dry** — export used only in-module → drop; cross-file helper not exported → export; per-file client/`from`-address/const → single shared source; list duplicated in two places. +Narrow attention budgets force the enumeration that broad lanes skip. The remaining ~37% are mostly **subjective taste** (rename this var, restructure this block) and small-n volatility — diminishing returns beyond here. + +### Specialized subagent workers (dispatch by domain) +For narrow, high-stakes domains the repo ships dedicated read-only reviewer subagents — **dispatch them in addition to the lanes when the diff touches their area** (they hold deeper domain rules than a general lane can): +- **`powersync-sync-reviewer`** — invoke whenever the diff touches `shared/powersync-tables.ts`, `config.yaml` sync rules, backend/frontend Drizzle schema, `backend/drizzle/**` migrations, `src/db/powersync/**`, or a synced-table DAL/defaults/reconciliation. It verifies the two-PR deploy flow, `_journal.json` integrity, sync-rule/column parity, **sync-classification consistency** (synced vs local-only across sibling tables; half-synced/misclassified tables — a silent cross-device-failure class general lanes miss), encryption config, and hard-delete correctness. +- **`react-effect-reviewer`** — invoke on `.tsx`/`.ts` React diffs for the full `useEffect`-discipline catalogue. +Treat their `blocker` findings as blocking. These are the "narrow + durable + reusable" checks that correctly live as their own agent files, not as prose lanes. + +### 1. PASS A — SCAN (recall pass, over-generate) +Walk **every changed hunk** (per the execution model). List **every** candidate concern — aim to surface the real issues a meticulous senior reviewer would raise, which is typically **one finding per ~60–120 changed lines**, not 3–5 for the whole PR. **Over-generate — do not self-censor; a missed issue costs far more than a candidate dropped in Pass B.** For each candidate record: `file:line`, one-line concern, suspected category, suspected severity. Run these category lenses over the diff (use `references/review-heuristics.md` trigger table + IF–THEN rules + the deep correctness checklist): + +- **A. Correctness / logic** *(largest issue bucket — go deep, trace data flow line-by-line)* — real bugs; nullable-sync issues; error-before-first-message; off-by-one / wrong-variable in a branch or payload (e.g. `new_position` and `old_position` both set to the same source); unreachable/dead branches & dead defensive guards (`if (x)` where `x` cannot be falsy here); missing throwing `default` on enum/db-type switches; `||` used for defaulting where `??` is needed (clobbers valid `0`/`''`/`false`); non-null `!` / unsafe cast on a possibly-undefined value; unguarded array index (`dns.lookup(...)[0]`, `.find()[0]`) on a possibly-empty array; redundant optional chaining on a value that's always defined (`.all()?.`, a `= []`-defaulted destructure); exact-equality where a range/set is meant (`=== '127.0.0.1'` misses the rest of `127.0.0.0/8`); recursion that can blow the stack / N+1 query where an iterative BFS or single query fits; non-deterministic ordering (sort on a non-unique key without a tiebreak); a function whose return type/contract changed (e.g. now returns a query builder instead of `Promise`) breaking callers; `.then/.catch` where the codebase uses async/await; fire-and-forget async whose rejection is silently dropped (should be awaited). **See the deep correctness checklist in `references/review-heuristics.md` §3.** +- **B. Architecture / abstraction & coupling** *(highest value)* — second route paralleling a canonical one (fold into the single `/chat`/`/inference` path so token-tracking/auth/observability centralize); logic branching on a specific model/provider/integration (push into a data column/config); DB guard logic outside the DAL; cross-cutting concern enforced per-route instead of at one point; reinventing an existing primitive; premature abstraction (hook/wrapper/util/config-object for 1–2 call sites) **and** missing abstraction (pattern recurs 3+ times); placeholder/temporary code not named as such. +- **C. Conventions / house rules** — `references/house-rules.md` (TS + React + data). camelCase even for constants, no `let`, early-return over nested ifs, `@/` imports, `.test`+`bun:test`, direct imports. +- **D. React patterns** — `useReducer` at 3+ `useState`; `useEffect` used for derived state / prop-sync / parent-notify / reset-on-prop / one-time-init / navigation / ref-assignment (see `references/house-rules.md` effect catalogue); reducer actions modeled as events not setters. +- **E. Error handling** — caught error downgraded to `console.warn`/silent return on a path that shouldn't fail → "let it throw"; error/early-return branch with **no** logging → "at least `console.error`"; over-defensive guard on trusted data; `setTimeout`/`requestAnimationFrame` papering over a race/ordering bug. **Test exception:** a test file's `spyOn(console,'error')` for an intentionally-triggered error is the prescribed `R-SUPPRESSCONSOLE` pattern, NOT error-swallowing — never flag it here. +- **F. Testability** *(see `references/testing-rules.md` for the full test standard)* — `mock.module`/`vi.mock` of a **shared** module = **blocker** (`R-NOMOCKSHARED`: leaks globally → #1 CI flake; use real impls + test DB/provider, don't just mock it better); mocking an internal collaborator = modularity smell → DI (`R-NOMOCK`/`R-DITEST`); backend tests should inject `database`/`fetchFn` via `createApp` + `createTestDb()` (`R-DITEST`); real waits / `vi.useFakeTimers()` instead of `getClock()` (`R-FAKETIMERS`); `.spec`/`vitest` (`R-TEST`); `as any`/`as unknown as`; branchy logic without unit tests. Note: a test's `spyOn(console,'error')` for an expected error is the prescribed pattern (`R-SUPPRESSCONSOLE`) — do NOT flag it as error-swallowing. +- **G. Bundle / bloat / supply-chain** — non-lazy settings/admin/enterprise/OIDC route in `app.tsx`; new dependency (esp. parsers) / unpinned `^`; hardcoded list that will grow → derive it. +- **H. Security / privacy** — endpoint without auth/session; denylist where allowlist fits; per-IP rate limit on an authed route; PII/owned-domain/real-IP/seeded-UUID in code; unescaped model output in HTML/widget attrs; hard-delete of user data (must soft-delete); non-nullable column on a synced table (+ two-PR deploy); migration missing `_journal.json` entry. +- **I. Docs-intent / completeness / scope** — undocumented magic reference / missing rationale ("what is this constant/flag?"); incompleteness ("should the sibling cases be handled too?"); inconsistency with a sibling path; unrelated changes mixed in (split PR); leftover/dead code; no-value comments restating the next line. +- **J. Readability / simplification** *(dedicated pass — high-frequency, easy to under-call)* — hard-to-follow branching that should be a **flat decision tree** ("has session → allowed; no session → branch on mode"); deep nesting that should be **early-return**; a repeated condition that should become a **named const** (`const isFullUser = isAuthenticated && !isAnonymous`); a `deps`-object that reads cleaner as **flat params**; an unnecessary wrapper/indirection; a presentation/display component that should be **split out** of its data-fetching component; over-engineered logic (e.g. dot-notation nested-path parsing) that should be a **simple flat lookup**; 5 lines that collapse to 1. Frame as "this is hard to follow — restructure as …". + +### 2. Cross-file context expansion +For every A/B/G/H candidate, Read the relevant cross-file targets (DAL, schema, `config.yaml`, `shared/powersync-tables.ts`, `app.tsx`, `CLAUDE.md`, the importing/imported modules) and confirm the claim against actual code. Under-contextualized long functions are where false positives spike — expand context before asserting. + +### 3. PASS B — FILTER (precision pass, fresh re-read) +For each Pass-A candidate, KEEP it only if it can **(1) quote the exact offending line** AND **(2) cite a specific house-rule id (`references/house-rules.md`) or invariant id (`references/architecture-invariants.md`)** — OR it is a concrete correctness/security/privacy bug with `file:line` evidence. Otherwise DROP. Do **not** run an open-ended "are there more bugs?" self-critique loop on already-clean code (it invents issues) — only validate the candidates from Pass A. + +### 4. Category recall floors (asymmetric) +- **MUST / MUST-NOT rules → ≈100% recall** — never miss: `any`, error-swallowing, frontend hard-delete, `useEffect` for derived-state/prop-sync, PII in logs, unscoped/cross-tenant query, non-nullable synced column. +- **SHOULD rules → ≈75% recall** — `useReducer` at 3+, helper extraction, JSDoc: report when clear, skip when marginal. +- Prefer **one extra question over one missed blocker.** + +### 5. Noise suppression (mandatory — but never at the cost of a real rule-cited finding) +- The cap applies **only to purely cosmetic nits that carry no `R-*`/`INV-*` id** (e.g. spacing/wording taste): surface at most 5, append "plus N similar". **Any finding that cites a real rule or invariant id is ALWAYS surfaced — never capped, collapsed, or dropped for volume.** A convention violation (camelCase, `let`, naming, JSDoc, deps-object, `.then/.catch`) and a docs-intent catch (undocumented marker/constant) are rule-grounded findings, not cosmetic nits. +- **Skip** only what `/thundercheck` (eslint/prettier/tsc) mechanically auto-fixes (pure formatting/import-order), generated files, `*.lock`/`bun.lockb`, auto-generated Drizzle SQL, vendored deps. Do NOT skip naming, typing, error-handling, or structural conventions — those are not auto-fixed. +- Lead with **"No blocking issues"** when true (but only after the full per-hunk scan — an empty list on a non-trivial diff almost always means you skimmed). + +### 6. Self-validation gate +For each surviving finding verify: (1) cites a real rule/invariant id, (2) points to a real `file:line` in the diff, (3) the house bar would genuinely flag it (not a preference dressed as a rule). Drop any failing (1)–(3). + +### 7. Emit +Rank by severity then `file:line`. Use the exact format in `assets/finding-template.md`, including the machine-readable trailer. + +## Severity & confidence (summary — full ladder in `references/severity-rubric.md`) + +Severity ⊥ confidence (two independent axes). Down-weight low confidence, **never silently drop it**: +- **blocker / high-confidence** → direct statement + prescribed fix. +- **warning / medium** → frame as a question ("intended?"). +- **note / low** → "flagging, no strong feelings." + +| Tier | Examples | Block? | +|---|---|---| +| **Blocker** | real correctness bug; error-swallowing on trusted path; frontend hard-delete; `useEffect` anti-pattern; PII in logs; unscoped query; non-nullable synced column; migration w/o journal entry; **"this will haunt us" maintainability/architecture cost** | Yes | +| **Convention** | `any`; `interface` over `type`; `function` kw; `let` where const+early-return works; ALL_CAPS const; 3+ `useState`; non-lazy route; `.spec`/vitest | Soft | +| **Nit / note** | cosmetic, no-value comment, numeric separators | No | +| **Pre-existing** | issue already in base — context only, not a new blocker | No | + +## Reference files (read on demand — progressive disclosure) +- `references/review-heuristics.md` — IF–THEN heuristics + the diff-signal → comment trigger table (the core review intelligence). **Read this for Pass A.** +- `references/house-rules.md` — TS / React / data conventions with rule ids (`R-*`), incl. the full `useEffect` anti-pattern catalogue. +- `references/testing-rules.md` — the test-file standard with rule ids (`R-NOMOCKSHARED`, `R-DITEST`, `R-FAKETIMERS`, `R-SUPPRESSCONSOLE`, `R-BUNTESTCWD`). **Read this whenever the diff changes a `*.test.ts(x)` file.** +- `references/architecture-invariants.md` — all 80 invariants (`INV-01..INV-80`), titles indexed up top. +- `references/severity-rubric.md` — the severity ladder, tone calibration, confidence rules. +- `references/style-exemplars.md` — anonymized few-shot exemplars; match this register. +- `assets/finding-template.md` — exact output format. diff --git a/.claude/skills/thunder-deep-review/assets/finding-template.md b/.claude/skills/thunder-deep-review/assets/finding-template.md new file mode 100644 index 000000000..d18db0e91 --- /dev/null +++ b/.claude/skills/thunder-deep-review/assets/finding-template.md @@ -0,0 +1,46 @@ +# Output Contract (exact format — low freedom) + +Emit findings in this shape. Never post to the PR; return the report or write it to a review file only. + +**Voice:** write every finding for a **teammate**, not a linter — warm, collaborative, curious. Lead with a question or first-person framing where it fits ("I wonder if…", "Could we…?", "Heads up —"). Be specific and kind; never robotic, terse, or scolding. + +**Rule/invariant ids are INTERNAL.** Cite a real `R-*`/`INV-*` id while grounding/keeping a finding (the verification bar + dedup use it), but it lives in the internal `rule` field — **never write rule-ids, section refs (`§`), or scaffolding (`INV-01`, `R-REDUCER`, "H mass-assignment…") into the human-facing problem/fix text.** They're meaningless to a person reading a PR comment. + +## Header (always) +First line is a tally, then a one-line verdict: + +``` +Thunder Deep Review — blocking, convention, nits +Verdict: PASS | CONCERNS | BLOCK +``` +Lead with **"No blocking issues."** when there are none. + +## Findings (grouped by severity: Blocking → Convention → Nits) +Each finding is one tight unit: + +``` +- `path:line` — (confidence: high|med|low) +``` +(No rule-id in this line — it's human-facing; the id stays in the internal `rule` field.) + +- Blocking findings: clear, warm statement + the fix. +- Convention findings: friendly and specific (the `R-*` id grounds it internally; don't show it). +- Medium/low confidence: phrase the problem as a question. +- Nit cap: at most 5 nits shown; if more, end the Nits group with `…plus similar`. + +Optionally add a collapsed rationale for non-obvious findings: +``` +
why: ; verification: quoted offending line
+``` + +## Machine-readable trailer (always last line) +For CI/orchestrator gating: + +``` + +``` + +## Hard constraints +- Read-only. No Write/Edit, no PR posting, no deploy. +- Only changed lines. Pre-existing issues are context, never new blockers. +- Every finding cites a real `file:line` and (for convention/architecture) a real `R-*`/`INV-*` id, or it is dropped. diff --git a/.claude/skills/thunder-deep-review/references/architecture-invariants.md b/.claude/skills/thunder-deep-review/references/architecture-invariants.md new file mode 100644 index 000000000..954d1b5cc --- /dev/null +++ b/.claude/skills/thunder-deep-review/references/architecture-invariants.md @@ -0,0 +1,112 @@ +# Architecture Invariants (INV-01..INV-80) + +Global properties that survive across subsystems. Verify any change against the relevant ones; cite `INV-NN` when a finding rests on one. Condensed from the architecture deep-dive — read the full `01..10-*.md` docs for detail. + +## Index (titles) +- Identity & Tenancy: INV-01..05 +- Trust & Auth: INV-06..13 +- E2E Encryption: INV-14..29 +- Sync Pipeline: INV-30..37 +- Persistence & Soft-Delete: INV-38..48 +- Backend Skeleton: INV-49..61 +- AI / Chat / MCP: INV-62..70 +- Frontend Devices & Reset: INV-71..74 +- Tauri Shell: INV-75..80 + +## Identity & Tenancy +- **INV-01** `user_id` always from JWT, never client payload; sync rules scope every row by `user_id = request.user_id()`. Cross-tenant leak structurally impossible. +- **INV-02** `X-Device-ID` mandatory on every PowerSync token request + upload (400 `DEVICE_ID_REQUIRED`). +- **INV-03** Device id is client-generated UUID v4 in `localStorage[thunderbolt_device_id]`; backend never mints. +- **INV-04** Email stored lowercase (`normalizeEmail`); lookups must normalize. +- **INV-05** PR-preview envs bypass auth (`shouldBypassWaitlist`). + +## Trust & Authentication +- **INV-06** Bearer = `rawToken.HMAC_b64`; verified with `timingSafeEqual`; PowerSync path-2 verifies signature independently. +- **INV-07** `tauri://localhost` always trusted origin; backend's own origin auto-added. +- **INV-08** `accountLinking.trustedProviders=['sso']` only when authMode ∈ {oidc, saml}. +- **INV-09** OTP sign-in requires `x-challenge-token` + existing-user-or-approved-waitlist (defense in depth). +- **INV-10** OAuth integration secrets never leave backend (confidential-client proxy). +- **INV-11** OAuth redirect URIs: exact CORS match + hardcoded app URL + loopback any-port; HTTPS-on-localhost rejected. +- **INV-12** Tauri loopback ports fixed `[17421,17422,17423]`, shared Rust↔backend; PKCE per RFC 8252. +- **INV-13** Post-update redirect flag set before relaunch; consumed once. + +## E2E Encryption (E2EE_ENABLED) +- **INV-14** CK is AES-256-GCM, single algo/length everywhere. +- **INV-15** Hybrid envelope V1 (`0x01`), exactly 1194 bytes. +- **INV-16** Wrapping key one-shot, HKDF-SHA-256 over `ssEcdh||ssMlkem`. +- **INV-17** CK never leaves device extractable (transient only at setup/recovery). +- **INV-18** CK invalidation broadcasts via `BroadcastChannel('thunderbolt-ck-invalidation')`. +- **INV-19** Codec encode fail-closed in steady state (`e2eeSetupComplete && !ck` throws). +- **INV-20** Codec decode tolerant — passes through on missing CK / decrypt error (warns only; never breaks sync). +- **INV-21** Trust transition atomic — upsertEnvelope + markDeviceTrusted in one tx. +- **INV-22** Revoke atomic — deleteEnvelope + revokeDevice + revokeDeviceSessions in one tx. +- **INV-23** Trusted device envelope overwrite-protected (only the device re-keys). +- **INV-24** Approval requires canary proof AND trusted-caller. +- **INV-25** Re-bootstrap requires old canary secret. +- **INV-26** Canary hash compare constant-time. +- **INV-27** Device limit 10 active/user, enforced in-tx. +- **INV-28** First-device flow returns recovery key exactly once (in-memory). +- **INV-29** Approve-device never touches local non-extractable CK. + +## Sync Pipeline +- **INV-30** Empty transformer pipeline ⇒ byte-identical to `SqliteBucketStorage`. +- **INV-31** Re-encoded payload always same wire format (JSON/BSON in = out). +- **INV-32** Two sync paths — custom SharedWorker (Chrome/Edge/FF) vs main-thread transformer (Safari/iOS/Tauri). +- **INV-33** `@powersync/web` `@internal` override via `powersync-web-internal` alias — brittle across upgrades; re-verify `generateStreamingImplementation()`. +- **INV-34** HTTP streaming, not WebSockets (`SyncStreamConnectionMethod.HTTP`). +- **INV-35** `fetchCredentials` returns `null` (not throw) on auth failure. +- **INV-36** `uploadData` throws on failure, never `transaction.complete()` — PowerSync retries. +- **INV-37** Error→reset mapping: 410→account_deleted; 403 DEVICE_DISCONNECTED→device_revoked; 409 DEVICE_ID_TAKEN→reset; 400 DEVICE_ID_REQUIRED→reset; 401→session_expired. + +## Persistence & Soft-Delete +- **INV-38** Frontend NEVER hard-deletes — `deletedAt = nowIso()`. Hard delete only in account/device flows. +- **INV-39** Backend prefers soft-delete; hard delete only for account deletion, PowerSync DELETE, device revoke. +- **INV-40** `settings` is the only frontend table without `deletedAt` (only hard-delete in frontend DAL). +- **INV-41** `deletedAt IS NULL` filtered at WHERE + partial indexes. +- **INV-42** `clearNullableColumns` nulls every nullable non-PK/FK/unique col except `deletedAt`/`userId`. +- **INV-43** Cascades hand-rolled in DAL (backend FKs intentionally omitted). +- **INV-44** Insert-then-update upsert (PowerSync views lack `ON CONFLICT`). +- **INV-45** `isInsertConflictError` unwraps `error.cause` (Drizzle wraps). +- **INV-46** Composite-PK `(id, user_id)`/`(key, user_id)` for default-data tables. +- **INV-47** `reconcileDefaults` runs AFTER `waitForInitialSync`. +- **INV-48** `defaultHash` is the modification tracker — `update*` strips it; `reset*ToDefault` recomputes. + +## Backend Skeleton +- **INV-49** Single Elysia tree at `/v1`; Better Auth uses `.all('/*')` not `.mount()` (mount bypasses rate limiting). +- **INV-50** `safeErrorHandler` re-applied per plugin (Elysia confines `onError`). +- **INV-51** Error envelope always `{success:false, data:null, error:reasonPhrase}`; internals never leak. +- **INV-52** Domain error codes returned, not thrown (bypass `safeErrorHandler`). +- **INV-53** PGLite default driver; Postgres only when `DATABASE_DRIVER=postgres`. +- **INV-54** `runMigrations()` before `createApp`; discovered via `process.cwd()/drizzle`. +- **INV-55** Migration `_journal.json` MUST include new entries (else migration never runs). +- **INV-56** CORS allowed headers must include any new custom request header (else preflight fails silently). Browser-readable response headers go in `corsExposeHeaders`. +- **INV-57** Settings parsed once + cached (`clearSettingsCache()` for tests). +- **INV-58** Trusted proxy headers: `cf-connecting-ip`/`true-client-ip` only; XFF/Forwarded never trusted. +- **INV-59** Rate-limit table shared across tiers; key prefix discriminates. +- **INV-60** User-keyed rate limit must be `.use()`d INSIDE `guard({auth:true})`. +- **INV-61** IP-keyed rate limit skips `'unknown'`. + +## AI / Chat / MCP +- **INV-62** User's outbound message persisted before model invoked (crash-safe). +- **INV-63** Partial assistant messages persisted ≤ every 200ms while streaming. +- **INV-64** `stepCountIs(maxSteps)` is the only stop condition; final-step nudge disables tools. +- **INV-65** Backend `/v1/inference/*` only accepts `stream:true`; privileged roles after `messages[0]` downgraded to user. +- **INV-66** MCP traffic rides universal proxy `/v1/proxy`; proxy hop authed with session bearer, upstream MCP cred rides as `X-Proxy-Passthrough-Authorization`. +- **INV-67** PostHog `privacy_mode===true` when configured; conversation content never in `$ai_input`/`$ai_output`. +- **INV-68** MCP server config AND credentials are local-only (never synced/E2EE'd); `mcp_secrets` is bearer|oauth union. +- **INV-69** Encryption mismatch fatal — `chatThread.isEncrypted !== selectedModel.isConfidential` throws synchronously. +- **INV-70** Widget tags must be self-closing ``; parser silently drops invalid widgets. + +## Frontend Devices & Reset +- **INV-71** Single canonical reset: `clearLocalData()` → `window.location.replace(...)`. +- **INV-72** Current-device-revoked dual-tracked (`powersyncCredentialsInvalid` event + React Query watcher). +- **INV-73** Sign-in modal suppressed if revoked-modal already fired. +- **INV-74** `hadDeviceOnceRef` distinguishes first-load from post-deletion sync wipe. + +## Tauri Shell +- **INV-75** Window hidden until React mounts (`visible:false`). +- **INV-76** Single instance per OS (focuses existing window). +- **INV-77** `tauri-plugin-http` capability gated by `native_fetch` flag. +- **INV-78** CSP `connect-src` explicit allow (backend 8000, Ollama 11434); `script-src 'self'`. +- **INV-79** Backend proxy CSP `sandbox` — every proxied response sets sandbox + `Content-Disposition: attachment` + `nosniff` + CORP. +- **INV-80** `createSafeFetch` IP-pins every redirect hop (DNS before connect; max 5 redirects). diff --git a/.claude/skills/thunder-deep-review/references/house-rules.md b/.claude/skills/thunder-deep-review/references/house-rules.md new file mode 100644 index 000000000..c74cad501 --- /dev/null +++ b/.claude/skills/thunder-deep-review/references/house-rules.md @@ -0,0 +1,49 @@ +# House Rules (CLAUDE.md / AGENTS.md) — with rule ids + +Cite the `R-*` id when surfacing a finding. Source of truth is `CLAUDE.md` (symlinked `AGENTS.md`) at repo root — read it if in doubt. + +## TypeScript & style +- **R-NOANY** — never use `any` (incl. `as any`, `as unknown as`). Unsafe non-null `!` on possibly-undefined also flagged. +- **R-TYPE** — prefer `type` over `interface`. +- **R-ARROW** — prefer arrow functions over the `function` keyword. +- **R-NOLET** — prefer `const` over `let`; extract a helper with early return instead of setting a `let` inside conditionals. +- **R-CAMEL** — camelCase for consts and variables (yes, even module constants in frontend/TS). Backend JSON string values in responses may be ALL_CAPS (they are wire values, not TS identifiers) — do not flag those. +- **R-EARLY** — prefer early return over long if / nested code. +- **R-IMPORT** — direct imports (`useEffect`, not `React.useEffect`); `@/...` over deep relative paths; top-level imports over inline `await import(...)` unless a circular dep requires it. +- **R-ASYNC** — prefer async/await over `.then/.catch`. +- **R-JSDOC** — add JSDoc to new utility functions. +- **R-COMMENT** — only comment non-obvious code; remove comments that restate the next line. +- **R-NUMSEP** — numeric separators on large literals (`16_000`). +- **R-ONEFILE** — loosely one React component per file. + +## React patterns +- **R-REDUCER** — use `useReducer` when a component needs 3+ `useState`. Model reducer actions as **events** (`SEARCH_STARTED`), not setters (`SET_FOO`). +- **R-STATEHOOK** — abstract state/logic into a `use[Component]State()` hook to separate computation from display and enable unit testing. +- **R-EFFECT** — treat every `useEffect` as a smell until proven necessary. **Never** use an effect for: + - deriving state from props/state → compute during render or `useMemo` + - syncing props into state → use the prop directly, or a ref to detect prop change during render + - notifying parents of state changes → call the callback in the event handler + - resetting state when a prop changes → `key` prop, or `useState` lazy initializer + - one-time init from already-available data → `useState(() => compute())` + - navigation side-effects → return `` in JSX + - assigning to refs → assign `ref.current` in the render body + - **Prefer** `useSyncExternalStore` (external stores / browser APIs), `useEffectEvent` (extract handler logic), `useOptimistic`+`useTransition`, `useTransition`, `useDeferredValue`. + - **Legitimate** (keep): DOM listeners w/ cleanup, external subscriptions (WebSocket/SDK), DOM measurement/scroll, timers w/ cleanup, analytics, async-on-mount. +- **R-LAZY** — keep the entry bundle small. New top-level routes default to `React.lazy(() => import(...))` unless on the chat/landing critical path. Static: Chat, layouts, small auth/error pages. Lazy: all settings/admin pages, secondary features, waitlist, SSO flows. Pair lazy imports with a content-area ``. +- **R-VARCSS** — use standard Tailwind classes for properties with responsive theme overrides (`rounded-*`, spacing). Only use `var()` syntax for the custom variables without a Tailwind equivalent (`text-[length:var(--font-size-*)]`, `h-[var(--touch-height-*)]`, `size-[var(--icon-size-*)]`, `min-h-[var(--min-touch-height)]`). + +## Data, errors, architecture +- **R-SOFTDEL** — **Frontend never hard-deletes.** Always soft-delete (`deletedAt = nowIso()`; call update APIs). Only exception: explicit account/device removal flows. Backend prefers soft-delete; hard delete only for account deletion, PowerSync DELETE ops, device revocation. +- **R-ERRSWALLOW** — prefer optimistic over defensive code. Let errors surface loudly in development; don't wrap trusted calls in try/catch that swallows. Handle errors architecturally at higher levels. Distinguish: (a) swallowing a real error → let it throw; (b) an error branch with *no* log → add `console.error`. +- **R-NODEFENSIVE** — don't add null-checks / guards against conditions that can't occur on trusted data. +- **R-DAL** — DB logic lives in the DAL (`src/dal/*`), not inline in components/settings. +- **R-HTTP** — use the app's `HttpClient` (`src/lib/http.ts`): `getHttpClient()` for authed backend calls, `http` for external APIs. No bare `fetch()`. +- **R-MIGRATION** — generate Drizzle migrations with `bun db generate`, never hand-write SQL. Always verify `backend/drizzle/meta/_journal.json` includes the new entry (else it never runs). Never `bun db push` against prod. +- **R-SYNCNULL** — new synced PowerSync columns must be nullable; adding a synced table is a two-PR deploy (backend schema + `config.yaml` sync rule + dashboard rules FIRST, frontend SECOND). +- **R-CORS** — a new custom request header needs no CORS change (echo-back), but a browser-readable response header must be added to `corsExposeHeaders`. +- **R-BUN** — use `bun` (not npm); `bun test` (not vitest); install latest (`bun add @latest`). +- **R-SIMPLE** — bias to tasteful simplicity; avoid over-engineering, premature optimization, and defensive patterns that obscure intent. Question and recommend alternatives. + +## Tests (summary — full standard in `references/testing-rules.md`) +- **R-TEST** — tests live as `.test.ts` next to source and use `bun:test`; never `.spec` files, never `vitest`. +- **R-NOMOCK** — `mock.module()` of an internal/shared module is a modularity smell → prefer dependency injection (inject `httpClient`/`fetch`/`database`). Mocking is OK only for truly-external boundaries: external/auth/third-party APIs, browser APIs absent in the test env, and React Router hooks. Its stronger sibling is **`R-NOMOCKSHARED`** (a `mock.module` of a *shared* module is a blocker-tier CI-flake leak) — see `references/testing-rules.md`. diff --git a/.claude/skills/thunder-deep-review/references/review-heuristics.md b/.claude/skills/thunder-deep-review/references/review-heuristics.md new file mode 100644 index 000000000..34ae44a2b --- /dev/null +++ b/.claude/skills/thunder-deep-review/references/review-heuristics.md @@ -0,0 +1,138 @@ +# Review Heuristics — the core review intelligence + +Apply these in **Pass A (Scan)**. Two parts: (1) a fast diff-signal → comment trigger table, (2) IF–THEN heuristics by category. Each heuristic, when it fires, must still pass the Pass-B verification bar (quote the line + cite a rule/invariant id) before it is surfaced. + +## 1. Diff-signal → comment trigger table (scan for these literal signals) + +| Diff signal | Likely finding | Rule/Cat | +|---|---|---| +| `const ALL_CAPS =` / `FOO_BAR:` keys in TS/frontend (incl. `*.config.ts`, `scripts/`, `e2e/`, `*.test.ts`) | "camelCase even for constants" | R-CAMEL | +| `let x` then reassignment; `let isX=false; if(...) isX=true` | "avoid `let` — early-return + `const`" | R-NOLET | +| `import … from '../../types'` (deep relative) | "use `@/types`" | R-IMPORT | +| `.spec.ts` file; `from 'vitest'` | "use `.test` + `bun:test`" | R-TEST | +| `mock.module(`/`vi.mock`/`jest.mock` of a SHARED module (`@/hooks/*`, `@/components/ui/*`, any app module other tests import) | "global mock leaks across files → #1 CI flake — DON'T mock shared modules; use real impls + test DB/provider" | R-NOMOCKSHARED (blocker) | +| `mock.module(`/`vi.mock`/`jest.mock` of an internal collaborator | "mocking smell → DI; or justify" | R-NOMOCK | +| partial `mock.module` of a shared module (missing exports) | "include EVERY export or it breaks the next test file" | R-NOMOCKSHARED | +| real wait in a test (`await sleep`, real `setTimeout`, `vi.useFakeTimers()`) | "fake timers are global — advance via `getClock()` in `act()`" | R-FAKETIMERS | +| `spyOn(console,'error')` in a test that triggers an expected error | OK — prescribed pattern, NOT error-swallowing (do not flag) | R-SUPPRESSCONSOLE | +| backend route test that module-mocks the db/`fetch` instead of `createApp({ database, fetchFn })` + `createTestDb()` | "inject via DI; roll back in `afterEach` `cleanup()`" | R-DITEST | +| Component with 3+ `useState(` | "switch to `useReducer`" | R-REDUCER | +| useState + useEffect that syncs/derives from props/state | "derive during render, don't effect" | R-EFFECT | +| `useEffect(` with no adjacent justifying comment | "justify the effect or remove" | R-EFFECT | +| reducer with `SET_FOO`/`SET_BAR` actions | "actions = events (SEARCH_STARTED), not setters" | R-REDUCER | +| Second backend route duplicating `/chat`/`/inference` per provider | "fold into the one canonical endpoint" | architecture | +| `if (model === 'mistral')`, hardcoded `google`/`microsoft` in generic code | "stay integration-agnostic — data column/config" | architecture | +| Hardcoded array listing every model / every table name | "list will grow & bloat — derive it" | bloat | +| `React.lazy` missing on a settings/admin/enterprise/OIDC route in `app.tsx` | "lazy-load — bundle + attack surface" | R-LAZY | +| New `dependencies` entry (esp. docx/pdf parsers); unpinned `^` | "supply-chain + bundle risk; pin" | bloat | +| Backend route handler with no auth/session guard; denylist arrays | "require auth; allowlist > denylist" | security | +| `rateLimit` keyed by IP on an authenticated route | "per-user, not per-IP" | security | +| Literal IPs, owned domains, home paths, seeded UUIDs in migrations/tests | "use example.com / 1.1.1.1; regen UUIDs" | security | +| Template-string building HTML/widget tags w/o escape; `sandbox="… allow-popups"` | "escape attrs; tighten sandbox" | security | +| `db.delete()` / hard delete on user-data tables (frontend) | "must soft-delete (`deletedAt`)" | R-SOFTDEL | +| Non-nullable column added to a synced table schema | "synced cols must be nullable; confirm two-PR deploy" | INV / sync | +| Migration SQL added without `_journal.json` entry | "journal entry missing — migration never runs" | INV-55 | +| DB query/mutation inline in a component/settings file | "move into `dal.ts`" | architecture | +| `catch (e) { console.warn(...) }` and continue; OR error branch with no logging | "let it throw" / "at least `console.error`" | R-ERRSWALLOW | +| `setTimeout`/`requestAnimationFrame` near init/race/flicker | "don't paper over the race; fix the branching" | error-handling | +| New `use[X]()`/wrapper with 1–2 call sites | "premature abstraction — inline it" | simplicity | +| Comment restating the next line (`// Save X` above `saveX()`) | "remove — self-explanatory" | R-COMMENT | +| `as any` / `as unknown as`; non-null `!` on possibly-undefined | "avoid the cast / unsafe assertion" | R-NOANY | +| switch/if over enum/db-type with no throwing `default` | "add throwing `default` — would silently fail" | error-handling | +| Prompt strings the code then string-matches on exact wording | "brittle; breaks across languages/providers" | llm | +| dev-env/CI/playwright/prettier churn intermixed with feature | "separate PR" | scope | +| `bun.lock`/`package.json`/workflow churn unrelated to PR title | "intentional?" | scope | +| Recomputed value a hook already provides (`?? 'http://localhost:8000/v1'` next to `useSettings()`) | "the hook already does this" | simplicity | +| Numeric literal `16000` without separators | "use `16_000`" | R-NUMSEP | +| Single-row setter/getter called in a loop | "plural batch variant — one SQL query" | perf | +| `useQuery` result assumed complete (no pagination) | "anticipate scale; lazy single SQL lookup" | perf | +| Hardcoded config (domain/cloudUrl/CORS) where app self-hosts | "should be env-driven" | architecture | +| Verbose user copy ("This action cannot be undone.") | "trim to minimal clear phrasing" | ux | +| Same list duplicated in a `.md` doc and source | "don't store twice — link" | docs | + +## 2. IF–THEN heuristics by category + +### Architecture & coupling (highest value) +- IF a diff adds a route/endpoint paralleling an existing canonical one THEN ask if it folds into the existing endpoint so cross-cutting hooks (token tracking, auth, observability) stay centralized — flag the second path. +- IF new logic branches on a specific model/provider/integration THEN push the knowledge into data (a models-table column, config object, or tool/system-prompt instruction) so it survives those becoming plugins. +- IF DB validation/guard logic (soft-delete, `isSystem`) lives in a bespoke helper THEN move it into the DAL. +- IF a cross-cutting concern (auth, prompt-injection sanitization) is enforced per-route THEN push it to the single enforcement point. +- IF a general-purpose function gains feature-specific logic THEN flag the coupling; generalize or move it out. +- IF new code reimplements an existing primitive (memoize, platform helpers, debounce, `useQuery`, dayjs, a localStorage hook) THEN redirect to the existing one. +- IF placeholder/temporary code lands THEN require it be named temporary and point at the eventual home. + +### Abstraction altitude +- IF a new hook/wrapper/util/config-object wraps a single call site or trivial values THEN challenge as premature — inline it ("the value doesn't justify the complexity"). +- IF the same expression/logic recurs 3+ times THEN extract a named helper — but only on genuine recurrence. +- IF code is "janky/funky/hacky" (mutating a function after it runs, re-throwing to control flow) THEN flag it as a smell and ask for the clean structure. + +### Error handling +- IF an error is caught and downgraded to `console.warn`/silent return on a path that shouldn't fail THEN "let it throw so we can fix it" (anti-defensive). +- IF an error/early-return branch has NO logging THEN require at least `console.error`/`console.warn` ("some future debugger gets burned"). (Distinguish from the swallow case.) +- IF a switch/if over an enum/db-type has no throwing `default` THEN flag silent-failure on a future variant. +- IF a race/ordering/flicker bug is "fixed" with `setTimeout`/`rAF`/extra state THEN reject as masking the real branching bug. +- IF code is defensive against an impossible condition on trusted data THEN call it overly defensive; prefer loud failure, fix upstream. + +### Testability — full standard in `references/testing-rules.md` +- IF a test `mock.module()`s a **SHARED** module (`@/hooks/*`, `@/components/ui/*`, or any app module other tests import) THEN it is a **blocker** (`R-NOMOCKSHARED`): the mock leaks globally across every test file in the worker — the #1 cause of CI failures that pass alone but fail together (`Export named 'X' not found`, `undefined is not an object`). Don't tell them to mock it better — tell them to **stop mocking shared modules** and use real impls + a test DB/provider. This is NOT just a DI taste nit; press it as a CI-flake blocker. +- IF a test partial-mocks a shared module (omits some exports) THEN flag it (`R-NOMOCKSHARED`): a missing export breaks the next test file — include EVERY export if a shared-module mock is truly unavoidable. +- IF a test relies on heavy mocking (esp. `mock.module` of internal collaborators) THEN modularity smell → DI (`R-NOMOCK`/`R-DITEST`), or at minimum a comment why the mock is unavoidable. Mocking is OK only for truly-external things: external/auth/third-party APIs, browser APIs absent in the test env, and React Router hooks — do NOT flag those. +- IF a backend route test module-mocks the db or `fetch` THEN redirect to DI (`R-DITEST`): inject via `createApp({ database, fetchFn })` + `createTestDb()`, and roll back in `afterEach` via `cleanup()`. +- IF a test uses a real wait (`await sleep`, real `setTimeout`) or re-installs timers (`vi.useFakeTimers()`) THEN flag (`R-FAKETIMERS`): fake timers are installed globally — advance time via `getClock()` (`tickAsync`/`runAllAsync`) inside `act()`. +- IF a test intentionally triggers an error THEN it should `spyOn(console,'error').mockImplementation(() => {})` in `beforeAll` (`R-SUPPRESSCONSOLE`). **Do NOT flag this as error-swallowing** — it is the OPPOSITE of the production `R-ERRSWALLOW` rule; in a test it is the prescribed pattern, not a smell. +- IF branchy logic (parsers, `prepareStep`, reconciliation, `getOrCreate`) lacks unit tests THEN request tests; suggest extracting a pure function. +- IF a test uses `.spec`/`vitest` or `as any`/`as unknown as` THEN flag (`R-TEST`/`R-NOANY`). + +### Correctness +- IF you find a genuine bug (nullable-sync, error-before-first-assistant-message, off-by-one branch) THEN label it explicitly "real bug" and distinguish from nits. +- IF logic depends on the LLM emitting exact wording THEN flag brittleness (non-English/cross-provider); prefer a structural fix. + +### Security / privacy / data / perf +- IF a backend endpoint is added without auth/session THEN ask whether it needs a session; prefer allowlist over denylist. +- IF rate-limiting is per-IP on an authed route THEN suggest per-user. +- IF code embeds PII, an owned domain, a real IP, or seeded UUIDs THEN flag (use `example.com`, `1.1.1.1`, regenerate UUIDs). +- IF model/external output is interpolated into HTML/widget attrs THEN require escaping; tighten over-broad `sandbox`. +- IF a delete touches user data THEN require soft-delete unless it is explicit account/device removal. +- IF a new synced PowerSync column is non-nullable THEN flag sync-breaking; confirm the two-PR deploy. +- IF a single-row setter/getter is called in a loop THEN suggest a plural batch variant. +- IF a `useQuery` result is assumed complete THEN anticipate scale; flag missing pagination. +- IF config is hardcoded but the app self-hosts/enterprise/standalone THEN ask whether it should be env-driven. + +### Docs-intent / completeness / scope +- IF a magic constant/flag/reference appears with no explanation THEN ask what it is / require a justifying comment (the "what is this?" catch). +- IF a change handles one case but sibling cases exist THEN ask "should the others be handled too?" (completeness). +- IF a path is inconsistent with a sibling path THEN flag the inconsistency. +- IF a PR mixes unrelated changes THEN recommend splitting; call out incidental lockfile/version churn ("intentional?"). +- IF a real-but-out-of-scope concern surfaces THEN defer with a follow-up (Linear) rather than blocking. + +## 3. Deep correctness checklist (run line-by-line on every changed hunk) + +Correctness is the largest bucket of real findings and the easiest to skim past. For each changed hunk, trace the data flow and check **every** item: + +- **Wrong/duplicated variable in a payload or branch** — e.g. an event sends `new_position` and `old_position` both from `updates[0].order` (delta always 0); a copy-pasted object reuses the wrong field. +- **Defaulting operator** — `a || b` where `a` can be a *valid* `0`/`''`/`false` → should be `a ?? b`. +- **Unsafe assertion** — `x!` or `x as T` on a value that can be `undefined`/`null` here. +- **Unguarded index** — `arr[0]` / `lookup(...)[0]` without checking the array is non-empty (`dns.lookup`, `.filter()[0]`, query `.rows[0]`). +- **Redundant optional chaining** — `?.` on something always defined (`.all()` always returns an array; a `= []`-defaulted destructure) → dead/ misleading code. +- **Exact-equality where a range/set is meant** — `=== '127.0.0.1'` misses `127.0.0.0/8`; a single host check that should be CIDR/range; an enum compared to one literal where several apply. +- **Dead / unreachable branch & dead defensive guard** — `if (value)` / `if (x)` where `x` cannot be falsy at that point; a branch whose condition is impossible (`isAnonymous && !isAuthenticated`). +- **Missing throwing `default`** — `switch`/if-chain over an enum or db-type with no `else`/`default` that throws → silent failure on a future variant. +- **Missing-guard on external/IO result** — DNS, fetch, db row, JSON parse used without handling the empty/error case. +- **Body/size/pagination limits** — a proxy/handler that reads a request body or `useQuery` result with no size cap / pagination → unbounded. +- **Return-contract change** — a function that changed from `Promise` to a raw query builder (or dropped an `await`) → callers now mishandle it. +- **Non-deterministic ordering** — `ORDER BY` / `.sort()` on a non-unique key with no tiebreak. +- **Recursion / N+1** — recursive id/descendant collection that can blow the stack or fire N queries → prefer iterative BFS / one query. +- **Async hygiene** — `.then/.catch` where the codebase uses async/await; fire-and-forget async whose rejection is silently dropped (await it, or it's an error-swallow). +- **Brittle string/keyword matching** — logic that depends on the LLM/user emitting exact wording (breaks across languages/providers); an over-broad regex (a generic self-closing-tag matcher that also eats real HTML). +- **Hardcoded-true / redundant property** — a field set to a constant (`has_coordinates: true`) that's always that value → redundant; flag it. +- **Token/length confusion** — a `token_count` computed from `.length` (characters), not a tokenizer → wrong; rename or remove. + +## 4. Docs-intent, naming & micro-quality (high-frequency, easy to skim past) + +- **Undocumented marker / magic reference** — an internal tracking code or magic string leaked into a comment or constant with no explanation (`external-6`, `external-14`, `M3`, `M4`, a bare flag). Ask "what is this?" — one finding per distinct marker. +- **Typos in user-facing or doc strings** — e.g. "privacy police" → "privacy policy". +- **Docs placement / completeness** — telemetry/feature docs buried in README that belong in a dedicated file; a "how to add X" checklist that omits a now-required step (update the doc, add the telemetry); a list duplicated in a doc and source. +- **Naming** — boolean not prefixed `is/should/has`; a vague name (`variables` → `updated`); a misleading word (`bypassed` where "disabled" is meant); a name implying the wrong flow (`confirmation` email vs `joined-waitlist`). +- **Unnecessary wrapper / indirection** — a one-line function wrapping `import.meta.env.X` → reference it directly; a `deps` object bundling args that read cleaner flat. +- **Safety defaults** — a permissive flag (`AUTH_ALLOW_ANONYMOUS`) that should default to `false` in code, not just in `.env.example`. +- **Single-source config** — a hardcoded `from` address / client constructed per-file that should live once in a shared module and be imported. diff --git a/.claude/skills/thunder-deep-review/references/severity-rubric.md b/.claude/skills/thunder-deep-review/references/severity-rubric.md new file mode 100644 index 000000000..c8317862a --- /dev/null +++ b/.claude/skills/thunder-deep-review/references/severity-rubric.md @@ -0,0 +1,33 @@ +# Severity Rubric, Tone & Confidence + +Two **independent** axes: **severity** (how much it matters) ⊥ **confidence** (how sure you are). Map them to register; never silently drop a low-confidence item — down-weight it to a question. + +## Severity ladder (hand the reader the tier in plain words) + +| Tier | When | Signature register | Block? | +|---|---|---|---| +| **Praise** | genuine taste signal | "Nice, very clean", "Good call" | — | +| **Nit** | style/cosmetic | "Nitpick: …", "minor", "this might be nitpicky" | No | +| **Non-blocking idea** | suggestion, no strong opinion | "just an idea", "thinking out loud", "safe to ignore I think?", "ok merging as-is if not trivial" | No | +| **Convention** | house rule, terse | "These should be camelCase", "no `let` here", "use `{ }`" | Soft-firm | +| **Real bug / concern** | genuine correctness | "This is a real bug", "this looks real" | Yes | +| **Future-pain (architectural)** | silent maintainability cost | "this will come back to haunt us", "painful someday", "some future debugger gets burned", "this is a code smell" | Yes (even if nothing breaks today) | +| **Hard block** | won't hold up | "I don't think this approach holds up", "we need these columns nullable" (imperative + *need*) | Yes | + +## Confidence → register +- **High** → direct statement + prescribed fix. +- **Medium** → question form ("intended?", "shouldn't we…?", "was this intentional?"). +- **Low** → "flagging, no strong feelings — drop if I'm wrong." + +## Emission rules +1. Lead with a question or first-person framing; questions do real work (Socratic challenge, info-seek, polite directive, scope/consistency probe). +2. Attach an explicit severity word so the author knows act-now vs later. +3. For non-blockers, grant explicit permission to merge as-is. +4. For out-of-scope concerns, defer with a follow-up rather than block. +5. Distinguish nit from real bug explicitly — never let a cosmetic note and a correctness bug read at the same volume. +6. Convention nits stay terse; may repeat across files (consistency over padding) but count once toward the nit cap. +7. Reserve "haunt us / painful someday / code smell" for **architectural & maintainability** blockers — the strongest non-correctness register; don't spend it on cosmetics. +8. Anti-defensive (let-it-throw) and anti-mocking (DI) are pressed hard even when nothing is broken — the two strongest standing positions. + +## High-confidence AI-authored "tells" (pounce on these) +Setter-style reducer actions; `setTimeout`/`rAF` race "fixes"; auto-added obvious comments; `as any`; randomly-generated UUIDs/domains possibly from training data; growing hardcoded enumerations; defensive try/catch swallowing; reinventing an existing primitive. diff --git a/.claude/skills/thunder-deep-review/references/style-exemplars.md b/.claude/skills/thunder-deep-review/references/style-exemplars.md new file mode 100644 index 000000000..6d2576339 --- /dev/null +++ b/.claude/skills/thunder-deep-review/references/style-exemplars.md @@ -0,0 +1,51 @@ +# Style Exemplars (anonymized) — match this register + +Few-shot exemplars distilled from real held-out reviews, paraphrased and de-identified. They encode the **bar**, not any person. Format: signal → finding (in-register) → severity. Match the voice: lead with a question/first-person framing, name the tier, prescribe the fix. + +1. **Signal:** a new backend route proxies one provider in parallel to the existing `/chat`/`/inference` path. + **Finding:** "I'd think of this as another model provider on the existing chat endpoint rather than a separate backend integration — that's where we'd centralize token tracking later. Can the encrypted body ride the one canonical endpoint instead? Prefer a single chat path for all providers." → *Future-pain / architecture (block).* + +2. **Signal:** code references `M4` / `external-6` / a bare flag with no explanation. + **Finding:** "What is `M4` here? There's no rationale in the code or nearby — can we name it or add a comment so the intent survives?" → *Docs-intent (warning, question).* + +3. **Signal:** a hardcoded array enumerating every model ever shipped, used to filter defaults. + **Finding:** "Over time this list gets really long and bloats the bundle — hate to grow it. Could we derive it instead (e.g. remove any unedited system model not in the current defaults) rather than enumerating removed ones?" → *Bloat (block-ish, architectural).* + +4. **Signal:** `const FOO_BAR = …` / ALL_CAPS keys in a frontend `.ts`/test file. + **Finding:** "These should be camelCase to match the codebase — the ALL_CAPS convention only applies to wire/JSON string values, not our TS identifiers." → *Convention (R-CAMEL).* + +5. **Signal:** `try { ok() } catch (e) { console.warn(e) }` around a trusted call. + **Finding:** "Let's not swallow this — prefer letting it throw so we catch it in dev. Swallowed errors tend to come back to haunt us." → *Error-handling (R-ERRSWALLOW, block).* + +6. **Signal:** a component with 4+ `useState`. + **Finding:** "This is a lot of `useState`s — would a `useReducer` make it less bug-prone and easier to follow?" → *React (R-REDUCER, convention).* + +7. **Signal:** a test leans on `mock.module(...)` of an internal collaborator. + **Finding:** "I'd love to avoid mocking our own modules here — it's usually a sign the code wants dependency injection. Could we inject the dependency instead, or at least note why the mock is unavoidable?" → *Testability (R-NOMOCK).* + +7a. **Signal:** a new test file does `mock.module('@/hooks/use-settings', () => ({ useSettings: () => ({ … }) }))` (a shared module). + **Finding:** "This is a blocker — `mock.module()` on a shared module like `@/hooks/use-settings` leaks globally to every test file in the worker. It'll pass alone and then fail in CI when run together (`undefined is not an object`, `Export named 'X' not found`) — the #1 CI-flake here. Don't mock it better; drop the mock and use the real hook with a test DB/provider (`setupTestDatabase` + `createTestProvider`)." → *Testability (R-NOMOCKSHARED, block).* + +7b. **Signal:** a test mocks `@/components/ui/dialog` but lists only `Dialog`/`DialogContent`, omitting `DialogFooter`. + **Finding:** "If we truly must mock a shared module, it has to include EVERY export — a missing `DialogFooter` here will crash the next test file that imports it (`Export named 'DialogFooter' not found`). Better: don't mock the shared UI module at all." → *Testability (R-NOMOCKSHARED, block).* + +8. **Signal:** a one-off `useAnonymousSessionGuard` hook gates auth alongside the existing `AuthGate`. + **Finding:** "This feels like a footgun — anonymous auth is just another auth type, so I'd fold it into `AuthGate` rather than a separate guard/hook. If someone later removed the other gate they might not realize this silently creates users. Could we do it imperatively in the auth logic?" → *Architecture / footgun (block).* + +9. **Signal:** a branch handles `isAnonymous && !isAuthenticated`. + **Finding:** "How is it possible for a user to be anonymous *and* not authenticated? This branch looks unreachable — can we remove it or correct the condition?" → *Correctness / dead code (warning).* + +10. **Signal:** a switch adds one new model/provider case; siblings exist nearby. + **Finding:** "Should the other three models be added here too? Want to make sure this isn't half-applied." → *Completeness (question).* + +11. **Signal:** a flicker/ordering bug is patched with `setTimeout(() => …, 0)`. + **Finding:** "I'm skeptical of the `setTimeout` here — it papers over an ordering bug rather than fixing it. Does fixing the branching remove the need for the timer?" → *Error-handling (block).* + +12. **Signal:** a frontend flow calls `db.delete()` on a user-data row. + **Finding:** "We always soft-delete on the frontend — set `deletedAt` instead of hard-deleting, unless this is an explicit account/device removal." → *Data (R-SOFTDEL, block).* + +13. **Signal:** a `deps` object bundles two collaborators passed into a function/hook. + **Finding:** "The `deps` object feels heavier than needed — deps are just inputs; I'd keep them flat as plain args, it reads easier." → *Readability (nit/idea).* + +14. **Signal:** a non-nullable column is added to a synced PowerSync table. + **Finding:** "We need these synced columns nullable — a non-nullable add breaks sync for existing rows. Also confirm the two-PR deploy (backend + `config.yaml` rules before the frontend)." → *Hard block (R-SYNCNULL).* diff --git a/.claude/skills/thunder-deep-review/references/testing-rules.md b/.claude/skills/thunder-deep-review/references/testing-rules.md new file mode 100644 index 000000000..8cf1c948e --- /dev/null +++ b/.claude/skills/thunder-deep-review/references/testing-rules.md @@ -0,0 +1,19 @@ +# Testing Rules — with rule ids + +Cite the `R-*` id when surfacing a test-file finding. Source of truth is the two testing docs: `docs/development/testing.md` (frontend/general) and `backend/docs/testing.md` (backend). Read them if in doubt. + +These rules apply to **test files** (`*.test.ts`, `*.test.tsx`). They are the standard a senior reviewer holds tests to — distinct from, and sometimes the **opposite of**, the production house rules (`house-rules.md`). The most important inversion is `R-SUPPRESSCONSOLE` vs `R-ERRSWALLOW` — see that rule. + +## The rules + +- **R-NOMOCKSHARED** *(blocker)* — **Never `mock.module()` a SHARED module.** Bun's `mock.module()` creates **global, persistent mocks that leak across test files** for ALL tests in the same worker — the **#1 cause of CI failures that pass individually but fail when run together**. Shared modules include `@/hooks/*`, `@/components/ui/*`, and any app module imported by other tests. Symptom errors in CI (not locally): `SyntaxError: Export named 'X' not found in module`, `TypeError: undefined is not an object`, `TypeError: X is not a function`. The fix is **not** "mock it better" — it's **don't mock shared modules at all**: use real implementations with a test DB/provider (`setupTestDatabase`/`createTestProvider` on the frontend; see `R-DITEST`). Mock ONLY truly-external things: external/auth/third-party APIs, browser APIs absent in the test env (e.g. `window.location.reload`), and React Router hooks when testing navigation. **If you absolutely must mock a shared module, you must include EVERY export** (a partial mock drops an export and breaks the next test file that imports it). This is a **blocker**, not a taste nit — a shared-module `mock.module` is a latent CI flake even if it passes today. + +- **R-DITEST** *(should)* — **Prefer dependency injection over mocking.** Inject collaborators instead of `mock.module`-ing them. + - **Frontend:** inject `httpClient`/`fetch` as a defaulted parameter. ✅ `export const checkInbox = async (params, httpClient: HttpClient = ky) => { … httpClient.get(…) }` — ❌ `mock.module('ky', () => ({ … }))`. For component tests, use real impls behind a test provider (`createTestProvider()`) and a test DB (`setupTestDatabase`/`resetTestDatabase`/`teardownTestDatabase` from `@/dal/test-utils`). + - **Backend:** use the PGlite in-memory DB. `createTestDb()` (from `@/test-utils/db`) returns `{ client, db, cleanup }`; the single PGlite instance is reused and migrations run once in preload (`bunfig.toml`). Inject the test DB via `createApp({ database })` and external calls via `createApp({ fetchFn })` (the `AppDeps` type). Each test runs in a transaction that you **roll back via `cleanup()` in `afterEach`** for isolation. `globalThis.fetch` is mocked in preload to **throw** if a test calls it without DI — so never module-mock the db or fetch; inject them. + +- **R-FAKETIMERS** *(should)* — **Fake timers are installed globally** (auto-installed before each test, uninstalled after) so tests run fast and deterministically. **Never use real waits** (`await sleep(...)`, real `setTimeout`) and **never re-install timers** (`vi.useFakeTimers()`). To advance time, use `getClock()` from `@/testing-library` inside `act()`: `await act(async () => { await getClock().tickAsync(300) })` for a specific duration, or `getClock().runAllAsync()` to settle all pending timers (also speeds up HTTP libs with retry logic like `ky`). + +- **R-SUPPRESSCONSOLE** *(should)* — **Suppress EXPECTED console errors in tests.** For a test that intentionally triggers an error, silence the noise with `spyOn(console, 'error').mockImplementation(() => {})` in `beforeAll`. **CRITICAL — this is the OPPOSITE of the production `R-ERRSWALLOW` rule, and must NOT be flagged as error-swallowing.** In production code an error branch with no log earns a "add `console.error`" finding (`R-ERRSWALLOW`); in a **test** that deliberately provokes an error, `spyOn(console,'error')` is the prescribed, correct pattern. The error-handling lane must never treat a test's `spyOn(console,'error')` as a swallow. (It IS still worth a note if a test spies on `console.error` but does **not** actually trigger an expected error — that hides real failures.) + +- **R-BUNTESTCWD** *(nit)* — **Don't run `bun test` from the repo root.** Bun's positional args are **substring filters, not paths**, so a filter like `src/` also matches `backend/src/...` and pulls in backend tests. Use the `test` script (`bun run test`), which scopes discovery via `bun test --cwd=src` and runs `shared/*.test.ts` and other out-of-tree paths by explicit path. (Tests live as `.test.ts` next to source and use `bun:test`, not `vitest` — see `R-TEST`/`R-BUN`.) diff --git a/.github/scripts/review-orchestrator.mjs b/.github/scripts/review-orchestrator.mjs new file mode 100644 index 000000000..53f20b5e3 --- /dev/null +++ b/.github/scripts/review-orchestrator.mjs @@ -0,0 +1,1171 @@ +#!/usr/bin/env node + +/* This Source Code Form is subject to the terms of the Mozilla Public + * License, v. 2.0. If a copy of the MPL was not distributed with this + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ + +// ============================================================================= +// review-orchestrator.mjs +// +// Deterministic control loop for the advisory "thunder-deep-review" bot. +// +// WHY THIS EXISTS: +// The workflow must NOT delegate orchestration to the model's tool-calls. +// All GitHub I/O — polling A's bots, building the skip-list, computing the +// resolved-diff, and posting the INLINE PR REVIEW — happens HERE, in +// deterministic code. The model is invoked separately and emits STRUCTURED +// JSON FINDINGS ONLY. This script reads that JSON, validates it, maps each +// finding to a diff position, and posts it as one PR review with N inline +// comments anchored to file:line (Bugbot-style). +// +// EXECUTION MODEL (two phases, see thunder-deep-review.yml): +// Phase "pre": poll A's bots -> write skip-list + diff to files for the model step. +// Phase "post": read model JSON -> dedup vs our own open threads -> resolve +// our own fixed threads -> POST ONE PR REVIEW (event=COMMENT) +// with the NEW findings as inline comments. +// Run as: node review-orchestrator.mjs +// +// FAIL-SOFT (advisory): on ANY unrecoverable error we log and +// process.exit(0) WITHOUT posting. A missing review is acceptable; a broken +// or duplicate review is not. Next push retries. +// +// ZERO npm DEPENDENCIES: Node >=18 built-in global `fetch` only. No Octokit. +// +// ─────────────────────────────────────────────────────────────────────────── +// IDENTITY: the orchestrator posts via the default GITHUB_TOKEN, so its review +// comments are authored as `github-actions[bot]` — THUNDER_BOT_LOGIN resolves +// to that (set in thunder-deep-review.yml). No separate GitHub App is needed. +// The model-findings JSON path/contract is fixed via the THUNDER_*_FILE env +// vars set in the workflow (see below). +// RESOLVED: EXPECTED_BOTS Cursor Bugbot app.id (1210556). +// ============================================================================= + +import { readFile, writeFile } from 'node:fs/promises'; +import { createHash } from 'node:crypto'; + +// ----------------------------------------------------------------------------- +// CONFIG / CONSTANTS +// ----------------------------------------------------------------------------- + +/** + * EXPECTED_BOTS — the external "Job A" SaaS reviewers we sequence behind. + * + * Identify bots by NUMERIC `app.id`, NOT by login strings or + * `app.slug` — logins ("cursor[bot]") and slugs drift, and a single app can + * spawn multiple check-runs. The numeric id is stable. + * + * To find an app id: open a check-run that bot produced and GET its + * `/repos/{owner}/{repo}/check-runs/{id}`, then read `.app.id`. Or hit + * `/repos/{owner}/{repo}/commits/{sha}/check-runs` and read each run's `.app.id`. + * + * `loginHint` is ONLY used as a secondary, best-effort filter for review + * COMMENTS (the comment author is a login, not an app id — see buildSkipList). + * + * Scope (Job A): Cursor Bugbot only. GitHub Advanced Security and CodeRabbit + * are out of scope; add either later by registering its app.id below. + */ +const EXPECTED_BOTS = [ + // Cursor Bugbot — real numeric GitHub App id (slug "cursor", name "Cursor Bugbot"). + { name: 'cursor-bugbot', appId: 1210556, loginHint: 'cursor[bot]' }, +]; + +// Identity of OUR OWN review comments, so we can dedup against / resolve only +// our own threads (never touch Bugbot's or a human's). The orchestrator runs +// under the default GITHUB_TOKEN, so the author login is github-actions[bot] +// (THUNDER_BOT_LOGIN, see env below). We also stamp every comment body with a +// hidden marker (SELF_COMMENT_MARKER) as the robust secondary signal. +const SELF_COMMENT_MARKER = ''; + +// Severity enum the model output is validated against. Anything else is dropped. +const SEVERITY_ENUM = ['blocking', 'convention', 'nit']; + +// Deterministic deep-mode gate. Diff bigger than this => deep mode. +const DEEP_MODE_CHANGED_LINES = 600; // tunable for the team's PR sizes. +const DEEP_MODE_FILE_COUNT = 40; // tunable for the team's PR sizes. +// Hard GitHub cliffs: /commits/{sha}/check-runs caps at 1000 check-suites, +// PR files list caps at 3000 files. Past the file cap we switch to bounded mode. +const PR_FILES_TRUNCATION_CAP = 3000; +// In bounded mode (PR exceeds the file cap) we deterministically clip the diff +// handed to the model to the first N per-file sections so "honor bounded mode" +// is a REAL scope limit, not just an unenforced prompt hint. +const BOUNDED_MODE_FILE_LIMIT = 200; + +// Per-comment body length cap so a single inline comment never balloons. +const MAX_COMMENT_BODY = 60_000; // GitHub hard limit is 65536; leave headroom. +// Cap on inline comments per review so we never post a wall of nits. +const MAX_INLINE_COMMENTS = 50; // overflow rolls into the review summary body. + +// Polling timing: jittered exponential backoff, honor Retry-After. +const POLL_BASE_MS = 15_000; // start at ~15s +const POLL_CAP_MS = 60_000; // cap at ~60s +const DISCOVERY_TIMEOUT_MS = 60_000; // ~60s: "never appeared" => treat bot as disabled +const COMPLETION_TIMEOUT_MS = 5 * 60_000; // ~5min: "exists but not completed yet" + +// File handoff between the two phases / the model step. +// Paths come from the THUNDER_*_FILE env vars set in thunder-deep-review.yml. +const SKIPLIST_FILE = process.env.THUNDER_SKIPLIST_FILE ?? '/tmp/thunder-skiplist.json'; +const DIFF_FILE = process.env.THUNDER_DIFF_FILE ?? '/tmp/thunder-diff.patch'; +const FINDINGS_FILE = process.env.THUNDER_FINDINGS_FILE ?? '/tmp/thunder-findings.json'; +const DEEPMODE_FILE = process.env.THUNDER_DEEPMODE_FILE ?? '/tmp/thunder-deepmode.json'; + +// ----------------------------------------------------------------------------- +// ENVIRONMENT (read once, validated in main) +// ----------------------------------------------------------------------------- + +const env = { + token: process.env.GITHUB_TOKEN ?? process.env.GH_TOKEN ?? '', + repoFull: process.env.GITHUB_REPOSITORY ?? '', // "owner/repo" + prNumber: process.env.PR_NUMBER ?? '', + // Reconcile EVERYTHING to the PR head SHA, never the synthetic merge ref. + headSha: process.env.PR_HEAD_SHA ?? '', + // GitHub PR action event name: "reopened" suppresses thread resolution. + eventAction: process.env.PR_ACTION ?? '', + apiBase: process.env.GITHUB_API_URL ?? 'https://api.github.com', + graphqlBase: process.env.GITHUB_GRAPHQL_URL ?? 'https://api.github.com/graphql', + // Login OUR review comments are authored as. The orchestrator posts via the + // default GITHUB_TOKEN, so the author login is github-actions[bot]. The hidden + // marker (SELF_COMMENT_MARKER) is the robust secondary match. + selfLogin: process.env.THUNDER_BOT_LOGIN ?? 'github-actions[bot]', + // Allow a `review:deep` label to force deep mode in addition to the size gate. + hasDeepLabel: (process.env.PR_LABELS ?? '').split(',').map((s) => s.trim()).includes('review:deep'), +}; + +const [owner, repo] = env.repoFull.split('/'); + +// ----------------------------------------------------------------------------- +// SMALL UTILITIES +// ----------------------------------------------------------------------------- + +const log = (...args) => console.log('[thunder-orchestrator]', ...args); +const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms)); +const jitter = (ms) => Math.round(ms * (0.8 + Math.random() * 0.4)); // ±20% + +/** Fail-soft exit: log the reason and exit 0 so the advisory bot never breaks CI. */ +const failSoft = (reason, err) => { + log('FAIL-SOFT (exiting 0 without posting):', reason, err?.message ?? ''); + process.exit(0); +}; + +/** + * githubFetch — single choke-point for every GitHub REST call. + * - Sends auth + API version headers. + * - 4xx => throw immediately (bail; retrying won't help) EXCEPT 403/429 with a + * rate-limit signal, which we treat as retryable backoff. + * - 5xx => signal retryable so callers can back off. + * - Honors `Retry-After` and `x-ratelimit-reset` when rate-limited. + * Returns { ok, status, json, linkHeader, retryAfterMs }. + */ +const githubFetch = async (url, init = {}) => { + const res = await fetch(url, { + ...init, + headers: { + Authorization: `Bearer ${env.token}`, + Accept: 'application/vnd.github+json', + 'X-GitHub-Api-Version': '2022-11-28', + 'User-Agent': 'thunder-deep-review-orchestrator', + // Declare the JSON body explicitly (matches graphqlFetch). Harmless on the + // bodyless GETs; required so POSTs that send JSON.stringify(...) — the + // create-review call — are parsed as JSON rather than mis-sniffed. + 'Content-Type': 'application/json', + ...(init.headers ?? {}), + }, + }); + + const retryAfterMs = computeRetryAfterMs(res); + const linkHeader = res.headers.get('link') ?? ''; + + // Rate-limited: surface as retryable regardless of 403/429 framing. GitHub's + // SECONDARY rate limit returns 403 with a `retry-after` header and a + // remaining quota > 0 — treat that as retryable too, not a permanent 4xx. + const remaining = res.headers.get('x-ratelimit-remaining'); + const isRateLimited = + res.status === 429 || + (res.status === 403 && (remaining === '0' || res.headers.get('retry-after') !== null)); + + if (res.status >= 500 || isRateLimited) { + return { ok: false, retryable: true, status: res.status, json: null, linkHeader, retryAfterMs }; + } + if (res.status >= 400) { + // Non-rate-limit 4xx: permanent for this request. Caller decides bail vs ignore. + const body = await res.text().catch(() => ''); + return { ok: false, retryable: false, status: res.status, json: null, linkHeader, retryAfterMs, body }; + } + const json = res.status === 204 ? null : await res.json().catch(() => null); + return { ok: true, retryable: false, status: res.status, json, linkHeader, retryAfterMs }; +}; + +/** Parse Retry-After (seconds) or x-ratelimit-reset (epoch secs) into a ms delay. */ +const computeRetryAfterMs = (res) => { + const retryAfter = res.headers.get('retry-after'); + if (retryAfter) { + const secs = Number(retryAfter); + if (Number.isFinite(secs)) return secs * 1000; + } + const reset = res.headers.get('x-ratelimit-reset'); + const remaining = res.headers.get('x-ratelimit-remaining'); + if (reset && remaining === '0') { + const deltaMs = Number(reset) * 1000 - Date.now(); + if (Number.isFinite(deltaMs) && deltaMs > 0) return Math.min(deltaMs, POLL_CAP_MS); + } + return 0; +}; + +/** Follow RFC-5988 `Link: ; rel="next"` to paginate. */ +const nextLink = (linkHeader) => { + // e.g. ; rel="next", ; rel="last" + const match = linkHeader.split(',').find((p) => p.includes('rel="next"')); + if (!match) return null; + const urlMatch = match.match(/<([^>]+)>/); + return urlMatch ? urlMatch[1] : null; +}; + +/** + * paginateAll — GET every page of a list endpoint, following Link headers. + * Retries 5xx / rate-limit with jittered backoff; bails on hard 4xx. + * `cap` guards against pathological pagination (e.g. the 3000-file cliff). + */ +const paginateAll = async (firstUrl, { cap = Infinity } = {}) => { + const out = []; + let url = firstUrl; + let truncated = false; + while (url) { + // Retry budget is PER PAGE, not shared across the whole pagination — + // otherwise a few flaky early pages exhaust the budget for all later ones. + let attempt = 0; + let r = await githubFetch(url); + while (!r.ok && r.retryable && attempt < 5) { + attempt += 1; + await sleep(Math.max(r.retryAfterMs, jitter(Math.min(POLL_BASE_MS * 2 ** attempt, POLL_CAP_MS)))); + r = await githubFetch(url); + } + if (!r.ok) { + throw new Error(`GET ${url} -> ${r.status} ${r.body ?? ''}`); + } + const page = Array.isArray(r.json) ? r.json : (r.json?.check_runs ?? []); + out.push(...page); + if (out.length >= cap) { + truncated = true; + break; + } + url = nextLink(r.linkHeader); + } + return { items: out, truncated }; +}; + +const apiUrl = (path) => `${env.apiBase}/repos/${owner}/${repo}${path}`; + +/** + * graphqlFetch — single choke-point for the GraphQL calls we need (review-thread + * resolution state + resolveReviewThread). Retries 5xx / rate-limit; returns + * `{ ok, data, errors }`. Mutations that fail are fail-soft at the caller. + */ +const graphqlFetch = async (query, variables) => { + let attempt = 0; + for (;;) { + const res = await fetch(env.graphqlBase, { + method: 'POST', + headers: { + Authorization: `Bearer ${env.token}`, + 'Content-Type': 'application/json', + 'User-Agent': 'thunder-deep-review-orchestrator', + }, + body: JSON.stringify({ query, variables }), + }); + // Retry 5xx, 429, and GitHub's secondary rate limit (403 + retry-after). + const secondaryLimited = res.status === 403 && res.headers.get('retry-after') !== null; + if (res.status >= 500 || res.status === 429 || secondaryLimited) { + if (attempt < 5) { + attempt += 1; + await sleep(Math.max(computeRetryAfterMs(res), jitter(Math.min(POLL_BASE_MS * 2 ** attempt, POLL_CAP_MS)))); + continue; + } + return { ok: false, data: null, errors: [{ message: `graphql ${res.status}` }] }; + } + const body = await res.json().catch(() => null); + if (!res.ok || !body) { + return { ok: false, data: null, errors: body?.errors ?? [{ message: `graphql ${res.status}` }] }; + } + return { ok: !body.errors?.length, data: body.data ?? null, errors: body.errors ?? null }; + } +}; + +// ============================================================================= +// POLLER +// ============================================================================= +// +// Enumerate the head-SHA check-runs ONCE, find the EXPECTED_BOTS runs by +// numeric app.id, then poll each existing run BY ID until status===completed +// (accept ANY conclusion). A short discovery sub-timeout treats "never +// appeared" as "bot disabled on this PR" (empty contribution) vs the full +// completion timeout for "run exists but hasn't finished yet". +// +// UNCHANGED by the inline-review redesign — only the comment SINK changed. +// ============================================================================= + +/** One enumeration of all check-runs for the head SHA (paginated). */ +const listCheckRuns = async () => { + // `filter=all` so we don't only see the latest run per app. + const url = apiUrl(`/commits/${env.headSha}/check-runs?per_page=100&filter=all`); + const { items } = await paginateAll(url); + return items; +}; + +/** + * pollForBots — block until all DISCOVERABLE expected bots reach completed, + * or the relevant timeouts elapse. Returns the set of bot app.ids that + * actually completed (used only for observability; the skip-list is built + * from comments + outputs separately). + */ +const pollForBots = async () => { + const start = Date.now(); + const expectedIds = new Set(EXPECTED_BOTS.map((b) => b.appId).filter((id) => id > 0)); + if (expectedIds.size === 0) { + log('no real EXPECTED_BOTS app.ids configured — skipping poll, skip-list will be best-effort.'); + return; + } + + let backoff = POLL_BASE_MS; + // Track, per bot id: have we ever SEEN a run (discovery), and is it completed? + const seen = new Set(); + const completed = new Set(); + + for (;;) { + let runs; + try { + runs = await listCheckRuns(); + } catch (err) { + // Transient list failure: back off and retry within the overall window. + if (Date.now() - start > COMPLETION_TIMEOUT_MS) { + log('poll: completion timeout during list error — proceeding.', err.message); + return; + } + await sleep(jitter(backoff)); + backoff = Math.min(backoff * 2, POLL_CAP_MS); + continue; + } + // Successful list: reset the backoff so a few transient errors don't leave + // us polling at the 60s cap for the rest of the discovery window. + backoff = POLL_BASE_MS; + + for (const run of runs) { + const appId = run.app?.id; + if (!expectedIds.has(appId)) continue; + seen.add(appId); + if (run.status === 'completed') completed.add(appId); // ANY conclusion is terminal. + } + + const elapsed = Date.now() - start; + + // Discovery sub-timeout: any expected bot we've NEVER seen after ~60s is + // treated as disabled and dropped from the wait-set. + const stillWaiting = [...expectedIds].filter((id) => { + if (completed.has(id)) return false; // done + if (!seen.has(id) && elapsed > DISCOVERY_TIMEOUT_MS) return false; // never appeared => disabled + return true; // either seen-but-not-complete, or not-yet-seen within discovery window + }); + + if (stillWaiting.length === 0) { + log(`poll: all discoverable bots terminal (completed=${[...completed].join(',') || 'none'}).`); + return; + } + if (elapsed > COMPLETION_TIMEOUT_MS) { + log(`poll: completion timeout (${Math.round(elapsed / 1000)}s) — proceeding with whatever posted.`); + return; + } + + await sleep(jitter(backoff)); + backoff = Math.min(backoff * 2, POLL_CAP_MS); + } +}; + +// ============================================================================= +// SKIP-LIST +// ============================================================================= +// +// Build the "already reported by A" skip-list so the model surfaces only what +// A missed. Sources, in priority order: +// 1. PR review (inline) comments authored by the bots — AUTHORITATIVE. +// 2. The bots' check-run `output.text/summary` — BEST-EFFORT (free text). +// Dedup key is normalized on file + diff_hunk + side/original_line, NOT raw +// line numbers (lines drift on force-push). +// +// UNCHANGED by the inline-review redesign. +// ============================================================================= + +/** Stable dedup key for an inline comment. */ +const dedupKeyFromReviewComment = (c) => { + const parts = [ + c.path ?? '', + // diff_hunk anchors the comment to surrounding code, resilient to line drift. + (c.diff_hunk ?? '').trim().slice(0, 200), + c.side ?? c.original_side ?? '', + String(c.original_line ?? c.line ?? ''), + ]; + return normalizeKey(parts.join(' ')); +}; + +const normalizeKey = (s) => createHash('sha1').update(s).digest('hex').slice(0, 16); + +/** + * Is this review comment from one of our expected Job-A bots (Cursor Bugbot + * et al.)? The skip-list MUST stay scoped to those external reviewers: pulling + * in unrelated bots — especially THIS workflow's own `github-actions[bot]` + * comments — would nudge the model to suppress its own prior findings, which + * then trips the "missing hash ⇒ resolve thread" convergence on the next run. + * Match an expected bot by its (drift-prone, best-effort) loginHint, and NEVER + * count our own comments (selfLogin or the self marker). + */ +const isBotComment = (c) => { + const login = c.user?.login ?? ''; + if (login === env.selfLogin || (c.body ?? '').includes(SELF_COMMENT_MARKER)) return false; + return EXPECTED_BOTS.some((b) => b.loginHint && b.loginHint === login); +}; + +const buildSkipList = async () => { + const entries = []; + + // 1) Inline review comments (authoritative). Paginate. + try { + const url = apiUrl(`/pulls/${env.prNumber}/comments?per_page=100`); + const { items } = await paginateAll(url); + for (const c of items) { + if (!isBotComment(c)) continue; + entries.push({ + key: dedupKeyFromReviewComment(c), + source: 'inline', + path: c.path, + hunk: (c.diff_hunk ?? '').trim().slice(0, 200), + summary: (c.body ?? '').slice(0, 280), + }); + } + } catch (err) { + // Best-effort: a skip-list failure must not abort the whole review. + log('skip-list: inline-comment fetch failed (continuing best-effort):', err.message); + } + + // 2) Check-run outputs (best-effort, free text). We can't anchor these to a + // hunk, so we only record a coarse summary the model can defer to. + try { + const runs = await listCheckRuns(); + const expectedIds = new Set(EXPECTED_BOTS.map((b) => b.appId).filter((id) => id > 0)); + for (const run of runs) { + if (!expectedIds.has(run.app?.id)) continue; + const text = `${run.output?.title ?? ''}\n${run.output?.summary ?? ''}`.trim(); + if (!text) continue; + entries.push({ + key: normalizeKey(`checkrun ${run.id}`), + source: 'checkrun-summary', + path: null, + hunk: null, + summary: text.slice(0, 600), + }); + } + } catch (err) { + log('skip-list: check-run output fetch failed (continuing best-effort):', err.message); + } + + // De-dup the skip-list itself by key. + const byKey = new Map(entries.map((e) => [e.key, e])); + return [...byKey.values()]; +}; + +// ============================================================================= +// DEEP-MODE GATE +// ============================================================================= +// +// Compute diff size DETERMINISTICALLY in code; do NOT let the model decide +// when to spend on deep mode. Detect the 3000-file truncation cliff and fall +// back to a bounded mode flag the model honors. +// +// UNCHANGED by the inline-review redesign. +// ============================================================================= + +const computeDeepMode = async () => { + let fileCount = 0; + let changedLines = 0; + let truncated = false; + try { + const url = apiUrl(`/pulls/${env.prNumber}/files?per_page=100`); + const { items, truncated: t } = await paginateAll(url, { cap: PR_FILES_TRUNCATION_CAP }); + truncated = t; + fileCount = items.length; + for (const f of items) changedLines += (f.additions ?? 0) + (f.deletions ?? 0); + } catch (err) { + log('deep-mode: file list failed, defaulting to non-deep bounded:', err.message); + } + + const sizeTriggered = changedLines >= DEEP_MODE_CHANGED_LINES || fileCount >= DEEP_MODE_FILE_COUNT; + return { + deepMode: env.hasDeepLabel || sizeTriggered, + boundedMode: truncated, // past the file cap, review a bounded subset only. + fileCount, + changedLines, + truncated, + reason: env.hasDeepLabel ? 'review:deep label' : sizeTriggered ? 'size gate' : 'default single fan-out', + }; +}; + +// ============================================================================= +// OUTPUT CONTRACT +// ============================================================================= +// +// The model writes STRUCTURED JSON findings to FINDINGS_FILE. We NEVER trust +// raw model markdown. Validate severities against the enum, drop anything +// malformed, then render each finding into an inline comment body here. +// +// Expected JSON shape (the action step's prompt must enforce this): +// { "findings": [ { "severity": "blocking"|"convention"|"nit", +// "file": "src/x.ts", "line": 42, "side": "RIGHT"|"LEFT", +// "title": "…", "body": "…", "rule": "optional-invariant-id" } ] } +// `side` is optional and defaults to RIGHT (the added/changed side). A finding +// whose line is not in the diff falls back to a file-level / summary comment. +// ============================================================================= + +/** + * Stable per-finding hash, used to match a finding to an EXISTING open thread + * across pushes (convergence). Anchored on file + normalized diff context + rule + * + title — NOT raw line numbers (they drift on force-push). The `title` is what + * separates two DISTINCT findings that share the same file, hunk, rule, and + * severity; without it they collapse to one hash, so the second finding gets + * dropped as "already open" and its feedback silently disappears. The model does + * not emit a hunk, so we derive a coarse code anchor from the diff (see + * hunkAnchorFor). + */ +const findingHash = (f) => + normalizeKey( + [f.file ?? '', f.rule ?? '', (f.anchor ?? '').trim().slice(0, 200), f.severity, (f.title ?? '').trim()].join(' '), + ); + +const readModelFindings = async () => { + const raw = await readFile(FINDINGS_FILE, 'utf8'); + const parsed = JSON.parse(raw); // throws on malformed => fail-soft upstream. + const list = Array.isArray(parsed?.findings) ? parsed.findings : []; + // Validate + normalize. Drop anything with an out-of-enum severity. + const valid = []; + for (const f of list) { + const severity = String(f.severity ?? '').toLowerCase(); + if (!SEVERITY_ENUM.includes(severity)) { + log(`dropping finding with invalid severity: ${JSON.stringify(f.severity)}`); + continue; + } + const side = String(f.side ?? 'RIGHT').toUpperCase() === 'LEFT' ? 'LEFT' : 'RIGHT'; + valid.push({ + severity, + file: typeof f.file === 'string' ? f.file : null, + // Valid diff lines are 1-based; reject <= 0 so a bad value falls back to + // a file-level / summary placement rather than a never-matching line. + line: Number.isFinite(f.line) && f.line >= 1 ? f.line : null, + side, + title: String(f.title ?? '').slice(0, 300), + body: String(f.body ?? '').slice(0, 4000), + rule: f.rule ? String(f.rule).slice(0, 80) : null, + }); + } + return valid; +}; + +// ============================================================================= +// DIFF POSITION MAPPING +// ============================================================================= +// +// The "create a review" REST API anchors inline comments with { path, line, +// side } (the line in the file at the head SHA). A comment can only attach to a +// line that is PART OF THE DIFF. We parse the unified diff (already produced +// deterministically into DIFF_FILE) to learn, per file, which (side, line) +// positions are commentable, and to capture a small code anchor per hunk so a +// finding can be matched across pushes even as line numbers drift. +// A finding whose line is NOT in the diff falls back to a file-level comment +// (subject_type=file) or, if even the file isn't in the diff, the review body +// summary — it NEVER crashes the post. +// ============================================================================= + +/** + * parseUnifiedDiff — minimal unified-diff parser. + * Returns Map, leftLines:Set, + * hunks: [{ rightStart, leftStart, text }] }>. + * We only need membership ("is this line commentable?") and a per-hunk text + * anchor; we do NOT need GitHub's legacy `position` integer (line+side is the + * modern, drift-resilient anchor). + */ +const parseUnifiedDiff = (patch) => { + const files = new Map(); + let current = null; + let rightLine = 0; + let leftLine = 0; + let hunk = null; + + // Split on CRLF or LF — a trailing \r would otherwise get baked into the + // hunk anchor text and destabilize the finding hash across pushes. + const lines = patch.split(/\r?\n/); + for (const raw of lines) { + if (raw.startsWith('diff --git ')) { + current = null; + hunk = null; + continue; + } + if (raw.startsWith('+++ ')) { + // "+++ b/path" — the head-side path. "/dev/null" => deletion (no RIGHT side). + const p = raw.slice(4).replace(/^b\//, '').trim(); + if (p === '/dev/null') { + current = null; + continue; + } + current = { rightLines: new Set(), leftLines: new Set(), hunks: [] }; + files.set(p, current); + continue; + } + if (raw.startsWith('@@')) { + // @@ -l,s +l,s @@ optional section heading + const m = raw.match(/@@ -(\d+)(?:,\d+)? \+(\d+)(?:,\d+)? @@/); + if (!m || !current) { + hunk = null; + continue; + } + leftLine = Number(m[1]); + rightLine = Number(m[2]); + hunk = { rightStart: rightLine, leftStart: leftLine, text: raw }; + current.hunks.push(hunk); + continue; + } + if (!current) continue; + const tag = raw[0]; + if (tag === '+') { + current.rightLines.add(rightLine); + if (hunk) hunk.text += `\n${raw}`; + rightLine += 1; + } else if (tag === '-') { + current.leftLines.add(leftLine); + if (hunk) hunk.text += `\n${raw}`; + leftLine += 1; + } else if (tag === ' ') { + // Context line (leading space): unchanged, but it IS part of the hunk and + // GitHub accepts review comments on it ("unchanged lines shown for + // context"). Record it on BOTH sides so a finding on a context line still + // anchors inline instead of being needlessly demoted to a file comment. + // ONLY a leading-space line counts — an empty token (the trailing split + // artifact, or stray text outside a hunk) must NOT advance the counters or + // be marked commentable, or it would falsely anchor a non-diff line. + current.rightLines.add(rightLine); + current.leftLines.add(leftLine); + if (hunk) hunk.text += `\n${raw}`; + rightLine += 1; + leftLine += 1; + } + // tag === '\\' ("\ No newline at end of file") and empty/other tokens are + // not content lines — skip without touching the line counters. + } + return files; +}; + +/** Load + parse the pre-computed diff. Best-effort: missing diff => empty map. */ +const loadDiffIndex = async () => { + try { + const patch = await readFile(DIFF_FILE, 'utf8'); + return parseUnifiedDiff(patch); + } catch (err) { + log('diff: could not read/parse DIFF_FILE (all findings fall back to summary):', err.message); + return new Map(); + } +}; + +/** + * Deterministically clip a unified diff to its first `limit` per-file sections. + * Each file section starts at a `diff --git ` line, so cutting on that boundary + * always yields a still-valid patch (no half-file/half-hunk). Returns the + * original patch unchanged when it has `limit` files or fewer. + */ +const clipDiffToFileLimit = (patch, limit) => { + const lines = patch.split('\n'); + let fileCount = 0; + const kept = []; + for (const line of lines) { + if (line.startsWith('diff --git ')) { + fileCount += 1; + if (fileCount > limit) break; + } + kept.push(line); + } + return { clipped: fileCount > limit, fileCount: Math.min(fileCount, limit), text: kept.join('\n') }; +}; + +/** + * In bounded mode, rewrite DIFF_FILE in place to its first + * BOUNDED_MODE_FILE_LIMIT file sections so the model gets a REAL, deterministic + * scope — not just a prompt asking it to self-limit. No-op when not bounded or + * the diff already fits. Best-effort: a read/write failure leaves the full diff. + */ +const enforceBoundedDiff = async (boundedMode) => { + if (!boundedMode) return; + try { + const patch = await readFile(DIFF_FILE, 'utf8'); + const { clipped, fileCount, text } = clipDiffToFileLimit(patch, BOUNDED_MODE_FILE_LIMIT); + if (!clipped) return; + await writeFile(DIFF_FILE, text); + log(`pre: bounded mode — clipped diff to first ${fileCount} files (${BOUNDED_MODE_FILE_LIMIT} cap).`); + } catch (err) { + log('pre: bounded-diff clip failed (continuing with full diff):', err.message); + } +}; + +/** + * Find the hunk in `fileEntry` that contains (side, line). Used to derive a + * stable code anchor for the finding hash so convergence survives line drift. + */ +const stripHunkHeader = (text) => text.replace(/^@@.*@@.*/, '').trim().slice(0, 200); + +const hunkAnchorFor = (fileEntry, line, side) => { + if (!fileEntry) return ''; + // Pick the LAST hunk whose start is at/before the target line — that's the + // one actually containing it. Picking the first (any start ≤ line) would + // anchor a late-file finding to an early hunk and weaken cross-push + // convergence. Coarse (first ~200 chars) on purpose — stability over exactness. + let containing = null; + for (const h of fileEntry.hunks) { + const start = side === 'LEFT' ? h.leftStart : h.rightStart; + if (line >= start) containing = h; + } + if (containing) { + const anchor = stripHunkHeader(containing.text); + if (anchor) return anchor; + } + return stripHunkHeader(fileEntry.hunks[0]?.text ?? ''); +}; + +/** + * Classify each finding against the diff index: + * - `inline` : (side, line) is commentable → real anchored inline comment. + * - `file` : file is in the diff but the line isn't → file-level comment. + * - `summary` : file not in the diff at all → rolled into the review body. + * Also attaches a stable `anchor` (+ `hash`) for cross-push convergence. + */ +const classifyFindings = (findings, diffIndex) => + findings.map((f) => { + const fileEntry = f.file ? diffIndex.get(f.file) : null; + const lineSet = f.side === 'LEFT' ? fileEntry?.leftLines : fileEntry?.rightLines; + const anchor = hunkAnchorFor(fileEntry, f.line ?? 0, f.side); + const withAnchor = { ...f, anchor }; + const hash = findingHash(withAnchor); + + if (f.file && fileEntry && f.line != null && lineSet?.has(f.line)) { + return { ...withAnchor, hash, placement: 'inline' }; + } + if (f.file && fileEntry) { + return { ...withAnchor, hash, placement: 'file' }; + } + return { ...withAnchor, hash, placement: 'summary' }; + }); + +// ============================================================================= +// OWN-THREAD STATE (dedup + resolution targets) — via GraphQL +// ============================================================================= +// +// To converge across pushes we need, for OUR OWN review comments only: +// - the stable finding-hash we stamped (read from a hidden HTML comment in +// each comment body), and +// - the review-thread node id + isResolved flag (REST review comments don't +// expose thread/resolution; GraphQL `reviewThreads` does). +// We then: (a) skip posting a finding whose hash already has an OPEN own-thread +// (dedup), and (b) `resolveReviewThread` our own OPEN threads whose finding is +// gone from the current set (fixed/changed code). +// ============================================================================= + +/** Pull the stamped finding-hash out of one of our comment bodies, if present. */ +const hashFromOwnBody = (body) => { + const m = (body ?? '').match(//); + return m ? m[1] : null; +}; + +/** Pull EVERY stamped finding-hash out of a body (a review body lists many). */ +const hashesFromBody = (body) => [...(body ?? '').matchAll(//g)].map((m) => m[1]); + +/** + * Is this login OURS? GraphQL returns a Bot login WITHOUT the "[bot]" suffix + * ("github-actions"), while the REST-derived selfLogin carries it + * ("github-actions[bot]") — normalize both before comparing. + */ +const isSelfLogin = (login) => login.replace(/\[bot\]$/, '') === env.selfLogin.replace(/\[bot\]$/, ''); + +/** + * Did WE author this review/comment? Login match, with the hidden + * SELF_COMMENT_MARKER as the canonical fallback. + */ +const isOwnAuthor = (login, body) => isSelfLogin(login) || (body ?? '').includes(SELF_COMMENT_MARKER); + +/** + * Fetch the finding-hashes we already SUMMARIZED in our own prior review bodies. + * Summary/overflow findings never become review THREADS (they're body bullets, + * not inline comments), so reviewThreads can't dedup them — without this each + * push reposts the same "Additional notes". We stamp every summary bullet with a + * hidden hash (see buildReviewPayload) and read them back here. Best-effort: a + * GraphQL failure just means we may repost a summary line, never a crash. + */ +const fetchOwnSummarizedHashes = async () => { + const query = ` + query($owner:String!, $repo:String!, $pr:Int!, $cursor:String) { + repository(owner:$owner, name:$repo) { + pullRequest(number:$pr) { + reviews(first:100, after:$cursor) { + pageInfo { hasNextPage endCursor } + nodes { body author { login } } + } + } + } + }`; + + const out = new Set(); + let cursor = null; + for (let page = 0; page < 10; page += 1) { + // eslint-disable-next-line no-await-in-loop -- sequential cursor pagination + const r = await graphqlFetch(query, { owner, repo, pr: Number(env.prNumber), cursor }); + if (!r.ok || !r.data) { + log('own-reviews: GraphQL fetch failed (summary dedup best-effort):', r.errors?.[0]?.message ?? ''); + return out; + } + const conn = r.data.repository?.pullRequest?.reviews; + for (const node of conn?.nodes ?? []) { + if (!isOwnAuthor(node.author?.login ?? '', node.body)) continue; + for (const h of hashesFromBody(node.body)) out.add(h); + } + if (!conn?.pageInfo?.hasNextPage) break; + cursor = conn.pageInfo.endCursor; + } + return out; +}; + +/** + * Fetch OUR OWN review threads on this PR via GraphQL, paginated. + * Returns [{ threadId, isResolved, resolvedByLogin, hash }] where hash is the + * stamped finding hash on the FIRST (root) comment of a thread we authored. + * `resolvedByLogin` is the login that resolved the thread (null if open), used + * to tell a human resolution apart from the bot's own auto-resolution. + */ +const fetchOwnThreads = async () => { + const query = ` + query($owner:String!, $repo:String!, $pr:Int!, $cursor:String) { + repository(owner:$owner, name:$repo) { + pullRequest(number:$pr) { + reviewThreads(first:100, after:$cursor) { + pageInfo { hasNextPage endCursor } + nodes { + id + isResolved + isOutdated + resolvedBy { login } + comments(first:1) { + nodes { body author { login } } + } + } + } + } + } + }`; + + const out = []; + let cursor = null; + // Hard page cap (~1000 threads) so a stuck/looping endCursor can't spin forever. + for (let page = 0; page < 10; page += 1) { + const r = await graphqlFetch(query, { + owner, + repo, + pr: Number(env.prNumber), + cursor, + }); + if (!r.ok || !r.data) { + log('own-threads: GraphQL fetch failed (treating as no prior threads):', r.errors?.[0]?.message ?? ''); + return out; + } + const conn = r.data.repository?.pullRequest?.reviewThreads; + for (const node of conn?.nodes ?? []) { + const root = node.comments?.nodes?.[0]; + if (!root) continue; + if (!isOwnAuthor(root.author?.login ?? '', root.body)) continue; + out.push({ + threadId: node.id, + isResolved: Boolean(node.isResolved), + resolvedByLogin: node.resolvedBy?.login ?? null, + hash: hashFromOwnBody(root.body), + }); + } + if (!conn?.pageInfo?.hasNextPage) break; + cursor = conn.pageInfo.endCursor; + } + return out; +}; + +/** Mark one of OUR threads resolved. Fail-soft per call. */ +const resolveThread = async (threadId) => { + const mutation = ` + mutation($threadId:ID!) { + resolveReviewThread(input:{threadId:$threadId}) { + thread { id isResolved } + } + }`; + const r = await graphqlFetch(mutation, { threadId }); + if (!r.ok) { + log('resolve-thread: failed (continuing):', r.errors?.[0]?.message ?? threadId); + return false; + } + return true; +}; + +// ============================================================================= +// INLINE REVIEW RENDERING + POST +// ============================================================================= +// +// Build ONE PR review: event=COMMENT (advisory — never approve/request-changes), +// commit_id=head SHA, comments=[{path, line, side, body}, ...]. Findings that +// can't anchor to a diff line are file-level comments (subject_type=file) or, if +// even the file isn't in the diff, summarized in the review body. +// ============================================================================= + +const SEVERITY_LABEL = { blocking: '🚫 Blocking', convention: '📐 Convention', nit: '🔧 Nit' }; + +// Heading for body-summarized findings (summary placements + inline overflow + +// the 422 fallback). Single source so the skip-post check and the fallback's +// dedup-guard match the heading we actually emit. +const SUMMARY_HEADING = `### Additional notes (couldn't anchor to a diff line)`; + +/** Render one finding into an inline-comment markdown body (with hidden hash). */ +const renderCommentBody = (f) => { + // Rule/invariant ids (f.rule) are INTERNAL grounding only — used for the + // dedup hash and the model's self-validation. They're meaningless to a human + // in a PR comment, so they are never rendered into the comment body. + const head = `**${SEVERITY_LABEL[f.severity]} — ${f.title}**`; + const stamp = `\n\n${SELF_COMMENT_MARKER}`; + const body = `${head}\n\n${f.body}${stamp}`; + // Unreachable while readModelFindings caps title@300 + body@4000 (well under + // MAX_COMMENT_BODY); kept intentionally in case those caps rise — not dead code. + return body.length > MAX_COMMENT_BODY ? `${body.slice(0, MAX_COMMENT_BODY - stamp.length - 16)}\n…${stamp}` : body; +}; + +/** + * Is this finding actually anchorable as an inline/file comment in the diff we + * parsed? This is the SAME test GitHub applies server-side when it validates + * `comments[]` on `POST /pulls/{n}/reviews`, so mirroring it here lets us demote + * anything unanchorable to the summary body instead of risking a 422 that drops + * the whole review. + */ +const isAnchorable = (f, diffIndex) => { + const fileEntry = f.file ? diffIndex.get(f.file) : null; + if (!fileEntry) return false; // file not in the real PR diff at all + if (f.placement === 'file') return true; // file is in the diff → file-level ok + const lineSet = f.side === 'LEFT' ? fileEntry.leftLines : fileEntry.rightLines; + return f.line != null && Boolean(lineSet?.has(f.line)); +}; + +/** + * Build the review payload from the findings we decided to POST (already + * deduped against open own-threads). Inline + file-level findings become + * `comments[]`; summary-only findings (and inline overflow past the cap) roll + * into the review `body`. + * + * RESILIENCE: an inline/file comment is added to `comments[]` ONLY if its + * (path, line, side) is actually anchorable in the diff we parsed — re-verified + * here against `diffIndex`, not just trusting the `placement` tag — so a stray + * path/line can never reach GitHub as an inline comment and 422 the review. + */ +const buildReviewPayload = ({ toPost, diffIndex, deepInfo, skipCount }) => { + const comments = []; + const overflow = []; + + for (const f of toPost) { + // Summary placements and anything not anchorable in the REAL diff roll into + // the body so the POST can never 422 ("Path could not be resolved"). + if (f.placement === 'summary' || !isAnchorable(f, diffIndex)) { + overflow.push(f); + continue; + } + if (comments.length >= MAX_INLINE_COMMENTS) { + overflow.push(f); + continue; + } + const base = { path: f.file, body: renderCommentBody(f) }; + // file-level comment: subject_type=file, no line. Otherwise anchor by line+side. + comments.push(f.placement === 'file' ? { ...base, subject_type: 'file' } : { ...base, line: f.line, side: f.side }); + } + + const summaryLines = []; + for (const f of overflow) { + const loc = f.file ? `\`${f.file}${f.line ? `:${f.line}` : ''}\`` : ''; + // Trailing hidden hash so the NEXT run can see this summary finding was + // already reported (summary findings have no thread, so reviewThreads can't + // dedup them — fetchOwnSummarizedHashes reads these markers back). + summaryLines.push(`- **${SEVERITY_LABEL[f.severity]}** ${loc} — ${f.title} `); + } + + const header = + `## 🔭 thunder-deep-review (advisory)\n` + + `Complements the other bots — surfaces only what they did not flag. ` + + `Never approves, never requests changes, never gates merge.\n` + + `head: \`${env.headSha.slice(0, 12)}\` · mode: ${deepInfo.deepMode ? 'deep' : 'single'}` + + `${deepInfo.boundedMode ? ' (bounded: diff exceeded file cap)' : ''} · ` + + `deferred ${skipCount} item(s) already reported by other bots (best-effort dedup)\n`; + + const summaryBlock = + summaryLines.length === 0 ? '' : `\n${SUMMARY_HEADING}\n${summaryLines.join('\n')}\n`; + + const body = `${header}${summaryBlock}`; + return { event: 'COMMENT', commit_id: env.headSha, body, comments }; +}; + +/** POST the review. event=COMMENT only — NEVER approve/request-changes. */ +const postReview = async (payload) => { + if (payload.comments.length === 0 && !payload.body.includes(SUMMARY_HEADING)) { + // Nothing to anchor inline and nothing to summarize beyond the header — + // i.e. zero new findings. Skip posting entirely rather than leave a + // contentless review on the PR. + log('post: no new inline comments and no summary items — skipping review post.'); + return; + } + const r = await githubFetch(apiUrl(`/pulls/${env.prNumber}/reviews`), { + method: 'POST', + body: JSON.stringify(payload), + }); + if (r.ok) { + log(`posted inline review: ${payload.comments.length} inline comment(s), event=COMMENT.`); + return; + } + // Last-resort resilience: GitHub rejects the WHOLE review if ANY inline + // comment can't be anchored ("Path could not be resolved"). Rather than lose + // every good finding, retry ONCE with all inline comments dropped, folding + // their locations into the summary body so the findings still reach the human. + if (r.status === 422 && payload.comments.length > 0) { + log(`post: 422 on inline anchors (${r.body ?? ''}) — retrying comment-free with findings in the body.`); + const demoted = payload.comments.map((c) => { + const lineRef = c.line ? `:${c.line}` : ''; + // Carry the hidden finding-hash from the inline body onto the demoted + // bullet so the next run's summary-dedup (fetchOwnSummarizedHashes) sees + // it and does not repost — otherwise this fallback path reintroduces the + // "summary findings never dedupe" bug it is rescuing. + const hash = hashFromOwnBody(c.body); + const stamp = hash ? ` ` : ''; + return `- \`${c.path}${lineRef}\` — ${c.body.split('\n')[0].replace(/\*\*/g, '')}${stamp}`; + }); + // payload.body may ALREADY carry a SUMMARY_HEADING block (mixed-placement + // reviews with summary findings) — append the demoted bullets under it + // rather than emitting a duplicate header. + const demotedBlock = demoted.join('\n'); + const fallbackBody = payload.body.includes(SUMMARY_HEADING) + ? `${payload.body}${demotedBlock}\n` + : `${payload.body}\n${SUMMARY_HEADING}\n${demotedBlock}\n`; + const retry = await githubFetch(apiUrl(`/pulls/${env.prNumber}/reviews`), { + method: 'POST', + body: JSON.stringify({ event: 'COMMENT', commit_id: env.headSha, body: fallbackBody, comments: [] }), + }); + if (!retry.ok) throw new Error(`create review retry failed: ${retry.status} ${retry.body ?? ''}`); + log(`posted comment-free fallback review: ${demoted.length} finding(s) in body, event=COMMENT.`); + return; + } + throw new Error(`create review failed: ${r.status} ${r.body ?? ''}`); +}; + +// ============================================================================= +// PHASES +// ============================================================================= + +/** PRE phase: poll A's bots, then write the skip-list + deep-mode flag for the model step. */ +const runPre = async () => { + await pollForBots(); + const [skipList, deepInfo] = await Promise.all([buildSkipList(), computeDeepMode()]); + await writeFile(SKIPLIST_FILE, JSON.stringify({ headSha: env.headSha, skipList }, null, 2)); + await writeFile(DEEPMODE_FILE, JSON.stringify(deepInfo, null, 2)); + // The DIFF_FILE itself is produced by a separate deterministic workflow step + // (the GitHub API's PR diff, Accept: v3.diff — the exact merge-base..head set + // GitHub validates review anchors against). This script never shells out to + // git; it only does GitHub REST I/O. But when the PR overran the file cap we + // clip that diff here so bounded mode is an ENFORCED scope, not a prompt hint. + await enforceBoundedDiff(deepInfo.boundedMode); + log(`pre: wrote ${skipList.length} skip-list entries; deepMode=${deepInfo.deepMode} (${deepInfo.reason}).`); +}; + +/** + * POST phase: read validated model findings, map them to diff positions, dedup + * against our own OPEN review threads, resolve our own FIXED threads, then post + * ONE PR review (event=COMMENT) carrying only the NEW findings as inline + * comments. + */ +const runPost = async () => { + const rawFindings = await readModelFindings(); // throws => fail-soft + let skipCount = 0; + let deepInfo = { deepMode: false, boundedMode: false }; + try { + skipCount = JSON.parse(await readFile(SKIPLIST_FILE, 'utf8')).skipList?.length ?? 0; + deepInfo = JSON.parse(await readFile(DEEPMODE_FILE, 'utf8')); + } catch { + /* best-effort metadata only */ + } + + // 1) Map findings to diff positions (inline / file / summary) + stable hash. + const diffIndex = await loadDiffIndex(); + const findings = classifyFindings(rawFindings, diffIndex); + const currentHashes = new Set(findings.map((f) => f.hash)); + + // 2) Fetch our own prior state: open inline THREADS (hash + resolution) and + // the hashes we already SUMMARIZED in prior review bodies. Summary findings + // have no thread, so without the latter each push reposts the same notes. + const [ownThreads, summarizedHashes] = await Promise.all([fetchOwnThreads(), fetchOwnSummarizedHashes()]); + const openHashes = new Set(ownThreads.filter((t) => !t.isResolved && t.hash).map((t) => t.hash)); + // Threads a HUMAN resolved (decided the finding is intentional). We must NOT + // re-post these even though the underlying code is unchanged, or the bot would + // fight the human's resolve on every push. We scope to human resolutions only: + // a thread WE auto-resolved (step 4) is resolved precisely because its code was + // fixed, so if that code regresses the finding genuinely reappears and SHOULD + // re-post — hence bot-self-resolved hashes are deliberately left out. + const humanResolvedHashes = new Set( + ownThreads.filter((t) => t.isResolved && t.hash && t.resolvedByLogin && !isSelfLogin(t.resolvedByLogin)).map((t) => t.hash), + ); + + // 3) Dedup: post only findings NOT already present as an OPEN own-thread, NOT + // human-resolved (respect the human's decision), and — for summary-placement + // findings, which have no thread — not already listed in a prior review's + // "Additional notes" body. + const toPost = findings.filter( + (f) => !openHashes.has(f.hash) && !humanResolvedHashes.has(f.hash) && !summarizedHashes.has(f.hash), + ); + + // 4) Convergence: resolve our OWN open threads whose finding is gone now. + // Suppressed on `reopened` (the prior threads may be from another base). + // ALSO suppressed when the model emitted ZERO findings this run: an empty + // result is the model saying "nothing new to add" (the prompt tells it to + // return {"findings":[]} in that case), NOT "every prior finding is fixed". + // Without this guard an empty run would mark every open own-thread stale and + // resolve it even though the flagged code was never touched. + if (env.eventAction === 'reopened') { + log('post: reopened event — skipping thread resolution.'); + } else if (findings.length === 0) { + log('post: model returned no findings — skipping thread resolution (no fix signal).'); + } else { + const stale = ownThreads.filter((t) => !t.isResolved && t.hash && !currentHashes.has(t.hash)); + for (const t of stale) { + // eslint-disable-next-line no-await-in-loop -- small N, ordered for clear logs + await resolveThread(t.threadId); + } + if (stale.length) log(`post: resolved ${stale.length} own thread(s) whose finding is fixed.`); + } + + // 5) Post the new findings as one inline review. + const payload = buildReviewPayload({ toPost, diffIndex, deepInfo, skipCount }); + await postReview(payload); +}; + +// ============================================================================= +// MAIN +// ============================================================================= + +const main = async () => { + const phase = process.argv[2]; + if (!['pre', 'post'].includes(phase)) failSoft(`usage: review-orchestrator.mjs (got "${phase}")`); + if (!env.token) failSoft('missing GITHUB_TOKEN/GH_TOKEN'); + if (!owner || !repo) failSoft(`missing/invalid GITHUB_REPOSITORY: "${env.repoFull}"`); + if (!env.prNumber) failSoft('missing PR_NUMBER'); + if (!env.headSha) failSoft('missing PR_HEAD_SHA'); // must reconcile to head, not merge ref. + + try { + if (phase === 'pre') await runPre(); + else await runPost(); + } catch (err) { + failSoft(`unrecoverable error in ${phase} phase`, err); + } +}; + +main(); diff --git a/.github/workflows/thunder-deep-review.yml b/.github/workflows/thunder-deep-review.yml new file mode 100644 index 000000000..7ae478f18 --- /dev/null +++ b/.github/workflows/thunder-deep-review.yml @@ -0,0 +1,319 @@ +# ============================================================================= +# thunder-deep-review — advisory, comment-only AI review. Never approves, never +# requests changes, never merges. Human approval + merge stay exactly as today. +# +# Input-name verification: +# - The `anthropics/claude-code-action` input names below were verified +# against the live @v1 `action.yml` (2026-06): `prompt`, `anthropic_api_key`, +# `claude_args`, and `use_sticky_comment` all EXIST. `model:` / +# `allowed_tools:` / `review_event:` do NOT exist in @v1 — do NOT +# reintroduce them (pass --model/--allowedTools via `claude_args`). +# Re-confirm only if the action major version changes. +# - Action commit SHAs are PINNED below (checkout / claude-code-action). +# - The only secret the team sets is the ANTHROPIC_API_KEY repo secret. +# - Recommended: CODEOWNERS-protect this workflow + .github/scripts/review-orchestrator.mjs +# (they hold the review logic + the API-key reference). +# +# Orchestration is deterministic CODE, not model tool-calls: .github/scripts/ +# review-orchestrator.mjs does ALL GitHub I/O (poll A's bots, build the +# skip-list, post the inline PR review). The model step is READ-ONLY and ONLY +# emits structured JSON findings as the action's structured_output (via +# --json-schema); a separate non-model step materializes that to a file. +# ============================================================================= + +name: thunder-deep-review + +on: + pull_request: + # Include ready_for_review so un-drafting triggers a review; drafts + # are skipped by the job `if:` below. + types: [opened, synchronize, reopened, ready_for_review] + # NOTE: fork PRs are GATED OFF entirely by the job `if:` — do NOT switch + # to `pull_request_target` + checkout of fork head (the key-exposure + # pwn-request). Fork coverage, if ever wanted, is a separate `workflow_run` + # job that reads metadata only and never executes fork code. + +# Least privilege + the three scopes the mechanism actually needs. +permissions: + contents: read # read the diff/files + pull-requests: write # post the inline PR review (POST /pulls/{n}/reviews) + checks: read # poll A's bot check-runs (without this the poller 403s) + +concurrency: + # Only review the latest push. cancel-in-progress during the debounce + # `sleep` (below) is free. No cross-run lock: a run posts a review only when it + # has new findings (it skips otherwise) and converges via open-thread hash dedup + # (the post phase resolves/skips the threads it already owns), so a cancelled run is safe. + group: thunder-deep-review-${{ github.event.pull_request.number }} + cancel-in-progress: true + +jobs: + review: + # Hard-gate on same-repo PRs only (no forks), skip Dependabot, skip + # drafts. This is the primary defense — fork code must never run with + # access to repo secrets. + if: >- + github.event.pull_request.head.repo.full_name == github.repository && + github.actor != 'dependabot[bot]' && + github.event.pull_request.draft == false + + # GitHub-hosted ephemeral runner: a clean VM per job. Safe now that the + # job only runs on same-repo PRs (fork `if:` gate above) and uses the + # public Anthropic API (no cloud creds on the runner). + runs-on: ubuntu-latest + + # Cap total runner time so a hung model API call can't hold the + # runner indefinitely. + timeout-minutes: 20 + + env: + # Reconcile EVERYTHING to the PR HEAD sha, never the synthetic merge ref. + # Both the poller and the diff must describe the same code. + PR_NUMBER: ${{ github.event.pull_request.number }} + PR_HEAD_SHA: ${{ github.event.pull_request.head.sha }} + PR_ACTION: ${{ github.event.action }} + PR_LABELS: ${{ join(github.event.pull_request.labels.*.name, ',') }} + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + # The orchestrator posts the review under the default GITHUB_TOKEN, so the + # author login is github-actions[bot]. The script matches/resolves its own + # threads by this login (+ a hidden marker). No separate GitHub App. + THUNDER_BOT_LOGIN: github-actions[bot] + # File handoff between steps (must match .github/scripts/review-orchestrator.mjs). + THUNDER_DIFF_FILE: ${{ github.workspace }}/thunder-diff.patch + THUNDER_SKIPLIST_FILE: ${{ github.workspace }}/thunder-skiplist.json + THUNDER_DEEPMODE_FILE: ${{ github.workspace }}/thunder-deepmode.json + THUNDER_FINDINGS_FILE: ${{ github.workspace }}/thunder-findings.json + + steps: + # ───────────────────────────────────────────────────────────────────── + # DEBOUNCE. Rapid pushes (rebases/typo fixes) cancel during this + # sleep at zero model cost. Only survivors proceed to spend on the model. + # ───────────────────────────────────────────────────────────────────── + - name: Debounce rapid pushes + run: sleep 60 + + # ───────────────────────────────────────────────────────────────────── + # Checkout. persist-credentials:false (don't leave a usable token on disk + # for the model step), check out the HEAD sha shallow. The diff itself is + # fetched from the GitHub API (next step), so we need NO base/head fetch + # and NOT fetch-depth:0 — the working tree is only here for the model's + # Read/Grep/Glob tools and the .claude symlink-prune step. SHA pinned to + # v7.0.0. + # ───────────────────────────────────────────────────────────────────── + - name: Checkout (head, shallow) + uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7.0.0 + with: + persist-credentials: false + ref: ${{ github.event.pull_request.head.sha }} # review the head sha + fetch-depth: 1 # shallow; the diff comes from the API, not local git + + # ───────────────────────────────────────────────────────────────────── + # Pre-compute the diff DETERMINISTICALLY (no model, no Bash tool in + # the model step). The model reads this file via `Read`. + # + # It MUST equal GitHub's REAL PR diff — the same merge-base..head ("Files + # changed") set that `POST /pulls/{n}/reviews` validates inline `comments[]` + # against — so we fetch it straight from the API (Accept: v3.diff) instead + # of a local `git diff`. A local TWO-DOT `git diff base..head` is WRONG: + # base.sha is the CURRENT tip of the base branch, not the merge-base, so + # when the PR branch is behind base it injects base-only commits as REVERSE + # diffs. The model then anchors findings to files NOT in the real PR diff, + # and `POST /pulls/{n}/reviews` 422s ("Path could not be resolved"), + # dropping the whole review. + # + # GITHUB_TOKEN is in the job env; this is NOT the read-only model step, so + # using it here is fine. `set -o pipefail` so an API error fails loudly + # instead of writing a truncated/empty diff that silently reviews nothing. + # ───────────────────────────────────────────────────────────────────── + - name: Compute diff to a bounded file + run: | + set -o pipefail + gh api "repos/${GITHUB_REPOSITORY}/pulls/${PR_NUMBER}" \ + -H "Accept: application/vnd.github.v3.diff" > "$THUNDER_DIFF_FILE" + echo "diff bytes: $(wc -c < "$THUNDER_DIFF_FILE")" + + # ───────────────────────────────────────────────────────────────────── + # The repo commits `.claude/{commands,agents}/thunderbot*` as symlinks into + # a local-only `.thunderbot/` dir that is NOT checked out in CI, so they are + # dangling links here. The Claude SDK scans `.claude/` (the project setting + # source) and crashes (`ENOENT ... statx`) on a dangling link. Remove ONLY + # broken symlinks; the real skill/agent files are left untouched. + # ───────────────────────────────────────────────────────────────────── + - name: Prune dangling .claude symlinks (local-only tooling) + run: | + find .claude -type l | while read -r l; do + [ -e "$l" ] || { echo "pruning dangling symlink: $l"; rm -f "$l"; } + done + + # ───────────────────────────────────────────────────────────────────── + # (PRE phase): poll A's bots, build the skip-list, compute the + # deterministic deep-mode flag. ALL GitHub I/O is in this script. + # This step touches untrusted PR metadata but holds no model credentials. + # ───────────────────────────────────────────────────────────────────── + - name: Orchestrator — poll bots + build skip-list (pre) + run: node .github/scripts/review-orchestrator.mjs pre + # NOTE: .github/scripts/review-orchestrator.mjs is checked in at the + # repo-relative path; recommend CODEOWNERS-protecting it. + + # ───────────────────────────────────────────────────────────────────── + # The MODEL step. READ-ONLY: emits STRUCTURED JSON findings ONLY as the + # action's structured_output (via --json-schema), NOT to a file (it has no + # Write/Edit/Bash tool). The next step materializes that output to the + # findings file. It does NO GitHub I/O. + # + # Input names below were verified against the live @v1 `action.yml` + # (2026-06); re-confirm only if the action major version changes: + # - auth via `anthropic_api_key` (the Anthropic public API directly; + # NOT Bedrock/Vertex). The only secret the team provisions. + # - sticky/consolidated comment via `use_sticky_comment`. We render+post + # the inline review ourselves in the post step, so it stays "false" + # here — the post step is the single owner of the inline review. + # - pass model + tools through `claude_args`, NOT `model:`/`allowed_tools:`. + # - there is NO `review_event:` input in @v1. + # - allowedTools = Read,Grep,Glob,Skill,Task,Agent. NO Bash(...). NO write tools. + # `Skill` and the subagent-spawn tool REQUIRE permission (per the tools + # reference); the action runs the Agent SDK in DEFAULT permission mode + # with NO canUseTool callback (headless CI), so any permission-required + # tool that is NOT pre-approved by an allow rule is denied with no prompt. + # The whole review is invoked via the `Skill` tool ("Use the + # thunder-deep-review skill"), and the skill fans out to read-only + # sub-reviewers via the subagent-spawn tool — that tool is named `Task` + # in some SDK versions and `Agent` in others, so BOTH are allow-listed + # for forward/backward compatibility (the unused name is harmless). If + # only the wrong name were listed, every spawn (the per-lens sub- + # reviewers + the dispatched powersync-sync/react-effect reviewers) + # would be silently denied, collapsing the fan-out to a single pass. + # Read/Grep/Glob need no permission but are listed for clarity. The + # skill + its subagents under `.claude/{skills,agents}/` are auto- + # discovered because the SDK loads the `project` setting source by default. + # SHA pinned to v1. + # PIN the model id LITERALLY here — do NOT source from vars.* (a + # mutable repo/org var changes review behavior with no PR/audit trail). + # ───────────────────────────────────────────────────────────────────── + - name: thunder-deep-review (model — structured findings only) + id: review_model + uses: anthropics/claude-code-action@806af32823ef69c8ef357086c573a902af641307 # v1 + with: + anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} + # Pass the workflow token so the action authenticates to GitHub WITH it + # instead of minting a Claude App token via OIDC — which would require + # `id-token: write` (we keep least-privilege without it). The action's + # OIDC path only runs when github_token is empty; providing it skips OIDC. + github_token: ${{ secrets.GITHUB_TOKEN }} + use_sticky_comment: "false" # we own the inline review via the post step + # Model id pinned LITERALLY (a mutable vars.* would change review + # behavior with no PR/audit trail). claude-opus-4-8 is a pinned + # snapshot id (the 4.6+ generation uses a dateless-but-pinned format). + # --allowedTools: Skill (invokes the review skill — permission-required, + # so it MUST be allow-listed or the headless run denies it and no-ops), + # Task AND Agent (the skill's read-only sub-reviewer fan-out — also + # permission-required; the subagent-spawn tool is `Task` in some SDK + # versions and `Agent` in others, so both names are listed), and + # Read,Grep,Glob (read-only working-tree access). NO Bash, + # NO write tools — the model is fully READ-ONLY and CANNOT write a file. + # --max-turns caps the agent loop. + # --json-schema: the model is read-only, so it can't write the findings + # file directly. Instead it returns findings as the action's STRUCTURED + # OUTPUT (action.yml `outputs.structured_output`, populated only when a + # --json-schema is provided in claude_args). The next step materializes + # that output into ${THUNDER_FINDINGS_FILE}. The schema MUST stay aligned + # with the orchestrator's readModelFindings parser: + # findings[].severity ∈ {blocking,convention,nit}; file (string|null); + # line (integer|null, parser keeps only >= 1); side ∈ {RIGHT,LEFT} + # (optional, defaults RIGHT); title; body; rule (string|null). + claude_args: >- + --model claude-opus-4-8 + --allowedTools Read,Grep,Glob,Skill,Task,Agent + --max-turns 20 + --json-schema '{"type":"object","additionalProperties":false,"required":["findings"],"properties":{"findings":{"type":"array","items":{"type":"object","additionalProperties":false,"required":["severity","file","line","title","body","rule"],"properties":{"severity":{"type":"string","enum":["blocking","convention","nit"]},"file":{"type":["string","null"]},"line":{"type":["integer","null"],"minimum":1},"side":{"type":"string","enum":["RIGHT","LEFT"]},"title":{"type":"string"},"body":{"type":"string"},"rule":{"type":["string","null"]}}}}}}' + prompt: | + Use the thunder-deep-review skill to review this pull request. + The unified diff to review is in the file: ${{ env.THUNDER_DIFF_FILE }} (read it with Read). + A skip-list of issues ALREADY reported by other bots is in: + ${{ env.THUNDER_SKIPLIST_FILE }} + Do NOT repeat anything on the skip-list. Surface ONLY what the other + bots missed (architecture, conventions, docs-intent, readability). + Deep-mode / bounded-mode flags are in: ${{ env.THUNDER_DEEPMODE_FILE }} — + honor them; do NOT decide spend yourself. + + OUTPUT CONTRACT: you are READ-ONLY — do NOT attempt to write any file. + Return your findings as STRUCTURED OUTPUT matching the provided JSON + schema — a single object: + { "findings": [ + { "severity": "blocking" | "convention" | "nit", + "file": "", "line": , + "side": "RIGHT" | "LEFT", + "title": "", + "body": "", + "rule": "" } ] } + VOICE — write `title` and `body` for a TEAMMATE, not a linter: warm, + collaborative, curious. Lead with a question or first-person framing + where it fits ("I wonder if…", "Could we…?", "Heads up —"). Be specific + and kind; never robotic, terse, or scolding. + HUMAN-FACING TEXT — `title` and `body` are shown verbatim to a person, + so write ONLY plain prose + the concrete fix. Do NOT put rule/invariant + ids, section refs, or internal scaffolding in them — NO "R-REDUCER", + NO "INV-01", NO "(review-heuristics §2)", NO "H mass-assignment …". + Put the id ONLY in the `rule` field; it is internal grounding and is + NEVER shown to the human. + Emit NO markdown headers, NO severity trailer, NO PR comment — the + orchestrator renders and posts. If nothing to add, return {"findings":[]}. + + # ───────────────────────────────────────────────────────────────────── + # Persist the model's STRUCTURED OUTPUT to the findings file the post phase + # reads. The model step is READ-ONLY (no Write/Edit/Bash tool), so it cannot + # create ${THUNDER_FINDINGS_FILE} itself — the action's structured_output + # (populated by the --json-schema above) is the handoff channel. This step + # has legitimate shell access and materializes the file. + # + # Security: structured_output is model-derived and can contain quotes, shell + # metacharacters, or hostile PR content. We pass it via `env:` (NEVER inlined + # into the run: body) and read it with Node from process.env, so there is no + # script-generation injection surface (GitHub's recommended pattern for + # untrusted expression values). + # + # Fail posture: an EMPTY structured_output (model step skipped/cancelled/ + # errored — this step is `if: always()`) falls back to an empty findings set + # so the post phase stays a clean no-op instead of crashing on ENOENT. A + # NON-empty but malformed payload throws here, surfacing a real schema + # regression instead of silently posting nothing. + # ───────────────────────────────────────────────────────────────────── + - name: Persist structured findings to file + if: always() + env: + STRUCTURED_OUTPUT: ${{ steps.review_model.outputs.structured_output }} + run: | + node <<'NODE' + const fs = require('node:fs'); + const file = process.env.THUNDER_FINDINGS_FILE; + if (!file) throw new Error('THUNDER_FINDINGS_FILE is unset'); + const raw = process.env.STRUCTURED_OUTPUT; + if (!raw) { + // Empty structured_output means the model step produced NOTHING + // (skipped/cancelled/errored) — distinct from a genuine empty result, + // which arrives as the non-empty string '{"findings":[]}'. Fail loudly + // so a broken model step is a red check, never a silent green no-op + // (codex second opinion). + console.error('::error::model step produced no structured_output — review did not run'); + process.exit(1); + } + const parsed = JSON.parse(raw); // throws on malformed -> surfaces a real regression + if (!parsed || !Array.isArray(parsed.findings)) { + throw new Error('structured_output must be an object with a findings[] array'); + } + fs.writeFileSync(file, `${JSON.stringify(parsed)}\n`, 'utf8'); + console.log(`wrote ${parsed.findings.length} finding(s); bytes: ${fs.statSync(file).size}`); + NODE + + # ───────────────────────────────────────────────────────────────────── + # (POST phase): read+validate the model's JSON, compute the + # resolved-diff structurally, render markdown deterministically, and post + # ONE PR review (POST /pulls/{n}/reviews, event=COMMENT) with anchored inline + # comments — no issue comment, no PATCH/upsert, no lock. No model creds here; + # pure GitHub I/O. + # `if: always()` so a non-zero model step still lets us post (or skip + # fail-soft) — but the script itself exits 0 on any error. + # ───────────────────────────────────────────────────────────────────── + - name: Orchestrator — render + post inline review (post) + if: always() + run: node .github/scripts/review-orchestrator.mjs post diff --git a/.gitignore b/.gitignore index 2cc1b5941..007c8c6f1 100644 --- a/.gitignore +++ b/.gitignore @@ -54,6 +54,9 @@ mise.toml !.claude/agents/** !.claude/rules/ !.claude/rules/** +!.claude/skills/ +!.claude/skills/thunder-deep-review/ +!.claude/skills/thunder-deep-review/** .zed/** .zedscratches/** .mcp.json