feat(jobs+reports): comparative-audit jobs API + AI SDK report tool surface#83
Merged
feat(jobs+reports): comparative-audit jobs API + AI SDK report tool surface#83
Conversation
…ture src/design/audit/v2/types.ts — comprehensive TypeScript interfaces covering all 8 layers from the RFC (docs/rfc/design-audit-world-class.md): Layer 1 — DimensionScore, RollupScore, ClassifierSignal, EnsembleClassification, DomHeuristics Layer 2 — Patch, PatchTarget, PatchDiff, PatchTest, PatchRollback, DesignFinding (extended with id, dimension, patches, kind) Layer 3 — NovelPatternObservation Layer 4 — PatchApplication, PatchReliability Layer 5 — Pattern, PatternScaffold, PatternFleetEvidence, PatternQuery, PatternMatch Layer 6 — AppliesWhen extended (audience, modality, regulatoryContext, audienceVulnerability), tag enums Layer 7 — EthicsRule, EthicsDetector, EthicsViolation, EthicsCategory Layer 8 — Modality, ModalityInput, SurfaceMeasurements, SurfaceRecord, Evidence, ModalityAdapter Top-level — AuditResult_v2, AuditRuntimeHints Phase 0 is the stable contract that lets Wave 1 + Wave 2 implementation work proceed in parallel without diverging interfaces. Editing this file mid-build is a coordinated change; layers must update in lockstep. Invariants enforced: - Every score is DimensionScore with range + confidence - Every major/critical finding MUST have >=1 Patch - Every patch has both target (what changes) and testThatProves (how we verify) - Every classification carries explicit ensembleConfidence + signalsAgreed - Pattern, ethics, modality types compose via shared AppliesWhen
…Layer 8 scaffold Full implementation of RFC-002: World-Class Design Audit. Primary consumer is coding agents (Claude Code, Codex, OpenCode, Pi); architecture is JSON-first, tool-callable, and self-explaining when uncertain. Layer 1 — Multi-dimensional scoring: ensemble classifier (URL + DOM heuristic + LLM tiebreaker), 5 universal dimensions, per-page-type rollup weights and calibration anchors, AuditResult_v2 shape. Layer 2 — Patch primitives: every major/critical finding ships patches[] with target, diff.before/after, testThatProves, rollback, estimatedDelta. Severity enforcement downgrades major/critical without valid patches to minor. Layer 3 — First-principles fallback: fires when ensembleConfidence < 0.6 or signals disagree; scores against 5 universal product principles only; emits NovelPatternObservation to ~/.bad/novel-patterns/. Layer 4 — Outcome attribution: append-only JSONL store, ack-patch + --post-patch close-the-loop, patchHash cross-tenant grouping, aggregatePatchReliability. Layer 5 — Pattern library (scaffold): types/store/mine/match + CLI query/show. Cold start until ~6 weeks fleet data; mine threshold N≥30, ≥5 tenants, rate≥0.7. Layer 6 — Composable predicates: AppliesWhen extended with audience/modality/ regulatoryContext/audienceVulnerability; 9 new rubric fragments; loader matches on context flags --audience/--modality/--regulatory/--audience-vulnerability. Layer 7 — Domain ethics gate: 4 rule files (medical/kids/finance/legal) with citation-backed rules; hard rollup floor critical→4, major→6; preEthicsScore preserved; --skip-ethics bypass (test-only, logged). Layer 8 — Modality adapters (scaffold): HTML adapter wraps existing Playwright pipeline; iOS/Android throw NotImplementedError; --modality dispatch. +40 new tests across patch-parse, patch-validate, first-principles, attribution. Total: 1393 passing.
… blob, health domain in medical rules Three fixes discovered during smoke testing: 1. ethics/rules/*.yaml were not being copied to dist/ at build time — copy-static-assets.mjs only copied rubric fragments and anchors. Added ethics rules entry so the gate actually loads its rules at runtime. 2. pageTextBlob included the request URL in the content blob, causing false negatives on pattern-absent rules: a URL like medical-no-dosage.html contains "dosage" and suppressed the dosage-warning-required rule. URL is now excluded from the blob; URL-based classification uses the ensemble classifier's own URL heuristic. 3. Medical ethics rules matched domain: [medical, clinical, pharmacy] but the LLM classifies pharmacy-style ecommerce pages as domain "health". Added "health" to the domain list so the rules apply correctly.
…try rollup tests PR #79 added two telemetry-rollup-remote tests that spawn `node --experimental-strip-types ROLLUP_PATH`. That flag is fully supported only on Node 22+; on Node 18 and 20 (both in our CI matrix), Node exits 9 (invalid argument) before the rollup script runs, so the tests assert exit 2 but get exit 9. Replace with `tsx` (added as a devDependency) which works identically across all Node versions. The behavior under test is unchanged: rollup --remote without BAD_TELEMETRY_API / BAD_TELEMETRY_ADMIN_BEARER must exit 2 with a clear stderr message.
…urface
Three new modules layered on top of the existing audit pipeline:
src/jobs/ declarative comparative-audit jobs (JobSpec → discover → fan-out → persist).
Crash-safe JSONL store at ~/.bad/jobs/, bounded-concurrency runner,
pre-flight cost estimate. CLI: bad jobs {create,list,status,estimate}.
src/discover/ wayback (CDX API) + list sources turn a DiscoverSpec into audit targets.
Pluggable fetch for tests, status-200-only filter on by default.
src/reports/ deterministic aggregation (leaderboard / tierBuckets / compareRuns /
longitudinal) + markdown templates + AI SDK tool surface so a
browser-side agent can interrogate jobs without re-implementing
aggregation. CLI: bad reports generate --job <id> --template <name>.
The contract: every number in any report flows through pure aggregate.ts functions —
never an LLM. Optional narrate() prepends an exec summary, same pattern as the
audit-patches layer (agent narrates, code computes).
+55 tests across jobs-store, jobs-queue, jobs-cost-estimate, discover-wayback,
reports-aggregate, reports-templates, reports-tools. Total: 1448 passing.
This was referenced Apr 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three new modules that turn the design-audit pipeline into a content-engine surface: declaratively audit N URLs (optionally fanned out into M historical wayback snapshots each), aggregate the results, and emit markdown leaderboards / longitudinal evolutions / batch comparisons. Same data is exposed as AI SDK tools so a browser-side chat agent can answer ad-hoc questions without re-implementing aggregation.
```
JobSpec (JSON)
↓ discoverTargets() (list | wayback CDX)
Job (queued, persisted)
↓ runJob({ auditFn }) (bounded concurrency, crash-safe)
audit-results/jobs// (per-target report.json)
↓ aggregateJob() → AggregateRow[]
markdown templates ─OR─ buildReportTools() (7 AI SDK tools)
```
What's in it
`src/jobs/` — `JobSpec` + `createJob` + `runJob` + cost-estimate. JSONL store at `~/.bad/jobs/`. Bounded-concurrency runner persists every result as it lands so a kill mid-run leaves a coherent partial record on disk.
`src/discover/` — `wayback` source uses archive.org's CDX API, samples `count` evenly across the time range, filters to HTTP 200. `list` source is a pass-through. Pluggable fetch for tests.
`src/reports/` — `aggregateJob` reads each per-target `report.json` and projects to `AggregateRow`. Pure functions: `leaderboard`, `tierBuckets`, `compareRuns`, `longitudinalFor`. Templates render deterministic markdown. `narrateReport` optionally prepends an LLM exec summary — without a brain it returns the deterministic body unchanged.
`buildReportTools()` — 7 AI SDK tools (`queryJob`, `fetchAudit`, `compareRuns`, `longitudinal`, `tierBuckets`, `renderTemplate`, `runFreshAudit`). The contract matches the audit-patches layer: every number comes from a tool result, the agent narrates around it. Plug into any `streamText({ tools })` call to get an interactive report agent.
CLI
```
bad jobs create --spec <file.json> # mints + runs a job
bad jobs estimate --spec <file.json> # pre-flight cost
bad jobs status [--json]
bad jobs list [--json]
bad reports generate --job --template <leaderboard|longitudinal|batch-comparison>
[--top N --by-type X --buckets 10,100,200 --narrate --context "YC W25 cohort" --out report.md]
```
Example spec:
```json
{
"kind": "comparative-audit",
"discover": {
"source": "wayback",
"urls": ["https://stripe.com/", "https://linear.app/"],
"snapshotsPerUrl": 6,
"since": "2012-01-01",
"until": "2026-01-01"
},
"audit": { "pages": 1 },
"concurrency": 3,
"maxCostUSD": 25,
"label": "Stripe vs Linear longitudinal"
}
```
Tests
+55 across `jobs-store`, `jobs-queue`, `jobs-cost-estimate`, `discover-wayback`, `reports-aggregate`, `reports-templates`, `reports-tools`. Total: 1448 passing.
Test plan