Skip to content

feat(jobs+reports): comparative-audit jobs API + AI SDK report tool surface#83

Merged
drewstone merged 5 commits intomainfrom
feat/jobs-reports-content-engine
Apr 26, 2026
Merged

feat(jobs+reports): comparative-audit jobs API + AI SDK report tool surface#83
drewstone merged 5 commits intomainfrom
feat/jobs-reports-content-engine

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

Three new modules that turn the design-audit pipeline into a content-engine surface: declaratively audit N URLs (optionally fanned out into M historical wayback snapshots each), aggregate the results, and emit markdown leaderboards / longitudinal evolutions / batch comparisons. Same data is exposed as AI SDK tools so a browser-side chat agent can answer ad-hoc questions without re-implementing aggregation.

```
JobSpec (JSON)
↓ discoverTargets() (list | wayback CDX)
Job (queued, persisted)
↓ runJob({ auditFn }) (bounded concurrency, crash-safe)
audit-results/jobs// (per-target report.json)
↓ aggregateJob() → AggregateRow[]
markdown templates ─OR─ buildReportTools() (7 AI SDK tools)
```

What's in it

`src/jobs/` — `JobSpec` + `createJob` + `runJob` + cost-estimate. JSONL store at `~/.bad/jobs/`. Bounded-concurrency runner persists every result as it lands so a kill mid-run leaves a coherent partial record on disk.

`src/discover/` — `wayback` source uses archive.org's CDX API, samples `count` evenly across the time range, filters to HTTP 200. `list` source is a pass-through. Pluggable fetch for tests.

`src/reports/` — `aggregateJob` reads each per-target `report.json` and projects to `AggregateRow`. Pure functions: `leaderboard`, `tierBuckets`, `compareRuns`, `longitudinalFor`. Templates render deterministic markdown. `narrateReport` optionally prepends an LLM exec summary — without a brain it returns the deterministic body unchanged.

`buildReportTools()` — 7 AI SDK tools (`queryJob`, `fetchAudit`, `compareRuns`, `longitudinal`, `tierBuckets`, `renderTemplate`, `runFreshAudit`). The contract matches the audit-patches layer: every number comes from a tool result, the agent narrates around it. Plug into any `streamText({ tools })` call to get an interactive report agent.

CLI

```
bad jobs create --spec <file.json> # mints + runs a job
bad jobs estimate --spec <file.json> # pre-flight cost
bad jobs status [--json]
bad jobs list [--json]

bad reports generate --job --template <leaderboard|longitudinal|batch-comparison>
[--top N --by-type X --buckets 10,100,200 --narrate --context "YC W25 cohort" --out report.md]
```

Example spec:

```json
{
"kind": "comparative-audit",
"discover": {
"source": "wayback",
"urls": ["https://stripe.com/", "https://linear.app/"],
"snapshotsPerUrl": 6,
"since": "2012-01-01",
"until": "2026-01-01"
},
"audit": { "pages": 1 },
"concurrency": 3,
"maxCostUSD": 25,
"label": "Stripe vs Linear longitudinal"
}
```

Tests

+55 across `jobs-store`, `jobs-queue`, `jobs-cost-estimate`, `discover-wayback`, `reports-aggregate`, `reports-templates`, `reports-tools`. Total: 1448 passing.

Test plan

  • `pnpm lint` — clean
  • `pnpm check:boundaries` — clean (151 files)
  • `pnpm test` — 1448/1448 passing
  • CLI smoke: `bad jobs estimate --spec /tmp/test-job.json` → correct cost output
  • CLI smoke: `bad jobs list` → "No jobs yet" when empty
  • Tier1/Tier2 gates green in CI
  • Real end-to-end run: `bad jobs create --spec` against ~3 URLs, verify `bad reports generate` produces a markdown leaderboard

…ture

src/design/audit/v2/types.ts — comprehensive TypeScript interfaces covering
all 8 layers from the RFC (docs/rfc/design-audit-world-class.md):

  Layer 1 — DimensionScore, RollupScore, ClassifierSignal, EnsembleClassification, DomHeuristics
  Layer 2 — Patch, PatchTarget, PatchDiff, PatchTest, PatchRollback, DesignFinding (extended with id, dimension, patches, kind)
  Layer 3 — NovelPatternObservation
  Layer 4 — PatchApplication, PatchReliability
  Layer 5 — Pattern, PatternScaffold, PatternFleetEvidence, PatternQuery, PatternMatch
  Layer 6 — AppliesWhen extended (audience, modality, regulatoryContext, audienceVulnerability), tag enums
  Layer 7 — EthicsRule, EthicsDetector, EthicsViolation, EthicsCategory
  Layer 8 — Modality, ModalityInput, SurfaceMeasurements, SurfaceRecord, Evidence, ModalityAdapter
  Top-level — AuditResult_v2, AuditRuntimeHints

Phase 0 is the stable contract that lets Wave 1 + Wave 2 implementation work
proceed in parallel without diverging interfaces. Editing this file mid-build
is a coordinated change; layers must update in lockstep.

Invariants enforced:
  - Every score is DimensionScore with range + confidence
  - Every major/critical finding MUST have >=1 Patch
  - Every patch has both target (what changes) and testThatProves (how we verify)
  - Every classification carries explicit ensembleConfidence + signalsAgreed
  - Pattern, ethics, modality types compose via shared AppliesWhen
…Layer 8 scaffold

Full implementation of RFC-002: World-Class Design Audit. Primary consumer is coding
agents (Claude Code, Codex, OpenCode, Pi); architecture is JSON-first, tool-callable,
and self-explaining when uncertain.

Layer 1 — Multi-dimensional scoring: ensemble classifier (URL + DOM heuristic + LLM
tiebreaker), 5 universal dimensions, per-page-type rollup weights and calibration
anchors, AuditResult_v2 shape.

Layer 2 — Patch primitives: every major/critical finding ships patches[] with
target, diff.before/after, testThatProves, rollback, estimatedDelta. Severity
enforcement downgrades major/critical without valid patches to minor.

Layer 3 — First-principles fallback: fires when ensembleConfidence < 0.6 or signals
disagree; scores against 5 universal product principles only; emits
NovelPatternObservation to ~/.bad/novel-patterns/.

Layer 4 — Outcome attribution: append-only JSONL store, ack-patch + --post-patch
close-the-loop, patchHash cross-tenant grouping, aggregatePatchReliability.

Layer 5 — Pattern library (scaffold): types/store/mine/match + CLI query/show. Cold
start until ~6 weeks fleet data; mine threshold N≥30, ≥5 tenants, rate≥0.7.

Layer 6 — Composable predicates: AppliesWhen extended with audience/modality/
regulatoryContext/audienceVulnerability; 9 new rubric fragments; loader matches on
context flags --audience/--modality/--regulatory/--audience-vulnerability.

Layer 7 — Domain ethics gate: 4 rule files (medical/kids/finance/legal) with
citation-backed rules; hard rollup floor critical→4, major→6; preEthicsScore
preserved; --skip-ethics bypass (test-only, logged).

Layer 8 — Modality adapters (scaffold): HTML adapter wraps existing Playwright
pipeline; iOS/Android throw NotImplementedError; --modality dispatch.

+40 new tests across patch-parse, patch-validate, first-principles, attribution.
Total: 1393 passing.
… blob, health domain in medical rules

Three fixes discovered during smoke testing:

1. ethics/rules/*.yaml were not being copied to dist/ at build time — copy-static-assets.mjs
   only copied rubric fragments and anchors. Added ethics rules entry so the gate
   actually loads its rules at runtime.

2. pageTextBlob included the request URL in the content blob, causing false negatives
   on pattern-absent rules: a URL like medical-no-dosage.html contains "dosage" and
   suppressed the dosage-warning-required rule. URL is now excluded from the blob;
   URL-based classification uses the ensemble classifier's own URL heuristic.

3. Medical ethics rules matched domain: [medical, clinical, pharmacy] but the LLM
   classifies pharmacy-style ecommerce pages as domain "health". Added "health" to the
   domain list so the rules apply correctly.
…try rollup tests

PR #79 added two telemetry-rollup-remote tests that spawn `node --experimental-strip-types
ROLLUP_PATH`. That flag is fully supported only on Node 22+; on Node 18 and 20 (both in
our CI matrix), Node exits 9 (invalid argument) before the rollup script runs, so the
tests assert exit 2 but get exit 9.

Replace with `tsx` (added as a devDependency) which works identically across all Node
versions. The behavior under test is unchanged: rollup --remote without
BAD_TELEMETRY_API / BAD_TELEMETRY_ADMIN_BEARER must exit 2 with a clear stderr message.
…urface

Three new modules layered on top of the existing audit pipeline:

src/jobs/         declarative comparative-audit jobs (JobSpec → discover → fan-out → persist).
                  Crash-safe JSONL store at ~/.bad/jobs/, bounded-concurrency runner,
                  pre-flight cost estimate. CLI: bad jobs {create,list,status,estimate}.

src/discover/     wayback (CDX API) + list sources turn a DiscoverSpec into audit targets.
                  Pluggable fetch for tests, status-200-only filter on by default.

src/reports/      deterministic aggregation (leaderboard / tierBuckets / compareRuns /
                  longitudinal) + markdown templates + AI SDK tool surface so a
                  browser-side agent can interrogate jobs without re-implementing
                  aggregation. CLI: bad reports generate --job <id> --template <name>.

The contract: every number in any report flows through pure aggregate.ts functions —
never an LLM. Optional narrate() prepends an exec summary, same pattern as the
audit-patches layer (agent narrates, code computes).

+55 tests across jobs-store, jobs-queue, jobs-cost-estimate, discover-wayback,
reports-aggregate, reports-templates, reports-tools. Total: 1448 passing.
@drewstone drewstone merged commit aec48b5 into main Apr 26, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant