feat(jobs+reports): comparative-audit jobs API + AI SDK report tool surface by drewstone · Pull Request #83 · tangle-network/browser-agent-driver

drewstone · 2026-04-26T09:37:48Z

Summary

Three new modules that turn the design-audit pipeline into a content-engine surface: declaratively audit N URLs (optionally fanned out into M historical wayback snapshots each), aggregate the results, and emit markdown leaderboards / longitudinal evolutions / batch comparisons. Same data is exposed as AI SDK tools so a browser-side chat agent can answer ad-hoc questions without re-implementing aggregation.

```
JobSpec (JSON)
↓ discoverTargets() (list | wayback CDX)
Job (queued, persisted)
↓ runJob({ auditFn }) (bounded concurrency, crash-safe)
audit-results/jobs// (per-target report.json)
↓ aggregateJob() → AggregateRow[]
markdown templates ─OR─ buildReportTools() (7 AI SDK tools)
```

What's in it

`src/jobs/` — `JobSpec` + `createJob` + `runJob` + cost-estimate. JSONL store at `~/.bad/jobs/`. Bounded-concurrency runner persists every result as it lands so a kill mid-run leaves a coherent partial record on disk.

`src/discover/` — `wayback` source uses archive.org's CDX API, samples `count` evenly across the time range, filters to HTTP 200. `list` source is a pass-through. Pluggable fetch for tests.

`src/reports/` — `aggregateJob` reads each per-target `report.json` and projects to `AggregateRow`. Pure functions: `leaderboard`, `tierBuckets`, `compareRuns`, `longitudinalFor`. Templates render deterministic markdown. `narrateReport` optionally prepends an LLM exec summary — without a brain it returns the deterministic body unchanged.

`buildReportTools()` — 7 AI SDK tools (`queryJob`, `fetchAudit`, `compareRuns`, `longitudinal`, `tierBuckets`, `renderTemplate`, `runFreshAudit`). The contract matches the audit-patches layer: every number comes from a tool result, the agent narrates around it. Plug into any `streamText({ tools })` call to get an interactive report agent.

CLI

```
bad jobs create --spec <file.json> # mints + runs a job
bad jobs estimate --spec <file.json> # pre-flight cost
bad jobs status [--json]
bad jobs list [--json]

bad reports generate --job --template <leaderboard|longitudinal|batch-comparison>
[--top N --by-type X --buckets 10,100,200 --narrate --context "YC W25 cohort" --out report.md]
```

Example spec:

```json
{
"kind": "comparative-audit",
"discover": {
"source": "wayback",
"urls": ["https://stripe.com/", "https://linear.app/"],
"snapshotsPerUrl": 6,
"since": "2012-01-01",
"until": "2026-01-01"
},
"audit": { "pages": 1 },
"concurrency": 3,
"maxCostUSD": 25,
"label": "Stripe vs Linear longitudinal"
}
```

Tests

+55 across `jobs-store`, `jobs-queue`, `jobs-cost-estimate`, `discover-wayback`, `reports-aggregate`, `reports-templates`, `reports-tools`. Total: 1448 passing.

Test plan

`pnpm lint` — clean
`pnpm check:boundaries` — clean (151 files)
`pnpm test` — 1448/1448 passing
CLI smoke: `bad jobs estimate --spec /tmp/test-job.json` → correct cost output
CLI smoke: `bad jobs list` → "No jobs yet" when empty
Tier1/Tier2 gates green in CI
Real end-to-end run: `bad jobs create --spec` against ~3 URLs, verify `bad reports generate` produces a markdown leaderboard

…ture src/design/audit/v2/types.ts — comprehensive TypeScript interfaces covering all 8 layers from the RFC (docs/rfc/design-audit-world-class.md): Layer 1 — DimensionScore, RollupScore, ClassifierSignal, EnsembleClassification, DomHeuristics Layer 2 — Patch, PatchTarget, PatchDiff, PatchTest, PatchRollback, DesignFinding (extended with id, dimension, patches, kind) Layer 3 — NovelPatternObservation Layer 4 — PatchApplication, PatchReliability Layer 5 — Pattern, PatternScaffold, PatternFleetEvidence, PatternQuery, PatternMatch Layer 6 — AppliesWhen extended (audience, modality, regulatoryContext, audienceVulnerability), tag enums Layer 7 — EthicsRule, EthicsDetector, EthicsViolation, EthicsCategory Layer 8 — Modality, ModalityInput, SurfaceMeasurements, SurfaceRecord, Evidence, ModalityAdapter Top-level — AuditResult_v2, AuditRuntimeHints Phase 0 is the stable contract that lets Wave 1 + Wave 2 implementation work proceed in parallel without diverging interfaces. Editing this file mid-build is a coordinated change; layers must update in lockstep. Invariants enforced: - Every score is DimensionScore with range + confidence - Every major/critical finding MUST have >=1 Patch - Every patch has both target (what changes) and testThatProves (how we verify) - Every classification carries explicit ensembleConfidence + signalsAgreed - Pattern, ethics, modality types compose via shared AppliesWhen

…Layer 8 scaffold Full implementation of RFC-002: World-Class Design Audit. Primary consumer is coding agents (Claude Code, Codex, OpenCode, Pi); architecture is JSON-first, tool-callable, and self-explaining when uncertain. Layer 1 — Multi-dimensional scoring: ensemble classifier (URL + DOM heuristic + LLM tiebreaker), 5 universal dimensions, per-page-type rollup weights and calibration anchors, AuditResult_v2 shape. Layer 2 — Patch primitives: every major/critical finding ships patches[] with target, diff.before/after, testThatProves, rollback, estimatedDelta. Severity enforcement downgrades major/critical without valid patches to minor. Layer 3 — First-principles fallback: fires when ensembleConfidence < 0.6 or signals disagree; scores against 5 universal product principles only; emits NovelPatternObservation to ~/.bad/novel-patterns/. Layer 4 — Outcome attribution: append-only JSONL store, ack-patch + --post-patch close-the-loop, patchHash cross-tenant grouping, aggregatePatchReliability. Layer 5 — Pattern library (scaffold): types/store/mine/match + CLI query/show. Cold start until ~6 weeks fleet data; mine threshold N≥30, ≥5 tenants, rate≥0.7. Layer 6 — Composable predicates: AppliesWhen extended with audience/modality/ regulatoryContext/audienceVulnerability; 9 new rubric fragments; loader matches on context flags --audience/--modality/--regulatory/--audience-vulnerability. Layer 7 — Domain ethics gate: 4 rule files (medical/kids/finance/legal) with citation-backed rules; hard rollup floor critical→4, major→6; preEthicsScore preserved; --skip-ethics bypass (test-only, logged). Layer 8 — Modality adapters (scaffold): HTML adapter wraps existing Playwright pipeline; iOS/Android throw NotImplementedError; --modality dispatch. +40 new tests across patch-parse, patch-validate, first-principles, attribution. Total: 1393 passing.

… blob, health domain in medical rules Three fixes discovered during smoke testing: 1. ethics/rules/*.yaml were not being copied to dist/ at build time — copy-static-assets.mjs only copied rubric fragments and anchors. Added ethics rules entry so the gate actually loads its rules at runtime. 2. pageTextBlob included the request URL in the content blob, causing false negatives on pattern-absent rules: a URL like medical-no-dosage.html contains "dosage" and suppressed the dosage-warning-required rule. URL is now excluded from the blob; URL-based classification uses the ensemble classifier's own URL heuristic. 3. Medical ethics rules matched domain: [medical, clinical, pharmacy] but the LLM classifies pharmacy-style ecommerce pages as domain "health". Added "health" to the domain list so the rules apply correctly.

…try rollup tests PR #79 added two telemetry-rollup-remote tests that spawn `node --experimental-strip-types ROLLUP_PATH`. That flag is fully supported only on Node 22+; on Node 18 and 20 (both in our CI matrix), Node exits 9 (invalid argument) before the rollup script runs, so the tests assert exit 2 but get exit 9. Replace with `tsx` (added as a devDependency) which works identically across all Node versions. The behavior under test is unchanged: rollup --remote without BAD_TELEMETRY_API / BAD_TELEMETRY_ADMIN_BEARER must exit 2 with a clear stderr message.

…urface Three new modules layered on top of the existing audit pipeline: src/jobs/ declarative comparative-audit jobs (JobSpec → discover → fan-out → persist). Crash-safe JSONL store at ~/.bad/jobs/, bounded-concurrency runner, pre-flight cost estimate. CLI: bad jobs {create,list,status,estimate}. src/discover/ wayback (CDX API) + list sources turn a DiscoverSpec into audit targets. Pluggable fetch for tests, status-200-only filter on by default. src/reports/ deterministic aggregation (leaderboard / tierBuckets / compareRuns / longitudinal) + markdown templates + AI SDK tool surface so a browser-side agent can interrogate jobs without re-implementing aggregation. CLI: bad reports generate --job <id> --template <name>. The contract: every number in any report flows through pure aggregate.ts functions — never an LLM. Optional narrate() prepends an exec summary, same pattern as the audit-patches layer (agent narrates, code computes). +55 tests across jobs-store, jobs-queue, jobs-cost-estimate, discover-wayback, reports-aggregate, reports-templates, reports-tools. Total: 1448 passing.

drewstone added 5 commits April 25, 2026 20:43

drewstone merged commit aec48b5 into main Apr 26, 2026
5 checks passed

github-actions Bot mentioned this pull request Apr 26, 2026

Release: version packages #82

Merged

This was referenced Apr 26, 2026

feat(brand-evolution): per-target design-token extraction + wayback CDX fix #84

Merged

feat(jobs): robustness layer + agentic orchestrator #85

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(jobs+reports): comparative-audit jobs API + AI SDK report tool surface#83

feat(jobs+reports): comparative-audit jobs API + AI SDK report tool surface#83
drewstone merged 5 commits intomainfrom
feat/jobs-reports-content-engine

drewstone commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Apr 26, 2026

Summary

What's in it

CLI

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant