Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions .changeset/design-audit-8-layer-architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
'@tangle-network/browser-agent-driver': minor
---

feat(design-audit): 8-layer architecture — Layers 1-7 fully shipped, Layer 8 scaffold

Full implementation of RFC-002: World-Class Design Audit. Primary consumer is coding agents (Claude Code, Codex, OpenCode, Pi); the architecture is JSON-first, tool-callable, and self-explaining when uncertain.

**Layer 1 — Multi-dimensional scoring** _(shipped)_
- Ensemble classifier (URL pattern + DOM heuristic + LLM tiebreaker) with `ensembleConfidence`, `signalsAgreed`, `dissent`.
- Five universal dimensions: `product_intent / visual_craft / trust_clarity / workflow / content_ia`.
- Per-page-type rollup weights (saas-app, marketing, dashboard, docs, ecommerce, social, tool, blog, utility).
- Per-page-type calibration anchors (`rubric/anchors/*.yaml`) so app surfaces aren't judged against marketing-site polish.
- `AuditResult_v2` emitted alongside v1 shape; v1 deprecated with one-release lag.

**Layer 2 — Patch primitives** _(shipped)_
- Every major/critical finding now ships `patches[]` with `target`, `diff.before`/`after`, `testThatProves`, `rollback`, `estimatedDelta`, and `estimatedDeltaConfidence`.
- `diff.before` is validated as a substring of the page snapshot at parse time — agents apply patches literally without re-authoring.
- Severity enforcement: findings without valid patches are downgraded from major/critical to minor.
- `patches/render.ts`: renders `unifiedDiff` from before/after when `target.filePath` is known (`git apply`-able).

**Layer 3 — First-principles fallback** _(shipped)_
- Fires when `ensembleConfidence < 0.6`, signals disagree, or page type is `unknown`.
- Scores against 5 universal product principles only (primary-job clarity, action obviousness, state preview, trust-before-commitment, recovery-from-failure).
- Sets `rollup.confidence = 'low'`; emits `NovelPatternObservation` to `~/.bad/novel-patterns/` for fleet mining.
- New rubric fragment `first-principles.md` carries the exact prompt that fires in this mode.

**Layer 4 — Outcome attribution** _(shipped)_
- `bad design-audit ack-patch <patchId> --pre-run-id <runId>` — records that an agent applied a patch.
- `bad design-audit --post-patch <patchId>` on re-audit — computes observed delta vs predicted, writes `agreementScore`.
- JSONL store at `~/.bad/attribution/applications/`. Append-only — outcomes are new events, not mutations.
- `aggregatePatchReliability()` cross-tenant rollup: groups by `patchHash = sha256(before+after+scope).slice(0,16)`. After N≥30 / ≥5 tenants / replicationRate≥0.7 → `recommendation: 'recommended'`.

**Layer 5 — Pattern library** _(scaffold)_
- `patterns/{store,mine,match}.ts` + `cli-patterns.ts` (`bad patterns query|show`).
- Cold-start: library is empty until ~6 weeks of attribution data accumulates. Mine threshold: N≥30, ≥5 tenants, replicationRate≥0.7. Mining impl is a TODO; the query API and types are stable.

**Layer 6 — Composable predicates** _(shipped)_
- `AppliesWhen` extended with `audience`, `modality`, `regulatoryContext`, `audienceVulnerability`.
- 9 new rubric fragments: `audience-{clinician,kids,developer}.md`, `regulatory-{hipaa,gdpr,coppa}.md`, `modality-{mobile,tablet}.md`, `audience-vulnerability-minor-facing.md`.
- Rubric loader matches new predicates when context provided via `--audience`, `--modality`, `--regulatory`, `--audience-vulnerability` CLI flags.

**Layer 7 — Domain ethics gate** _(shipped)_
- 4 rule files (medical, kids, finance, legal) with citation-backed rules (FDA 21 CFR 201.57, COPPA 16 CFR 312.5, TILA/Reg Z, GDPR).
- Hard rollup floor: `critical-floor → 4`, `major-floor → 6`. `preEthicsScore` preserves the LLM's uncapped score.
- `--skip-ethics` bypass (test-only, logged + warned), `--ethics-rules-dir` override.
- 8 paired pass/fail fixtures in `bench/design/ethics-fixtures/`.

**Layer 8 — Modality adapters** _(scaffold)_
- `modality/{types,html,ios,android,index}.ts`. HTML adapter wraps existing Playwright pipeline. iOS and Android throw `NotImplementedError` with clear message. `--modality html|ios|android` dispatches to the right adapter.

**Skill contract updates:**
- `~/code/dotfiles/claude/skills/bad/SKILL.md`: patch consumption loop, Layer 3-8 contract, ack-patch / --post-patch close-the-loop, ethics floor priority rule.
- `skills/design-evolve/SKILL.md`: Phase 3 (apply fixes) now patch-first; Phase 4 includes attribution close-the-loop.

**Tests:** +40 new tests across `design-audit-patch-{parse,validate}`, `design-audit-first-principles`, `design-audit-attribution`. Total: 1393 passing.
19 changes: 19 additions & 0 deletions .changeset/design-audit-layer-1-foundation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
'@tangle-network/browser-agent-driver': minor
---

feat(design-audit): Layer 1 — multi-dim scoring foundation

Land the first layer of the world-class 8-layer design-audit architecture (RFC `docs/rfc/design-audit-world-class.md`). This release ships:

- **Ensemble classifier** (`src/design/audit/classify-ensemble.ts`) — three-signal vote (URL pattern + DOM heuristic + LLM tiebreaker) with explicit `ensembleConfidence`, `signalsAgreed`, and `dissent` records. URL+DOM agreement above the 0.7 threshold skips the LLM call entirely.
- **Per-page-type rollup weights** (`src/design/audit/rubric/rollup-weights.ts`) — saas-app, marketing, dashboard, docs, ecommerce, social, tool, blog, utility, plus `default`/`unknown` fallbacks. Module-load invariant: every weight set sums to 1.0 ± 1e-6.
- **Per-page-type calibration anchors** (`src/design/audit/rubric/anchors/*.yaml`) — 9 anchor files referencing real product 9-10 examples (Linear's app, Figma, Notion, Stripe, MDN, Apple Store, Threads, Stratechery, Vercel deploys, etc.) so saas-app surfaces are no longer judged against marketing-site polish.
- **Multi-dim scoring** (`src/design/audit/v2/score.ts`) — five universal dimensions (product_intent / visual_craft / trust_clarity / workflow / content_ia) each with `score`, `range`, `confidence`. Rollup is a weighted aggregate with conservative confidence (any dim `low` → rollup `low`).
- **`AuditResult_v2`** — emitted alongside the v1 shape in `report.json` under a top-level `v2` block. One-release deprecation window before v1 is removed.
- **`--audit-passes auto`** — new default that runs the ensemble classifier first, then picks the focused pass bundle for that classification.
- **CLI summary** — per-page console output now prints the 5-dimension breakdown plus rollup formula.

Backwards compat: all existing v1 fields (`score`, `findings`, `summary`, `strengths`, etc.) remain on `PageAuditResult` and `report.json`. Consumers should migrate to `report.v2.pages[].scores` over the next release.

Skill update: `skills/bad/SKILL.md` documents the new JSON shape with an agent-side worked example for choosing which dimension to invest in based on `score × weight` leverage.
16 changes: 16 additions & 0 deletions .changeset/design-audit-layer-7-ethics-gate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
'@tangle-network/browser-agent-driver': minor
---

feat(design-audit): Layer 7 — domain ethics gate (+ Layer 6 composable predicates)

Adds a hard score floor for pages that fail domain-specific ethics rules and the predicate vocabulary that lets those rules target the right audience/modality/regulatory context. RFC: `docs/rfc/design-audit-world-class.md`.

- **Ethics rule set** (`src/design/audit/ethics/rules/{medical,kids,finance,legal}.yaml`) — curated, citation-backed rules covering medication dosage disclosure (FDA 21 CFR 201.57), kid-facing dark-pattern guards (COPPA, FTC Endorsement Guides), finance fee disclosure (TILA / Reg Z), and legal disclaimer presence.
- **Detector kinds** (`src/design/audit/ethics/check.ts`) — `pattern-absent`, `pattern-present`, `llm-classifier`. Pattern checks are case-insensitive against page text; the LLM classifier asks for a single yes/no token to keep latency + cost predictable.
- **Hard rollup floor** — a `critical-floor` violation caps the rollup at 4; `major-floor` caps at 6. `PageAuditResult.preEthicsScore` preserves the LLM's pre-cap score so reports can show "would have scored 8, capped at 4 — fix the dosage disclosure".
- **Composable predicates (Layer 6)** — extends `AppliesWhen` with `audience`, `modality`, `regulatoryContext`, and `audienceVulnerability`. A pediatric medical app on tablet for clinicians now matches the medical *and* kids rule sets simultaneously instead of forcing one classification.
- **CLI flags**: `--skip-ethics` (test-only bypass, audited + warned), `--ethics-rules-dir <path>` (override the builtin yaml), `--audience`, `--modality`, `--audience-vulnerability` (comma-separated tag lists threaded into rule matching).
- **Fixtures** (`bench/design/ethics-fixtures/`) — paired pass/fail HTML for each rule category, used by `tests/design-audit-ethics-{rules,check}.test.ts`.

Backwards compat: rules ship empty by default for any classification not on the curated list, so existing audits see no change unless they opt in via `--audience`/`--modality` or land on a covered domain. `EthicsViolation` is exported from both `src/design/audit/types.ts` and `v2/types.ts`; `PageAuditResult.ethicsViolations` is optional.
28 changes: 28 additions & 0 deletions .changeset/jobs-reports-content-engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
'@tangle-network/browser-agent-driver': minor
---

feat(jobs+reports): comparative-audit jobs API + AI SDK report tool surface

Three new modules layered cleanly on top of the existing audit pipeline. Lets you declaratively audit N URLs (optionally expanded into M historical wayback snapshots each), aggregate the results, and emit shareable markdown reports — or expose the same data as AI SDK tools so a browser-side agent can answer ad-hoc questions.

**`src/jobs/`** — declarative comparative-audit jobs.
- `JobSpec` JSON describes targets + audit options + cost cap; `createJob` mints and persists; `runJob` fans out with bounded concurrency and crash-safe per-result writes to `~/.bad/jobs/`.
- Pre-flight cost estimate (`estimateCost`) refuses jobs that would silently spend more than `maxCostUSD`.
- `AuditFn` injection keeps the queue decoupled from Playwright/LLM for tests.
- CLI: `bad jobs create --spec <file.json>`, `bad jobs status <id>`, `bad jobs list`, `bad jobs estimate --spec <file.json>`.

**`src/discover/`** — turn a `DiscoverSpec` into audit targets.
- `wayback` source uses archive.org's CDX API to list captures, then samples `count` evenly across the time range.
- `list` source is a pass-through.
- Pluggable `fetch` for tests; status-200-only filter on by default so 4xx snapshots don't poison the job.

**`src/reports/`** — turn a job into an artifact.
- `aggregateJob` reads each per-target `report.json`, projects to `AggregateRow` (rollup, dimensions, ethics count). All numbers in any report flow through this — never an LLM.
- `leaderboard`, `longitudinalFor`, `compareRuns`, `tierBuckets` are pure functions over rows.
- `renderLeaderboard` / `renderLongitudinal` / `renderBatchComparison` produce deterministic markdown.
- `narrateReport(brain, body)` optionally prepends an LLM exec-summary; without `brain`, returns the deterministic body unchanged. Same contract as the audit-patches layer: agent narrates, code computes.
- `buildReportTools()` exposes a 7-tool AI SDK surface (`queryJob`, `fetchAudit`, `compareRuns`, `longitudinal`, `tierBuckets`, `renderTemplate`, `runFreshAudit`) so a browser-side agent can interrogate jobs without re-implementing aggregation.
- CLI: `bad reports generate --job <id> --template <leaderboard|longitudinal|batch-comparison> [--top N --by-type X --buckets 10,100 --narrate --out file.md]`.

**Tests:** +55 across `jobs-store`, `jobs-queue`, `jobs-cost-estimate`, `discover-wayback`, `reports-aggregate`, `reports-templates`, `reports-tools`. Total: 1448 passing.
24 changes: 24 additions & 0 deletions bench/design/ethics-fixtures/finance-disclosed-fees.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>SwiftPay — review and confirm</title>
</head>
<body>
<main>
<h1>Review your payment</h1>
<dl>
<dt>You send</dt><dd>$1,000.00 USD</dd>
<dt>Network fee</dt><dd>$2.50</dd>
<dt>FX rate</dt><dd>1 USD = 0.92 EUR</dd>
<dt>Conversion fee (1.5%)</dt><dd>$15.00</dd>
<dt>Tax (VAT)</dt><dd>$3.40</dd>
<dt>Total charged</dt><dd>$1,020.90 USD</dd>
<dt>Recipient gets</dt><dd>€905.30 EUR</dd>
<dt>APR (effective annual percentage rate, if held)</dt><dd>0.0%</dd>
</dl>
<p>All fees, taxes, and FX are disclosed above. By tapping Pay you authorize SwiftPay to charge $1,020.90.</p>
<button>Pay now</button>
</main>
</body>
</html>
16 changes: 16 additions & 0 deletions bench/design/ethics-fixtures/finance-hidden-fees.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>SwiftPay — Send money</title>
</head>
<body>
<main>
<h1>Send $1,000 to Alex</h1>
<p>You send: $1,000.00</p>
<p>Recipient gets: $1,000.00</p>
<button>Pay now</button>
<p style="font-size:6px;color:#fff">Network, FX, conversion, and processing fees may apply. See terms.</p>
</main>
</body>
</html>
16 changes: 16 additions & 0 deletions bench/design/ethics-fixtures/gdpr-no-consent.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>EuroShop — Home</title>
</head>
<body>
<header><h1>EuroShop</h1></header>
<main>
<h2>Best deals in Berlin today</h2>
<p>Trending products curated for the EU market.</p>
<button>Shop now</button>
</main>
<footer><p>EuroShop GmbH 2026</p></footer>
</body>
</html>
31 changes: 31 additions & 0 deletions bench/design/ethics-fixtures/gdpr-with-consent.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>EuroShop — Home</title>
</head>
<body>
<div role="dialog" aria-label="Cookie consent">
<h2>We use cookies</h2>
<p>We use necessary and optional cookies. Manage your preferences below.</p>
<fieldset>
<legend>Cookie consent preferences</legend>
<label><input type="checkbox" checked disabled /> Necessary</label>
<label><input type="checkbox" /> Analytics</label>
<label><input type="checkbox" /> Marketing</label>
</fieldset>
<button>Accept selected</button>
<button>Reject all</button>
</div>
<header><h1>EuroShop</h1></header>
<main>
<h2>Best deals in Berlin today</h2>
<p>Trending products curated for the EU market.</p>
<button>Shop now</button>
</main>
<footer>
<a href="/privacy">Privacy policy</a> ·
<a href="/privacy/choices">Your privacy choices (Do Not Sell or Share)</a>
</footer>
</body>
</html>
19 changes: 19 additions & 0 deletions bench/design/ethics-fixtures/kids-age-gated.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>SuperKids — verify your age</title>
</head>
<body>
<main>
<h1>Welcome to SuperKids</h1>
<form>
<label for="dob">Date of birth (we verify your age before letting you play):</label>
<input id="dob" name="dob" type="date" required />
<p>Enter your age so we can keep things appropriate. Parents will be notified.</p>
<button type="submit">Continue</button>
</form>
<p>We only collect what is strictly necessary for the activity. Optional fields are clearly marked.</p>
</main>
</body>
</html>
17 changes: 17 additions & 0 deletions bench/design/ethics-fixtures/kids-dark-pattern.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>SuperKids Coins!</title>
</head>
<body>
<main>
<h1>Hi friend! Get 100 SuperCoins NOW!</h1>
<p style="color:red;font-size:32px"><strong>HURRY! Only 3 left!</strong></p>
<p>Tap YES or your unicorn will be sad forever 🥺</p>
<button style="background:lime;font-size:48px">YES, give me coins!</button>
<a href="#" style="font-size:8px;color:#ccc">no thanks, I hate fun</a>
<p style="font-size:8px;color:#ddd">Cost: $4.99 charged to mom's card. Auto-renews monthly. Cancel by writing a letter.</p>
</main>
</body>
</html>
18 changes: 18 additions & 0 deletions bench/design/ethics-fixtures/medical-no-dosage.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>RxMed — Order amoxicillin</title>
</head>
<body>
<header><h1>RxMed Pharmacy</h1></header>
<main>
<h2>Amoxicillin 500mg</h2>
<p>Take this medication as your doctor recommends.</p>
<p>Available in 30-tablet packs.</p>
<button>Add to cart</button>
<button>Refill prescription</button>
</main>
<footer><p>RxMed &copy; 2026</p></footer>
</body>
</html>
24 changes: 24 additions & 0 deletions bench/design/ethics-fixtures/medical-with-dosage.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>RxMed — Order amoxicillin (Rx)</title>
</head>
<body>
<header><h1>RxMed Pharmacy</h1></header>
<main>
<h2>Amoxicillin 500mg — Prescribing Information</h2>
<section aria-labelledby="dose-h">
<h3 id="dose-h">Dosage and administration</h3>
<p>Adults: 500 mg orally every 8 hours. Adjust dosage for renal impairment.</p>
</section>
<section aria-labelledby="warn-h">
<h3 id="warn-h">Warnings and contraindications</h3>
<p>Contraindication: hypersensitivity to penicillin.</p>
<p>Adverse effects: nausea, diarrhea, rare anaphylaxis. Report any side effect to MedWatch (FDA 1088).</p>
</section>
<button>Add to cart</button>
<p><a href="/medwatch">Report a side effect</a> (MedWatch).</p>
</main>
</body>
</html>
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@
"pixelmatch": "^7.1.0",
"playwright": "^1.40.0",
"pngjs": "^7.0.0",
"tsx": "^4.21.0",
"typescript": "^5.3.0",
"vitest": "^4.0.18"
}
Expand Down
Loading
Loading