Conversation
6bbaae8 to
8b166eb
Compare
8b166eb to
145a190
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR was opened by the Changesets release GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated.
Releases
@tangle-network/browser-agent-driver@0.31.0
Minor Changes
#84
a679190Thanks @drewstone! - feat(jobs+reports): brand-kit / design-system extraction at every audit targetComparative-audit jobs can now extract the full deterministic design-token bundle (colors, font families, type scale, logos, font files, brand metadata, detected libraries) at every target — including every wayback snapshot. New
brand-evolutionreport template renders a per-URL chronological view of palette and typography drift, with snapshot-to-snapshot deltas (colors added/removed, font family swaps, brand-meta changes, library adoption).Spec: add
audit.extractTokens: trueto aJobSpec. Each per-target output dir gets atokens.jsonalongsidereport.json.CLI:
bad reports generate --template brand-evolution --job <id>AI SDK tools: two new tools —
fetchTokens(returns the per-target token summaries, optionally filtered to one URL's chronological series) anddiffTokens(deterministic delta between two token summaries in the same job).renderTemplatenow acceptstemplate: 'brand-evolution'.The token extractor is the existing
extractDesignTokens(no LLM, ~10s per target). Same deterministic-data / LLM-narrates contract as the rest of the reports surface — every callout in the brand-evolution report comes from a pure function oftokens.json.Verified end-to-end on
https://stripe.com/2014 → 2019 → 2024 wayback snapshots: pulled out the Whitney → Camphor → sohne-var typeface progression and the matching primary-color shifts (#008cdd→#6772e5→#635bff).+12 new tests across
reports-tokensand the queue/tools touch-ups. Total: 1460 passing.#81
36b6e63Thanks @drewstone! - feat(design-audit): 8-layer architecture — Layers 1-7 fully shipped, Layer 8 scaffoldFull implementation of RFC-002: World-Class Design Audit. Primary consumer is coding agents (Claude Code, Codex, OpenCode, Pi); the architecture is JSON-first, tool-callable, and self-explaining when uncertain.
Layer 1 — Multi-dimensional scoring (shipped)
ensembleConfidence,signalsAgreed,dissent.product_intent / visual_craft / trust_clarity / workflow / content_ia.rubric/anchors/*.yaml) so app surfaces aren't judged against marketing-site polish.AuditResult_v2emitted alongside v1 shape; v1 deprecated with one-release lag.Layer 2 — Patch primitives (shipped)
patches[]withtarget,diff.before/after,testThatProves,rollback,estimatedDelta, andestimatedDeltaConfidence.diff.beforeis validated as a substring of the page snapshot at parse time — agents apply patches literally without re-authoring.patches/render.ts: rendersunifiedDifffrom before/after whentarget.filePathis known (git apply-able).Layer 3 — First-principles fallback (shipped)
ensembleConfidence < 0.6, signals disagree, or page type isunknown.rollup.confidence = 'low'; emitsNovelPatternObservationto~/.bad/novel-patterns/for fleet mining.first-principles.mdcarries the exact prompt that fires in this mode.Layer 4 — Outcome attribution (shipped)
bad design-audit ack-patch <patchId> --pre-run-id <runId>— records that an agent applied a patch.bad design-audit --post-patch <patchId>on re-audit — computes observed delta vs predicted, writesagreementScore.~/.bad/attribution/applications/. Append-only — outcomes are new events, not mutations.aggregatePatchReliability()cross-tenant rollup: groups bypatchHash = sha256(before+after+scope).slice(0,16). After N≥30 / ≥5 tenants / replicationRate≥0.7 →recommendation: 'recommended'.Layer 5 — Pattern library (scaffold)
patterns/{store,mine,match}.ts+cli-patterns.ts(bad patterns query|show).Layer 6 — Composable predicates (shipped)
AppliesWhenextended withaudience,modality,regulatoryContext,audienceVulnerability.audience-{clinician,kids,developer}.md,regulatory-{hipaa,gdpr,coppa}.md,modality-{mobile,tablet}.md,audience-vulnerability-minor-facing.md.--audience,--modality,--regulatory,--audience-vulnerabilityCLI flags.Layer 7 — Domain ethics gate (shipped)
critical-floor → 4,major-floor → 6.preEthicsScorepreserves the LLM's uncapped score.--skip-ethicsbypass (test-only, logged + warned),--ethics-rules-diroverride.bench/design/ethics-fixtures/.Layer 8 — Modality adapters (scaffold)
modality/{types,html,ios,android,index}.ts. HTML adapter wraps existing Playwright pipeline. iOS and Android throwNotImplementedErrorwith clear message.--modality html|ios|androiddispatches to the right adapter.Skill contract updates:
~/code/dotfiles/claude/skills/bad/SKILL.md: patch consumption loop, Layer 3-8 contract, ack-patch / --post-patch close-the-loop, ethics floor priority rule.skills/design-evolve/SKILL.md: Phase 3 (apply fixes) now patch-first; Phase 4 includes attribution close-the-loop.Tests: +40 new tests across
design-audit-patch-{parse,validate},design-audit-first-principles,design-audit-attribution. Total: 1393 passing.#81
36b6e63Thanks @drewstone! - feat(design-audit): Layer 1 — multi-dim scoring foundationLand the first layer of the world-class 8-layer design-audit architecture (RFC
docs/rfc/design-audit-world-class.md). This release ships:src/design/audit/classify-ensemble.ts) — three-signal vote (URL pattern + DOM heuristic + LLM tiebreaker) with explicitensembleConfidence,signalsAgreed, anddissentrecords. URL+DOM agreement above the 0.7 threshold skips the LLM call entirely.src/design/audit/rubric/rollup-weights.ts) — saas-app, marketing, dashboard, docs, ecommerce, social, tool, blog, utility, plusdefault/unknownfallbacks. Module-load invariant: every weight set sums to 1.0 ± 1e-6.src/design/audit/rubric/anchors/*.yaml) — 9 anchor files referencing real product 9-10 examples (Linear's app, Figma, Notion, Stripe, MDN, Apple Store, Threads, Stratechery, Vercel deploys, etc.) so saas-app surfaces are no longer judged against marketing-site polish.src/design/audit/v2/score.ts) — five universal dimensions (product_intent / visual_craft / trust_clarity / workflow / content_ia) each withscore,range,confidence. Rollup is a weighted aggregate with conservative confidence (any dimlow→ rolluplow).AuditResult_v2— emitted alongside the v1 shape inreport.jsonunder a top-levelv2block. One-release deprecation window before v1 is removed.--audit-passes auto— new default that runs the ensemble classifier first, then picks the focused pass bundle for that classification.Backwards compat: all existing v1 fields (
score,findings,summary,strengths, etc.) remain onPageAuditResultandreport.json. Consumers should migrate toreport.v2.pages[].scoresover the next release.Skill update:
skills/bad/SKILL.mddocuments the new JSON shape with an agent-side worked example for choosing which dimension to invest in based onscore × weightleverage.#81
36b6e63Thanks @drewstone! - feat(design-audit): Layer 7 — domain ethics gate (+ Layer 6 composable predicates)Adds a hard score floor for pages that fail domain-specific ethics rules and the predicate vocabulary that lets those rules target the right audience/modality/regulatory context. RFC:
docs/rfc/design-audit-world-class.md.src/design/audit/ethics/rules/{medical,kids,finance,legal}.yaml) — curated, citation-backed rules covering medication dosage disclosure (FDA 21 CFR 201.57), kid-facing dark-pattern guards (COPPA, FTC Endorsement Guides), finance fee disclosure (TILA / Reg Z), and legal disclaimer presence.src/design/audit/ethics/check.ts) —pattern-absent,pattern-present,llm-classifier. Pattern checks are case-insensitive against page text; the LLM classifier asks for a single yes/no token to keep latency + cost predictable.critical-floorviolation caps the rollup at 4;major-floorcaps at 6.PageAuditResult.preEthicsScorepreserves the LLM's pre-cap score so reports can show "would have scored 8, capped at 4 — fix the dosage disclosure".AppliesWhenwithaudience,modality,regulatoryContext, andaudienceVulnerability. A pediatric medical app on tablet for clinicians now matches the medical and kids rule sets simultaneously instead of forcing one classification.--skip-ethics(test-only bypass, audited + warned),--ethics-rules-dir <path>(override the builtin yaml),--audience,--modality,--audience-vulnerability(comma-separated tag lists threaded into rule matching).bench/design/ethics-fixtures/) — paired pass/fail HTML for each rule category, used bytests/design-audit-ethics-{rules,check}.test.ts.Backwards compat: rules ship empty by default for any classification not on the curated list, so existing audits see no change unless they opt in via
--audience/--modalityor land on a covered domain.EthicsViolationis exported from bothsrc/design/audit/types.tsandv2/types.ts;PageAuditResult.ethicsViolationsis optional.#83
aec48b5Thanks @drewstone! - feat(jobs+reports): comparative-audit jobs API + AI SDK report tool surfaceThree new modules layered cleanly on top of the existing audit pipeline. Lets you declaratively audit N URLs (optionally expanded into M historical wayback snapshots each), aggregate the results, and emit shareable markdown reports — or expose the same data as AI SDK tools so a browser-side agent can answer ad-hoc questions.
src/jobs/— declarative comparative-audit jobs.JobSpecJSON describes targets + audit options + cost cap;createJobmints and persists;runJobfans out with bounded concurrency and crash-safe per-result writes to~/.bad/jobs/.estimateCost) refuses jobs that would silently spend more thanmaxCostUSD.AuditFninjection keeps the queue decoupled from Playwright/LLM for tests.bad jobs create --spec <file.json>,bad jobs status <id>,bad jobs list,bad jobs estimate --spec <file.json>.src/discover/— turn aDiscoverSpecinto audit targets.waybacksource uses archive.org's CDX API to list captures, then samplescountevenly across the time range.listsource is a pass-through.fetchfor tests; status-200-only filter on by default so 4xx snapshots don't poison the job.src/reports/— turn a job into an artifact.aggregateJobreads each per-targetreport.json, projects toAggregateRow(rollup, dimensions, ethics count). All numbers in any report flow through this — never an LLM.leaderboard,longitudinalFor,compareRuns,tierBucketsare pure functions over rows.renderLeaderboard/renderLongitudinal/renderBatchComparisonproduce deterministic markdown.narrateReport(brain, body)optionally prepends an LLM exec-summary; withoutbrain, returns the deterministic body unchanged. Same contract as the audit-patches layer: agent narrates, code computes.buildReportTools()exposes a 7-tool AI SDK surface (queryJob,fetchAudit,compareRuns,longitudinal,tierBuckets,renderTemplate,runFreshAudit) so a browser-side agent can interrogate jobs without re-implementing aggregation.bad reports generate --job <id> --template <leaderboard|longitudinal|batch-comparison> [--top N --by-type X --buckets 10,100 --narrate --out file.md].Tests: +55 across
jobs-store,jobs-queue,jobs-cost-estimate,discover-wayback,reports-aggregate,reports-templates,reports-tools. Total: 1448 passing.#85
3451a43Thanks @drewstone! - feat(jobs): robustness layer + agentic orchestratorFive hardening additions plus an LLM-driven control loop that wraps the runner. The architectural rule: protocols are deterministic (retry, anti-bot detection, schema gating) and judgment is agentic (when to re-sample broken wayback snapshots, retry vs. skip, conclude). Mixing those lines is how you end up paying LLM tax on exponential backoff.
Deterministic foundation
src/jobs/retry.ts— whitelist-based retry with exponential backoff + jitter. Retries 429 / 5xx / network / timeout / fetch failures; everything else (4xx, anti-bot, schema, unknown) is treated as deterministic and not retried. Configurable per-error-class viaisRetryable. Default: 3 attempts, 500ms base, 5s cap. Wired intorunJobviaRunJobOptions.retryPolicy.src/jobs/anti-bot.ts— pure pattern match against an audit'sreport.json. Title patterns (Cloudflare interstitial, "Just a moment...", "Access denied", etc.) and intent patterns plus a last-resort heuristic (zero findings + low classifier confidence + unknown type). When fired, the runner recordsstatus: 'skipped'with a reason instead of putting a bogus score on the leaderboard.src/jobs/cost-history.ts— adaptive cost estimate from prior job records. Uses static default until N≥3 completed jobs exist; afterward averages per-target cost from the last 20. Floors at 50% of the static default to prevent runaway optimism on a stretch of zero-cost claude-code jobs.tokens.jsonis now stamped withschemaVersion: 1at write time; the aggregator refuses files older thanMIN_TOKENS_SCHEMA.bad jobs resume <jobId>re-runs only targets that aren't alreadyok/skipped.RunJobOptions.resumeexposes the same on the API.Agentic orchestrator
src/jobs/orchestrator.ts—orchestrateJob(job, opts)runs the deterministic fan-out viarunJob, then enters a control loop only if intervention is warranted.needsInterventionis the gate: any failures, missing entries, or zero-scored wayback snapshots (broken archive captures) trigger the agent.getJobState,resampleWayback,retryTarget,markSkipped,concludeJob. Hard caps: 2 retries per target, 1 resample per URL, cost ≤spec.maxCostUSD * 0.9.claude-codeprovider as the audit pipeline (subscription-based, no API key required).bad jobs orchestrate --spec <file.json>runs the spec end-to-end with the agent layer. Same JSON spec ascreate.Tests: +34 across
jobs-retry,jobs-anti-bot,jobs-cost-history,jobs-orchestrator(deterministic gate), andjobs-orchestrator-agent(LLM path withMockLanguageModelV3). Total: 1494 passing.Patch Changes
#84
a679190Thanks @drewstone! - fix(discover/wayback): use CDXcollapse=timestamp:6instead oflimitso longitudinal jobs span the requested windowSymptom: a job with
since: 2012-01-01, until: 2024-01-01, snapshotsPerUrl: 4against a popular site returned four snapshots all clustered in 2012-2013 instead of evenly across 2012-2024.Cause: the CDX call passed
limit: max(count*4, 50), which caps how many captures CDX returns beforesampleEvenlyruns. For sites with thousands of captures (Stripe, Linear, GitHub, etc.) the first 50 in chronological order are all from the start of the window, so even sampling could only produce early-window snapshots.Fix: drop
limit, usecollapse=timestamp:6(one capture per month). The row count is now bounded by the window length in months, which keeps payloads sane while ensuring captures are spread across the whole window.Verified:
discoverWaybackSnapshots('https://stripe.com/', { count: 5, since: '2012-01-01', until: '2024-01-01' })now returns snapshots at 2012-02, 2015-03, 2018-03, 2021-02, 2024-01.