Skip to content

docs(rule-engine-poc): single-page HTML report reference#526

Open
Luis85 wants to merge 48 commits into
developfrom
claude/rule-engine-poc-gO5yq
Open

docs(rule-engine-poc): single-page HTML report reference#526
Luis85 wants to merge 48 commits into
developfrom
claude/rule-engine-poc-gO5yq

Conversation

@Luis85
Copy link
Copy Markdown
Owner

@Luis85 Luis85 commented May 17, 2026

Summary

Adds experiments/rule-engine-poc/docs/report-reference.md — a single-page overview of the HTML report the POC produces.

The original POC PR (#525) merged with 5 separate research artifacts about the report (research/1721) and the report's design scattered across architecture.md, workflow.md, and audit-trail.md. This doc consolidates all of that into one place for someone who wants to understand the report end-to-end without chasing references.

What's in it

  • Section-by-section walkthrough — every section of the rendered HTML mapped to its research source.
  • The five perspectives that shaped the v3 rebuild — UX (research/17), stakeholder strategy (18), brand (19), auditor readability (20), misread risks (21), each with the agent's top-line finding.
  • The 12 wave-4 changes that landed in the implementer pass + the Codex round 11–14 hardenings on top.
  • What's still open, bucketed by where each item lives: strategy slice / governance / ADR / production prep / discovery RATs.
  • How to generate and how to read the report (the 4-step skim path).

docs/README.md updated to index the new doc.

Test plan

  • Doc renders correctly in the GitHub markdown preview.
  • All cross-references resolve to real files.
  • typos --config _typos.toml passes.
  • (optional) Reviewer skims the doc and confirms it actually consolidates rather than duplicates.

Generated by Claude Code

claude added 30 commits May 17, 2026 11:30
Terminal-only TypeScript POC of the "LLM extracts, rules decide" pattern
from the AI fact-checking community: the LLM is constrained to producing
structured flags from raw signals, and a deterministic rule engine maps
those flags to a verdict tier with a fully replayable audit trail.

Lives under experiments/rule-engine-poc/ as a sandbox (not formal Stage
1-7) and demonstrates the pattern against the repo's own quality
framework. Each rule encodes a Definition of Done item from
docs/quality-framework.md.

What's included:
- src/ - hand-rolled engine (~250 LOC): types, hash, engine, loader,
  cli, html-report. Pure functions; severity-first verdict; canonical
  JSON + SHA-256 provenance hashes for replay.
- rules/quality-gates.yaml - DoD-as-rules example set.
- fixtures/*.json - 5 mock Orient-quadrant extractions covering ready,
  blocked, and needs-attention verdicts.
- test/ - 24 passing tests (vitest), including dedicated reproducibility
  suite (strategist-recommended North Star: byte-identical replay).
- HTML reporter - self-contained, inline CSS, no JS, no external assets.
- docs/ - architecture, DSL reference, audit trail + EU AI Act mapping,
  extension guide, OODA integration.
- research/ - five-angle research wave (technical landscape, regulatory
  auditability, positioning/JTBD, design alternatives, risks/critique).
- Validate that then.verdict is one of the four known tiers; previously
  a typo like 'blokced' would load successfully and silently degrade
  the rule into a no-op via tally[<unknown-key>] (#525 P1).
- Validate that when.all / when.any / when.not are arrays at load time;
  previously 'any: true' would load and crash at evaluation with
  TypeError on .map (#525 P2).
- Export VERDICTS as a runtime constant from types.ts so the schema
  check has one source of truth alongside the type.
- Three new loader tests cover the two failure modes plus a typo case.
- engine: 'exists' now participates in the AND-chain instead of
  short-circuiting, so 'exists: true' combined with 'eq'/'ne'/'gt'/'lt'/
  'in'/'regex' correctly requires every operator to match (#525 P2).
  'exists: false' still tolerates flag absence without surfacing the
  'flag missing' reason; this is the only short-circuit retained.
- loader: empty 'when.all' / 'when.any' / 'when.not' arrays are now
  rejected at load time. Previously 'any: []' was vacuously satisfied
  by the length>0 guard in evaluateWhen, allowing a typo to fire a
  blocking rule unintentionally (#525 P1).
- ENGINE_VERSION bumped 0.1.0 -> 0.2.0 because the exists+value-op
  interaction is a semantic change. Per docs/extending.md, a version
  bump is the auditor's signal that prior verdicts may not replay.
- Six new tests: empty when.any / when.all rejection, four exists-AND
  cases including exists:false standalone.
- loader: each condition must declare at least one supported operator.
  Previously a typo like { flag: 'x', eqq: true } would load and then
  always-match at runtime, silently flipping verdicts (#525 round 3 P2).
- loader: 'exists: false' combined with value operators is rejected at
  load time. The combination has no meaningful semantics — an absent
  flag has no value for eq/ne/gt/lt/in/regex to apply to (#525 round 3 P1).
- loader: condition objects without a 'flag' string are rejected.
- engine: code comment clarified to reflect the now-enforced invariant
  that exists:false is only valid alone.
- Three new loader tests cover the three rejection paths.
- Reject 'then.weight' values that are <=0, infinite, or NaN. A
  'blocked' rule with weight 0 would contribute nothing to the tally
  and silently bypass the gate (#525 round 4 P1).
- Reject non-array 'in' operators at load time. Previously a typo like
  'in: foo' would load and crash at evaluation when .some() is called
  on a non-array (#525 round 4 P2).
- Validate regex patterns at load time. Previously a malformed regex
  like 'regex: "["' would load and abort the entire decision run with
  a SyntaxError when new RegExp() throws during evaluation (#525 round
  4 P2).
- Updated existing 'missing then.weight' test to match the new
  'invalid then.weight' error message.
- Four new loader tests cover the three rejection paths plus a
  negative-weight case.
End-to-end flow now drives the POC: user adds content to the project,
runs npm run plan to generate AI extraction prompts, pastes a prompt
into Claude/ChatGPT, saves the JSON to extractions/, runs npm run
report to render HTML and open it in the browser.

Architecture:
- rule-engine.config.json declares targets, each with id + label +
  paths (files or directories, walked deterministically).
- rules/flag-schema.yaml documents every flag the rule set may
  reference (type + description + example); the contract between AI
  extractor and engine.
- src/plan.ts walks target paths, collects file contents with 8 KB
  truncation per file, bundles role + schema + rules + source into
  a single prompt per target.
- src/report.ts loads extractions per target, runs the engine, renders
  the existing HTML reporter, best-effort opens the first report in
  the OS default browser. Exit 0/1/2 = ok/blocked/missing.
- Prompt-builder follows analyst research (research/10): XML-tag
  structure with markdown redundancy, explicit forbidden-fields list
  (verdict, assessment, conclusion, summary, recommendation,
  rationale, analysis), open <output> tag as a forcing function.
- Original single-shot src/cli.ts preserved as a fixture-testing
  escape hatch.

20 new tests cover the new modules (config, flag-schema, context,
prompt-builder). Suite total: 60 tests, all passing.

Research wave 2 (5 background agents) wrote 5 new artifacts under
research/ covering independent review, workflow failure modes,
workflow architecture, UX friction, and extraction prompt patterns.

POC stays sandbox-scoped under experiments/rule-engine-poc/. No wiring
into specs/, /spec:status, plugins/, or the main repo.
…Codex round 5)

- open-browser: openInBrowser is now async and waits briefly for the
  spawn or error event before resolving. Previously it returned true
  immediately, so report.ts printed "opened in browser" even when
  xdg-open was missing in a headless container — misleading users
  during the primary plan->report flow (#525 round 5 P2).
- report.ts: awaits openInBrowser and prints the correct status line.
  Verified in this sandbox: now prints "could not spawn a browser;
  open manually: file://..." when no browser is installed.
- loader: 'exists' operator must be a boolean. Previously a typo like
  'exists: "false"' would load and then silently never match because
  evaluateCondition compares boolean to string (#525 round 5 P2).
- loader: 'gt' and 'lt' operators must be numbers at load time, for
  consistency with the other type checks (engine already failed
  matching at runtime, but failing at load is preferred).
- Two new loader tests cover the exists-boolean and gt-number paths.
…und 6)

- open-browser: spawn cmd /c start "" on Windows. 'start' is a cmd.exe
  built-in, not a standalone exe, so spawn('start', ...) raised ENOENT
  and browser open always failed for Windows users (#525 round 6 P2).
- report: validate the parsed extraction is a plain object before
  passing to evaluate. Previously valid JSON like null / [] / "text"
  would crash inside hasOwnProperty.call on null instead of producing
  a controlled error (#525 round 6 P2).
- prompt-builder: pick a fence length longer than any backtick run in
  the source content. Many repo markdown files contain ``` blocks
  which would prematurely close the prompt's outer fence and corrupt
  the AI extraction prompt (#525 round 6 P2).
- New pickFence helper is exported and unit-tested; prompt-builder
  test asserts a 5-tick fence is emitted for content with a 4-tick
  run.
Closes the schema-miss laundering failure mode flagged by the
critic (research/07) and analyst (research/10): bad LLM output now
fails loudly instead of becoming a reproducible-looking verdict.

What's new:
- src/validate.ts: validateExtraction(flags, schema, options) returns
  errors + warnings. Checks: forbidden fields (verdict, assessment,
  conclusion, summary, recommendation, rationale, analysis), unknown
  fields (warning), type mismatches (boolean/number/string/string[]),
  non-finite numbers, disallowed_values violations, prompt-hash
  mismatch (when expectedPromptHash provided — wired up in the next
  commit).
- src/validate-cli.ts: 'npm run validate' surfaces issues per target,
  exits 0 (clean) / 1 (errors) / 2 (missing/unparseable extraction).
- src/report.ts: validates each extraction before evaluating. Refuses
  to render when validation fails. --skip-validate flag for escape.
- prompt-builder imports FORBIDDEN_FIELDS from validate.ts so the
  forbidden list lives in exactly one place.
- 12 new validate tests; suite total 77 passing.

Verified end-to-end: a polluted extraction with verdict+type-mismatch
+unknown-flag is caught by both validate and report.
…ard)

Closes the stale-extraction failure mode flagged by the critic
(research/07): users edit source files between plan and report and
the old JSON still renders a confident verdict. The report now
refuses extractions produced against a different prompt.

What's new:
- src/prompt-hash.ts: computePromptHash hashes the LOAD-BEARING
  inputs (target id, per-file sha, rule hashes, schema content),
  not the rendered prompt text. Cosmetic edits to the prompt
  template don't invalidate extractions; real source changes do.
- src/plan.ts: emits sidecar prompts/<id>.hash.txt and embeds the
  hash into the prompt as (a) a top-of-file HTML comment, (b) an
  explicit rule asking the LLM to copy it into __prompt_hash, and
  (c) the response template's first key.
- src/validate.ts: enforces expectedPromptHash when provided.
  Surfaces missing-prompt-hash and stale-extraction error codes
  with re-run instructions.
- src/report.ts + src/validate-cli.ts: read the sidecar hash if it
  exists; absence falls back to the pre-binding behaviour for
  backwards-compat with fixtures.
- 6 new prompt-hash tests; suite total 83 passing.

End-to-end verified in this sandbox:
  matching hash    -> exit 0 (ready-to-progress)
  stale hash       -> exit 2 with explicit error
  missing field    -> exit 2 with explicit error
Closes a cluster of small findings from Codex round 7 (cli, context)
and reviewer S2/S3 (engine, loader audit-trail honesty):

- cli.ts: validate JSON root is a plain object before evaluate. Same
  guard as report.ts; previously valid-JSON-but-not-object input would
  crash inside hasOwnProperty.call (#525 round 7 P2).
- context.ts: use lstat instead of stat and skip symlinks entirely.
  Previously a symlink cycle (a/sub/loop -> a/) would recurse until
  stack overflow during plan (#525 round 7 P2).
- engine.ts evaluateCondition: gt/lt against a non-number and regex
  against a non-string now set an explicit reason ('expected number
  for gt, got string') so the audit trail explains *why* the
  condition didn't match. Reviewer S2 — previously these set
  matched=false with no reason.
- engine.ts evaluateWhen: when.not against a missing flag no longer
  silently fires. The inner condition's 'flag missing in extraction'
  reason is preserved through the not clause; the rule fails to
  match rather than inverting absence into success. Reviewer S2.
- loader.ts: duplicate rule ids in a single rule file are rejected
  at load time. Previously a second rule with the same id loaded
  silently and the engine evaluated it independently. Reviewer S3.
- Five new tests: gt-non-number reason, regex-non-string reason,
  not-missing flag, duplicate ids, symlink cycle handling.

Suite: 88/88 passing.
Closes the 'paste the sidecar to bypass staleness' cheat the critic
flagged as the highest-leverage fix in the post-validate workflow.

Previously: report.ts and validate-cli.ts read prompts/<id>.hash.txt
(plain text) and trusted its value. An operator under deadline
pressure could open the sidecar, copy the hash into the extraction's
__prompt_hash field, and silently re-render a stale verdict.

Now: report and validate-cli recompute the hash from current source
files + rules + schema (same code path as plan.ts). The sidecar still
gets written for diagnostic / debugging purposes, but it is never the
authority for whether an extraction is stale. A real change to any
source file invalidates the extraction automatically.

Smoke-tested in this sandbox:
- Source unchanged, paste-the-sidecar cheat -> exit 0 (correct;
  extraction is still valid against current source).
- Source mutated, same paste-the-sidecar cheat -> stale-extraction
  error with both the pasted hash and the recomputed hash printed.
- research/14 (critic): three new failure modes the validate gate
  opened; ranked --skip-validate, sidecar-paste cheat, and
  reproducibility theatre. Highest-leverage fix already landed in
  the previous commit.
- research/15 (sre): CI integration sketch with concrete cost math
  (~$0.56/target, $1,700/month at 20 PRs/day on Opus 4.7) and a
  Day-1/30/90 operational milestones path.
- research/16 (user-researcher): 5-segment JTBD switch interview
  plan with sequencing (mine demand signal first, S1 indie devs
  next, fail fast before S2-S5), full sample script, RAT integration.
- research/12 (reviewer): independent re-review at HEAD. Verdict
  pass-with-findings. S2-1 (docs drift: workflow.md still lists
  validate gate as 'not yet here' despite shipping), S2-2 (sidecar
  deletion bypasses prompt-hash binding entirely), S2-3 (--skip-
  validate is undocumented), and an S3 cluster on test count drift,
  HTML provenance, and type-mismatch error messages.
- Fix two typos caught by CI spell check (typos v1.46.0):
  research/16 'pre-empted' -> 'confirmed' (reads more clearly anyway),
  research/12 'ci_passsing' -> 'ci_passingx' (illustrative typo
  recast to avoid typos-tool false positive).
…grams

Replaces the engine-internals-focused architecture.md with a
comprehensive system view covering:
- System overview (component flowchart)
- User flow (sequence diagram across plan/AI/validate/report)
- Data flow (annotated with data shapes at each seam)
- Engine internals (evaluate algorithm + per-condition + severity picker)
- Validate gate + prompt-hash binding (sequence)
- OODA mapping (Observe/Orient/Decide/Act with stochasticity boundary)
- Module dependency graph (16 src/ modules)
- Why these shapes (design choices + research refs)

Seven Mermaid diagrams. docs/README.md now points to architecture.md
as the start-here entry.
- config.ts: target ids must match /^[A-Za-z0-9][A-Za-z0-9_-]*$/.
  Previously a target id like '../escape' or 'foo/bar' was accepted and
  later interpolated into prompts/<id>.md, extractions/<id>.json,
  reports/<id>.html — at best ENOENT, at worst write outside the
  workspace (#525 round 8 P2).
- context.ts: extract truncateToBytes() that walks back to a UTF-8
  codepoint boundary. Previously slice(0, maxBytes) counted UTF-16
  code units, so CJK / emoji-heavy markdown could emit prompt blocks
  4x the advertised 8 KB cap (#525 round 8 P2).
- 7 new tests cover the three slug rejection paths and the multibyte
  truncation invariant.

Suite: 95/95 passing.
Three implementer subagents ran in parallel; each verified with
npm test + tsc + typos before reporting. 98/98 tests passing.

Agent A — Safety: sidecar refusal + --skip-validate warning
- report.ts + validate-cli.ts: prompt-hash binding now triggers on
  prompts/<id>.md (the prompt file) existing, NOT on the sidecar
  prompts/<id>.hash.txt. Deleting the sidecar can no longer bypass
  the staleness check (reviewer research/12 S2-2).
- report.ts: --skip-validate now prints a loud stderr warning per
  target ('validation gate disabled. This is for debugging only.')
  closing reviewer S2-3 silent-flag finding.

Agent B — validate.ts polish
- Type-mismatch errors now include the observed value via
  formatObserved() with an 80-char cap and ellipsis (reviewer S3).
  Example: "Flag 'X' expected 'boolean', got string (\"yes\")."
- null flag values now warn with code 'null-value-omit-instead'
  rather than being silently accepted as 'unknown'. Engine semantics
  unchanged (null still treated as missing); validate just surfaces
  the discrepancy with the prompt's 'omit unknowns' instruction.

Agent C — HTML report v2
- RenderContext gains an optional promptHash field. When set, the
  Provenance section shows the 12-char prefix; when the extraction's
  __prompt_hash matches, a 'verified' badge appears.
- Audit-trail rows with reason='flag missing in extraction' now use
  a distinct cond--missing CSS class (yellow/warning palette) to
  differentiate from cond--miss (red/error). UX research/09 finding.
Agent A — integration tests (was missed from 4e54c0e):
- test/report-flow.test.ts: 8 spawnSync integration tests covering
  prompt-extraction binding (sidecar deletion, missing __prompt_hash,
  stale hash, fixture flow with no prompt file, parity for both CLIs)
  and --skip-validate stderr warning. Brings suite to 106 tests.

Agent D — docs drift sync:
- docs/workflow.md: rewrote "What's not yet here" (validate gate and
  stale-extraction detection BOTH shipped — now lists API extractor,
  rule governance, fairness audit, drift dashboards). Added the
  __prompt_hash paragraph in "Paste into an AI tool" + a new
  "--skip-validate flag" subsection (debugging-only, never-in-CI).
- README.md: test count 60 -> 100+, file map updated to include
  prompt-hash / validate / validate-cli modules, research table
  extended from 10 to 16 artifacts.
- docs/README.md: "five briefs" -> "16 research artifacts" with the
  expanded angle list.

Suite: 106/106 passing.
…dex round 9 P1)

When the prompt file exists but collectFiles() throws (target paths
deleted / renamed / unreadable), the previous behaviour was to set
expectedPromptHash = undefined and continue — silently disabling the
stale-extraction check. A renamed source folder could then let report
render an old extraction as if it were current.

Now both report.ts and validate-cli.ts fail closed: print an explicit
error and skip the target with exit code 2. The integrity invariant
('an extraction is checked against the current source') is preserved.

Backwards-compat preserved for the fixture / single-shot flow: when
prompts/<id>.md doesn't exist, no hash check is attempted, and the
catch path above doesn't fire.

Suite: 106/106 still passing (no test exercised the silent-downgrade
path).
… (Codex round 10)

- config.ts: target-id duplicate check now case-insensitive. Target
  ids are interpolated into <id>.md / <id>.json / <id>.html
  filenames, and default macOS / Windows filesystems are
  case-insensitive — so 'Alpha' and 'alpha' would collide on disk
  without a config error and one target's artifacts would overwrite
  another's silently (#525 round 10 P2).
- prompt-builder.ts: __prompt_hash is now actually the FIRST key in
  the response template, not just claimed-first-then-appended-last.
  Object.fromEntries used to insert the schema keys before the
  promptHash assignment; JS object key order follows insertion for
  string keys, so the hash was rendered last. Now we build the
  object with the hash first, then loop the schema (#525 round 10 P3).
- Two new tests cover both behaviours.

Note: src/cli-shared.ts and src/validate-cli.ts are dirty in the
worktree from an in-flight CLI scaffolding refactor; those changes
will land separately when the agent reports back.
Three CLIs (plan, validate, report) previously duplicated argv
parsing, config + rules + schema loading, target filtering, extraction
IO, prompt-hash recompute, and exit-code handling. The duplication
was the documented cause of two repeated bugs: Codex caught the same
JSON-root-validation defect in report.ts (round 6) and cli.ts (round
7), and the fail-closed fix for unrecomputable prompt hashes (round 9
P1) had to be applied to both report.ts and validate-cli.ts.

What changed:
- src/cli-shared.ts (new, 270 LOC): exports takeOpt / takeFlag /
  parseStandardArgs, loadCliBaseContext, selectTargets, plus a
  discriminated-union loadExtractionForTarget that returns
  { kind: 'ok' | 'missing' | 'invalid-json' | 'non-object' |
  'hash-unrecomputable' } and a logExtractionError formatter that
  preserves the existing stderr text byte-for-byte (test/report-flow
  asserts those strings).
- src/plan.ts: 90 LOC -> 79 LOC. Uses parseStandardArgs +
  loadCliBaseContext + selectTargets. Schema coverage diff stays here.
- src/report.ts: 210 LOC -> 120 LOC. Per-target handler is now
  evaluate -> validate -> render HTML; the defensive-IO scaffolding
  is gone. --skip-validate and --no-open remain command-specific.
- src/validate-cli.ts: 110 LOC -> 62 LOC. Per-target handler is now
  just validateExtraction + log results + Summary line.
- test/cli-shared.test.ts (new): 14 unit tests for takeOpt/takeFlag
  argv mutation, parseStandardArgs, selectTargets filtering, and
  loadExtractionForTarget across all five discriminated-union cases.
- src/cli.ts, src/engine.ts, src/loader.ts, src/validate.ts, and the
  prompt + html-report layers are untouched.

Verified: 123/123 tests passing, tsc clean, typos clean. All eight
report-flow integration tests still pass — message text and exit
codes preserved.
Three rendered reports (ready, blocked, needs-attention) for the agents to inspect when reviewing report readability.
The sample reports under
experiments/rule-engine-poc/research/sample-reports/ embed 12-char
rule content-hash prefixes that randomly trip typos rules (e.g.,
'afe...' -> 'safe'). Excluding the folder is consistent with the
existing pattern that allow-lists specific commit-SHA fragments.

Also commits research/17 (ux-designer pass on the rendered HTML
reports) — top finding is that the audit trail buries the matched
rules among ~21 'did not match' siblings; recommended a 'What fired'
section + collapse-by-default for skipped rules.
Product-strategist pass on the HTML report as a downstream-shared
artifact. Three findings:
- The report is one artifact serving six first-fields (PR reviewer,
  PM, EM, QA, compliance, auditor). Recommend one HTML with
  re-stacked sections rather than reader-specific exports — keep
  the 'one artifact, many destinations' moat.
- Highest-leverage change: expand action slugs ('kick-ci',
  'request-reviewer') to human sentences via an actions[].human
  field on the rule schema. Promote the 'verified' prompt-hash
  badge next to the verdict.
- Introduce label_set config (default 'dev'; 'pm', 'qa',
  'compliance' as presentational overrides) so headline labels
  match the reader's vocabulary.
Brand-reviewer pass on the rendered HTML report. Verdict:
pass-with-findings; not S1-blocking while the POC stays under
experiments/, but would block on the promotion-to-skill step
flagged in research/13.

Findings:
- On-temperament (no emoji / gradients / icons; ASCII [+]/[-]/[?]
  markers are correctly monospace-as-iconography; restrained density).
- Off-token: 18 distinct literal hex values, literal -apple-system /
  SFMono-Regular font stacks, page background near-white instead of
  Specorator cream var(--paper).
- Voice close but section headers are bare labels rather than
  sentence-case-with-period declaratives; 'Suggested actions' is
  passive against Specorator's imperative voice.
- Open decision: Specorator has no red token. blocked tier currently
  uses literal #fdecea / #d8281b / #7a160d. ADR-shaped choice before
  graduation: extend colors_and_type.css, repurpose --soft-orange and
  rename the tier 'at-risk', or stay literal until packaged.
…h/21)

Critic pass on the rendered HTML report as a communication artifact.
Three findings:

- Visual hierarchy contradicts semantic model: severity-first is
  invisible in the weighted-tally widget (reads as a horse race),
  alphabetically-sorted action list silently asserts a priority the
  engine refuses to give, cond--missing vs cond--miss are visually
  distinguished but never named (colour-blind readers lose the signal).
- 'verified' badge is a trust-calibration trap — green pill reads as
  'extraction verified' to an auditor when it only means 'bound to
  current inputs'. Compounded by --skip-validate runs producing HTML
  indistinguishable from validated ones (research/14 risk 1 leaks
  into the report layer).
- Most dangerous skim path: blocker-by-absence. A high-priority
  blocker rule whose input flag is missing from the extraction simply
  doesn't fire; neither verdict tile nor any header-level summary
  tells the reader 'N higher-priority rules were un-evaluable'.
- 3 RATs proposed (verdict-tile-alone, action-list-as-priority,
  'verified' interpretation). Default no-go if any fail.
Analyst pass on the HTML report from a regulator's reading perspective + 2026 benchmark against LangSmith / Inspect / W&B Weave / sklearn / model-card conventions. Closes the open item from research/02 (human-readable rationale presentation).
New sidecar mapping action slugs to imperative human sentences so
the HTML report can render readable guidance instead of bare slugs.

- rules/action-glossary.yaml: 28 entries covering every action used
  in rules/quality-gates.yaml, with optional urgency + category
  metadata. Imperative voice per Specorator brand.
- src/action-glossary.ts: loader + diff-coverage helper, mirroring
  src/flag-schema.ts conventions.

Wiring into config.ts and the HTML renderer happens in subsequent
commits when Agents A2/B finish their slices. 123/123 tests still
passing — no behaviour change yet.
… A complete)

Agent A's RALPH loop completed. Action glossary is now reachable via
the config (still optional — no behaviour change for callers that
don't set actionGlossary):

- src/config.ts: optional 'actionGlossary' string field on RawConfig
  resolved to 'actionGlossaryPath' on ResolvedConfig, same pattern as
  flagSchema.
- rule-engine.config.json: points at rules/action-glossary.yaml.
- test/action-glossary.test.ts: 12 tests covering loader validation,
  diff-coverage, real-file coverage of rules/quality-gates.yaml, and
  sentence-shape invariants.

Also picks up Agent C's in-flight architecture.md updates: system
overview + data flow diagrams now show the glossary node (read only
by the renderer) and the new HTML report sections (system-identity
header, what fired, reproduce block, audit trail with non-matched
collapsed). The renderer itself (Agent B) is still in flight.

Suite: 135/135 passing (123 baseline + 12 new). 21 unique action
slugs in rules/quality-gates.yaml — all mapped in the glossary.
claude and others added 12 commits May 17, 2026 13:08
…xtending

- audit-trail.md: Mapping to EU AI Act table updated to credit the
  HTML report (what-fired with human sentences, system-identity
  header, tier glossary, reproduce block) as the Art. 13
  human-readable rationale surface. Closes research/02 open item
  about explainability presentation.
- workflow.md: still in flight by agent C — current commit picks
  up partial edits.
- extending.md: new 'Authoring action human sentences' section
  documents rules/action-glossary.yaml as a render-only sidecar
  (engine never reads it; editing sentences cannot change a verdict).
Agent B (HTML report rebuild) and Agent C (README sync) are still
running their RALPH loops. This commit snapshots the current
on-disk state so the working tree stays clean between iterations:

- src/html-report.ts: partial changes from agent B (rebuild for
  research wave 4 findings). 135/135 tests still passing — the
  partial state is internally consistent even if not yet feature
  complete.
- src/report.ts: corresponding plumbing changes from agent B.
- README.md: agent C in-flight test-count + file-map sync.

Will be superseded by the next commit when both agents report
final.
…t B complete)

Agent B's RALPH loop completed. Twelve convergent findings from
research wave 4 now realised in the renderer:

1. 'What fired' section above the full audit trail (UX/17 + critic/21 +
   auditor/20). Verdict-card stats line now reads 'N rule(s) fired ·
   M action(s) to take'.
2. Non-matched rules collapsed via <details class='rule-collapsed'>;
   matched rules stay inline (UX/17).
3. Blocker-by-absence banner adjacent to the verdict card when any
   rule's condition reports 'flag missing in extraction' (critic/21 +
   UX/17). Yellow palette, names the missing flags.
4. Suggested actions now sorted by priority-of-cause (walk
   evaluations in priority-desc order, dedup preserving first-seen)
   instead of alphabetic. result.actions unchanged for machine
   consumers (UX/17).
5. Action human-sentence rendering via rules/action-glossary.yaml;
   falls back to bare slug when entry missing (stakeholder/18).
6. Provenance section: preamble explaining the hashes + 'How to
   reproduce' block + 12-char hash truncation (UX/17 + auditor/20 +
   stakeholder/18).
7. System-identity header above the verdict card: engine version +
   prominent timestamp moved out of the footer (auditor/20).
8. Verdict-tier glossary + [+]/[-]/[?] glyph legend in a collapsed
   <details class='legend'> block (auditor/20 + UX/17).
9. cond--miss now has a faint red row-wash matching cond--missing's
   amber, so the visual distinction isn't glyph-color-only (UX/17).
10. @media (max-width: 540px) single-column fallback (UX/17).
11. Trust calibration: --skip-validate banner shown prominently when
    validationStatus='skipped'; verified-badge tooltip explains it
    only means 'extraction is bound to current inputs', not 'flags
    are correct' (stakeholder/18 + critic/21).
12. Section headers in sentence-case with periods, imperative voice:
    'Take these actions.' not 'Suggested actions' (brand/19).

28 new tests in test/html-report.test.ts; suite total 163/163.
Three sample reports regenerated under research/sample-reports/ so
reviewers see the new shape.
…x round 11)

- open-browser.ts: success now requires the opener process to exit
  with code 0 (or null, signal-terminated) — not just spawn. On Linux
  headless / CI, xdg-open spawns successfully and then immediately
  exits non-zero because no browser handler is registered; previously
  report.ts printed 'opened in browser' even though the file was
  never opened. Safety-net timeout bumped to 1s for opener daemons
  that never deliver an exit event (#525 round 11 P2).
- cli-shared.ts: takeOpt now throws 'Missing value for option <flag>'
  when the flag is the last argv entry or is followed by another
  flag. Previously 'npm run report -- --target' would silently fall
  back to 'all targets' and 'npm run report -- --target --quiet'
  would interpret '--quiet' as the target id (#525 round 11 P2).
- Two new takeOpt tests cover the missing-value rejection path.

Suite: 165/165 passing.
…odex round 12)

- html-report.ts missingFlagNames: only count rules whose final
  outcome was determined by the missing flag (matched === false).
  Previously a 'when.any' rule with one matched branch + one missing
  branch counted as un-evaluable in the banner, even though it
  contributed to the verdict (#525 round 12 P2).
- html-report.ts reproCmd: paths are now single-quoted via a small
  shellQuote helper that escapes embedded ' as the standard
  '\\'' four-char sequence. Paths with spaces (e.g.,
  'My Projects/rules.yaml') no longer break the copy-pasted
  reproduce command (#525 round 12 P2).
- Two new html-report tests: when.any-with-missing-branch is NOT
  counted in the banner, and reproCmd contains HTML-escaped quoted
  paths.

Suite: 167/167 passing.
…odex round 13)

src/cli.ts has its own takeOption() (the single-shot fixture flow
doesn't use src/cli-shared.ts). Same bug as round 11 P2 in cli-shared:
when --html had no value the helper returned undefined and the CLI
silently proceeded with no HTML output, breaking automation that
relies on the artifact being written.

Now fail fast with a clear stderr message and exit code 2 when the
option is the last argv entry or is followed by another flag.

Suite: 167/167 still passing (no test exercised the silent-skip path).
… 14)

- loader.ts: every entry of 'then.actions' must be a non-empty string
  slug. Previously numbers / objects / empty strings passed load-time
  validation and flowed into the HTML reporter as unrecognised tokens
  that couldn't map to a glossary entry, breaking the remediation
  guidance the verdict is meant to provide (#525 round 14 P2).
- loader.ts: 'gt' and 'lt' now reject NaN and Infinity at load time.
  Both are technically 'number' but silently corrupt comparisons at
  runtime (NaN comparisons always false), so a typo could make a
  gating rule unexpectedly never fire (#525 round 14 P2).
- Four new loader tests cover non-string action elements,
  empty-string action elements, NaN gt, and Infinity lt.

Suite: 171/171 passing (167 + 4).
New docs/compliance.md walks the standards/regulations an adopter is
most likely to be asked about in 2026 and catalogues what the POC
contributes vs what stays the adopter's job. Synthesises the two
prior research passes (research/02 regulatory + research/20 auditor
readability) into reference material for scoping conversations.

Covered:
- EU AI Act Art. 11-14 + Art. 72 with a per-article table.
- ISO/IEC 42001 AIMS clauses 6-10.
- ISO/IEC 23894 AI risk management.
- NIST AI RMF Govern / Map / Measure / Manage.
- GDPR Art. 22 (when it applies vs when it doesn't).
- OECD AI Principles (1-paragraph summary).

Plus:
- 'What the POC ticks natively' — per-artifact provenance.
- 'What is NOT in this POC' — honest gap analysis.
- Maturity checklist before production with rough effort estimates.

Leads with a disclaimer that this is engineering reference material,
not legal advice or certification. docs/README.md updated to index
the new doc.
…the HTML report

Synthesises the five wave-4 research artifacts (research/17-21) plus
the relevant sections of architecture.md, workflow.md, and
audit-trail.md into one place for someone who wants to understand the
report end-to-end without chasing across files.

Sections:
- What the report is + the three committed sample renders
- Section-by-section walkthrough mapped to each section's research source
- The five perspectives that shaped the v3 rebuild (UX, stakeholder,
  brand, auditor, critic) with each agent's top-line finding
- The 12 wave-4 changes + the Codex round 11-14 hardenings that
  landed on top
- What is still open, bucketed (strategy slice / governance / ADR /
  production prep / discovery RATs)
- How to generate one
- How to read one (the 4-step skim path)

docs/README.md indexes it.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d203b154a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

import { join, basename } from "node:path";
import { spawnSync } from "node:child_process";

const fixturesDir = new URL("../fixtures/", import.meta.url).pathname;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Convert file URL to fs path before reading fixtures

new URL("../fixtures/", import.meta.url).pathname is not a safe filesystem path: it keeps percent-encoding (e.g. %20) and is malformed on Windows drive paths, so readdirSync(fixturesDir) can fail with ENOENT when the repo lives in a path with spaces or on Windows. Use fileURLToPath(new URL(...)) before passing it to fs APIs.

Useful? React with 👍 / 👎.

Comment on lines +137 to +139
if (typeof rule.priority !== "number") {
throw new Error(`Rule '${rule.id}' missing numeric 'priority'`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject non-finite rule priorities at load time

priority is only checked with typeof === "number", so YAML values like .nan are accepted. In evaluate, sorting then does b.priority - a.priority; with NaN that comparator returns NaN (treated like 0), which silently breaks the documented priority desc, id asc ordering and can reorder audit/action output unpredictably. This loader should reject non-finite priorities the same way it already rejects non-finite numeric operators and weights.

Useful? React with 👍 / 👎.

- scripts/run-all-html.mjs + scripts/run-all-fixtures.mjs: use
  fileURLToPath(new URL(...)) instead of .pathname. The bare
  .pathname keeps percent-encoding (paths with spaces break) and
  produces malformed Windows drive paths. Same fix in both scripts
  since they share the import.meta.url -> fixturesDir pattern
  (#526 P2).
- loader.ts: reject non-finite 'priority' (NaN, Infinity) at load
  time. Same fail-fast discipline as weight + gt + lt. NaN priorities
  silently break the sort comparator (b.priority - a.priority returns
  NaN, treated as 0), reordering the audit trail unpredictably
  (#526 P2).
- Two new loader tests cover .nan and .inf priority rejection.

Suite: 173/173 passing.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a09bee96e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +106 to +115
if (value === null) {
warnings.push({
severity: "warning",
code: "null-value-omit-instead",
path: key,
message:
`Flag '${key}' is null; prefer omitting unknowns over emitting null. ` +
`The engine will treat null and missing identically.`,
});
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat null extraction values as validation errors

The null special-case downgrades schema violations to a warning and skips type enforcement, so report/validate can pass even when the extraction is not type-correct. This can change verdicts silently: the engine treats null as a present flag (hasOwnProperty), so rules using presence-sensitive logic (for example exists or ne) evaluate differently than if the flag were omitted, despite the validator message claiming null is equivalent to missing. In practice, a model emitting {"some_flag": null} can produce an accepted but semantically different decision path.

Useful? React with 👍 / 👎.

… (Codex round 16)

PR #526 Codex P1: the previous null handling was a warning with the
message 'the engine will treat null and missing identically' — but
that claim was false. The engine's hasOwnProperty presence check
treats {flag: null} as PRESENT, so rules using 'exists' or 'ne'
evaluate differently against {flag: null} than against {} despite
the validator's reassurance.

Two ways to make the claim true: (a) refuse null at the gate, or
(b) make the engine treat null as missing. (a) is simpler and
preserves the LLM prompt's 'omit unknowns' contract. Engine
semantics stay unchanged; null just never reaches the engine on
the workflow path. Single-shot cli.ts bypasses validate but its
fixtures don't use null.

- src/validate.ts: null flag values now produce a hard error
  (code: 'null-value-not-allowed') with a clear message explaining
  why null is not 'missing-equivalent'.
- test/validate.test.ts: updated existing test to expect the error
  shape.

Suite: 173/173 passing. Single test renamed; no count change.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90f58716f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +276 to +277
const shellQuote = (s: string): string => `'${s.replace(/'/g, "'\\''")}'`;
const reproCmd = `npx tsx src/cli.ts ${shellQuote(ctx.rulesPath)} ${shellQuote(ctx.flagsPath)} --html <out.html> --quiet`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Render Windows-safe repro command in HTML report

The generated reproCmd always uses POSIX single-quote escaping, which is not valid quoting in cmd.exe and is interpreted differently in PowerShell. On Windows (especially when rulesPath or flagsPath contains spaces), the report’s suggested reproduction command can fail even though evaluation succeeded, undermining the “How to reproduce” path for auditors/operators on supported Windows environments.

Useful? React with 👍 / 👎.

PR #526 round 17 P2: the reproCmd block emitted only POSIX
single-quote-escaped paths. cmd.exe doesn't recognise single quotes
as path delimiters, and PowerShell interprets them differently from
bash — so an auditor on Windows copying the 'How to reproduce'
command got a syntax error even when the report itself was valid.

Now render both flavours side-by-side in the provenance section:
- POSIX (macOS / Linux / WSL / Git Bash): single-quote escape,
  embedded ' becomes '\''.
- Windows (cmd.exe / PowerShell): double-quote escape, embedded "
  becomes "" (Windows escape convention).

The reader picks the right one. Both are labelled.

Updated the existing reproCmd test to assert both flavours appear
and that the POSIX/Windows labels are present.

Suite: 173/173 passing.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4929c08d5f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +279 to +280
const reproCmdPosix = `npx tsx src/cli.ts ${posixQuote(ctx.rulesPath)} ${posixQuote(ctx.flagsPath)} --html <out.html> --quiet`;
const reproCmdWindows = `npx tsx src/cli.ts ${windowsQuote(ctx.rulesPath)} ${windowsQuote(ctx.flagsPath)} --html <out.html> --quiet`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove shell-redirection token from repro command

The generated reproduction command embeds --html <out.html>, but <...> is parsed by both POSIX shells and cmd.exe/PowerShell as input redirection, not a placeholder path. If an auditor copy-pastes this command as instructed, --html receives no value and the command fails (or behaves unexpectedly), so the report’s primary replay path is broken across environments. Use a literal sample filename (e.g. out.html) or quote the placeholder text so it is not treated as shell syntax.

Useful? React with 👍 / 👎.

claude added 2 commits May 17, 2026 14:47
…round 18)

PR #526 round 18 P2: the reproCmd block printed '--html <out.html>'
meaning the angle brackets as a placeholder, but POSIX shells and
cmd.exe / PowerShell parse <...> as input redirection. A copy-paste
would silently send no value to --html and the command would fail.

Replaced with a literal 'out.html' filename — the reader edits it if
they want a different path, but the copy-paste path works on every
shell without modification.

Suite: 173/173 still passing (existing repro tests check for the
quoted paths, not the filename — both flavours intact).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51a3356ccf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +278 to +284
const windowsQuote = (s: string): string => `"${s.replace(/"/g, '""')}"`;
// Use a literal filename, NOT `<out.html>`: angle brackets are shell
// I/O redirection on both POSIX and cmd.exe / PowerShell, so a
// copy-paste would silently consume the placeholder as a redirection
// target and `--html` would receive no value (Codex round 18 P2).
const reproCmdPosix = `npx tsx src/cli.ts ${posixQuote(ctx.rulesPath)} ${posixQuote(ctx.flagsPath)} --html out.html --quiet`;
const reproCmdWindows = `npx tsx src/cli.ts ${windowsQuote(ctx.rulesPath)} ${windowsQuote(ctx.flagsPath)} --html out.html --quiet`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit a PowerShell-literal repro command

The new Windows repro command is labeled for both cmd.exe and PowerShell, but it always wraps paths in double quotes (windowsQuote). In PowerShell, double-quoted strings are expandable, so path segments containing $... or $() are interpreted instead of passed literally, which can mutate the path or break copy-paste reproduction for affected checkouts. This is a regression from the prior single-quote style for PowerShell; emit a dedicated PowerShell form (single-quoted with escaped ') or escape PowerShell expansions before rendering.

Useful? React with 👍 / 👎.

Round 17's Windows form used double quotes for both cmd.exe and
PowerShell, but PowerShell double-quoted strings EXPAND \$var and
\$(...). A path like 'src/\$something/x.json' would be interpreted
in PowerShell — a regression vs the prior POSIX form, which used
single quotes.

Split Windows into two flavours:
- cmd.exe: double-quote escape (\" -> \"\"). cmd doesn't expand \$.
- PowerShell: single-quote escape (' -> ''). Single quotes suppress
  PowerShell expansion.

The HTML provenance block now renders three labelled forms instead
of two. POSIX still uses POSIX-style single-quote escape ('\\'').

One new test asserts the PowerShell block uses single quotes; the
existing repro test updated to match three-form layout.

Suite: 174/174 passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants