docs(rule-engine-poc): single-page HTML report reference#526
Conversation
Terminal-only TypeScript POC of the "LLM extracts, rules decide" pattern from the AI fact-checking community: the LLM is constrained to producing structured flags from raw signals, and a deterministic rule engine maps those flags to a verdict tier with a fully replayable audit trail. Lives under experiments/rule-engine-poc/ as a sandbox (not formal Stage 1-7) and demonstrates the pattern against the repo's own quality framework. Each rule encodes a Definition of Done item from docs/quality-framework.md. What's included: - src/ - hand-rolled engine (~250 LOC): types, hash, engine, loader, cli, html-report. Pure functions; severity-first verdict; canonical JSON + SHA-256 provenance hashes for replay. - rules/quality-gates.yaml - DoD-as-rules example set. - fixtures/*.json - 5 mock Orient-quadrant extractions covering ready, blocked, and needs-attention verdicts. - test/ - 24 passing tests (vitest), including dedicated reproducibility suite (strategist-recommended North Star: byte-identical replay). - HTML reporter - self-contained, inline CSS, no JS, no external assets. - docs/ - architecture, DSL reference, audit trail + EU AI Act mapping, extension guide, OODA integration. - research/ - five-angle research wave (technical landscape, regulatory auditability, positioning/JTBD, design alternatives, risks/critique).
- Validate that then.verdict is one of the four known tiers; previously a typo like 'blokced' would load successfully and silently degrade the rule into a no-op via tally[<unknown-key>] (#525 P1). - Validate that when.all / when.any / when.not are arrays at load time; previously 'any: true' would load and crash at evaluation with TypeError on .map (#525 P2). - Export VERDICTS as a runtime constant from types.ts so the schema check has one source of truth alongside the type. - Three new loader tests cover the two failure modes plus a typo case.
- engine: 'exists' now participates in the AND-chain instead of short-circuiting, so 'exists: true' combined with 'eq'/'ne'/'gt'/'lt'/ 'in'/'regex' correctly requires every operator to match (#525 P2). 'exists: false' still tolerates flag absence without surfacing the 'flag missing' reason; this is the only short-circuit retained. - loader: empty 'when.all' / 'when.any' / 'when.not' arrays are now rejected at load time. Previously 'any: []' was vacuously satisfied by the length>0 guard in evaluateWhen, allowing a typo to fire a blocking rule unintentionally (#525 P1). - ENGINE_VERSION bumped 0.1.0 -> 0.2.0 because the exists+value-op interaction is a semantic change. Per docs/extending.md, a version bump is the auditor's signal that prior verdicts may not replay. - Six new tests: empty when.any / when.all rejection, four exists-AND cases including exists:false standalone.
- loader: each condition must declare at least one supported operator.
Previously a typo like { flag: 'x', eqq: true } would load and then
always-match at runtime, silently flipping verdicts (#525 round 3 P2).
- loader: 'exists: false' combined with value operators is rejected at
load time. The combination has no meaningful semantics — an absent
flag has no value for eq/ne/gt/lt/in/regex to apply to (#525 round 3 P1).
- loader: condition objects without a 'flag' string are rejected.
- engine: code comment clarified to reflect the now-enforced invariant
that exists:false is only valid alone.
- Three new loader tests cover the three rejection paths.
- Reject 'then.weight' values that are <=0, infinite, or NaN. A 'blocked' rule with weight 0 would contribute nothing to the tally and silently bypass the gate (#525 round 4 P1). - Reject non-array 'in' operators at load time. Previously a typo like 'in: foo' would load and crash at evaluation when .some() is called on a non-array (#525 round 4 P2). - Validate regex patterns at load time. Previously a malformed regex like 'regex: "["' would load and abort the entire decision run with a SyntaxError when new RegExp() throws during evaluation (#525 round 4 P2). - Updated existing 'missing then.weight' test to match the new 'invalid then.weight' error message. - Four new loader tests cover the three rejection paths plus a negative-weight case.
End-to-end flow now drives the POC: user adds content to the project, runs npm run plan to generate AI extraction prompts, pastes a prompt into Claude/ChatGPT, saves the JSON to extractions/, runs npm run report to render HTML and open it in the browser. Architecture: - rule-engine.config.json declares targets, each with id + label + paths (files or directories, walked deterministically). - rules/flag-schema.yaml documents every flag the rule set may reference (type + description + example); the contract between AI extractor and engine. - src/plan.ts walks target paths, collects file contents with 8 KB truncation per file, bundles role + schema + rules + source into a single prompt per target. - src/report.ts loads extractions per target, runs the engine, renders the existing HTML reporter, best-effort opens the first report in the OS default browser. Exit 0/1/2 = ok/blocked/missing. - Prompt-builder follows analyst research (research/10): XML-tag structure with markdown redundancy, explicit forbidden-fields list (verdict, assessment, conclusion, summary, recommendation, rationale, analysis), open <output> tag as a forcing function. - Original single-shot src/cli.ts preserved as a fixture-testing escape hatch. 20 new tests cover the new modules (config, flag-schema, context, prompt-builder). Suite total: 60 tests, all passing. Research wave 2 (5 background agents) wrote 5 new artifacts under research/ covering independent review, workflow failure modes, workflow architecture, UX friction, and extraction prompt patterns. POC stays sandbox-scoped under experiments/rule-engine-poc/. No wiring into specs/, /spec:status, plugins/, or the main repo.
…Codex round 5) - open-browser: openInBrowser is now async and waits briefly for the spawn or error event before resolving. Previously it returned true immediately, so report.ts printed "opened in browser" even when xdg-open was missing in a headless container — misleading users during the primary plan->report flow (#525 round 5 P2). - report.ts: awaits openInBrowser and prints the correct status line. Verified in this sandbox: now prints "could not spawn a browser; open manually: file://..." when no browser is installed. - loader: 'exists' operator must be a boolean. Previously a typo like 'exists: "false"' would load and then silently never match because evaluateCondition compares boolean to string (#525 round 5 P2). - loader: 'gt' and 'lt' operators must be numbers at load time, for consistency with the other type checks (engine already failed matching at runtime, but failing at load is preferred). - Two new loader tests cover the exists-boolean and gt-number paths.
…und 6)
- open-browser: spawn cmd /c start "" on Windows. 'start' is a cmd.exe
built-in, not a standalone exe, so spawn('start', ...) raised ENOENT
and browser open always failed for Windows users (#525 round 6 P2).
- report: validate the parsed extraction is a plain object before
passing to evaluate. Previously valid JSON like null / [] / "text"
would crash inside hasOwnProperty.call on null instead of producing
a controlled error (#525 round 6 P2).
- prompt-builder: pick a fence length longer than any backtick run in
the source content. Many repo markdown files contain ``` blocks
which would prematurely close the prompt's outer fence and corrupt
the AI extraction prompt (#525 round 6 P2).
- New pickFence helper is exported and unit-tested; prompt-builder
test asserts a 5-tick fence is emitted for content with a 4-tick
run.
Closes the schema-miss laundering failure mode flagged by the critic (research/07) and analyst (research/10): bad LLM output now fails loudly instead of becoming a reproducible-looking verdict. What's new: - src/validate.ts: validateExtraction(flags, schema, options) returns errors + warnings. Checks: forbidden fields (verdict, assessment, conclusion, summary, recommendation, rationale, analysis), unknown fields (warning), type mismatches (boolean/number/string/string[]), non-finite numbers, disallowed_values violations, prompt-hash mismatch (when expectedPromptHash provided — wired up in the next commit). - src/validate-cli.ts: 'npm run validate' surfaces issues per target, exits 0 (clean) / 1 (errors) / 2 (missing/unparseable extraction). - src/report.ts: validates each extraction before evaluating. Refuses to render when validation fails. --skip-validate flag for escape. - prompt-builder imports FORBIDDEN_FIELDS from validate.ts so the forbidden list lives in exactly one place. - 12 new validate tests; suite total 77 passing. Verified end-to-end: a polluted extraction with verdict+type-mismatch +unknown-flag is caught by both validate and report.
…ard) Closes the stale-extraction failure mode flagged by the critic (research/07): users edit source files between plan and report and the old JSON still renders a confident verdict. The report now refuses extractions produced against a different prompt. What's new: - src/prompt-hash.ts: computePromptHash hashes the LOAD-BEARING inputs (target id, per-file sha, rule hashes, schema content), not the rendered prompt text. Cosmetic edits to the prompt template don't invalidate extractions; real source changes do. - src/plan.ts: emits sidecar prompts/<id>.hash.txt and embeds the hash into the prompt as (a) a top-of-file HTML comment, (b) an explicit rule asking the LLM to copy it into __prompt_hash, and (c) the response template's first key. - src/validate.ts: enforces expectedPromptHash when provided. Surfaces missing-prompt-hash and stale-extraction error codes with re-run instructions. - src/report.ts + src/validate-cli.ts: read the sidecar hash if it exists; absence falls back to the pre-binding behaviour for backwards-compat with fixtures. - 6 new prompt-hash tests; suite total 83 passing. End-to-end verified in this sandbox: matching hash -> exit 0 (ready-to-progress) stale hash -> exit 2 with explicit error missing field -> exit 2 with explicit error
Closes a cluster of small findings from Codex round 7 (cli, context) and reviewer S2/S3 (engine, loader audit-trail honesty): - cli.ts: validate JSON root is a plain object before evaluate. Same guard as report.ts; previously valid-JSON-but-not-object input would crash inside hasOwnProperty.call (#525 round 7 P2). - context.ts: use lstat instead of stat and skip symlinks entirely. Previously a symlink cycle (a/sub/loop -> a/) would recurse until stack overflow during plan (#525 round 7 P2). - engine.ts evaluateCondition: gt/lt against a non-number and regex against a non-string now set an explicit reason ('expected number for gt, got string') so the audit trail explains *why* the condition didn't match. Reviewer S2 — previously these set matched=false with no reason. - engine.ts evaluateWhen: when.not against a missing flag no longer silently fires. The inner condition's 'flag missing in extraction' reason is preserved through the not clause; the rule fails to match rather than inverting absence into success. Reviewer S2. - loader.ts: duplicate rule ids in a single rule file are rejected at load time. Previously a second rule with the same id loaded silently and the engine evaluated it independently. Reviewer S3. - Five new tests: gt-non-number reason, regex-non-string reason, not-missing flag, duplicate ids, symlink cycle handling. Suite: 88/88 passing.
Closes the 'paste the sidecar to bypass staleness' cheat the critic flagged as the highest-leverage fix in the post-validate workflow. Previously: report.ts and validate-cli.ts read prompts/<id>.hash.txt (plain text) and trusted its value. An operator under deadline pressure could open the sidecar, copy the hash into the extraction's __prompt_hash field, and silently re-render a stale verdict. Now: report and validate-cli recompute the hash from current source files + rules + schema (same code path as plan.ts). The sidecar still gets written for diagnostic / debugging purposes, but it is never the authority for whether an extraction is stale. A real change to any source file invalidates the extraction automatically. Smoke-tested in this sandbox: - Source unchanged, paste-the-sidecar cheat -> exit 0 (correct; extraction is still valid against current source). - Source mutated, same paste-the-sidecar cheat -> stale-extraction error with both the pasted hash and the recomputed hash printed.
- research/14 (critic): three new failure modes the validate gate opened; ranked --skip-validate, sidecar-paste cheat, and reproducibility theatre. Highest-leverage fix already landed in the previous commit. - research/15 (sre): CI integration sketch with concrete cost math (~$0.56/target, $1,700/month at 20 PRs/day on Opus 4.7) and a Day-1/30/90 operational milestones path. - research/16 (user-researcher): 5-segment JTBD switch interview plan with sequencing (mine demand signal first, S1 indie devs next, fail fast before S2-S5), full sample script, RAT integration.
- research/12 (reviewer): independent re-review at HEAD. Verdict pass-with-findings. S2-1 (docs drift: workflow.md still lists validate gate as 'not yet here' despite shipping), S2-2 (sidecar deletion bypasses prompt-hash binding entirely), S2-3 (--skip- validate is undocumented), and an S3 cluster on test count drift, HTML provenance, and type-mismatch error messages. - Fix two typos caught by CI spell check (typos v1.46.0): research/16 'pre-empted' -> 'confirmed' (reads more clearly anyway), research/12 'ci_passsing' -> 'ci_passingx' (illustrative typo recast to avoid typos-tool false positive).
…grams Replaces the engine-internals-focused architecture.md with a comprehensive system view covering: - System overview (component flowchart) - User flow (sequence diagram across plan/AI/validate/report) - Data flow (annotated with data shapes at each seam) - Engine internals (evaluate algorithm + per-condition + severity picker) - Validate gate + prompt-hash binding (sequence) - OODA mapping (Observe/Orient/Decide/Act with stochasticity boundary) - Module dependency graph (16 src/ modules) - Why these shapes (design choices + research refs) Seven Mermaid diagrams. docs/README.md now points to architecture.md as the start-here entry.
- config.ts: target ids must match /^[A-Za-z0-9][A-Za-z0-9_-]*$/. Previously a target id like '../escape' or 'foo/bar' was accepted and later interpolated into prompts/<id>.md, extractions/<id>.json, reports/<id>.html — at best ENOENT, at worst write outside the workspace (#525 round 8 P2). - context.ts: extract truncateToBytes() that walks back to a UTF-8 codepoint boundary. Previously slice(0, maxBytes) counted UTF-16 code units, so CJK / emoji-heavy markdown could emit prompt blocks 4x the advertised 8 KB cap (#525 round 8 P2). - 7 new tests cover the three slug rejection paths and the multibyte truncation invariant. Suite: 95/95 passing.
Three implementer subagents ran in parallel; each verified with
npm test + tsc + typos before reporting. 98/98 tests passing.
Agent A — Safety: sidecar refusal + --skip-validate warning
- report.ts + validate-cli.ts: prompt-hash binding now triggers on
prompts/<id>.md (the prompt file) existing, NOT on the sidecar
prompts/<id>.hash.txt. Deleting the sidecar can no longer bypass
the staleness check (reviewer research/12 S2-2).
- report.ts: --skip-validate now prints a loud stderr warning per
target ('validation gate disabled. This is for debugging only.')
closing reviewer S2-3 silent-flag finding.
Agent B — validate.ts polish
- Type-mismatch errors now include the observed value via
formatObserved() with an 80-char cap and ellipsis (reviewer S3).
Example: "Flag 'X' expected 'boolean', got string (\"yes\")."
- null flag values now warn with code 'null-value-omit-instead'
rather than being silently accepted as 'unknown'. Engine semantics
unchanged (null still treated as missing); validate just surfaces
the discrepancy with the prompt's 'omit unknowns' instruction.
Agent C — HTML report v2
- RenderContext gains an optional promptHash field. When set, the
Provenance section shows the 12-char prefix; when the extraction's
__prompt_hash matches, a 'verified' badge appears.
- Audit-trail rows with reason='flag missing in extraction' now use
a distinct cond--missing CSS class (yellow/warning palette) to
differentiate from cond--miss (red/error). UX research/09 finding.
Agent A — integration tests (was missed from 4e54c0e): - test/report-flow.test.ts: 8 spawnSync integration tests covering prompt-extraction binding (sidecar deletion, missing __prompt_hash, stale hash, fixture flow with no prompt file, parity for both CLIs) and --skip-validate stderr warning. Brings suite to 106 tests. Agent D — docs drift sync: - docs/workflow.md: rewrote "What's not yet here" (validate gate and stale-extraction detection BOTH shipped — now lists API extractor, rule governance, fairness audit, drift dashboards). Added the __prompt_hash paragraph in "Paste into an AI tool" + a new "--skip-validate flag" subsection (debugging-only, never-in-CI). - README.md: test count 60 -> 100+, file map updated to include prompt-hash / validate / validate-cli modules, research table extended from 10 to 16 artifacts. - docs/README.md: "five briefs" -> "16 research artifacts" with the expanded angle list. Suite: 106/106 passing.
…dex round 9 P1)
When the prompt file exists but collectFiles() throws (target paths
deleted / renamed / unreadable), the previous behaviour was to set
expectedPromptHash = undefined and continue — silently disabling the
stale-extraction check. A renamed source folder could then let report
render an old extraction as if it were current.
Now both report.ts and validate-cli.ts fail closed: print an explicit
error and skip the target with exit code 2. The integrity invariant
('an extraction is checked against the current source') is preserved.
Backwards-compat preserved for the fixture / single-shot flow: when
prompts/<id>.md doesn't exist, no hash check is attempted, and the
catch path above doesn't fire.
Suite: 106/106 still passing (no test exercised the silent-downgrade
path).
… (Codex round 10) - config.ts: target-id duplicate check now case-insensitive. Target ids are interpolated into <id>.md / <id>.json / <id>.html filenames, and default macOS / Windows filesystems are case-insensitive — so 'Alpha' and 'alpha' would collide on disk without a config error and one target's artifacts would overwrite another's silently (#525 round 10 P2). - prompt-builder.ts: __prompt_hash is now actually the FIRST key in the response template, not just claimed-first-then-appended-last. Object.fromEntries used to insert the schema keys before the promptHash assignment; JS object key order follows insertion for string keys, so the hash was rendered last. Now we build the object with the hash first, then loop the schema (#525 round 10 P3). - Two new tests cover both behaviours. Note: src/cli-shared.ts and src/validate-cli.ts are dirty in the worktree from an in-flight CLI scaffolding refactor; those changes will land separately when the agent reports back.
Three CLIs (plan, validate, report) previously duplicated argv
parsing, config + rules + schema loading, target filtering, extraction
IO, prompt-hash recompute, and exit-code handling. The duplication
was the documented cause of two repeated bugs: Codex caught the same
JSON-root-validation defect in report.ts (round 6) and cli.ts (round
7), and the fail-closed fix for unrecomputable prompt hashes (round 9
P1) had to be applied to both report.ts and validate-cli.ts.
What changed:
- src/cli-shared.ts (new, 270 LOC): exports takeOpt / takeFlag /
parseStandardArgs, loadCliBaseContext, selectTargets, plus a
discriminated-union loadExtractionForTarget that returns
{ kind: 'ok' | 'missing' | 'invalid-json' | 'non-object' |
'hash-unrecomputable' } and a logExtractionError formatter that
preserves the existing stderr text byte-for-byte (test/report-flow
asserts those strings).
- src/plan.ts: 90 LOC -> 79 LOC. Uses parseStandardArgs +
loadCliBaseContext + selectTargets. Schema coverage diff stays here.
- src/report.ts: 210 LOC -> 120 LOC. Per-target handler is now
evaluate -> validate -> render HTML; the defensive-IO scaffolding
is gone. --skip-validate and --no-open remain command-specific.
- src/validate-cli.ts: 110 LOC -> 62 LOC. Per-target handler is now
just validateExtraction + log results + Summary line.
- test/cli-shared.test.ts (new): 14 unit tests for takeOpt/takeFlag
argv mutation, parseStandardArgs, selectTargets filtering, and
loadExtractionForTarget across all five discriminated-union cases.
- src/cli.ts, src/engine.ts, src/loader.ts, src/validate.ts, and the
prompt + html-report layers are untouched.
Verified: 123/123 tests passing, tsc clean, typos clean. All eight
report-flow integration tests still pass — message text and exit
codes preserved.
Three rendered reports (ready, blocked, needs-attention) for the agents to inspect when reviewing report readability.
The sample reports under experiments/rule-engine-poc/research/sample-reports/ embed 12-char rule content-hash prefixes that randomly trip typos rules (e.g., 'afe...' -> 'safe'). Excluding the folder is consistent with the existing pattern that allow-lists specific commit-SHA fragments. Also commits research/17 (ux-designer pass on the rendered HTML reports) — top finding is that the audit trail buries the matched rules among ~21 'did not match' siblings; recommended a 'What fired' section + collapse-by-default for skipped rules.
Product-strategist pass on the HTML report as a downstream-shared
artifact. Three findings:
- The report is one artifact serving six first-fields (PR reviewer,
PM, EM, QA, compliance, auditor). Recommend one HTML with
re-stacked sections rather than reader-specific exports — keep
the 'one artifact, many destinations' moat.
- Highest-leverage change: expand action slugs ('kick-ci',
'request-reviewer') to human sentences via an actions[].human
field on the rule schema. Promote the 'verified' prompt-hash
badge next to the verdict.
- Introduce label_set config (default 'dev'; 'pm', 'qa',
'compliance' as presentational overrides) so headline labels
match the reader's vocabulary.
Brand-reviewer pass on the rendered HTML report. Verdict: pass-with-findings; not S1-blocking while the POC stays under experiments/, but would block on the promotion-to-skill step flagged in research/13. Findings: - On-temperament (no emoji / gradients / icons; ASCII [+]/[-]/[?] markers are correctly monospace-as-iconography; restrained density). - Off-token: 18 distinct literal hex values, literal -apple-system / SFMono-Regular font stacks, page background near-white instead of Specorator cream var(--paper). - Voice close but section headers are bare labels rather than sentence-case-with-period declaratives; 'Suggested actions' is passive against Specorator's imperative voice. - Open decision: Specorator has no red token. blocked tier currently uses literal #fdecea / #d8281b / #7a160d. ADR-shaped choice before graduation: extend colors_and_type.css, repurpose --soft-orange and rename the tier 'at-risk', or stay literal until packaged.
…h/21) Critic pass on the rendered HTML report as a communication artifact. Three findings: - Visual hierarchy contradicts semantic model: severity-first is invisible in the weighted-tally widget (reads as a horse race), alphabetically-sorted action list silently asserts a priority the engine refuses to give, cond--missing vs cond--miss are visually distinguished but never named (colour-blind readers lose the signal). - 'verified' badge is a trust-calibration trap — green pill reads as 'extraction verified' to an auditor when it only means 'bound to current inputs'. Compounded by --skip-validate runs producing HTML indistinguishable from validated ones (research/14 risk 1 leaks into the report layer). - Most dangerous skim path: blocker-by-absence. A high-priority blocker rule whose input flag is missing from the extraction simply doesn't fire; neither verdict tile nor any header-level summary tells the reader 'N higher-priority rules were un-evaluable'. - 3 RATs proposed (verdict-tile-alone, action-list-as-priority, 'verified' interpretation). Default no-go if any fail.
Analyst pass on the HTML report from a regulator's reading perspective + 2026 benchmark against LangSmith / Inspect / W&B Weave / sklearn / model-card conventions. Closes the open item from research/02 (human-readable rationale presentation).
New sidecar mapping action slugs to imperative human sentences so the HTML report can render readable guidance instead of bare slugs. - rules/action-glossary.yaml: 28 entries covering every action used in rules/quality-gates.yaml, with optional urgency + category metadata. Imperative voice per Specorator brand. - src/action-glossary.ts: loader + diff-coverage helper, mirroring src/flag-schema.ts conventions. Wiring into config.ts and the HTML renderer happens in subsequent commits when Agents A2/B finish their slices. 123/123 tests still passing — no behaviour change yet.
… A complete) Agent A's RALPH loop completed. Action glossary is now reachable via the config (still optional — no behaviour change for callers that don't set actionGlossary): - src/config.ts: optional 'actionGlossary' string field on RawConfig resolved to 'actionGlossaryPath' on ResolvedConfig, same pattern as flagSchema. - rule-engine.config.json: points at rules/action-glossary.yaml. - test/action-glossary.test.ts: 12 tests covering loader validation, diff-coverage, real-file coverage of rules/quality-gates.yaml, and sentence-shape invariants. Also picks up Agent C's in-flight architecture.md updates: system overview + data flow diagrams now show the glossary node (read only by the renderer) and the new HTML report sections (system-identity header, what fired, reproduce block, audit trail with non-matched collapsed). The renderer itself (Agent B) is still in flight. Suite: 135/135 passing (123 baseline + 12 new). 21 unique action slugs in rules/quality-gates.yaml — all mapped in the glossary.
…xtending - audit-trail.md: Mapping to EU AI Act table updated to credit the HTML report (what-fired with human sentences, system-identity header, tier glossary, reproduce block) as the Art. 13 human-readable rationale surface. Closes research/02 open item about explainability presentation. - workflow.md: still in flight by agent C — current commit picks up partial edits. - extending.md: new 'Authoring action human sentences' section documents rules/action-glossary.yaml as a render-only sidecar (engine never reads it; editing sentences cannot change a verdict).
Agent B (HTML report rebuild) and Agent C (README sync) are still running their RALPH loops. This commit snapshots the current on-disk state so the working tree stays clean between iterations: - src/html-report.ts: partial changes from agent B (rebuild for research wave 4 findings). 135/135 tests still passing — the partial state is internally consistent even if not yet feature complete. - src/report.ts: corresponding plumbing changes from agent B. - README.md: agent C in-flight test-count + file-map sync. Will be superseded by the next commit when both agents report final.
…t B complete) Agent B's RALPH loop completed. Twelve convergent findings from research wave 4 now realised in the renderer: 1. 'What fired' section above the full audit trail (UX/17 + critic/21 + auditor/20). Verdict-card stats line now reads 'N rule(s) fired · M action(s) to take'. 2. Non-matched rules collapsed via <details class='rule-collapsed'>; matched rules stay inline (UX/17). 3. Blocker-by-absence banner adjacent to the verdict card when any rule's condition reports 'flag missing in extraction' (critic/21 + UX/17). Yellow palette, names the missing flags. 4. Suggested actions now sorted by priority-of-cause (walk evaluations in priority-desc order, dedup preserving first-seen) instead of alphabetic. result.actions unchanged for machine consumers (UX/17). 5. Action human-sentence rendering via rules/action-glossary.yaml; falls back to bare slug when entry missing (stakeholder/18). 6. Provenance section: preamble explaining the hashes + 'How to reproduce' block + 12-char hash truncation (UX/17 + auditor/20 + stakeholder/18). 7. System-identity header above the verdict card: engine version + prominent timestamp moved out of the footer (auditor/20). 8. Verdict-tier glossary + [+]/[-]/[?] glyph legend in a collapsed <details class='legend'> block (auditor/20 + UX/17). 9. cond--miss now has a faint red row-wash matching cond--missing's amber, so the visual distinction isn't glyph-color-only (UX/17). 10. @media (max-width: 540px) single-column fallback (UX/17). 11. Trust calibration: --skip-validate banner shown prominently when validationStatus='skipped'; verified-badge tooltip explains it only means 'extraction is bound to current inputs', not 'flags are correct' (stakeholder/18 + critic/21). 12. Section headers in sentence-case with periods, imperative voice: 'Take these actions.' not 'Suggested actions' (brand/19). 28 new tests in test/html-report.test.ts; suite total 163/163. Three sample reports regenerated under research/sample-reports/ so reviewers see the new shape.
…x round 11) - open-browser.ts: success now requires the opener process to exit with code 0 (or null, signal-terminated) — not just spawn. On Linux headless / CI, xdg-open spawns successfully and then immediately exits non-zero because no browser handler is registered; previously report.ts printed 'opened in browser' even though the file was never opened. Safety-net timeout bumped to 1s for opener daemons that never deliver an exit event (#525 round 11 P2). - cli-shared.ts: takeOpt now throws 'Missing value for option <flag>' when the flag is the last argv entry or is followed by another flag. Previously 'npm run report -- --target' would silently fall back to 'all targets' and 'npm run report -- --target --quiet' would interpret '--quiet' as the target id (#525 round 11 P2). - Two new takeOpt tests cover the missing-value rejection path. Suite: 165/165 passing.
…odex round 12) - html-report.ts missingFlagNames: only count rules whose final outcome was determined by the missing flag (matched === false). Previously a 'when.any' rule with one matched branch + one missing branch counted as un-evaluable in the banner, even though it contributed to the verdict (#525 round 12 P2). - html-report.ts reproCmd: paths are now single-quoted via a small shellQuote helper that escapes embedded ' as the standard '\\'' four-char sequence. Paths with spaces (e.g., 'My Projects/rules.yaml') no longer break the copy-pasted reproduce command (#525 round 12 P2). - Two new html-report tests: when.any-with-missing-branch is NOT counted in the banner, and reproCmd contains HTML-escaped quoted paths. Suite: 167/167 passing.
…odex round 13) src/cli.ts has its own takeOption() (the single-shot fixture flow doesn't use src/cli-shared.ts). Same bug as round 11 P2 in cli-shared: when --html had no value the helper returned undefined and the CLI silently proceeded with no HTML output, breaking automation that relies on the artifact being written. Now fail fast with a clear stderr message and exit code 2 when the option is the last argv entry or is followed by another flag. Suite: 167/167 still passing (no test exercised the silent-skip path).
… 14) - loader.ts: every entry of 'then.actions' must be a non-empty string slug. Previously numbers / objects / empty strings passed load-time validation and flowed into the HTML reporter as unrecognised tokens that couldn't map to a glossary entry, breaking the remediation guidance the verdict is meant to provide (#525 round 14 P2). - loader.ts: 'gt' and 'lt' now reject NaN and Infinity at load time. Both are technically 'number' but silently corrupt comparisons at runtime (NaN comparisons always false), so a typo could make a gating rule unexpectedly never fire (#525 round 14 P2). - Four new loader tests cover non-string action elements, empty-string action elements, NaN gt, and Infinity lt. Suite: 171/171 passing (167 + 4).
New docs/compliance.md walks the standards/regulations an adopter is most likely to be asked about in 2026 and catalogues what the POC contributes vs what stays the adopter's job. Synthesises the two prior research passes (research/02 regulatory + research/20 auditor readability) into reference material for scoping conversations. Covered: - EU AI Act Art. 11-14 + Art. 72 with a per-article table. - ISO/IEC 42001 AIMS clauses 6-10. - ISO/IEC 23894 AI risk management. - NIST AI RMF Govern / Map / Measure / Manage. - GDPR Art. 22 (when it applies vs when it doesn't). - OECD AI Principles (1-paragraph summary). Plus: - 'What the POC ticks natively' — per-artifact provenance. - 'What is NOT in this POC' — honest gap analysis. - Maturity checklist before production with rough effort estimates. Leads with a disclaimer that this is engineering reference material, not legal advice or certification. docs/README.md updated to index the new doc.
…/git/Luis85/agentic-workflow into claude/rule-engine-poc-gO5yq
…the HTML report Synthesises the five wave-4 research artifacts (research/17-21) plus the relevant sections of architecture.md, workflow.md, and audit-trail.md into one place for someone who wants to understand the report end-to-end without chasing across files. Sections: - What the report is + the three committed sample renders - Section-by-section walkthrough mapped to each section's research source - The five perspectives that shaped the v3 rebuild (UX, stakeholder, brand, auditor, critic) with each agent's top-line finding - The 12 wave-4 changes + the Codex round 11-14 hardenings that landed on top - What is still open, bucketed (strategy slice / governance / ADR / production prep / discovery RATs) - How to generate one - How to read one (the 4-step skim path) docs/README.md indexes it.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d203b154a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| import { join, basename } from "node:path"; | ||
| import { spawnSync } from "node:child_process"; | ||
|
|
||
| const fixturesDir = new URL("../fixtures/", import.meta.url).pathname; |
There was a problem hiding this comment.
Convert file URL to fs path before reading fixtures
new URL("../fixtures/", import.meta.url).pathname is not a safe filesystem path: it keeps percent-encoding (e.g. %20) and is malformed on Windows drive paths, so readdirSync(fixturesDir) can fail with ENOENT when the repo lives in a path with spaces or on Windows. Use fileURLToPath(new URL(...)) before passing it to fs APIs.
Useful? React with 👍 / 👎.
| if (typeof rule.priority !== "number") { | ||
| throw new Error(`Rule '${rule.id}' missing numeric 'priority'`); | ||
| } |
There was a problem hiding this comment.
Reject non-finite rule priorities at load time
priority is only checked with typeof === "number", so YAML values like .nan are accepted. In evaluate, sorting then does b.priority - a.priority; with NaN that comparator returns NaN (treated like 0), which silently breaks the documented priority desc, id asc ordering and can reorder audit/action output unpredictably. This loader should reject non-finite priorities the same way it already rejects non-finite numeric operators and weights.
Useful? React with 👍 / 👎.
- scripts/run-all-html.mjs + scripts/run-all-fixtures.mjs: use fileURLToPath(new URL(...)) instead of .pathname. The bare .pathname keeps percent-encoding (paths with spaces break) and produces malformed Windows drive paths. Same fix in both scripts since they share the import.meta.url -> fixturesDir pattern (#526 P2). - loader.ts: reject non-finite 'priority' (NaN, Infinity) at load time. Same fail-fast discipline as weight + gt + lt. NaN priorities silently break the sort comparator (b.priority - a.priority returns NaN, treated as 0), reordering the audit trail unpredictably (#526 P2). - Two new loader tests cover .nan and .inf priority rejection. Suite: 173/173 passing.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a09bee96e1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (value === null) { | ||
| warnings.push({ | ||
| severity: "warning", | ||
| code: "null-value-omit-instead", | ||
| path: key, | ||
| message: | ||
| `Flag '${key}' is null; prefer omitting unknowns over emitting null. ` + | ||
| `The engine will treat null and missing identically.`, | ||
| }); | ||
| continue; |
There was a problem hiding this comment.
Treat null extraction values as validation errors
The null special-case downgrades schema violations to a warning and skips type enforcement, so report/validate can pass even when the extraction is not type-correct. This can change verdicts silently: the engine treats null as a present flag (hasOwnProperty), so rules using presence-sensitive logic (for example exists or ne) evaluate differently than if the flag were omitted, despite the validator message claiming null is equivalent to missing. In practice, a model emitting {"some_flag": null} can produce an accepted but semantically different decision path.
Useful? React with 👍 / 👎.
… (Codex round 16) PR #526 Codex P1: the previous null handling was a warning with the message 'the engine will treat null and missing identically' — but that claim was false. The engine's hasOwnProperty presence check treats {flag: null} as PRESENT, so rules using 'exists' or 'ne' evaluate differently against {flag: null} than against {} despite the validator's reassurance. Two ways to make the claim true: (a) refuse null at the gate, or (b) make the engine treat null as missing. (a) is simpler and preserves the LLM prompt's 'omit unknowns' contract. Engine semantics stay unchanged; null just never reaches the engine on the workflow path. Single-shot cli.ts bypasses validate but its fixtures don't use null. - src/validate.ts: null flag values now produce a hard error (code: 'null-value-not-allowed') with a clear message explaining why null is not 'missing-equivalent'. - test/validate.test.ts: updated existing test to expect the error shape. Suite: 173/173 passing. Single test renamed; no count change.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90f58716f5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const shellQuote = (s: string): string => `'${s.replace(/'/g, "'\\''")}'`; | ||
| const reproCmd = `npx tsx src/cli.ts ${shellQuote(ctx.rulesPath)} ${shellQuote(ctx.flagsPath)} --html <out.html> --quiet`; |
There was a problem hiding this comment.
Render Windows-safe repro command in HTML report
The generated reproCmd always uses POSIX single-quote escaping, which is not valid quoting in cmd.exe and is interpreted differently in PowerShell. On Windows (especially when rulesPath or flagsPath contains spaces), the report’s suggested reproduction command can fail even though evaluation succeeded, undermining the “How to reproduce” path for auditors/operators on supported Windows environments.
Useful? React with 👍 / 👎.
PR #526 round 17 P2: the reproCmd block emitted only POSIX single-quote-escaped paths. cmd.exe doesn't recognise single quotes as path delimiters, and PowerShell interprets them differently from bash — so an auditor on Windows copying the 'How to reproduce' command got a syntax error even when the report itself was valid. Now render both flavours side-by-side in the provenance section: - POSIX (macOS / Linux / WSL / Git Bash): single-quote escape, embedded ' becomes '\''. - Windows (cmd.exe / PowerShell): double-quote escape, embedded " becomes "" (Windows escape convention). The reader picks the right one. Both are labelled. Updated the existing reproCmd test to assert both flavours appear and that the POSIX/Windows labels are present. Suite: 173/173 passing.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4929c08d5f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const reproCmdPosix = `npx tsx src/cli.ts ${posixQuote(ctx.rulesPath)} ${posixQuote(ctx.flagsPath)} --html <out.html> --quiet`; | ||
| const reproCmdWindows = `npx tsx src/cli.ts ${windowsQuote(ctx.rulesPath)} ${windowsQuote(ctx.flagsPath)} --html <out.html> --quiet`; |
There was a problem hiding this comment.
Remove shell-redirection token from repro command
The generated reproduction command embeds --html <out.html>, but <...> is parsed by both POSIX shells and cmd.exe/PowerShell as input redirection, not a placeholder path. If an auditor copy-pastes this command as instructed, --html receives no value and the command fails (or behaves unexpectedly), so the report’s primary replay path is broken across environments. Use a literal sample filename (e.g. out.html) or quote the placeholder text so it is not treated as shell syntax.
Useful? React with 👍 / 👎.
…round 18) PR #526 round 18 P2: the reproCmd block printed '--html <out.html>' meaning the angle brackets as a placeholder, but POSIX shells and cmd.exe / PowerShell parse <...> as input redirection. A copy-paste would silently send no value to --html and the command would fail. Replaced with a literal 'out.html' filename — the reader edits it if they want a different path, but the copy-paste path works on every shell without modification. Suite: 173/173 still passing (existing repro tests check for the quoted paths, not the filename — both flavours intact).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 51a3356ccf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const windowsQuote = (s: string): string => `"${s.replace(/"/g, '""')}"`; | ||
| // Use a literal filename, NOT `<out.html>`: angle brackets are shell | ||
| // I/O redirection on both POSIX and cmd.exe / PowerShell, so a | ||
| // copy-paste would silently consume the placeholder as a redirection | ||
| // target and `--html` would receive no value (Codex round 18 P2). | ||
| const reproCmdPosix = `npx tsx src/cli.ts ${posixQuote(ctx.rulesPath)} ${posixQuote(ctx.flagsPath)} --html out.html --quiet`; | ||
| const reproCmdWindows = `npx tsx src/cli.ts ${windowsQuote(ctx.rulesPath)} ${windowsQuote(ctx.flagsPath)} --html out.html --quiet`; |
There was a problem hiding this comment.
Emit a PowerShell-literal repro command
The new Windows repro command is labeled for both cmd.exe and PowerShell, but it always wraps paths in double quotes (windowsQuote). In PowerShell, double-quoted strings are expandable, so path segments containing $... or $() are interpreted instead of passed literally, which can mutate the path or break copy-paste reproduction for affected checkouts. This is a regression from the prior single-quote style for PowerShell; emit a dedicated PowerShell form (single-quoted with escaped ') or escape PowerShell expansions before rendering.
Useful? React with 👍 / 👎.
Round 17's Windows form used double quotes for both cmd.exe and
PowerShell, but PowerShell double-quoted strings EXPAND \$var and
\$(...). A path like 'src/\$something/x.json' would be interpreted
in PowerShell — a regression vs the prior POSIX form, which used
single quotes.
Split Windows into two flavours:
- cmd.exe: double-quote escape (\" -> \"\"). cmd doesn't expand \$.
- PowerShell: single-quote escape (' -> ''). Single quotes suppress
PowerShell expansion.
The HTML provenance block now renders three labelled forms instead
of two. POSIX still uses POSIX-style single-quote escape ('\\'').
One new test asserts the PowerShell block uses single quotes; the
existing repro test updated to match three-form layout.
Suite: 174/174 passing.
Summary
Adds
experiments/rule-engine-poc/docs/report-reference.md— a single-page overview of the HTML report the POC produces.The original POC PR (#525) merged with 5 separate research artifacts about the report (
research/17–21) and the report's design scattered acrossarchitecture.md,workflow.md, andaudit-trail.md. This doc consolidates all of that into one place for someone who wants to understand the report end-to-end without chasing references.What's in it
research/17), stakeholder strategy (18), brand (19), auditor readability (20), misread risks (21), each with the agent's top-line finding.docs/README.mdupdated to index the new doc.Test plan
typos --config _typos.tomlpasses.Generated by Claude Code