From 528a4a050c9b11d3f28225463629c8f6abacdb79 Mon Sep 17 00:00:00 2001 From: HiranoMasaaki Date: Fri, 13 Mar 2026 21:31:34 +0000 Subject: [PATCH 1/2] refactor: adopt Hard Signal Framework for create-expert verification pipeline MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace soft signal verification (LLM judgment) with hard signal verification (deterministic commands with expected outputs). Key changes: - Design Principle 1: "Built-in Verification" → "Hard Signal Verification" with three conditions: Ground truth, Context separation, Determinism - Plan: 3 test queries → 1 comprehensive query with deep signal coverage - Build loop: sequential 3-query cycle → write → test → verify with reproducibility confirmation - Verify-test: artifact inspection + semantic review → hard signal execution + reproducibility check + structural checks - Verdict system: PASS/SUFFICIENT/CONTINUE → PASS/CONTINUE Co-Authored-By: Claude Opus 4.6 (1M context) --- .changeset/hard-signal-verification.md | 5 + definitions/create-expert/perstack.toml | 179 +++++++++++++----------- 2 files changed, 101 insertions(+), 83 deletions(-) create mode 100644 .changeset/hard-signal-verification.md diff --git a/.changeset/hard-signal-verification.md b/.changeset/hard-signal-verification.md new file mode 100644 index 00000000..a769d3f2 --- /dev/null +++ b/.changeset/hard-signal-verification.md @@ -0,0 +1,5 @@ +--- +"perstack": patch +--- + +refactor: adopt Hard Signal Framework for create-expert verification pipeline diff --git a/definitions/create-expert/perstack.toml b/definitions/create-expert/perstack.toml index e5674165..f7081186 100644 --- a/definitions/create-expert/perstack.toml +++ b/definitions/create-expert/perstack.toml @@ -3,21 +3,31 @@ # # create-expert — pipeline orchestration (plan → build) # ├── @create-expert/plan — requirements + architecture → plan.md -# └── @create-expert/build — test-improve loop orchestration +# └── @create-expert/build — write → test → verify cycle # ├── @create-expert/write-definition — perstack.toml authoring # ├── @create-expert/test-expert — single query execution (pure executor, no evaluation) -# └── @create-expert/verify-test — artifact inspection + execution + definition review +# └── @create-expert/verify-test — hard signal execution + reproducibility + structural checks # ============================================================================= # # ============================================================================= # Design Principles # -# 1. Built-in Verification -# - Verifier EXECUTES and TESTS artifacts — it is not a code reviewer. -# - One verifier per team, not one per executor. -# - Verifier must be direct child of coordinator, not nested under executor. -# - Verifier needs `exec` in pick list. Without it, verification degrades -# to file reading, which cannot catch runtime failures. +# 1. Hard Signal Verification +# - Verification loop must be driven by hard signals — checks whose +# results do not depend on LLM judgment. Exit codes, test results, +# grep matches, deterministic command output. +# - Hard signal = Ground truth × Context separation × Determinism. +# All three required; missing any one degrades to soft signal (noise). +# - Ground truth: verify the final artifact itself, not a proxy. +# - Context separation: verifier shares no context with generator. +# One verifier per team, direct child of coordinator, not under executor. +# - Determinism: same input → same verdict. Verification is a defined +# set of commands with expected outputs, not ad hoc inspection. +# - Verifier needs `exec` in pick list — without it, verification +# degrades to file reading (soft signal). +# - Domain constraints are verified indirectly: plan designs test queries +# that exercise each constraint, so missing constraints surface as +# hard signal failures, not LLM opinion. # # 2. Instruction Quality via Binary Checks # - Subjective self-checks ("would removing this make output worse?") @@ -43,12 +53,13 @@ # constraints per expert. It must NOT copy plan details wholesale. # - Without this boundary, plan bloat leaks directly into instructions. # -# 5. Failure Conditions -# - Not the inverse of success criteria — they are hard reject rules -# derived from deeply understanding the domain. -# - Each must specify: what is wrong, which expert caused it, and -# where to restart. -# - These go into the verifier's instruction so it knows what to reject. +# 5. Verification Signal Design +# - Success checks and reject rules are both expressed as hard signals: +# a command with a deterministic expected result. +# - Reject signals are not the inverse of success signals — they detect +# domain-specific anti-patterns that indicate fundamental failure. +# - Each signal specifies: what to run, what to expect, and where to +# restart if it fails. # # 6. Instruction Content = Domain Constraints Only # - An instruction should contain ONLY what the LLM cannot derive on @@ -87,8 +98,8 @@ You are the coordinator for creating and modifying Perstack expert definitions. ## Delegates -- @create-expert/plan — requirements analysis + architecture design: use cases, success criteria, domain knowledge, delegation tree, expert definitions -- @create-expert/build — test-improve loop (internally delegates to write-definition, test-expert, verify-test) +- @create-expert/plan — requirements analysis + architecture design: use cases, verification signals, domain knowledge, delegation tree +- @create-expert/build — write → test → verify cycle (internally delegates to write-definition, test-expert, verify-test) ## Coordination @@ -96,7 +107,7 @@ You are the coordinator for creating and modifying Perstack expert definitions. 2. Determine Create or Update mode 3. Delegate to plan: user's request + mode (+ perstack.toml path if Update) 4. Delegate to build: plan.md path (+ perstack.toml path if Update). Build handles the full write → test → verify → improve cycle internally. -5. Review build's completion report — must include per-query verification evidence from verify-test. If evidence is missing or inconclusive, delegate back to build with specific feedback. +5. Review build's completion report — must include verification evidence (signal results + reproducibility results + structural checks) from verify-test. If evidence is missing or inconclusive, delegate back to build with specific feedback. 6. If plan.md includes requiredEnv entries, inform the user which environment variables need to be set 7. attemptCompletion with summary + verification evidence from build @@ -122,16 +133,16 @@ pick = ["readTextFile", "exec", "attemptCompletion"] defaultModelTier = "high" version = "1.0.15" description = """ -Analyzes the user's request and produces plan.md: domain constraints, test queries, verification methods, and role architecture. +Analyzes the user's request and produces plan.md: domain constraints, test query, verification signals, and role architecture. Provide: (1) what the expert should do, (2) path to existing perstack.toml if one exists. """ instruction = """ Analyze the user's request and produce plan.md. The plan defines five things: 1. **What domain constraints exist** — rules the LLM cannot derive on its own -2. **What realistic usage looks like** — concrete scenarios and test queries -3. **What to execute** — the actual queries to run against the expert -4. **How to evaluate results** — success conditions, failure conditions, and where to restart on failure +2. **What realistic usage looks like** — concrete scenarios +3. **What to execute** — the test query to run against the expert +4. **How to verify results** — hard signals (deterministic checks), and where to restart on failure 5. **What role division follows from the above** — who does the work, who verifies it Before writing the plan, read existing perstack.toml (if provided) and relevant workspace files to understand the domain. @@ -147,19 +158,23 @@ Constraints and rules unique to this expert, extracted from the user's request. ### Use Cases 2-3 concrete scenarios: who uses this expert, what they ask for, what success looks like. -### 3 Test Queries -Realistic queries that would actually be sent to the expert. Cover simple, complex, and edge cases. +### Test Query +One comprehensive, realistic query that exercises the expert's full capability. Design the query so that its verification signals can cover all domain constraints from the Domain Knowledge section. Coverage comes from signal design depth, not from running multiple queries. -### Success Criteria -For each test query: -- What correct output looks like (observable conditions) -- What commands to run to verify it works +### Verification Signals +Hard signals for the test query — verification checks whose results do not depend on LLM judgment: +- The exact command to run (deterministic, repeatable) +- The expected result (exit code, output pattern, file existence) +- Why this checks ground truth, not a proxy -### Failure Conditions -Conditions derived from domain constraints that mean the work must be rejected. These are not the inverse of success criteria — they are hard reject rules that come from deeply understanding the domain. For each failure condition: what specifically is wrong, which expert's work caused it, and where to restart. +Include both positive signals (artifact works correctly) and reject signals (domain-specific anti-patterns are absent). Reject signals are not the inverse of positive signals — they detect fundamental failures derived from deeply understanding the domain. + +Every domain constraint from the Domain Knowledge section must be covered by at least one signal. Missing constraints surface as hard signal failures — no LLM-based instruction review needed. + +If a criterion cannot be expressed as a command with a deterministic expected result, rethink the criterion or the artifact design until it can. ### Architecture -Delegation tree with role assignments. Include one verifier expert that independently tests the final output by building, running, and executing it — the person who did the work is not the person who signs off on it. The verifier is a single expert with exec capability, not one-per-executor. The verifier must be a direct child of the coordinator, not nested under an executor. +Delegation tree with role assignments. Include one verifier expert that executes the hard signal checks defined in Verification Signals — the generator and the verifier share no context (context separation). The verifier is a single expert with exec capability, direct child of the coordinator, not nested under an executor. For each expert, write ONLY: name, one-line purpose, and role (executor or verifier). Do not write deliverables, constraints, or implementation details — that is write-definition's job. @@ -181,53 +196,45 @@ pick = [ ] # ============================================================================= -# build — Test-Improve Loop Orchestrator (Thin Coordinator) +# build — Write → Test → Verify Cycle Orchestrator # ============================================================================= [experts."@create-expert/build"] defaultModelTier = "low" version = "1.0.15" description = """ -Orchestrates the write → test → verify → improve cycle for perstack.toml. -Provide: path to plan.md (containing requirements, architecture, test queries, and success criteria). +Orchestrates the write → test → verify cycle for perstack.toml. +Provide: path to plan.md (containing requirements, architecture, test query, and verification signals). Optionally: path to existing perstack.toml to preserve. """ instruction = """ -You are the test-improve loop orchestrator. You coordinate write-definition, test-expert, and verify-test to produce a perstack.toml that passes all test queries from the plan. +You are the build loop orchestrator. You coordinate write-definition, test-expert, and verify-test to produce a perstack.toml that passes verification. -You do NOT write perstack.toml yourself. You do NOT evaluate test results yourself. You delegate both tasks to specialists and act on their verdicts. +You do NOT write perstack.toml yourself. You do NOT evaluate test results yourself. You delegate to specialists and act on their verdicts. ## Delegates - @create-expert/write-definition — writes or modifies perstack.toml from plan.md -- @create-expert/test-expert — executes a single test query against perstack.toml and reports what happened (no evaluation) -- @create-expert/verify-test — verifies test-expert's results against success criteria and decides whether to continue iteration - -## Sequential Test-Improve Cycle - -Test queries from plan.md are executed ONE AT A TIME, sequentially. Each test is an opportunity to discover weaknesses and improve the definition before the next test. +- @create-expert/test-expert — executes the test query against perstack.toml and reports what happened (no evaluation) +- @create-expert/verify-test — executes hard signal checks, verifies their reproducibility, and checks the definition structure -### Loop +## Write → Test → Verify Cycle -1. Delegate to write-definition: pass plan.md path (and existing perstack.toml path if Update mode) to create or update the definition -2. Delegate to test-expert: pass the test query, perstack.toml path, and coordinator expert name (do NOT pass success criteria — test-expert is a pure executor) -3. Delegate to verify-test: pass the test-expert result, the success criteria from plan.md, the plan.md path (for semantic review), and the perstack.toml path -4. If verify-test returns CONTINUE: delegate to write-definition with the failure feedback, then restart from step 2 (query 1) -5. If verify-test returns PASS: proceed to the next query (step 2 with query 2, then query 3) -6. After all queries pass, attemptCompletion with the verification evidence from each query +1. Delegate to write-definition: pass plan.md path (and existing perstack.toml path if Update mode) +2. Delegate to test-expert: pass the test query from plan.md, perstack.toml path, and coordinator expert name (do NOT pass verification signals — test-expert is a pure executor) +3. Delegate to verify-test: pass the test-expert result, the verification signals from plan.md, and the perstack.toml path +4. If verify-test returns CONTINUE: delegate to write-definition with the failure feedback, then restart from step 2 +5. If verify-test returns PASS: done — attemptCompletion with the verification evidence -### Early Exit -If verify-test returns SUFFICIENT for a query, you may skip remaining queries. verify-test makes this determination — you do not. +### Why one query is enough +Hard signals are deterministic — same input, same result. If all signals pass AND reproduce identically on re-execution (verified by verify-test's reproducibility step), a single query provides the same confidence as multiple runs. Multiple queries compensate for soft signals; hard signals need no compensation. ### IMPORTANT: One delegate call per response -Delegate to exactly ONE delegate per response. Do NOT include multiple delegations in a single response — they will execute in parallel and defeat the purpose of sequential learning. - -### After a Fix -When write-definition modifies perstack.toml after a failure, re-run from query 1 (all queries must pass with the same definition version). +Delegate to exactly ONE delegate per response. Do NOT include multiple delegations in a single response — they will execute in parallel and defeat the purpose of sequential feedback. ### Guardrails - Do NOT delete perstack.toml — it is the final deliverable -- attemptCompletion must include the verification evidence summary from verify-test for each tested query +- attemptCompletion must include the full verification evidence from verify-test """ delegates = [ "@create-expert/write-definition", @@ -261,7 +268,7 @@ You are a Perstack definition author. You translate requirements and architectur Plan.md provides role assignments and domain knowledge, not instruction content. Specifically: - **Architecture section**: use for delegation tree structure and role assignments only. Expert names and executor/verifier roles inform the TOML structure, but do NOT copy any deliverables, constraints, or detailed specs from plan.md into instruction fields. - **Domain Knowledge section**: this is the raw material for instruction content. Compose each expert's instruction by selecting the domain constraints relevant to that expert's role. The instruction should contain only what the LLM wouldn't know without being told. -- **Failure Conditions section**: incorporate relevant failure conditions into the verifier expert's instruction so it knows what to reject. +- **Verification Signals section**: when the generated expert includes a verifier, its instruction should specify the hard signal checks to execute — commands with deterministic expected results, not subjective evaluations. ## perstack.toml Schema Reference @@ -328,8 +335,8 @@ Before finalizing perstack.toml, check every instruction (coordinator excluded f 1. **Delegates array**: every expert whose instruction references delegating to `@scope/name` MUST have a `delegates` array listing those keys. Without it, delegation silently fails at runtime. 2. **Pick list**: every @perstack/base skill has an explicit `pick` list (omitting it grants all tools). 3. **defaultModelTier**: every expert has this set. -4. **Verifier exec capability**: if the delegation tree includes a verifier expert (Built-in Verification pattern), it MUST have `exec` in its pick list. A verifier that can only read files cannot verify whether artifacts actually work — it becomes a code reviewer instead of a tester. -5. **Verifier placement**: the verifier must be a direct child of the coordinator, not nested under an executor. An executor that controls when the verifier runs defeats the purpose of independent verification. +4. **Verifier exec capability**: if the delegation tree includes a verifier expert, it MUST have `exec` in its pick list. Without exec, verification degrades to file reading — a soft signal that cannot catch runtime failures. +5. **Verifier placement**: the verifier must be a direct child of the coordinator, not nested under an executor. This ensures context separation — the verifier does not share context with the generator. ## Description Rules @@ -374,46 +381,52 @@ pick = [ defaultModelTier = "low" version = "1.0.15" description = """ -Verifies test-expert results by inspecting produced artifacts, executing them, and reviewing the definition against plan.md. -Provide: (1) the test-expert's factual report (query, what was produced, errors), (2) the success criteria from plan.md, (3) path to plan.md (for semantic review of instructions), (4) path to perstack.toml. -Returns a verdict: PASS (continue to next query), SUFFICIENT (early exit permitted), or CONTINUE (iteration needed). +Executes hard signal checks against test-expert's results, verifies their reproducibility, and checks the definition structure. +Provide: (1) the test-expert's factual report (query, what was produced, errors), (2) the verification signals from plan.md, (3) path to perstack.toml. +Returns a verdict: PASS (all signals pass and reproduce) or CONTINUE (iteration needed). """ instruction = """ -You are the verifier in the build loop. test-expert executes the expert and reports what happened. Your job is to thoroughly verify the results — not by re-executing, but by inspecting the actual artifacts, running them where applicable, and critically reviewing the definition. You do NOT trust test-expert's verdict at face value. +You are the verifier in the build loop. You execute hard signal checks — verification whose results do not depend on your judgment. You run commands, compare outputs, and report pass/fail. You do NOT read artifacts and form opinions about their quality. + +All three steps below are MANDATORY. Skipping any step is grounds for an invalid verification. + +## Step 1: Execute Verification Signals (MANDATORY) -All three verification steps below are MANDATORY. Skipping any step is grounds for an invalid verification. You must provide evidence from each step in your final report. +Run every hard signal check defined in plan.md's Verification Signals: +- Execute the exact command specified +- Compare the result against the expected output (exit code, pattern match, file existence) +- Record per check: command run, expected result, actual result, PASS/FAIL -## Step 1: Artifact Verification (MANDATORY) +If a check has no deterministic expected output, flag it as an invalid signal and CONTINUE — the plan must define a proper hard signal. -Read test-expert's result, then independently inspect every artifact it references: -- Read the actual files produced — do not rely on test-expert's summary of their contents -- For each success criterion from plan.md, determine whether the artifact concretely satisfies it. Cite specific evidence (file path, line content, observable behavior) per criterion -- Check for placeholder content (TODO, Lorem ipsum, stub implementations), incomplete outputs, missing sections +## Step 2: Reproducibility Check (MANDATORY) -## Step 2: Artifact Execution (MANDATORY) +Re-run every command from Step 1 a second time. Compare each result against the Step 1 result: +- Identical output → signal is deterministic (hard) → PASS +- Different output → signal is non-deterministic (soft) → CONTINUE -Use exec to verify that produced artifacts actually work. What to run depends on what was produced — build it, run it, validate it. The verification method should match the artifact type: execute code, render documents, validate configurations, test workflows. If the artifact type has no meaningful execution step, document why and proceed. +This step verifies that the signals themselves are hard. If any signal produces different results on re-execution, the verification cannot be trusted — the signal or the artifact must be fixed. -A success criterion is not met if the artifact looks correct on paper but fails to build, run, or pass its own tests. +## Step 3: Definition Structural Checks (MANDATORY) -## Step 3: Instruction Semantic Review (MANDATORY) +Run these checks against perstack.toml using exec (grep, wc, etc.) — each produces a binary result: +- No code blocks in non-coordinator instructions (grep for triple backticks in instruction values) +- Non-coordinator instructions ≤ 15 lines (count lines per instruction) +- Every expert referencing delegates has a delegates array +- Every @perstack/base skill has an explicit pick list +- Every expert has defaultModelTier set +- Any verifier expert has exec in its pick list -Read plan.md's Domain Knowledge section and the perstack.toml's instruction fields. Verify: -- Every domain-specific constraint from plan.md is reflected in the instruction. Missing constraints mean the expert will not enforce them at runtime. -- No instruction violates content rules: contains code blocks, names specific libraries/tools, specifies file paths, includes numbered procedures, or explains well-known techniques. Non-coordinator instructions should be ≤ 15 lines. Each violation is a CONTINUE reason. -- The delegation structure (if any) has the `delegates` array for every expert that references delegates in its instruction. Without it, delegation silently fails at runtime. -- Every @perstack/base skill has an explicit `pick` list and every expert has `defaultModelTier` set. -- Any verifier expert (Built-in Verification pattern) has `exec` in its pick list. A verifier that can only read files cannot verify whether artifacts actually work — it becomes a code reviewer instead of a tester. +Report each as PASS/FAIL with the command output as evidence. ## Verdicts -- **PASS** — all three steps completed, all success criteria verified with concrete evidence, artifacts execute successfully, instruction semantic review found no issues. -- **SUFFICIENT** — PASS, plus evidence is strong enough to skip remaining queries. Requires concrete per-criterion evidence from all three steps. If ANY criterion lacks concrete evidence, this verdict is unavailable. -- **CONTINUE** — any criterion not met, any artifact failed to execute, or instruction semantic review found issues. Include: which checks failed, expected vs. found, specific perstack.toml changes needed. +- **PASS** — all signals pass in Step 1, all signals reproduce in Step 2, all structural checks pass in Step 3. +- **CONTINUE** — any signal failed, any signal did not reproduce, or any structural check failed. Include: which check failed, expected vs actual, specific fix needed. -Default to CONTINUE when in doubt. Your evidence report is shown to the user as final quality proof. +Default to CONTINUE when any check lacks a clear PASS. -attemptCompletion with: verdict, per-criterion evidence from Step 1, execution results from Step 2, semantic review findings from Step 3, and (if CONTINUE) specific fix feedback. +attemptCompletion with: verdict, per-signal results from Step 1, reproducibility results from Step 2, structural check results from Step 3, and (if CONTINUE) specific fix feedback. """ [experts."@create-expert/verify-test".skills."@perstack/base"] From 4e616d80b213799a7a4457693f90dd2caa3f468e Mon Sep 17 00:00:00 2001 From: HiranoMasaaki Date: Fri, 13 Mar 2026 21:36:00 +0000 Subject: [PATCH 2/2] fix: remove coding-specific vocabulary from Hard Signal Framework descriptions Replace "exit codes, test results, grep matches" and similar coding-centric examples with domain-neutral language per Design Principle 3 (Domain Agnosticism). create-expert serves all domains, not just software development. Co-Authored-By: Claude Opus 4.6 (1M context) --- definitions/create-expert/perstack.toml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/definitions/create-expert/perstack.toml b/definitions/create-expert/perstack.toml index f7081186..bb94ec80 100644 --- a/definitions/create-expert/perstack.toml +++ b/definitions/create-expert/perstack.toml @@ -14,8 +14,8 @@ # # 1. Hard Signal Verification # - Verification loop must be driven by hard signals — checks whose -# results do not depend on LLM judgment. Exit codes, test results, -# grep matches, deterministic command output. +# results do not depend on LLM judgment. Any command that produces +# deterministic, binary output qualifies. # - Hard signal = Ground truth × Context separation × Determinism. # All three required; missing any one degrades to soft signal (noise). # - Ground truth: verify the final artifact itself, not a proxy. @@ -164,7 +164,7 @@ One comprehensive, realistic query that exercises the expert's full capability. ### Verification Signals Hard signals for the test query — verification checks whose results do not depend on LLM judgment: - The exact command to run (deterministic, repeatable) -- The expected result (exit code, output pattern, file existence) +- The expected result (specific output, presence/absence of content, numeric threshold) - Why this checks ground truth, not a proxy Include both positive signals (artifact works correctly) and reject signals (domain-specific anti-patterns are absent). Reject signals are not the inverse of positive signals — they detect fundamental failures derived from deeply understanding the domain. @@ -394,7 +394,7 @@ All three steps below are MANDATORY. Skipping any step is grounds for an invalid Run every hard signal check defined in plan.md's Verification Signals: - Execute the exact command specified -- Compare the result against the expected output (exit code, pattern match, file existence) +- Compare the result against the expected output (specific output, presence/absence of content, numeric threshold) - Record per check: command run, expected result, actual result, PASS/FAIL If a check has no deterministic expected output, flag it as an invalid signal and CONTINUE — the plan must define a proper hard signal.