From 0a3857e7f7ec2e3f546a7f6b06a94e52559d94f0 Mon Sep 17 00:00:00 2001
From: Linh Ngo <thlinh.ngo@gmail.com>
Date: Sat, 30 May 2026 23:14:18 +0700
Subject: [PATCH 1/3] feat(skills): integrate sk harness quality gates into
 workflow skills (#735)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- tentacle-orchestration/SKILL.md: add Harness gate row to Phase 3
  verification table + note to run sk harness check before Phase 4
  when harness.yaml exists
- task-step-generator/SKILL.md: extend TEST phase template to include
  sk harness check when harness.yaml exists
- project-onboarding/SKILL.md: add step 0.4 Harness Engineering —
  sk harness init, verify harness.yaml, sk harness doctor
- karpathy-guidelines/SKILL.md: link harness.yaml + sk harness check
  as the permanent mechanism for verifiable success criteria
- workflow-creator/SKILL.md: add HARNESS gate row to gate table example

Closes #735

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .github/skills/karpathy-guidelines/SKILL.md   | 139 ++++++++
 .github/skills/project-onboarding/SKILL.md    | 309 ++++++++++++++++++
 .github/skills/task-step-generator/SKILL.md   | 232 +++++++++++++
 .../references/step-file-template.md          | 174 ++++++++++
 .../skills/tentacle-orchestration/SKILL.md    |   3 +-
 .github/skills/workflow-creator/SKILL.md      | 159 +++++++++
 .../references/workflow-template.md           |  93 ++++++
 7 files changed, 1108 insertions(+), 1 deletion(-)
 create mode 100644 .github/skills/karpathy-guidelines/SKILL.md
 create mode 100644 .github/skills/project-onboarding/SKILL.md
 create mode 100644 .github/skills/task-step-generator/SKILL.md
 create mode 100644 .github/skills/task-step-generator/references/step-file-template.md
 create mode 100644 .github/skills/workflow-creator/SKILL.md
 create mode 100644 .github/skills/workflow-creator/references/workflow-template.md

diff --git a/.github/skills/karpathy-guidelines/SKILL.md b/.github/skills/karpathy-guidelines/SKILL.md
new file mode 100644
index 00000000..9d6f90bb
--- /dev/null
+++ b/.github/skills/karpathy-guidelines/SKILL.md
@@ -0,0 +1,139 @@
+---
+name: karpathy-guidelines
+description: Behavioral guidelines to reduce common LLM coding mistakes. Use when writing, reviewing, or refactoring code to avoid overcomplication, make surgical changes, surface assumptions, and define verifiable success criteria.
+license: MIT
+vendored-from: https://github.com/forrestchang/andrej-karpathy-skills
+vendored-commit: main
+supported-hosts: Copilot CLI, Claude Code
+---
+
+# Karpathy Guidelines
+
+Behavioral guidelines to reduce common LLM coding mistakes, derived from [Andrej Karpathy's observations](https://x.com/karpathy/status/2015883857489522876) on LLM coding pitfalls.
+
+**Tradeoff:** These guidelines bias toward caution over speed. For trivial tasks, use judgment.
+
+## When to Use
+
+Invoke this skill when:
+- Writing new code and at risk of over-engineering
+- Reviewing or refactoring existing code
+- Debugging a bug (need verifiable fix criteria)
+- Starting a multi-step task (need a scoped plan)
+- Feeling tempted to "improve" adjacent unrelated code
+
+## Workflow
+
+Apply the four guidelines in order for any non-trivial coding task:
+
+### 1. Think Before Coding
+
+**Don't assume. Don't hide confusion. Surface tradeoffs.**
+
+Before implementing:
+- State your assumptions explicitly. If uncertain, ask.
+- If multiple interpretations exist, present them - don't pick silently.
+- If a simpler approach exists, say so. Push back when warranted.
+- If something is unclear, stop. Name what's confusing. Ask.
+
+### 2. Simplicity First
+
+**Minimum code that solves the problem. Nothing speculative.**
+
+- No features beyond what was asked.
+- No abstractions for single-use code.
+- No "flexibility" or "configurability" that wasn't requested.
+- No error handling for impossible scenarios.
+- If you write 200 lines and it could be 50, rewrite it.
+- Mandatory Project Rule 10, **Minimum Footprint** (canonical text:
+  `docs/AGENT-RULES.md#rule-10--minimum-footprint`): no unjustified new
+  files, no speculative abstractions, reuse existing patterns first, decompose
+  functions over 50 lines or explain why not, flag files over 400 lines, and
+  keep every changed line traceable to the task.
+
+Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify.
+
+### 3. Surgical Changes
+
+**Touch only what you must. Clean up only your own mess.**
+
+When editing existing code:
+- Don't "improve" adjacent code, comments, or formatting.
+- Don't refactor things that aren't broken.
+- Match existing style, even if you'd do it differently.
+- If you notice unrelated dead code, mention it - don't delete it.
+
+When your changes create orphans:
+- Remove imports/variables/functions that YOUR changes made unused.
+- Don't remove pre-existing dead code unless asked.
+
+The test: Every changed line should trace directly to the user's request.
+
+Project Rule 11, **New File Justification** (canonical text:
+`docs/AGENT-RULES.md#rule-11--new-file-justification`): before adding a file,
+search for an existing home, state the new file's responsibility, wire it into
+the relevant lint/test/docs/CI surface, and add or update tests for the behavior
+it owns.
+
+### 4. Goal-Driven Execution
+
+**Define success criteria. Loop until verified.**
+
+Transform tasks into verifiable goals:
+- "Add validation" → "Write tests for invalid inputs, then make them pass"
+- "Fix the bug" → "Write a test that reproduces it, then make it pass"
+- "Refactor X" → "Ensure tests pass before and after"
+
+For multi-step tasks, state a brief plan:
+```
+1. [Step] → verify: [check]
+2. [Step] → verify: [check]
+3. [Step] → verify: [check]
+```
+
+Strong success criteria let you loop independently. Weak criteria ("make it work") require constant clarification.
+
+Use `harness.yaml` to encode success criteria permanently: run `sk harness init` to scaffold the template, edit `success_criteria:` to match your goals, then run `sk harness check` after every implementation step. This turns the verify loop into a single command.
+
+**Within orchestrated workflows (multi-tentacle goals):** Apply this principle at the goal level, not just the task level. Before dispatching any tentacle:
+1. Write the goal's success criteria explicitly (e.g., in `CONTEXT.md` or the first tentacle's description).
+2. After all verification gates pass, evaluate the goal against those criteria.
+3. If unmet, loop: plan new tentacles scoped to remaining gaps. If met, proceed to commit and close.
+
+The loop-until-verified semantic applies at every level — individual task, tentacle, and orchestrated goal — and produces the same result: no "done" claim without evidence.
+
+## Examples
+
+<example>
+User asks: "Add dark mode support to the dashboard"
+
+✅ Karpathy-aligned response:
+1. Clarify scope: "Do you want CSS variables only, or also a toggle UI element?"
+2. After clarification: implement exactly what was asked — no additional themes, no config system
+3. Success criteria: "Dark mode renders correctly on dashboard; no other pages changed"
+</example>
+
+<example>
+User asks: "Fix the login bug"
+
+✅ Karpathy-aligned response:
+1. State assumption: "I assume the bug is the 401 on expired tokens based on recent context"
+2. Write a failing test that reproduces the bug
+3. Make the minimal change to pass the test
+4. Verify: no other tests broken, no adjacent refactoring done
+</example>
+
+<example>
+User asks: "Refactor the data processing module"
+
+✅ Karpathy-aligned response:
+1. Ask: "What's the goal — readability, performance, or testability?"
+2. After answer: change only what serves the stated goal
+3. Match existing code style even if you'd write it differently
+4. Ensure all existing tests still pass before and after
+</example>
+
+## Attribution
+
+Vendored from [forrestchang/andrej-karpathy-skills](https://github.com/forrestchang/andrej-karpathy-skills) under MIT license.
+Original guidelines derived from Andrej Karpathy's public observations on LLM coding pitfalls.
diff --git a/.github/skills/project-onboarding/SKILL.md b/.github/skills/project-onboarding/SKILL.md
new file mode 100644
index 00000000..80cc8a65
--- /dev/null
+++ b/.github/skills/project-onboarding/SKILL.md
@@ -0,0 +1,309 @@
+---
+name: project-onboarding
+description: Complete guide to set up the full AI-assisted development ecosystem for any project. Use when joining a new project, bootstrapping AI tools, initializing Copilot, or onboarding a codebase so no creator, hook, workflow, or routing layer is missed.
+---
+
+# Project Onboarding
+
+Set up the complete AI-assisted development ecosystem for any project.
+Each phase builds on the previous one. **Skipping a phase leaves a gap
+that compounds downstream.** Follow the phases in order.
+
+## When to Use
+
+- Joining a new codebase and you want the full Copilot/agent setup done in the right order
+- User mentions "setup AI tools", "bootstrap project", "initialize copilot", or "onboard project"
+- You want to ensure memory, hooks, workflows, tentacles, and conductor are all installed
+- You need one canonical checklist instead of running creators piecemeal
+
+## Why this matters
+
+Without a structured onboarding, teams cherry-pick tools and miss critical
+infrastructure. The AI has no memory across sessions, no guardrails against
+dangerous commands, no workflow phases, and picks different skills each time.
+This guide ensures every layer is in place before you write the first line of code.
+
+Equally important: **deploying too much at once is its own failure mode.**
+Over-broad instructions (`applyTo: "**/*"`) injected at every tool call, duplicate
+skills across global and project surfaces, and routing rules that drift from what
+is actually installed all compound into a bloated context that slows the AI and
+masks real errors. This guide teaches both what to install and how to keep it lean.
+
+## Staged Rollout Principles
+
+Follow these principles throughout every phase. Violating them is how projects
+end up with 100+ skills loaded simultaneously and 9 instructions firing on every
+tool call.
+
+1. **Minimal first.** Deploy only what the current phase needs. Add more only after
+   the previous layer is verified clean. Resist installing "just in case" extras.
+
+2. **Progressive escalation.** Each creator output is an input to the next. Do not
+   run conductor-creator before you have agents and workflows to route — the rules
+   it generates will be empty or wrong.
+
+3. **Verify before advancing.** Each phase has an explicit verify step. Do not move
+   on if the verify command returns errors or warnings. Fix the gap now; it compounds.
+
+4. **Load budget awareness.** Every `applyTo: "**/*"` instruction fires on every tool
+   call. Every global skill adds to the catalog the AI must scan. After setup, audit:
+   - Instruction count: aim for ≤6 always-loaded instructions.
+   - `applyTo` scope: narrow to the file types or paths that actually need the rule.
+   - Skills: remove project-local copies of any skill that exists in `~/.copilot/skills/`.
+   - Duplicates: if a skill name appears in both global and project surfaces, keep only one.
+
+5. **Single source of truth.** If a skill is deployed globally (in `~/.copilot/skills/`),
+   do not re-deploy it in `.github/skills/`. The project copy silently duplicates context
+   without adding value.
+
+## Overview
+
+```
+Phase 0: FOUNDATION  ->  Memory + Agents + Safety
+Phase 1: PROCESS     ->  Workflows + Orchestration
+Phase 2: ROUTING     ->  Conductor ties everything together
+Phase 3: VERIFY      ->  Confirm zero gaps and healthy load budget
+```
+
+## Phase 0: Foundation
+
+Run these three creators first. Everything else depends on agents, memory,
+and guardrails being in place.
+
+### 0.1 Session Knowledge
+
+**Invoke:** `session-knowledge-creator` skill
+
+**Output:** `briefing.py`, `learn.py`, `.instructions.md`
+
+**Why it matters:** Without session knowledge, every session starts from zero.
+The AI repeats mistakes, forgets past decisions, and has no institutional memory.
+Briefing gives pre-task context; learn records post-task insights.
+
+**Verify:** `sk briefing --wakeup` returns output.  (fallback: `python3 ~/.copilot/tools/briefing.py --wakeup`)
+
+### 0.2 Agent Creator
+
+**Invoke:** `agent-creator` skill
+
+**Output:** `.github/agents/*.agent.md`
+
+**Why it matters:** Generic agents produce generic output. Specialized agents
+encode your architecture, test framework, and domain knowledge.
+
+**Verify:** `ls .github/agents/` shows 5-8 agent files.
+
+### 0.3 Hook Creator
+
+**Invoke:** `hook-creator` skill
+
+**Output:** `.github/hooks/hooks.json` + `scripts/`
+
+**Why it matters:** Hooks are the **strongest enforcement** mechanism.
+They physically intercept and block violations before they happen. Unlike
+skills (AI can ignore) or instructions (AI can rationalize skipping), hooks
+run on every tool call. They prevent commits to protected branches, block
+credential leaks, and guard auto-generated files.
+
+**Verify:** `cat .github/hooks/hooks.json` lists preToolUse/postToolUse hooks.
+
+### 0.4 Harness Engineering
+
+**Invoke:** `sk harness init`
+
+**Output:** `harness.yaml` + `.harness/` directory
+
+**Why it matters:** Harness configuration defines verifiable success criteria and produces a documented 36% agent performance improvement (CORE benchmark, arXiv 2412.04524). Without harness, "done" has no evidence.
+
+**Verify:** `sk harness doctor` returns no errors. `cat harness.yaml` shows success_criteria block.
+
+## Phase 1: Process
+
+With agents and safety in place, define **how work gets done.**
+
+### 1.1 Workflow Creator
+
+**Invoke:** `workflow-creator` skill
+
+**Output:** `WORKFLOW.md` or a `strict-tdd-workflow` skill
+
+**Why it matters:** Without phases, the AI jumps straight to coding before
+understanding requirements, skips testing, and produces unreviewed output.
+Workflows add blocking quality gates between phases.
+
+**Verify:** A `WORKFLOW.md` or `.github/skills/*workflow*/SKILL.md` exists.
+
+### 1.2 Tentacle Creator
+
+**Invoke:** `tentacle-creator` skill
+
+**Output:** `.github/skills/tentacle-orchestration/SKILL.md`
+
+**Why it matters:** Tasks spanning multiple modules run serially without
+orchestration. Tentacle breaks them into parallel work units with clear file
+ownership so agents do not overwrite each other.
+
+**Verify:** `.github/skills/tentacle-orchestration/SKILL.md` exists.
+
+## Phase 2: Routing
+
+The conductor connects **everything from Phase 0 and Phase 1** into a single
+deterministic router.
+
+### 2.1 Conductor Creator
+
+**Invoke:** `conductor-creator` skill
+
+**Output:**
+- `.github/skills/conductor/scripts/conductor.py` (engine)
+- `.github/skills/conductor/scripts/conductor-rules.json` (rules)
+- `.github/instructions/conductor-routing.instructions.md` (auto-load)
+
+**Why it matters:** Without a conductor, the AI re-derives routing every time,
+picking different skills and workflows for the same task across sessions.
+The conductor makes this deterministic: same input = same plan.
+
+**Verify:**
+
+```bash
+python3 .github/skills/conductor/scripts/conductor.py --sync
+# Expect: 0 new unrouted, 0 stale, 100% coverage
+```
+
+## Phase 3: Verify
+
+Run these checks to confirm **zero gaps** in the setup AND a healthy load budget.
+
+```bash
+# 3.1 Sync check — routing rules match installed skills
+python3 .github/skills/conductor/scripts/conductor.py --sync
+
+# 3.2 Rule audit — no orphan or stale references
+python3 .github/skills/conductor/scripts/conductor.py --audit
+
+# 3.3 Test suite
+python3 .github/skills/conductor/scripts/test-conductor.py
+
+# 3.4 Smoke test
+python3 .github/skills/conductor/scripts/conductor.py "implement user login" --verbose
+python3 .github/skills/conductor/scripts/conductor.py "fix crash on startup" --verbose
+```
+
+### 3.5 Load Budget Audit
+
+After conductor passes, audit the instruction and skill surfaces to prevent
+context bloat from accumulating silently.
+
+```bash
+# Count always-loaded instructions (applyTo: **/* or no filter)
+grep -rl 'applyTo.*\*\*/\*' .github/instructions/ ~/.github/instructions/ 2>/dev/null
+
+# Find duplicate skill names across global and project
+comm -12 \
+  <(ls ~/.copilot/skills/ 2>/dev/null | sort) \
+  <(ls .github/skills/ 2>/dev/null | sort)
+# Any name printed here = duplicate. Remove the project copy if the global copy exists.
+
+# List skills that are no longer referenced in conductor routing
+# --sync reports coverage gaps; use it as the authoritative check:
+python3 .github/skills/conductor/scripts/conductor.py --sync
+# Any skill on disk not covered by a routing rule is flagged by --sync.
+# To review all rules for stale references, run --audit and inspect the output manually.
+```
+
+**Healthy targets after onboarding:**
+
+| Surface | Target |
+|---------|--------|
+| Always-loaded instructions (`applyTo: **/*`) | ≤ 6 total across global + project |
+| Duplicate skills (same name in global + project) | 0 |
+| Conductor orphans (skill on disk, not in rules) | 0 (or explicitly listed in `_meta.intentionally_unrouted`) |
+| Conductor stale refs (rule references missing skill) | 0 |
+
+## Quick Reference
+
+| Phase | Creator | Output | Verify |
+|-------|---------|--------|--------|
+| 0.1 | `session-knowledge-creator` | briefing.py, learn.py | `sk briefing --wakeup` |
+| 0.2 | `agent-creator` | .github/agents/*.agent.md | `ls .github/agents/` |
+| 0.3 | `hook-creator` | .github/hooks/ | `cat hooks.json` |
+| 1.1 | `workflow-creator` | WORKFLOW.md | File exists with phases |
+| 1.2 | `tentacle-creator` | tentacle-orchestration | SKILL.md exists |
+| 2.1 | `conductor-creator` | conductor-rules.json | `--sync` reports clean |
+| 3 | Load audit | — | ≤6 always-loaded instr., 0 duplicate skills |
+
+## Dependency Map
+
+```
+session-knowledge-creator ---+
+                             |
+agent-creator ---------------+
+                             |
+hook-creator ----------------+---> conductor-creator ---> READY
+                             |
+workflow-creator ------------+
+                             |
+tentacle-creator ------------+
+```
+
+All five creators feed into conductor-creator. The conductor is the
+integration point that ties the ecosystem together.
+
+## Ongoing Maintenance
+
+| Event | Action |
+|-------|--------|
+| Added/removed a skill | `conductor.py --sync --fix` then re-run load audit |
+| Added a new agent | Update `agent_routing` in conductor-rules.json |
+| Changed workflow phases | Update `workflows` in conductor-rules.json |
+| New session starts | `sk briefing --auto --compact` |
+| After fixing a bug | `sk learn --mistake "Title" "Details" --tags "tags"` |
+| After completing feature | `sk learn --feature "Title" "Details" --tags "tags"` |
+| Skill installed globally | Remove project-local copy if it exists in `.github/skills/` |
+| Instruction added | Check that `applyTo` is as narrow as possible; re-run load audit |
+
+## Platform Notes
+
+- **macOS/Linux:** Use `python3`. Paths use `/`.
+- **Windows:** Use `python` instead. All scripts are cross-platform Python.
+- **Copilot CLI:** Skills auto-discover via `.skill-meta.json`.
+  Instructions auto-inject via `.instructions.md` with `applyTo` frontmatter.
+
+## Troubleshooting
+
+| Problem | Cause | Fix |
+|---------|-------|-----|
+| `--sync` shows orphans | New skill added after setup | `--sync --fix` |
+| Agent uses wrong model | Model not specified | Check `.instructions.md` model rules |
+| Hook blocks valid action | Overly strict regex | Edit `.github/hooks/scripts/` |
+| Briefing returns empty | No entries recorded | Start using `sk learn` after tasks |
+| Context feels slow / bloated | Too many `applyTo: **/*` instructions | Narrow `applyTo` on each instruction file |
+| Same skill name in global + project | Old project copy not removed after global rollout | Delete `.github/skills/<name>/` when `~/.copilot/skills/<name>/` exists |
+| Conductor rules reference missing skill | Skill removed but rule not updated | `conductor.py --audit`, then remove stale rule or restore skill |
+
+<example>
+**Project:** Existing Python backend with no AI scaffolding yet
+
+**User asks:** "Onboard this project so future Copilot sessions have memory, hooks, workflows, and routing"
+
+**Recommended order:**
+1. `session-knowledge-creator`
+2. `agent-creator`
+3. `hook-creator`
+4. `workflow-creator`
+5. `tentacle-creator`
+6. `conductor-creator`
+
+**Expected result:**
+- session memory tools installed
+- project agents created
+- hooks deployed
+- workflow defined
+- tentacle orchestration available
+- conductor routing synced with zero gaps
+- load audit: ≤6 always-loaded instructions, 0 duplicate skill names, 0 conductor orphans
+
+**Propagation note:** If any of the above meta-skills are already installed globally
+(in `~/.copilot/skills/`), skip re-deploying them to `.github/skills/`. The project
+should extend, not duplicate, the global layer.
+</example>
diff --git a/.github/skills/task-step-generator/SKILL.md b/.github/skills/task-step-generator/SKILL.md
new file mode 100644
index 00000000..f7c8665e
--- /dev/null
+++ b/.github/skills/task-step-generator/SKILL.md
@@ -0,0 +1,232 @@
+---
+name: task-step-generator
+description: >
+  Generate a structured STEPS.md file that breaks a specific task into concrete, ordered
+  steps grounded in the project's phased workflow. Use when a task is too complex for a
+  single prompt, when tentacle-orchestration needs a reviewed planning scaffold before
+  dispatch, or for single-module features, multi-stage bug fixes, and scripted migrations.
+  Trigger phrases: "create step
+  file", "generate steps", "make a task plan", "write out the steps", "break this into
+  steps", "step-by-step plan", "task steps", "execution plan".
+---
+
+# Task Step Generator
+
+Generate a `STEPS.md` file that breaks a specific task into concrete, ordered steps an
+agent can follow without re-reading the full specification. Steps are grounded in the
+project's existing phases (from `WORKFLOW.md` or the standard phased lifecycle) and
+are normally scoped to complete in a single agent session. When invoked by
+`tentacle-orchestration`, the step file is a top-level scaffold that must be reviewed
+and split into scoped tentacles before dispatch.
+
+## When to Use
+
+- Task is too complex to fit in one prompt response but touches only 1–3 files or modules
+- Tentacle orchestration needs a first-pass step plan to review before creating parallel work units
+- You want a traceable, reviewable execution plan before starting implementation
+- A task has non-obvious ordering constraints (e.g., schema migration before code change)
+- User says "create step file", "break this into steps", "make a task plan"
+
+**Not for:**
+- Tasks spanning 3+ independent modules as the final execution plan → use `tentacle-orchestration` to review this scaffold and split it into tentacles
+- Project-level process templates → use `workflow-creator` to generate `WORKFLOW.md`
+- Pure research or exploration tasks (no implementation deliverable)
+
+## Why Step Files Help
+
+Without explicit steps, agents make predictable mistakes:
+- Start coding before understanding requirements
+- Skip verification steps when they "seem" done
+- Lose track of intermediate deliverables between session boundaries
+- Make irreversible changes (DB migrations, file deletions) before testing
+
+A step file makes the execution plan visible and checkable — each step has a concrete
+done condition, so the agent knows when to proceed and a human can audit progress.
+
+## How to Generate
+
+### Phase 1: Understand the task
+
+Before writing any steps, investigate:
+
+1. Read the task description and clarify any ambiguities. If any scope, dependency, or
+   acceptance criterion is below confidence `1.0`, prepend a RESEARCH step that splits the
+   uncertainty and dispatches independent validation on the strongest available model
+   (`claude-opus-4.7` when available).
+2. Identify the implementation target: which files change? What is the entry point?
+3. Check if a `WORKFLOW.md` exists (`cat .github/WORKFLOW.md 2>/dev/null`) — use its phases
+   as the skeleton; if not, use the standard phases below
+4. Identify ordering constraints: what must happen before what?
+
+### Phase 2: Map to phases
+
+Assign each piece of work to a phase. Use only the phases the task actually needs.
+
+**Standard phases** (skip phases the task doesn't need):
+
+| Phase | Purpose | Gate artifact |
+|-------|---------|--------------|
+| CLARIFY | Confirm requirements are implementation-ready | Spec health report or confirmed spec |
+| RESEARCH | Resolve confidence `< 1.0` ambiguities before decisions | Research evidence, rejected alternatives, confidence `1.0` |
+| DESIGN | Produce a technical design or interface sketch | Design doc or interface definition |
+| VERIFY | Review the design before touching code | Explicit approval (PASS/FAIL) |
+| BUILD | Implement the code change | Compiling code with no regressions |
+| TEST | Run and write tests | All tests green; if `harness.yaml` exists, include `sk harness check` — artifact: harness criteria output (all criteria green) |
+| REVIEW | Check correctness, security, contracts | Code review findings addressed |
+| LOOP-EVAL | Evaluate whether the overarching goal is met; decide to iterate or close | Goal met (proceed to COMMIT) or remaining gaps identified (loop to BUILD/TEST) |
+| COMMIT | Package and ship | Clean git commit |
+
+Each phase produces one concrete artifact. A step only advances when that artifact exists.
+
+Include a **LOOP-EVAL** step when:
+- The task has an explicit overarching goal (e.g., "all benchmarks pass", "all tests green")
+- The task may require multiple iterations to reach the goal (e.g., a fix that reveals follow-on failures)
+- The agent will be operating semi-autonomously and must decide whether to continue or stop
+
+Omit LOOP-EVAL for strictly bounded tasks with a single deliverable (e.g., "add this one column", "rename this function").
+
+### Phase 3: Write concrete steps
+
+Each step must answer: **What exactly do I do, and how do I know I'm done?**
+
+Step format:
+```markdown
+## Step N: <Phase> — <Action>
+
+**Goal:** One sentence describing what this step produces.
+
+**Actions:**
+1. <Concrete command or action>
+2. <Concrete command or action>
+
+**Done when:** <Observable, verifiable condition — not "seems right">
+**Confidence:** `1.0` required; if lower, this step is blocked by a RESEARCH step.
+```
+
+Use real commands from the project's toolchain. Avoid vague verbs like "check" or
+"ensure" — replace them with the exact command that confirms the condition.
+
+### Phase 4: Add a self-check table
+
+After all steps, add a phase-gate table for quick progress tracking:
+
+```markdown
+## Phase Gates
+
+| Phase | Artifact | Status |
+|-------|---------|--------|
+| CLARIFY | Confirmed spec | ☐ |
+| RESEARCH | Confidence `< 1.0` concerns resolved or explicitly not needed | ☐ |
+| BUILD | `npx tsc --noEmit` passes | ☐ |
+| TEST | `yarn test` passes | ☐ |
+| LOOP-EVAL | Goal criteria met (or single-pass task: skip) | ☐ |
+| COMMIT | Clean `git diff --stat` | ☐ |
+```
+
+## Output Format
+
+Write to `.github/steps/<task-slug>.md` or to the path the user specifies.
+If neither exists, write to `STEPS.md` in the project root.
+
+See `references/step-file-template.md` for the full annotated template.
+
+## Anti-Patterns
+
+| Anti-Pattern | Why It Fails |
+|-------------|-------------|
+| Steps without done conditions | Agent can't tell when to proceed |
+| Mixing phases (build + test in one step) | Gate is ambiguous; errors mix together |
+| Vague actions ("verify it works") | Not actionable; agent guesses |
+| No ordering constraints | Agent skips steps that depend on earlier output |
+| Treating a 3+ module scaffold as final | Step file becomes too large; review it, then split into tentacles |
+| Low-confidence done condition accepted | Agent guesses success; add RESEARCH and block downstream steps |
+
+<example>
+**Task:** Add a `created_at` timestamp column to the `orders` table and expose it in the API response.
+
+**Project:** Python + SQLite + FastAPI, no existing WORKFLOW.md.
+
+**Generated step file:**
+
+```markdown
+# STEPS: Add created_at to orders
+
+**Task:** Add `created_at` timestamp to `orders` table and return it in GET /orders.
+**Scope:** `migrations/`, `models/order.py`, `routers/orders.py`, `tests/test_orders.py`
+
+## Step 1: CLARIFY — Confirm scope
+
+**Goal:** Confirm there are no ambiguities before touching the schema.
+
+**Actions:**
+1. `grep -r "orders" migrations/` — confirm existing migrations are in order
+2. `grep -r "created_at" models/` — check if pattern exists elsewhere for consistency
+
+**Done when:** Migration baseline known; no contradicting existing column found.
+
+## Step 2: BUILD — Add migration
+
+**Goal:** Add migration file that adds `created_at` with a non-null default.
+
+**Actions:**
+1. Create `migrations/003_add_orders_created_at.sql`:
+   ```sql
+   ALTER TABLE orders ADD COLUMN created_at TEXT NOT NULL DEFAULT (datetime('now'));
+   ```
+2. Run migration: `sqlite3 app.db < migrations/003_add_orders_created_at.sql`
+3. `python -c "import ast; ast.parse(open('models/order.py').read())"` — syntax check
+
+**Done when:** `sqlite3 app.db ".schema orders"` shows `created_at` column.
+
+## Step 3: BUILD — Update model and router
+
+**Goal:** Surface `created_at` in the Pydantic model and API response.
+
+**Actions:**
+1. Add `created_at: str` field to `OrderResponse` in `models/order.py`
+2. Map DB column to field in `routers/orders.py` query result
+
+**Done when:** `python -m py_compile models/order.py routers/orders.py` exits 0.
+
+## Step 4: TEST — Run and extend tests
+
+**Goal:** Verify the field appears in the API response and no existing tests broke.
+
+**Actions:**
+1. `pytest tests/test_orders.py -v` — confirm existing tests still pass
+2. Add assertion: `assert "created_at" in response.json()[0]`
+3. `pytest tests/test_orders.py -v` — confirm new assertion passes
+
+**Done when:** All tests green; new assertion present and passing.
+
+## Step 5: REVIEW — Quick correctness check
+
+**Goal:** Verify no injection vectors, no null risks, no contract breaks.
+
+**Actions:**
+1. Confirm `created_at` default is server-side (not user-supplied)
+2. Confirm migration is reversible (document rollback in a comment)
+
+**Done when:** No critical findings from the review checklist.
+
+## Step 6: COMMIT — Ship
+
+**Actions:**
+1. `git add migrations/ models/ routers/ tests/`
+2. `git diff --stat` — confirm only expected files changed
+3. `git commit -m "feat(orders): add created_at timestamp column and API field"`
+
+**Done when:** `git log --oneline -1` shows the commit.
+
+## Phase Gates
+
+| Phase | Artifact | Status |
+|-------|---------|--------|
+| CLARIFY | Migration baseline confirmed | ☐ |
+| BUILD (migration) | Schema shows `created_at` | ☐ |
+| BUILD (code) | `py_compile` exits 0 | ☐ |
+| TEST | All tests green | ☐ |
+| REVIEW | No critical findings | ☐ |
+| COMMIT | Clean commit | ☐ |
+```
+</example>
diff --git a/.github/skills/task-step-generator/references/step-file-template.md b/.github/skills/task-step-generator/references/step-file-template.md
new file mode 100644
index 00000000..a5d732f4
--- /dev/null
+++ b/.github/skills/task-step-generator/references/step-file-template.md
@@ -0,0 +1,174 @@
+# Step File Template
+
+Annotated template for `STEPS.md` files generated by the `task-step-generator` skill.
+Copy this template, fill in each section, and remove annotation comments.
+
+---
+
+```markdown
+# STEPS: <task title>
+
+<!--
+  One-paragraph summary of the task. Include:
+  - What changes (which files, which modules)
+  - Why (what user-visible behavior changes)
+  - Constraints (backward compat, DB state, API versioning)
+-->
+
+**Task:** <one sentence description>
+**Scope:** `<file1>`, `<file2>`, `<dir/**>`
+**Estimated phases:** CLARIFY → BUILD → TEST → REVIEW → COMMIT
+
+---
+
+## Step 1: CLARIFY — <what you're confirming>
+
+<!--
+  Investigate before touching code. Read relevant files, check existing patterns,
+  confirm the spec is implementation-ready.
+-->
+
+**Goal:** <what this step produces — one sentence>
+
+**Actions:**
+1. `<exact command>` — <what it shows>
+2. `<exact command>` — <what it shows>
+
+**Done when:** <observable, verifiable condition>
+
+---
+
+## Step 2: DESIGN — <what you're designing> (skip if not needed)
+
+<!--
+  Produce a technical design, interface sketch, or data model before implementing.
+  For small tasks, this may just be "write the function signature and docstring first."
+-->
+
+**Goal:** Interface or design defined before implementation begins.
+
+**Actions:**
+1. Write function signatures / interface definitions in `<file>`
+2. Confirm with a quick review before filling in bodies
+
+**Done when:** Signatures are written and consistent with callers.
+
+---
+
+## Step 3: BUILD — <what you're implementing>
+
+<!--
+  One implementation step per logical unit. Keep steps small enough that a build check
+  between them is meaningful (not: "implement everything").
+-->
+
+**Goal:** <what this step produces>
+
+**Actions:**
+1. `<edit file>` — <what changes>
+2. `<build check command>` — confirm no syntax errors
+
+**Done when:** `<build command>` exits 0 / `<specific observable>` is true.
+
+---
+
+## Step 4: TEST — Run and extend tests
+
+<!--
+  Always run existing tests before adding new ones — establish the baseline.
+  Then add assertions that cover the new behavior.
+-->
+
+**Goal:** Existing tests still pass; new behavior is covered.
+
+**Actions:**
+1. `<test command>` — confirm baseline
+2. Add test case for `<new behavior>` in `<test file>`
+3. `<test command>` — confirm new assertion passes
+
+**Done when:** All tests green; new assertion present and passing.
+
+---
+
+## Step 5: REVIEW — Correctness check
+
+<!--
+  Apply the code-reviewer skill to the changes in this task.
+  Focus on: security (injection, traversal), contracts (null safety, error paths),
+  and correctness (all paths traced).
+-->
+
+**Goal:** No critical or high findings remain unresolved.
+
+**Actions:**
+1. Trace each changed function with a concrete failure-case input
+2. Check error paths: are all errors propagated or handled?
+3. For DB/file/network changes: confirm no injection vectors
+
+**Done when:** Review checklist complete; any critical findings resolved.
+
+---
+
+## Step 6: COMMIT — Ship
+
+**Actions:**
+1. `git add <files>`
+2. `git diff --stat` — verify only expected files changed
+3. `git commit -m "<type>(<scope>): <description>"`
+
+**Done when:** `git log --oneline -1` shows the commit.
+
+---
+
+## Phase Gates
+
+<!--
+  Fill in the "Artifact" column with the actual command or evidence needed.
+  Check each box (☑) as the phase completes.
+-->
+
+| Phase | Artifact | Status |
+|-------|---------|--------|
+| CLARIFY | <specific evidence> | ☐ |
+| BUILD | `<build command>` exits 0 | ☐ |
+| TEST | `<test command>` all green | ☐ |
+| REVIEW | No critical findings | ☐ |
+| COMMIT | Clean git commit | ☐ |
+```
+
+---
+
+## Annotation Guide
+
+### Done conditions
+
+Good done conditions are **observable without running the agent again**:
+
+| ✅ Good | ❌ Bad |
+|--------|--------|
+| `` `npx tsc --noEmit` exits 0 `` | "TypeScript compiles correctly" |
+| `` `pytest` shows 0 failures `` | "Tests pass" |
+| "Migration column visible in `.schema`" | "Migration applied" |
+| "PR has no 🔴 findings in review" | "Code looks good" |
+
+### Phase selection
+
+Include only the phases your task actually needs. A bug fix often skips DESIGN and VERIFY.
+A schema migration always includes CLARIFY (check existing state) and TEST.
+
+| Task type | Typical phases |
+|-----------|---------------|
+| Bug fix | CLARIFY → BUILD → TEST → COMMIT |
+| New feature | CLARIFY → DESIGN → BUILD → TEST → REVIEW → COMMIT |
+| Schema migration | CLARIFY → BUILD → TEST → REVIEW → COMMIT |
+| Refactor | CLARIFY → BUILD → TEST → REVIEW → COMMIT |
+| Config change | CLARIFY → BUILD → COMMIT |
+
+### Naming
+
+Name the step file after the task slug:
+- `feat-add-created-at.md`
+- `fix-null-pointer-in-handler.md`
+- `refactor-auth-module.md`
+
+Store at `.github/steps/<slug>.md` so it's tracked with the repo.
diff --git a/.github/skills/tentacle-orchestration/SKILL.md b/.github/skills/tentacle-orchestration/SKILL.md
index 1d458df0..0ac03140 100644
--- a/.github/skills/tentacle-orchestration/SKILL.md
+++ b/.github/skills/tentacle-orchestration/SKILL.md
@@ -327,8 +327,9 @@ Summary:
 | **Review** | Security issues, design flaws, scope creep | Never skip |
 | **Docs** | Stale README, outdated JSDoc, missing CHANGELOG | Internal refactors only |
 | **QA audit** | Hallucinated tests, spec mismatches, blind spots | Low-risk changes only |
+| **Harness** | Missing/unreachable success criteria | When harness.yaml absent |
 
-The first 4 gates are mandatory. Skipping any of them means you don't know if the agent output is correct.
+The first 4 gates are mandatory. When `harness.yaml` exists: run `sk harness check [--json]` as an additional gate before Phase 4. This verifies all success criteria commands are reachable and passing. Skipping any of them means you don't know if the agent output is correct.
 
 **Evidence requirement:** Each gate must produce concrete, recorded output before being marked as passed. Do not rely on agent claims that "lint is clean" or "tests pass" — run the commands yourself and attach or reference the output. A gate is only passed when you hold the proof, not when the sub-agent says it is. See Rule 9 (Claims Require Evidence) in `docs/AGENT-RULES.md`.
 
diff --git a/.github/skills/workflow-creator/SKILL.md b/.github/skills/workflow-creator/SKILL.md
new file mode 100644
index 00000000..e983d6f8
--- /dev/null
+++ b/.github/skills/workflow-creator/SKILL.md
@@ -0,0 +1,159 @@
+---
+name: workflow-creator
+description: >
+  Create a phased development workflow (WORKFLOW.md) with quality gates for any project.
+  Use when setting up a new project, improving development process, or when the user mentions
+  "create workflow", "setup phases", "quality gates", "development process", "CI pipeline",
+  or wants a structured multi-phase approach to UI/feature changes.
+---
+
+# Workflow Creator
+
+Generate a `WORKFLOW.md` file that defines a phased development lifecycle with quality gates,
+phase dependencies, and evidence requirements.
+
+## When to Use
+
+- Setting up a new project that needs a structured development process
+- User mentions "create workflow", "quality gates", "development phases", or "CI pipeline"
+- AI agents are skipping steps (testing, review) or working out of order
+- A feature involves multiple stages (design → build → test → QA) that must be gated
+
+## Why Phased Workflows Matter
+
+Without phases, AI agents make common mistakes:
+- Code before understanding requirements → rework
+- Skip design verification → ship broken UI
+- Skip testing → broken in production
+- No visual QA → pixel-level bugs users notice
+- No review gate → architecture drift
+
+A phased workflow with **blocking gates** prevents these by enforcing order.
+
+## Workflow Template
+
+Every workflow follows this pattern:
+
+```
+Phase 0 → Phase 1 → Phase 2 → ... → Phase N
+         ↑ gate ↑  ↑ gate ↑        ↑ gate ↑
+```
+
+**Gates are BLOCKING** — cannot proceed until previous phase produces its required artifact.
+
+### Base Phases (adapt for your project)
+
+| Phase | Name | Purpose | Artifact |
+|-------|------|---------|----------|
+| 0 | CLARIFY | Make requirements implementation-ready | Spec Health Report |
+| 0.5 | RESEARCH | Resolve any CLARIFY/DESIGN/ROUTING confidence below `1.0` | Independent research evidence + confidence `1.0` |
+| 1 | DESIGN | Generate visual/technical design | Design files or specs |
+| 2 | VERIFY | Review design before coding | Review verdicts (PASS/FAIL) |
+| 3 | BUILD | Implement code | Compiling code + passing tests |
+| 4 | TEST | Functional verification | Test results (all pass) |
+| 5 | REVIEW | Code quality check | Review approval |
+| 6 | QA | Visual/manual verification | Screenshots/evidence |
+| 7 | COMMIT | Ship it | Clean git commit |
+
+The RESEARCH phase is blocking whenever confidence is `< 1.0`. Split ambiguous/noisy
+questions into independent research tasks and use the strongest available model
+(`claude-opus-4.7` when available) for validation before proceeding to DESIGN or BUILD.
+
+### Customization by Project Type
+
+**Backend/API**: Drop DESIGN + QA, strengthen TEST with integration + load tests.
+**Mobile/Desktop**: Keep all phases, add per-platform QA.
+**Libraries**: Drop DESIGN/VERIFY, strengthen REVIEW (API surface), add DOCS phase.
+**Data pipelines**: Replace DESIGN with SCHEMA REVIEW, QA with DATA VALIDATION.
+
+## Creating a Workflow
+
+### Step 1: Understand the Project
+
+Examine project type, existing CI/CD, test infrastructure, and agents.
+
+### Step 2: Select Phases
+
+Each phase needs: purpose, input, activities, gate artifact, owner, skip conditions.
+
+### Step 3: Define Blocking Wait Rule
+
+```markdown
+### ⛔ BLOCKING WAIT Rule
+NEVER start Phase N+1 while Phase N is running.
+Parallelism ONLY within a single phase (e.g., 3 test suites in parallel).
+```
+
+### Step 4: Define Phase Gate Evidence Table
+
+Map each phase to its required evidence, verification method, and when skipping is blocked.
+
+### Step 5: Add Self-Check Protocol
+
+At every phase transition, verify artifacts exist and meet quality criteria.
+
+## Integration
+
+- Store as `.github/WORKFLOW.md`
+- Reference from `AGENTS.md` and project instructions
+- Hooks enforce phases (e.g., `commit-gate.py`)
+- Conductor agent uses workflow as playbook
+
+### Deploying via project profiles
+
+Instead of building a workflow from scratch, use `setup-project.py --profile` or `install-project-hooks.py --profile` to install a pre-built hook bundle and starter `WORKFLOW.md`:
+
+```bash
+python3 ~/.copilot/tools/setup-project.py --profile python      # Python: TDD, test-reminder, commit-gate
+python3 ~/.copilot/tools/setup-project.py --profile typescript  # TypeScript: coding-standards, test-reminder
+python3 ~/.copilot/tools/setup-project.py --profile mobile      # Mobile: architecture-guard, QA phase
+python3 ~/.copilot/tools/setup-project.py --profile fullstack   # Full-stack: architecture-guard, session-banner
+
+# Hooks only (no full project setup)
+python3 ~/.copilot/tools/install-project-hooks.py --profile python --workflow  # hooks + WORKFLOW.md
+python3 ~/.copilot/tools/install-project-hooks.py --list-profiles              # show available profiles
+
+# Build a custom profile and deploy it
+python3 ~/.copilot/tools/profile-builder.py --name myteam \
+  --hooks dangerous-blocker.py commit-gate.py \
+  --phases CLARIFY BUILD TEST COMMIT                             # creates presets/myteam.json
+python3 ~/.copilot/tools/profile-export.py --profile myteam --output myteam.json   # export to share
+python3 ~/.copilot/tools/profile-import.py --file myteam.json                      # import on another machine
+python3 ~/.copilot/tools/setup-project.py --profile myteam                         # deploy
+```
+
+Profile bundles are defined in `presets/` (`default`, `python`, `typescript`, `mobile`, `fullstack`).
+Use `--workflow` with `install-project-hooks.py` to also generate a starter `WORKFLOW.md`.
+
+## Anti-Patterns
+
+| Anti-Pattern | Why It Fails |
+|-------------|-------------|
+| Too many phases (>9) | Overhead kills velocity |
+| No skip conditions | Trivial changes take forever |
+| Soft gates ("should") | AI rationalizes skipping |
+| No evidence requirements | "Done" without proof |
+| Phase overlap allowed | Defeats gate purpose |
+| Proceeding past CLARIFY with confidence `< 1.0` | Implementation starts from guesses; run RESEARCH first |
+
+<example>
+**Project:** React dashboard (TypeScript + Jest + Playwright)
+
+**Selected phases:** CLARIFY → DESIGN → VERIFY → BUILD → TEST → REVIEW → QA → COMMIT
+
+**Customizations:**
+- Skipped DESIGN for bug-fix tasks (`skip_if: bug_fix: true`)
+- TEST requires both Jest (unit) and Playwright (e2e) passing
+- QA captures Playwright screenshots as visual evidence
+- COMMIT blocked by `commit-gate.py` until QA artifact exists
+
+**Gate table (excerpt):**
+| Phase | Evidence | Command |
+|-------|---------|---------|
+| BUILD | TypeScript compiles | `npx tsc --noEmit` |
+| TEST | All tests green | `yarn test --ci` |
+| HARNESS | All criteria green | `sk harness check` |
+| QA | Screenshot in `qa-evidence/` | Playwright visual run |
+
+**Output:** `.github/WORKFLOW.md` (referenced from `AGENTS.md`)
+</example>
diff --git a/.github/skills/workflow-creator/references/workflow-template.md b/.github/skills/workflow-creator/references/workflow-template.md
new file mode 100644
index 00000000..b513c4bb
--- /dev/null
+++ b/.github/skills/workflow-creator/references/workflow-template.md
@@ -0,0 +1,93 @@
+# WORKFLOW.md Template
+
+Copy and customize for your project.
+
+```markdown
+# Development Workflow
+
+> This document defines the phased development lifecycle with quality gates.
+> Every phase has a BLOCKING gate — cannot proceed until the previous phase's artifact exists.
+
+## Phase Overview
+
+| Phase | Name | Owner | Gate Artifact | Skip When |
+|-------|------|-------|---------------|-----------|
+| 0 | CLARIFY | spec-clarifier | Spec Health Report (verdict=CLEAN) | Trivial bugfix (<3 files, clear repro) |
+| 1 | DESIGN | designer | Design files (HTML/PNG/Figma) | Non-UI changes |
+| 2 | VERIFY | 3 reviewers (parallel) | All 3 verdicts = PASS | Phase 1 skipped |
+| 3 | BUILD | builder | Compiling code + passing unit tests | — |
+| 4 | TEST | test runners (parallel) | All test suites pass | — |
+| 5 | REVIEW | code-reviewer | Review approval | — |
+| 6 | QA | qa-verifier | Screenshots + OCR evidence | Non-UI changes |
+| 7 | COMMIT | conductor | Clean git commit | — |
+
+## ⛔ BLOCKING WAIT Rule
+
+Start Phase N+1 ONLY after Phase N artifacts exist.
+Parallelism is allowed WITHIN a single phase (e.g., 3 test suites in parallel).
+
+Starting the next phase early means working blind — code written without verification
+results almost always needs rewriting.
+
+## Phase Gate Evidence
+
+Each gate requires specific artifacts before the phase can be marked "done":
+
+| Phase | Required Evidence | Verification |
+|-------|-------------------|-------------|
+| 0: CLARIFY | Spec Health Report with verdict | Check verdict = ✅ CLEAN |
+| 1: DESIGN | Design files exist | File existence check |
+| 2: VERIFY | All reviewer verdicts | Count PASS verdicts = total reviewers |
+| 3: BUILD | Build output + test output | Exit code = 0 |
+| 4: TEST | Test results per platform | 0 failures across all suites |
+| 5: REVIEW | Review comments | No blocking issues |
+| 6: QA | Screenshots + text verification | OCR confirms expected elements |
+| 7: COMMIT | Git commit hash | Commit exists in log |
+
+## Self-Check Protocol
+
+At every phase transition, run this check:
+
+```
+□ Previous phase artifact exists (not just "I think it passed")
+□ Artifact meets quality criteria (not empty, not trivially correct)
+□ No blocking issues from previous phase remain unresolved
+□ Current phase has clear input from previous phase's output
+```
+
+## Phase Details
+
+### Phase 0: CLARIFY
+<!-- Customize: what constitutes a "clear spec" for your project -->
+
+### Phase 1: DESIGN
+<!-- Customize: design tool, output format, ad slot requirements -->
+
+### Phase 2: VERIFY
+<!-- Customize: how many reviewers, what they check -->
+
+### Phase 3: BUILD
+<!-- Customize: build command, test command, E2E impact check -->
+
+### Phase 4: TEST
+<!-- Customize: test suites, platforms, parallel execution -->
+
+### Phase 5: REVIEW
+<!-- Customize: review checklist, independence rule -->
+
+### Phase 6: QA
+<!-- Customize: platforms, themes, screenshot process, OCR -->
+
+### Phase 7: COMMIT
+<!-- Customize: commit message format, trailers -->
+
+## Anti-Patterns
+
+| Anti-Pattern | Why It Fails |
+|-------------|-------------|
+| Skipping phases for "simple changes" | Simple changes still break things |
+| Starting Phase N+1 while N runs | Working blind = rework guaranteed |
+| Marking phases done without artifacts | "Trust me" doesn't catch bugs |
+| Using same agent for build and review | Self-review misses builder's blind spots |
+| Sequential test suites when parallel is possible | Wastes time proportional to suite count |
+```

From 32f5a306ffb59126705b899ff77bb32d20760ed6 Mon Sep 17 00:00:00 2001
From: Linh Ngo <thlinh.ngo@gmail.com>
Date: Sat, 30 May 2026 23:14:28 +0700
Subject: [PATCH 2/3] feat(agents/instructions): add harness integration to all
 agents and copilot-instructions (#736)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- All 10 .github/agents/*.agent.md: append Harness Integration section
  with sk harness check, SK_HARNESS=1, sk harness init, quality-over-speed
- .github/copilot-instructions.md: add ## Harness Engineering section
  after Quality Checklist — 7-principle table with executable commands
- AGENTS.md: add Command column to harness principles table with
  sk harness init, SK_HARNESS=1, sk harness check, sk tentacle etc.

Closes #736

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .github/agents/browse-ui-host-state.agent.md  | 20 +++++++++++++++++++
 .../agents/browser-security-reviewer.agent.md | 20 +++++++++++++++++++
 .github/agents/dev-leader.agent.md            | 20 +++++++++++++++++++
 .../agents/hosted-shell-bootstrap.agent.md    | 20 +++++++++++++++++++
 .github/agents/python-browse-backend.agent.md | 20 +++++++++++++++++++
 .github/agents/qa-leader.agent.md             | 20 +++++++++++++++++++
 .github/agents/research-planner.agent.md      | 20 +++++++++++++++++++
 .github/agents/test-leader.agent.md           | 20 +++++++++++++++++++
 .github/agents/verification-gate.agent.md     | 20 +++++++++++++++++++
 .../agents/whole-app-impact-auditor.agent.md  | 20 +++++++++++++++++++
 .github/copilot-instructions.md               | 16 +++++++++++++++
 AGENTS.md                                     | 18 ++++++++---------
 12 files changed, 225 insertions(+), 9 deletions(-)

diff --git a/.github/agents/browse-ui-host-state.agent.md b/.github/agents/browse-ui-host-state.agent.md
index 76ca9133..4a6a1726 100644
--- a/.github/agents/browse-ui-host-state.agent.md
+++ b/.github/agents/browse-ui-host-state.agent.md
@@ -95,3 +95,23 @@ Use Playwright or browser smoke only when runtime behavior cannot be proven by u
 ## Output
 
 Summarize the state machine, affected UI surfaces, and validation evidence. Call out any browser-dependent behavior explicitly.
+
+## Harness Integration
+
+Quality over speed — always run harness gates before marking work done:
+
+- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green
+- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront
+- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk <command>`
+- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524)
+
+```bash
+# Check harness (when harness.yaml present)
+sk harness check
+
+# Enable middleware for agent task
+SK_HARNESS=1 sk briefing "my task"
+
+# Set up harness on new project
+sk harness init --yes
+```
diff --git a/.github/agents/browser-security-reviewer.agent.md b/.github/agents/browser-security-reviewer.agent.md
index a51aeeea..045f66c6 100644
--- a/.github/agents/browser-security-reviewer.agent.md
+++ b/.github/agents/browser-security-reviewer.agent.md
@@ -70,3 +70,23 @@ Return:
 - **Safe-to-merge conditions:** exact checks that must pass
 
 If there are no genuine issues, say so directly and do not invent style feedback.
+
+## Harness Integration
+
+Quality over speed — always run harness gates before marking work done:
+
+- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green
+- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront
+- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk <command>`
+- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524)
+
+```bash
+# Check harness (when harness.yaml present)
+sk harness check
+
+# Enable middleware for agent task
+SK_HARNESS=1 sk briefing "my task"
+
+# Set up harness on new project
+sk harness init --yes
+```
diff --git a/.github/agents/dev-leader.agent.md b/.github/agents/dev-leader.agent.md
index 11eeca48..c9ab09e7 100644
--- a/.github/agents/dev-leader.agent.md
+++ b/.github/agents/dev-leader.agent.md
@@ -96,3 +96,23 @@ When you encounter a problem that spans domains:
 Primary: `*.py` (root tools), `hooks/**/*`, `browse/**/*.py`, `migrate.py`, `sk.py`, `install.py`
 
 Out of scope: `browse-ui/src/**/*` (belongs to browse-leader), `crates/**/*` (flag to orchestrator)
+
+## Harness Integration
+
+Quality over speed — always run harness gates before marking work done:
+
+- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green
+- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront
+- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk <command>`
+- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524)
+
+```bash
+# Check harness (when harness.yaml present)
+sk harness check
+
+# Enable middleware for agent task
+SK_HARNESS=1 sk briefing "my task"
+
+# Set up harness on new project
+sk harness init --yes
+```
diff --git a/.github/agents/hosted-shell-bootstrap.agent.md b/.github/agents/hosted-shell-bootstrap.agent.md
index e031e970..5dc74bc6 100644
--- a/.github/agents/hosted-shell-bootstrap.agent.md
+++ b/.github/agents/hosted-shell-bootstrap.agent.md
@@ -124,3 +124,23 @@ Open a PR with:
 - Test output
 - Browser smoke evidence if possible
 - Any follow-up issue needed for HTTPS companion or richer pairing
+
+## Harness Integration
+
+Quality over speed — always run harness gates before marking work done:
+
+- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green
+- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront
+- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk <command>`
+- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524)
+
+```bash
+# Check harness (when harness.yaml present)
+sk harness check
+
+# Enable middleware for agent task
+SK_HARNESS=1 sk briefing "my task"
+
+# Set up harness on new project
+sk harness init --yes
+```
diff --git a/.github/agents/python-browse-backend.agent.md b/.github/agents/python-browse-backend.agent.md
index c6de77a5..6cbf606a 100644
--- a/.github/agents/python-browse-backend.agent.md
+++ b/.github/agents/python-browse-backend.agent.md
@@ -103,3 +103,23 @@ If `watch-sessions.py` changes, also run or confirm coverage for `tests/test_wat
 ## Output
 
 Report the changed routes, security decisions, synchronization decisions, and test evidence. If any behavior is deferred, open or reference a follow-up issue instead of leaving silent TODOs.
+
+## Harness Integration
+
+Quality over speed — always run harness gates before marking work done:
+
+- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green
+- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront
+- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk <command>`
+- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524)
+
+```bash
+# Check harness (when harness.yaml present)
+sk harness check
+
+# Enable middleware for agent task
+SK_HARNESS=1 sk briefing "my task"
+
+# Set up harness on new project
+sk harness init --yes
+```
diff --git a/.github/agents/qa-leader.agent.md b/.github/agents/qa-leader.agent.md
index 01d53066..f4862e37 100644
--- a/.github/agents/qa-leader.agent.md
+++ b/.github/agents/qa-leader.agent.md
@@ -139,3 +139,23 @@ Security: no SQL interpolation, no pickle, no secrets
 Primary: All changed surfaces (read-only audit + gate execution)
 
 Out of scope: Implementing fixes (that belongs to dev-leader or browse-leader)
+
+## Harness Integration
+
+Quality over speed — always run harness gates before marking work done:
+
+- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green
+- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront
+- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk <command>`
+- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524)
+
+```bash
+# Check harness (when harness.yaml present)
+sk harness check
+
+# Enable middleware for agent task
+SK_HARNESS=1 sk briefing "my task"
+
+# Set up harness on new project
+sk harness init --yes
+```
diff --git a/.github/agents/research-planner.agent.md b/.github/agents/research-planner.agent.md
index 88eff5a0..730f7cd5 100644
--- a/.github/agents/research-planner.agent.md
+++ b/.github/agents/research-planner.agent.md
@@ -77,3 +77,23 @@ When creating an implementation issue, include:
 ## Output
 
 Produce concise research that is implementation-ready. If asked to create GitHub issues, make each issue specific enough that a cloud agent can implement it without hidden context.
+
+## Harness Integration
+
+Quality over speed — always run harness gates before marking work done:
+
+- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green
+- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront
+- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk <command>`
+- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524)
+
+```bash
+# Check harness (when harness.yaml present)
+sk harness check
+
+# Enable middleware for agent task
+SK_HARNESS=1 sk briefing "my task"
+
+# Set up harness on new project
+sk harness init --yes
+```
diff --git a/.github/agents/test-leader.agent.md b/.github/agents/test-leader.agent.md
index 0cc775ab..2ad976ef 100644
--- a/.github/agents/test-leader.agent.md
+++ b/.github/agents/test-leader.agent.md
@@ -113,3 +113,23 @@ Handoff must include: test suite output (pass/fail counts), coverage gaps identi
 Primary: `test_*.py`, `run_all_tests.py`, `tests/**/*`, `browse-ui/src/**/*.test.*`, `browse-ui/e2e/**/*`
 
 Review scope (read-only audit): all files changed by dev-leader or browse-leader
+
+## Harness Integration
+
+Quality over speed — always run harness gates before marking work done:
+
+- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green
+- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront
+- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk <command>`
+- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524)
+
+```bash
+# Check harness (when harness.yaml present)
+sk harness check
+
+# Enable middleware for agent task
+SK_HARNESS=1 sk briefing "my task"
+
+# Set up harness on new project
+sk harness init --yes
+```
diff --git a/.github/agents/verification-gate.agent.md b/.github/agents/verification-gate.agent.md
index cc9bd062..3184b9fc 100644
--- a/.github/agents/verification-gate.agent.md
+++ b/.github/agents/verification-gate.agent.md
@@ -90,3 +90,23 @@ Return:
 - **Sync coverage:** watcher, auto-update/install, docs, hooks, skills/agents, conventions, CI/deploy status
 - **Failures:** root cause and affected file/test when known
 - **Next action:** exact command or issue/PR comment to run next
+
+## Harness Integration
+
+Quality over speed — always run harness gates before marking work done:
+
+- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green
+- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront
+- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk <command>`
+- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524)
+
+```bash
+# Check harness (when harness.yaml present)
+sk harness check
+
+# Enable middleware for agent task
+SK_HARNESS=1 sk briefing "my task"
+
+# Set up harness on new project
+sk harness init --yes
+```
diff --git a/.github/agents/whole-app-impact-auditor.agent.md b/.github/agents/whole-app-impact-auditor.agent.md
index da8d985b..a3a8a157 100644
--- a/.github/agents/whole-app-impact-auditor.agent.md
+++ b/.github/agents/whole-app-impact-auditor.agent.md
@@ -91,3 +91,23 @@ Return:
 - **Verification matrix:** command or check per impacted surface
 - **Risks if skipped:** concrete breakage scenario
 - **Follow-up issues:** only when deferring work is safe and explicit
+
+## Harness Integration
+
+Quality over speed — always run harness gates before marking work done:
+
+- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green
+- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront
+- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk <command>`
+- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524)
+
+```bash
+# Check harness (when harness.yaml present)
+sk harness check
+
+# Enable middleware for agent task
+SK_HARNESS=1 sk briefing "my task"
+
+# Set up harness on new project
+sk harness init --yes
+```
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
index 24e04773..c1e3df0e 100755
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -206,6 +206,22 @@ On Windows (PowerShell), apply these rules to reduce token consumption:
 
 **Closeout:** attach command output (not just assertions) · `sk learn` before `task_complete` · subagents handoff with `--status DONE --changed-file <file> --learn`.
 
+## Harness Engineering
+
+The 🛡️ 7 harness principles from `AGENTS.md` apply at runtime. Executable commands:
+
+| Principle | When | Command |
+|-----------|------|---------|
+| No-Ship-Bugs | Before every commit | `python3 test_security.py && python3 test_fixes.py` |
+| Follow-Workflow | New project setup | `sk harness init` |
+| Quality-Over-Speed | After implementation | `sk harness check` |
+| Tentacle-Orchestration | ≥3 files changed | `sk tentacle create <name> --briefing` |
+| No-Abandon | confidence < 1.0 | `sk briefing "<topic>"` → research loop |
+| Rules-First | Before every task | `sk briefing --auto --compact` |
+| Knowledge-Recording | After bug fix/pattern | `sk learn --mistake "Title" "Details"` |
+
+Enable harness dispatch middleware: `SK_HARNESS=1 sk <command>`
+
 ## Testing
 
 ```bash
diff --git a/AGENTS.md b/AGENTS.md
index 0b68920e..344616d8 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -78,15 +78,15 @@ For `browse-ui/` changes: `cd browse-ui && pnpm typecheck && pnpm lint && pnpm f
 
 <!-- AI: Read this table before every task. Apply each principle as a checklist gate. -->
 
-| # | Principle | Rule | <!-- AI: enforcement note --> |
-|---|-----------|------|-------------------------------|
-| 1 | **No-Ship-Bugs** | CODE→COMPILE→TEST→VERIFY→COMMIT. Never commit without passing tests. | <!-- AI: run test_security.py && test_fixes.py before any commit --> |
-| 2 | **Follow-Workflow** | Clarify→Plan→Execute→Verify→Close. No skipping phases. | <!-- AI: check WORKFLOW.md or use PLAN→BUILD→TEST→VERIFY→COMMIT if absent --> |
-| 3 | **Quality-Over-Speed** | Multi-platform = no shortcuts. Verify on all surfaces. | <!-- AI: run all surface gates (Python + browse-ui + Rust) before closeout --> |
-| 4 | **Tentacle-Orchestration** | ≥3 files or ≥2 modules → tentacle required. | <!-- AI: count changed files; if ≥3, create tentacle before editing --> |
-| 5 | **No-Abandon** | confidence < 1.0 = research loop, never BLOCKED. Fix or delegate. | <!-- AI: never write BLOCKED; create research-<topic> tentacle instead --> |
-| 6 | **Rules-First** | Read AGENTS.md before every task. | <!-- AI: this table IS the rules — re-read on each new task --> |
-| 7 | **Knowledge-Recording** | `sk learn` after every bug fix or new pattern. | <!-- AI: call sk learn --mistake or --pattern before task_complete --> |
+| # | Principle | Rule | Command | <!-- AI: enforcement note --> |
+|---|-----------|------|---------|-------------------------------|
+| 1 | **No-Ship-Bugs** | CODE→COMPILE→TEST→VERIFY→COMMIT. Never commit without passing tests. | `python3 test_security.py && test_fixes.py` | <!-- AI: run test_security.py && test_fixes.py before any commit --> |
+| 2 | **Follow-Workflow** | Clarify→Plan→Execute→Verify→Close. No skipping phases. | `sk harness init` (new) · `sk harness check` (verify) | <!-- AI: check WORKFLOW.md or use PLAN→BUILD→TEST→VERIFY→COMMIT if absent --> |
+| 3 | **Quality-Over-Speed** | Multi-platform = no shortcuts. Verify on all surfaces. | `SK_HARNESS=1 sk <cmd>` | <!-- AI: run all surface gates (Python + browse-ui + Rust) before closeout --> |
+| 4 | **Tentacle-Orchestration** | ≥3 files or ≥2 modules → tentacle required. | `sk tentacle create <name> --briefing` | <!-- AI: count changed files; if ≥3, create tentacle before editing --> |
+| 5 | **No-Abandon** | confidence < 1.0 = research loop, never BLOCKED. Fix or delegate. | `sk briefing "<topic>"` | <!-- AI: never write BLOCKED; create research-<topic> tentacle instead --> |
+| 6 | **Rules-First** | Read AGENTS.md before every task. | `sk briefing --auto --compact` | <!-- AI: this table IS the rules — re-read on each new task --> |
+| 7 | **Knowledge-Recording** | `sk learn` after every bug fix or new pattern. | `sk learn --mistake/--pattern/--feature` | <!-- AI: call sk learn --mistake or --pattern before task_complete --> |
 
 > Canonical source: `templates/copilot-instructions.md § 🛡️ Harness Engineering — 7 Nguyên tắc`  
 > Full rule details: [docs/AGENT-RULES.md](docs/AGENT-RULES.md)

From f70c19dd8419be39193ff11099ce86bcd30418f0 Mon Sep 17 00:00:00 2001
From: Linh Ngo <thlinh.ngo@gmail.com>
Date: Sat, 30 May 2026 23:37:05 +0700
Subject: [PATCH 3/3] fix: add python3 prefix to test_fixes.py command in
 AGENTS.md harness table

Address reviewer comment: 'test_fixes.py' without python3 would try to
execute as a shell command and fail. Use explicit 'python3 test_fixes.py'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 AGENTS.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/AGENTS.md b/AGENTS.md
index 344616d8..95c4bb65 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -80,7 +80,7 @@ For `browse-ui/` changes: `cd browse-ui && pnpm typecheck && pnpm lint && pnpm f
 
 | # | Principle | Rule | Command | <!-- AI: enforcement note --> |
 |---|-----------|------|---------|-------------------------------|
-| 1 | **No-Ship-Bugs** | CODE→COMPILE→TEST→VERIFY→COMMIT. Never commit without passing tests. | `python3 test_security.py && test_fixes.py` | <!-- AI: run test_security.py && test_fixes.py before any commit --> |
+| 1 | **No-Ship-Bugs** | CODE→COMPILE→TEST→VERIFY→COMMIT. Never commit without passing tests. | `python3 test_security.py && python3 test_fixes.py` | <!-- AI: run python3 test_security.py && python3 test_fixes.py before any commit --> |
 | 2 | **Follow-Workflow** | Clarify→Plan→Execute→Verify→Close. No skipping phases. | `sk harness init` (new) · `sk harness check` (verify) | <!-- AI: check WORKFLOW.md or use PLAN→BUILD→TEST→VERIFY→COMMIT if absent --> |
 | 3 | **Quality-Over-Speed** | Multi-platform = no shortcuts. Verify on all surfaces. | `SK_HARNESS=1 sk <cmd>` | <!-- AI: run all surface gates (Python + browse-ui + Rust) before closeout --> |
 | 4 | **Tentacle-Orchestration** | ≥3 files or ≥2 modules → tentacle required. | `sk tentacle create <name> --briefing` | <!-- AI: count changed files; if ≥3, create tentacle before editing --> |