From 0a3857e7f7ec2e3f546a7f6b06a94e52559d94f0 Mon Sep 17 00:00:00 2001 From: Linh Ngo Date: Sat, 30 May 2026 23:14:18 +0700 Subject: [PATCH 1/3] feat(skills): integrate sk harness quality gates into workflow skills (#735) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - tentacle-orchestration/SKILL.md: add Harness gate row to Phase 3 verification table + note to run sk harness check before Phase 4 when harness.yaml exists - task-step-generator/SKILL.md: extend TEST phase template to include sk harness check when harness.yaml exists - project-onboarding/SKILL.md: add step 0.4 Harness Engineering — sk harness init, verify harness.yaml, sk harness doctor - karpathy-guidelines/SKILL.md: link harness.yaml + sk harness check as the permanent mechanism for verifiable success criteria - workflow-creator/SKILL.md: add HARNESS gate row to gate table example Closes #735 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/skills/karpathy-guidelines/SKILL.md | 139 ++++++++ .github/skills/project-onboarding/SKILL.md | 309 ++++++++++++++++++ .github/skills/task-step-generator/SKILL.md | 232 +++++++++++++ .../references/step-file-template.md | 174 ++++++++++ .../skills/tentacle-orchestration/SKILL.md | 3 +- .github/skills/workflow-creator/SKILL.md | 159 +++++++++ .../references/workflow-template.md | 93 ++++++ 7 files changed, 1108 insertions(+), 1 deletion(-) create mode 100644 .github/skills/karpathy-guidelines/SKILL.md create mode 100644 .github/skills/project-onboarding/SKILL.md create mode 100644 .github/skills/task-step-generator/SKILL.md create mode 100644 .github/skills/task-step-generator/references/step-file-template.md create mode 100644 .github/skills/workflow-creator/SKILL.md create mode 100644 .github/skills/workflow-creator/references/workflow-template.md diff --git a/.github/skills/karpathy-guidelines/SKILL.md b/.github/skills/karpathy-guidelines/SKILL.md new file mode 100644 index 00000000..9d6f90bb --- /dev/null +++ b/.github/skills/karpathy-guidelines/SKILL.md @@ -0,0 +1,139 @@ +--- +name: karpathy-guidelines +description: Behavioral guidelines to reduce common LLM coding mistakes. Use when writing, reviewing, or refactoring code to avoid overcomplication, make surgical changes, surface assumptions, and define verifiable success criteria. +license: MIT +vendored-from: https://github.com/forrestchang/andrej-karpathy-skills +vendored-commit: main +supported-hosts: Copilot CLI, Claude Code +--- + +# Karpathy Guidelines + +Behavioral guidelines to reduce common LLM coding mistakes, derived from [Andrej Karpathy's observations](https://x.com/karpathy/status/2015883857489522876) on LLM coding pitfalls. + +**Tradeoff:** These guidelines bias toward caution over speed. For trivial tasks, use judgment. + +## When to Use + +Invoke this skill when: +- Writing new code and at risk of over-engineering +- Reviewing or refactoring existing code +- Debugging a bug (need verifiable fix criteria) +- Starting a multi-step task (need a scoped plan) +- Feeling tempted to "improve" adjacent unrelated code + +## Workflow + +Apply the four guidelines in order for any non-trivial coding task: + +### 1. Think Before Coding + +**Don't assume. Don't hide confusion. Surface tradeoffs.** + +Before implementing: +- State your assumptions explicitly. If uncertain, ask. +- If multiple interpretations exist, present them - don't pick silently. +- If a simpler approach exists, say so. Push back when warranted. +- If something is unclear, stop. Name what's confusing. Ask. + +### 2. Simplicity First + +**Minimum code that solves the problem. Nothing speculative.** + +- No features beyond what was asked. +- No abstractions for single-use code. +- No "flexibility" or "configurability" that wasn't requested. +- No error handling for impossible scenarios. +- If you write 200 lines and it could be 50, rewrite it. +- Mandatory Project Rule 10, **Minimum Footprint** (canonical text: + `docs/AGENT-RULES.md#rule-10--minimum-footprint`): no unjustified new + files, no speculative abstractions, reuse existing patterns first, decompose + functions over 50 lines or explain why not, flag files over 400 lines, and + keep every changed line traceable to the task. + +Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify. + +### 3. Surgical Changes + +**Touch only what you must. Clean up only your own mess.** + +When editing existing code: +- Don't "improve" adjacent code, comments, or formatting. +- Don't refactor things that aren't broken. +- Match existing style, even if you'd do it differently. +- If you notice unrelated dead code, mention it - don't delete it. + +When your changes create orphans: +- Remove imports/variables/functions that YOUR changes made unused. +- Don't remove pre-existing dead code unless asked. + +The test: Every changed line should trace directly to the user's request. + +Project Rule 11, **New File Justification** (canonical text: +`docs/AGENT-RULES.md#rule-11--new-file-justification`): before adding a file, +search for an existing home, state the new file's responsibility, wire it into +the relevant lint/test/docs/CI surface, and add or update tests for the behavior +it owns. + +### 4. Goal-Driven Execution + +**Define success criteria. Loop until verified.** + +Transform tasks into verifiable goals: +- "Add validation" → "Write tests for invalid inputs, then make them pass" +- "Fix the bug" → "Write a test that reproduces it, then make it pass" +- "Refactor X" → "Ensure tests pass before and after" + +For multi-step tasks, state a brief plan: +``` +1. [Step] → verify: [check] +2. [Step] → verify: [check] +3. [Step] → verify: [check] +``` + +Strong success criteria let you loop independently. Weak criteria ("make it work") require constant clarification. + +Use `harness.yaml` to encode success criteria permanently: run `sk harness init` to scaffold the template, edit `success_criteria:` to match your goals, then run `sk harness check` after every implementation step. This turns the verify loop into a single command. + +**Within orchestrated workflows (multi-tentacle goals):** Apply this principle at the goal level, not just the task level. Before dispatching any tentacle: +1. Write the goal's success criteria explicitly (e.g., in `CONTEXT.md` or the first tentacle's description). +2. After all verification gates pass, evaluate the goal against those criteria. +3. If unmet, loop: plan new tentacles scoped to remaining gaps. If met, proceed to commit and close. + +The loop-until-verified semantic applies at every level — individual task, tentacle, and orchestrated goal — and produces the same result: no "done" claim without evidence. + +## Examples + + +User asks: "Add dark mode support to the dashboard" + +✅ Karpathy-aligned response: +1. Clarify scope: "Do you want CSS variables only, or also a toggle UI element?" +2. After clarification: implement exactly what was asked — no additional themes, no config system +3. Success criteria: "Dark mode renders correctly on dashboard; no other pages changed" + + + +User asks: "Fix the login bug" + +✅ Karpathy-aligned response: +1. State assumption: "I assume the bug is the 401 on expired tokens based on recent context" +2. Write a failing test that reproduces the bug +3. Make the minimal change to pass the test +4. Verify: no other tests broken, no adjacent refactoring done + + + +User asks: "Refactor the data processing module" + +✅ Karpathy-aligned response: +1. Ask: "What's the goal — readability, performance, or testability?" +2. After answer: change only what serves the stated goal +3. Match existing code style even if you'd write it differently +4. Ensure all existing tests still pass before and after + + +## Attribution + +Vendored from [forrestchang/andrej-karpathy-skills](https://github.com/forrestchang/andrej-karpathy-skills) under MIT license. +Original guidelines derived from Andrej Karpathy's public observations on LLM coding pitfalls. diff --git a/.github/skills/project-onboarding/SKILL.md b/.github/skills/project-onboarding/SKILL.md new file mode 100644 index 00000000..80cc8a65 --- /dev/null +++ b/.github/skills/project-onboarding/SKILL.md @@ -0,0 +1,309 @@ +--- +name: project-onboarding +description: Complete guide to set up the full AI-assisted development ecosystem for any project. Use when joining a new project, bootstrapping AI tools, initializing Copilot, or onboarding a codebase so no creator, hook, workflow, or routing layer is missed. +--- + +# Project Onboarding + +Set up the complete AI-assisted development ecosystem for any project. +Each phase builds on the previous one. **Skipping a phase leaves a gap +that compounds downstream.** Follow the phases in order. + +## When to Use + +- Joining a new codebase and you want the full Copilot/agent setup done in the right order +- User mentions "setup AI tools", "bootstrap project", "initialize copilot", or "onboard project" +- You want to ensure memory, hooks, workflows, tentacles, and conductor are all installed +- You need one canonical checklist instead of running creators piecemeal + +## Why this matters + +Without a structured onboarding, teams cherry-pick tools and miss critical +infrastructure. The AI has no memory across sessions, no guardrails against +dangerous commands, no workflow phases, and picks different skills each time. +This guide ensures every layer is in place before you write the first line of code. + +Equally important: **deploying too much at once is its own failure mode.** +Over-broad instructions (`applyTo: "**/*"`) injected at every tool call, duplicate +skills across global and project surfaces, and routing rules that drift from what +is actually installed all compound into a bloated context that slows the AI and +masks real errors. This guide teaches both what to install and how to keep it lean. + +## Staged Rollout Principles + +Follow these principles throughout every phase. Violating them is how projects +end up with 100+ skills loaded simultaneously and 9 instructions firing on every +tool call. + +1. **Minimal first.** Deploy only what the current phase needs. Add more only after + the previous layer is verified clean. Resist installing "just in case" extras. + +2. **Progressive escalation.** Each creator output is an input to the next. Do not + run conductor-creator before you have agents and workflows to route — the rules + it generates will be empty or wrong. + +3. **Verify before advancing.** Each phase has an explicit verify step. Do not move + on if the verify command returns errors or warnings. Fix the gap now; it compounds. + +4. **Load budget awareness.** Every `applyTo: "**/*"` instruction fires on every tool + call. Every global skill adds to the catalog the AI must scan. After setup, audit: + - Instruction count: aim for ≤6 always-loaded instructions. + - `applyTo` scope: narrow to the file types or paths that actually need the rule. + - Skills: remove project-local copies of any skill that exists in `~/.copilot/skills/`. + - Duplicates: if a skill name appears in both global and project surfaces, keep only one. + +5. **Single source of truth.** If a skill is deployed globally (in `~/.copilot/skills/`), + do not re-deploy it in `.github/skills/`. The project copy silently duplicates context + without adding value. + +## Overview + +``` +Phase 0: FOUNDATION -> Memory + Agents + Safety +Phase 1: PROCESS -> Workflows + Orchestration +Phase 2: ROUTING -> Conductor ties everything together +Phase 3: VERIFY -> Confirm zero gaps and healthy load budget +``` + +## Phase 0: Foundation + +Run these three creators first. Everything else depends on agents, memory, +and guardrails being in place. + +### 0.1 Session Knowledge + +**Invoke:** `session-knowledge-creator` skill + +**Output:** `briefing.py`, `learn.py`, `.instructions.md` + +**Why it matters:** Without session knowledge, every session starts from zero. +The AI repeats mistakes, forgets past decisions, and has no institutional memory. +Briefing gives pre-task context; learn records post-task insights. + +**Verify:** `sk briefing --wakeup` returns output. (fallback: `python3 ~/.copilot/tools/briefing.py --wakeup`) + +### 0.2 Agent Creator + +**Invoke:** `agent-creator` skill + +**Output:** `.github/agents/*.agent.md` + +**Why it matters:** Generic agents produce generic output. Specialized agents +encode your architecture, test framework, and domain knowledge. + +**Verify:** `ls .github/agents/` shows 5-8 agent files. + +### 0.3 Hook Creator + +**Invoke:** `hook-creator` skill + +**Output:** `.github/hooks/hooks.json` + `scripts/` + +**Why it matters:** Hooks are the **strongest enforcement** mechanism. +They physically intercept and block violations before they happen. Unlike +skills (AI can ignore) or instructions (AI can rationalize skipping), hooks +run on every tool call. They prevent commits to protected branches, block +credential leaks, and guard auto-generated files. + +**Verify:** `cat .github/hooks/hooks.json` lists preToolUse/postToolUse hooks. + +### 0.4 Harness Engineering + +**Invoke:** `sk harness init` + +**Output:** `harness.yaml` + `.harness/` directory + +**Why it matters:** Harness configuration defines verifiable success criteria and produces a documented 36% agent performance improvement (CORE benchmark, arXiv 2412.04524). Without harness, "done" has no evidence. + +**Verify:** `sk harness doctor` returns no errors. `cat harness.yaml` shows success_criteria block. + +## Phase 1: Process + +With agents and safety in place, define **how work gets done.** + +### 1.1 Workflow Creator + +**Invoke:** `workflow-creator` skill + +**Output:** `WORKFLOW.md` or a `strict-tdd-workflow` skill + +**Why it matters:** Without phases, the AI jumps straight to coding before +understanding requirements, skips testing, and produces unreviewed output. +Workflows add blocking quality gates between phases. + +**Verify:** A `WORKFLOW.md` or `.github/skills/*workflow*/SKILL.md` exists. + +### 1.2 Tentacle Creator + +**Invoke:** `tentacle-creator` skill + +**Output:** `.github/skills/tentacle-orchestration/SKILL.md` + +**Why it matters:** Tasks spanning multiple modules run serially without +orchestration. Tentacle breaks them into parallel work units with clear file +ownership so agents do not overwrite each other. + +**Verify:** `.github/skills/tentacle-orchestration/SKILL.md` exists. + +## Phase 2: Routing + +The conductor connects **everything from Phase 0 and Phase 1** into a single +deterministic router. + +### 2.1 Conductor Creator + +**Invoke:** `conductor-creator` skill + +**Output:** +- `.github/skills/conductor/scripts/conductor.py` (engine) +- `.github/skills/conductor/scripts/conductor-rules.json` (rules) +- `.github/instructions/conductor-routing.instructions.md` (auto-load) + +**Why it matters:** Without a conductor, the AI re-derives routing every time, +picking different skills and workflows for the same task across sessions. +The conductor makes this deterministic: same input = same plan. + +**Verify:** + +```bash +python3 .github/skills/conductor/scripts/conductor.py --sync +# Expect: 0 new unrouted, 0 stale, 100% coverage +``` + +## Phase 3: Verify + +Run these checks to confirm **zero gaps** in the setup AND a healthy load budget. + +```bash +# 3.1 Sync check — routing rules match installed skills +python3 .github/skills/conductor/scripts/conductor.py --sync + +# 3.2 Rule audit — no orphan or stale references +python3 .github/skills/conductor/scripts/conductor.py --audit + +# 3.3 Test suite +python3 .github/skills/conductor/scripts/test-conductor.py + +# 3.4 Smoke test +python3 .github/skills/conductor/scripts/conductor.py "implement user login" --verbose +python3 .github/skills/conductor/scripts/conductor.py "fix crash on startup" --verbose +``` + +### 3.5 Load Budget Audit + +After conductor passes, audit the instruction and skill surfaces to prevent +context bloat from accumulating silently. + +```bash +# Count always-loaded instructions (applyTo: **/* or no filter) +grep -rl 'applyTo.*\*\*/\*' .github/instructions/ ~/.github/instructions/ 2>/dev/null + +# Find duplicate skill names across global and project +comm -12 \ + <(ls ~/.copilot/skills/ 2>/dev/null | sort) \ + <(ls .github/skills/ 2>/dev/null | sort) +# Any name printed here = duplicate. Remove the project copy if the global copy exists. + +# List skills that are no longer referenced in conductor routing +# --sync reports coverage gaps; use it as the authoritative check: +python3 .github/skills/conductor/scripts/conductor.py --sync +# Any skill on disk not covered by a routing rule is flagged by --sync. +# To review all rules for stale references, run --audit and inspect the output manually. +``` + +**Healthy targets after onboarding:** + +| Surface | Target | +|---------|--------| +| Always-loaded instructions (`applyTo: **/*`) | ≤ 6 total across global + project | +| Duplicate skills (same name in global + project) | 0 | +| Conductor orphans (skill on disk, not in rules) | 0 (or explicitly listed in `_meta.intentionally_unrouted`) | +| Conductor stale refs (rule references missing skill) | 0 | + +## Quick Reference + +| Phase | Creator | Output | Verify | +|-------|---------|--------|--------| +| 0.1 | `session-knowledge-creator` | briefing.py, learn.py | `sk briefing --wakeup` | +| 0.2 | `agent-creator` | .github/agents/*.agent.md | `ls .github/agents/` | +| 0.3 | `hook-creator` | .github/hooks/ | `cat hooks.json` | +| 1.1 | `workflow-creator` | WORKFLOW.md | File exists with phases | +| 1.2 | `tentacle-creator` | tentacle-orchestration | SKILL.md exists | +| 2.1 | `conductor-creator` | conductor-rules.json | `--sync` reports clean | +| 3 | Load audit | — | ≤6 always-loaded instr., 0 duplicate skills | + +## Dependency Map + +``` +session-knowledge-creator ---+ + | +agent-creator ---------------+ + | +hook-creator ----------------+---> conductor-creator ---> READY + | +workflow-creator ------------+ + | +tentacle-creator ------------+ +``` + +All five creators feed into conductor-creator. The conductor is the +integration point that ties the ecosystem together. + +## Ongoing Maintenance + +| Event | Action | +|-------|--------| +| Added/removed a skill | `conductor.py --sync --fix` then re-run load audit | +| Added a new agent | Update `agent_routing` in conductor-rules.json | +| Changed workflow phases | Update `workflows` in conductor-rules.json | +| New session starts | `sk briefing --auto --compact` | +| After fixing a bug | `sk learn --mistake "Title" "Details" --tags "tags"` | +| After completing feature | `sk learn --feature "Title" "Details" --tags "tags"` | +| Skill installed globally | Remove project-local copy if it exists in `.github/skills/` | +| Instruction added | Check that `applyTo` is as narrow as possible; re-run load audit | + +## Platform Notes + +- **macOS/Linux:** Use `python3`. Paths use `/`. +- **Windows:** Use `python` instead. All scripts are cross-platform Python. +- **Copilot CLI:** Skills auto-discover via `.skill-meta.json`. + Instructions auto-inject via `.instructions.md` with `applyTo` frontmatter. + +## Troubleshooting + +| Problem | Cause | Fix | +|---------|-------|-----| +| `--sync` shows orphans | New skill added after setup | `--sync --fix` | +| Agent uses wrong model | Model not specified | Check `.instructions.md` model rules | +| Hook blocks valid action | Overly strict regex | Edit `.github/hooks/scripts/` | +| Briefing returns empty | No entries recorded | Start using `sk learn` after tasks | +| Context feels slow / bloated | Too many `applyTo: **/*` instructions | Narrow `applyTo` on each instruction file | +| Same skill name in global + project | Old project copy not removed after global rollout | Delete `.github/skills//` when `~/.copilot/skills//` exists | +| Conductor rules reference missing skill | Skill removed but rule not updated | `conductor.py --audit`, then remove stale rule or restore skill | + + +**Project:** Existing Python backend with no AI scaffolding yet + +**User asks:** "Onboard this project so future Copilot sessions have memory, hooks, workflows, and routing" + +**Recommended order:** +1. `session-knowledge-creator` +2. `agent-creator` +3. `hook-creator` +4. `workflow-creator` +5. `tentacle-creator` +6. `conductor-creator` + +**Expected result:** +- session memory tools installed +- project agents created +- hooks deployed +- workflow defined +- tentacle orchestration available +- conductor routing synced with zero gaps +- load audit: ≤6 always-loaded instructions, 0 duplicate skill names, 0 conductor orphans + +**Propagation note:** If any of the above meta-skills are already installed globally +(in `~/.copilot/skills/`), skip re-deploying them to `.github/skills/`. The project +should extend, not duplicate, the global layer. + diff --git a/.github/skills/task-step-generator/SKILL.md b/.github/skills/task-step-generator/SKILL.md new file mode 100644 index 00000000..f7c8665e --- /dev/null +++ b/.github/skills/task-step-generator/SKILL.md @@ -0,0 +1,232 @@ +--- +name: task-step-generator +description: > + Generate a structured STEPS.md file that breaks a specific task into concrete, ordered + steps grounded in the project's phased workflow. Use when a task is too complex for a + single prompt, when tentacle-orchestration needs a reviewed planning scaffold before + dispatch, or for single-module features, multi-stage bug fixes, and scripted migrations. + Trigger phrases: "create step + file", "generate steps", "make a task plan", "write out the steps", "break this into + steps", "step-by-step plan", "task steps", "execution plan". +--- + +# Task Step Generator + +Generate a `STEPS.md` file that breaks a specific task into concrete, ordered steps an +agent can follow without re-reading the full specification. Steps are grounded in the +project's existing phases (from `WORKFLOW.md` or the standard phased lifecycle) and +are normally scoped to complete in a single agent session. When invoked by +`tentacle-orchestration`, the step file is a top-level scaffold that must be reviewed +and split into scoped tentacles before dispatch. + +## When to Use + +- Task is too complex to fit in one prompt response but touches only 1–3 files or modules +- Tentacle orchestration needs a first-pass step plan to review before creating parallel work units +- You want a traceable, reviewable execution plan before starting implementation +- A task has non-obvious ordering constraints (e.g., schema migration before code change) +- User says "create step file", "break this into steps", "make a task plan" + +**Not for:** +- Tasks spanning 3+ independent modules as the final execution plan → use `tentacle-orchestration` to review this scaffold and split it into tentacles +- Project-level process templates → use `workflow-creator` to generate `WORKFLOW.md` +- Pure research or exploration tasks (no implementation deliverable) + +## Why Step Files Help + +Without explicit steps, agents make predictable mistakes: +- Start coding before understanding requirements +- Skip verification steps when they "seem" done +- Lose track of intermediate deliverables between session boundaries +- Make irreversible changes (DB migrations, file deletions) before testing + +A step file makes the execution plan visible and checkable — each step has a concrete +done condition, so the agent knows when to proceed and a human can audit progress. + +## How to Generate + +### Phase 1: Understand the task + +Before writing any steps, investigate: + +1. Read the task description and clarify any ambiguities. If any scope, dependency, or + acceptance criterion is below confidence `1.0`, prepend a RESEARCH step that splits the + uncertainty and dispatches independent validation on the strongest available model + (`claude-opus-4.7` when available). +2. Identify the implementation target: which files change? What is the entry point? +3. Check if a `WORKFLOW.md` exists (`cat .github/WORKFLOW.md 2>/dev/null`) — use its phases + as the skeleton; if not, use the standard phases below +4. Identify ordering constraints: what must happen before what? + +### Phase 2: Map to phases + +Assign each piece of work to a phase. Use only the phases the task actually needs. + +**Standard phases** (skip phases the task doesn't need): + +| Phase | Purpose | Gate artifact | +|-------|---------|--------------| +| CLARIFY | Confirm requirements are implementation-ready | Spec health report or confirmed spec | +| RESEARCH | Resolve confidence `< 1.0` ambiguities before decisions | Research evidence, rejected alternatives, confidence `1.0` | +| DESIGN | Produce a technical design or interface sketch | Design doc or interface definition | +| VERIFY | Review the design before touching code | Explicit approval (PASS/FAIL) | +| BUILD | Implement the code change | Compiling code with no regressions | +| TEST | Run and write tests | All tests green; if `harness.yaml` exists, include `sk harness check` — artifact: harness criteria output (all criteria green) | +| REVIEW | Check correctness, security, contracts | Code review findings addressed | +| LOOP-EVAL | Evaluate whether the overarching goal is met; decide to iterate or close | Goal met (proceed to COMMIT) or remaining gaps identified (loop to BUILD/TEST) | +| COMMIT | Package and ship | Clean git commit | + +Each phase produces one concrete artifact. A step only advances when that artifact exists. + +Include a **LOOP-EVAL** step when: +- The task has an explicit overarching goal (e.g., "all benchmarks pass", "all tests green") +- The task may require multiple iterations to reach the goal (e.g., a fix that reveals follow-on failures) +- The agent will be operating semi-autonomously and must decide whether to continue or stop + +Omit LOOP-EVAL for strictly bounded tasks with a single deliverable (e.g., "add this one column", "rename this function"). + +### Phase 3: Write concrete steps + +Each step must answer: **What exactly do I do, and how do I know I'm done?** + +Step format: +```markdown +## Step N: + +**Goal:** One sentence describing what this step produces. + +**Actions:** +1. +2. + +**Done when:** +**Confidence:** `1.0` required; if lower, this step is blocked by a RESEARCH step. +``` + +Use real commands from the project's toolchain. Avoid vague verbs like "check" or +"ensure" — replace them with the exact command that confirms the condition. + +### Phase 4: Add a self-check table + +After all steps, add a phase-gate table for quick progress tracking: + +```markdown +## Phase Gates + +| Phase | Artifact | Status | +|-------|---------|--------| +| CLARIFY | Confirmed spec | ☐ | +| RESEARCH | Confidence `< 1.0` concerns resolved or explicitly not needed | ☐ | +| BUILD | `npx tsc --noEmit` passes | ☐ | +| TEST | `yarn test` passes | ☐ | +| LOOP-EVAL | Goal criteria met (or single-pass task: skip) | ☐ | +| COMMIT | Clean `git diff --stat` | ☐ | +``` + +## Output Format + +Write to `.github/steps/.md` or to the path the user specifies. +If neither exists, write to `STEPS.md` in the project root. + +See `references/step-file-template.md` for the full annotated template. + +## Anti-Patterns + +| Anti-Pattern | Why It Fails | +|-------------|-------------| +| Steps without done conditions | Agent can't tell when to proceed | +| Mixing phases (build + test in one step) | Gate is ambiguous; errors mix together | +| Vague actions ("verify it works") | Not actionable; agent guesses | +| No ordering constraints | Agent skips steps that depend on earlier output | +| Treating a 3+ module scaffold as final | Step file becomes too large; review it, then split into tentacles | +| Low-confidence done condition accepted | Agent guesses success; add RESEARCH and block downstream steps | + + +**Task:** Add a `created_at` timestamp column to the `orders` table and expose it in the API response. + +**Project:** Python + SQLite + FastAPI, no existing WORKFLOW.md. + +**Generated step file:** + +```markdown +# STEPS: Add created_at to orders + +**Task:** Add `created_at` timestamp to `orders` table and return it in GET /orders. +**Scope:** `migrations/`, `models/order.py`, `routers/orders.py`, `tests/test_orders.py` + +## Step 1: CLARIFY — Confirm scope + +**Goal:** Confirm there are no ambiguities before touching the schema. + +**Actions:** +1. `grep -r "orders" migrations/` — confirm existing migrations are in order +2. `grep -r "created_at" models/` — check if pattern exists elsewhere for consistency + +**Done when:** Migration baseline known; no contradicting existing column found. + +## Step 2: BUILD — Add migration + +**Goal:** Add migration file that adds `created_at` with a non-null default. + +**Actions:** +1. Create `migrations/003_add_orders_created_at.sql`: + ```sql + ALTER TABLE orders ADD COLUMN created_at TEXT NOT NULL DEFAULT (datetime('now')); + ``` +2. Run migration: `sqlite3 app.db < migrations/003_add_orders_created_at.sql` +3. `python -c "import ast; ast.parse(open('models/order.py').read())"` — syntax check + +**Done when:** `sqlite3 app.db ".schema orders"` shows `created_at` column. + +## Step 3: BUILD — Update model and router + +**Goal:** Surface `created_at` in the Pydantic model and API response. + +**Actions:** +1. Add `created_at: str` field to `OrderResponse` in `models/order.py` +2. Map DB column to field in `routers/orders.py` query result + +**Done when:** `python -m py_compile models/order.py routers/orders.py` exits 0. + +## Step 4: TEST — Run and extend tests + +**Goal:** Verify the field appears in the API response and no existing tests broke. + +**Actions:** +1. `pytest tests/test_orders.py -v` — confirm existing tests still pass +2. Add assertion: `assert "created_at" in response.json()[0]` +3. `pytest tests/test_orders.py -v` — confirm new assertion passes + +**Done when:** All tests green; new assertion present and passing. + +## Step 5: REVIEW — Quick correctness check + +**Goal:** Verify no injection vectors, no null risks, no contract breaks. + +**Actions:** +1. Confirm `created_at` default is server-side (not user-supplied) +2. Confirm migration is reversible (document rollback in a comment) + +**Done when:** No critical findings from the review checklist. + +## Step 6: COMMIT — Ship + +**Actions:** +1. `git add migrations/ models/ routers/ tests/` +2. `git diff --stat` — confirm only expected files changed +3. `git commit -m "feat(orders): add created_at timestamp column and API field"` + +**Done when:** `git log --oneline -1` shows the commit. + +## Phase Gates + +| Phase | Artifact | Status | +|-------|---------|--------| +| CLARIFY | Migration baseline confirmed | ☐ | +| BUILD (migration) | Schema shows `created_at` | ☐ | +| BUILD (code) | `py_compile` exits 0 | ☐ | +| TEST | All tests green | ☐ | +| REVIEW | No critical findings | ☐ | +| COMMIT | Clean commit | ☐ | +``` + diff --git a/.github/skills/task-step-generator/references/step-file-template.md b/.github/skills/task-step-generator/references/step-file-template.md new file mode 100644 index 00000000..a5d732f4 --- /dev/null +++ b/.github/skills/task-step-generator/references/step-file-template.md @@ -0,0 +1,174 @@ +# Step File Template + +Annotated template for `STEPS.md` files generated by the `task-step-generator` skill. +Copy this template, fill in each section, and remove annotation comments. + +--- + +```markdown +# STEPS: + + + +**Task:** +**Scope:** ``, ``, `` +**Estimated phases:** CLARIFY → BUILD → TEST → REVIEW → COMMIT + +--- + +## Step 1: CLARIFY — + + + +**Goal:** + +**Actions:** +1. `` — +2. `` — + +**Done when:** + +--- + +## Step 2: DESIGN — (skip if not needed) + + + +**Goal:** Interface or design defined before implementation begins. + +**Actions:** +1. Write function signatures / interface definitions in `` +2. Confirm with a quick review before filling in bodies + +**Done when:** Signatures are written and consistent with callers. + +--- + +## Step 3: BUILD — + + + +**Goal:** + +**Actions:** +1. `` — +2. `` — confirm no syntax errors + +**Done when:** `` exits 0 / `` is true. + +--- + +## Step 4: TEST — Run and extend tests + + + +**Goal:** Existing tests still pass; new behavior is covered. + +**Actions:** +1. `` — confirm baseline +2. Add test case for `` in `` +3. `` — confirm new assertion passes + +**Done when:** All tests green; new assertion present and passing. + +--- + +## Step 5: REVIEW — Correctness check + + + +**Goal:** No critical or high findings remain unresolved. + +**Actions:** +1. Trace each changed function with a concrete failure-case input +2. Check error paths: are all errors propagated or handled? +3. For DB/file/network changes: confirm no injection vectors + +**Done when:** Review checklist complete; any critical findings resolved. + +--- + +## Step 6: COMMIT — Ship + +**Actions:** +1. `git add ` +2. `git diff --stat` — verify only expected files changed +3. `git commit -m "(): "` + +**Done when:** `git log --oneline -1` shows the commit. + +--- + +## Phase Gates + + + +| Phase | Artifact | Status | +|-------|---------|--------| +| CLARIFY | | ☐ | +| BUILD | `` exits 0 | ☐ | +| TEST | `` all green | ☐ | +| REVIEW | No critical findings | ☐ | +| COMMIT | Clean git commit | ☐ | +``` + +--- + +## Annotation Guide + +### Done conditions + +Good done conditions are **observable without running the agent again**: + +| ✅ Good | ❌ Bad | +|--------|--------| +| `` `npx tsc --noEmit` exits 0 `` | "TypeScript compiles correctly" | +| `` `pytest` shows 0 failures `` | "Tests pass" | +| "Migration column visible in `.schema`" | "Migration applied" | +| "PR has no 🔴 findings in review" | "Code looks good" | + +### Phase selection + +Include only the phases your task actually needs. A bug fix often skips DESIGN and VERIFY. +A schema migration always includes CLARIFY (check existing state) and TEST. + +| Task type | Typical phases | +|-----------|---------------| +| Bug fix | CLARIFY → BUILD → TEST → COMMIT | +| New feature | CLARIFY → DESIGN → BUILD → TEST → REVIEW → COMMIT | +| Schema migration | CLARIFY → BUILD → TEST → REVIEW → COMMIT | +| Refactor | CLARIFY → BUILD → TEST → REVIEW → COMMIT | +| Config change | CLARIFY → BUILD → COMMIT | + +### Naming + +Name the step file after the task slug: +- `feat-add-created-at.md` +- `fix-null-pointer-in-handler.md` +- `refactor-auth-module.md` + +Store at `.github/steps/.md` so it's tracked with the repo. diff --git a/.github/skills/tentacle-orchestration/SKILL.md b/.github/skills/tentacle-orchestration/SKILL.md index 1d458df0..0ac03140 100644 --- a/.github/skills/tentacle-orchestration/SKILL.md +++ b/.github/skills/tentacle-orchestration/SKILL.md @@ -327,8 +327,9 @@ Summary: | **Review** | Security issues, design flaws, scope creep | Never skip | | **Docs** | Stale README, outdated JSDoc, missing CHANGELOG | Internal refactors only | | **QA audit** | Hallucinated tests, spec mismatches, blind spots | Low-risk changes only | +| **Harness** | Missing/unreachable success criteria | When harness.yaml absent | -The first 4 gates are mandatory. Skipping any of them means you don't know if the agent output is correct. +The first 4 gates are mandatory. When `harness.yaml` exists: run `sk harness check [--json]` as an additional gate before Phase 4. This verifies all success criteria commands are reachable and passing. Skipping any of them means you don't know if the agent output is correct. **Evidence requirement:** Each gate must produce concrete, recorded output before being marked as passed. Do not rely on agent claims that "lint is clean" or "tests pass" — run the commands yourself and attach or reference the output. A gate is only passed when you hold the proof, not when the sub-agent says it is. See Rule 9 (Claims Require Evidence) in `docs/AGENT-RULES.md`. diff --git a/.github/skills/workflow-creator/SKILL.md b/.github/skills/workflow-creator/SKILL.md new file mode 100644 index 00000000..e983d6f8 --- /dev/null +++ b/.github/skills/workflow-creator/SKILL.md @@ -0,0 +1,159 @@ +--- +name: workflow-creator +description: > + Create a phased development workflow (WORKFLOW.md) with quality gates for any project. + Use when setting up a new project, improving development process, or when the user mentions + "create workflow", "setup phases", "quality gates", "development process", "CI pipeline", + or wants a structured multi-phase approach to UI/feature changes. +--- + +# Workflow Creator + +Generate a `WORKFLOW.md` file that defines a phased development lifecycle with quality gates, +phase dependencies, and evidence requirements. + +## When to Use + +- Setting up a new project that needs a structured development process +- User mentions "create workflow", "quality gates", "development phases", or "CI pipeline" +- AI agents are skipping steps (testing, review) or working out of order +- A feature involves multiple stages (design → build → test → QA) that must be gated + +## Why Phased Workflows Matter + +Without phases, AI agents make common mistakes: +- Code before understanding requirements → rework +- Skip design verification → ship broken UI +- Skip testing → broken in production +- No visual QA → pixel-level bugs users notice +- No review gate → architecture drift + +A phased workflow with **blocking gates** prevents these by enforcing order. + +## Workflow Template + +Every workflow follows this pattern: + +``` +Phase 0 → Phase 1 → Phase 2 → ... → Phase N + ↑ gate ↑ ↑ gate ↑ ↑ gate ↑ +``` + +**Gates are BLOCKING** — cannot proceed until previous phase produces its required artifact. + +### Base Phases (adapt for your project) + +| Phase | Name | Purpose | Artifact | +|-------|------|---------|----------| +| 0 | CLARIFY | Make requirements implementation-ready | Spec Health Report | +| 0.5 | RESEARCH | Resolve any CLARIFY/DESIGN/ROUTING confidence below `1.0` | Independent research evidence + confidence `1.0` | +| 1 | DESIGN | Generate visual/technical design | Design files or specs | +| 2 | VERIFY | Review design before coding | Review verdicts (PASS/FAIL) | +| 3 | BUILD | Implement code | Compiling code + passing tests | +| 4 | TEST | Functional verification | Test results (all pass) | +| 5 | REVIEW | Code quality check | Review approval | +| 6 | QA | Visual/manual verification | Screenshots/evidence | +| 7 | COMMIT | Ship it | Clean git commit | + +The RESEARCH phase is blocking whenever confidence is `< 1.0`. Split ambiguous/noisy +questions into independent research tasks and use the strongest available model +(`claude-opus-4.7` when available) for validation before proceeding to DESIGN or BUILD. + +### Customization by Project Type + +**Backend/API**: Drop DESIGN + QA, strengthen TEST with integration + load tests. +**Mobile/Desktop**: Keep all phases, add per-platform QA. +**Libraries**: Drop DESIGN/VERIFY, strengthen REVIEW (API surface), add DOCS phase. +**Data pipelines**: Replace DESIGN with SCHEMA REVIEW, QA with DATA VALIDATION. + +## Creating a Workflow + +### Step 1: Understand the Project + +Examine project type, existing CI/CD, test infrastructure, and agents. + +### Step 2: Select Phases + +Each phase needs: purpose, input, activities, gate artifact, owner, skip conditions. + +### Step 3: Define Blocking Wait Rule + +```markdown +### ⛔ BLOCKING WAIT Rule +NEVER start Phase N+1 while Phase N is running. +Parallelism ONLY within a single phase (e.g., 3 test suites in parallel). +``` + +### Step 4: Define Phase Gate Evidence Table + +Map each phase to its required evidence, verification method, and when skipping is blocked. + +### Step 5: Add Self-Check Protocol + +At every phase transition, verify artifacts exist and meet quality criteria. + +## Integration + +- Store as `.github/WORKFLOW.md` +- Reference from `AGENTS.md` and project instructions +- Hooks enforce phases (e.g., `commit-gate.py`) +- Conductor agent uses workflow as playbook + +### Deploying via project profiles + +Instead of building a workflow from scratch, use `setup-project.py --profile` or `install-project-hooks.py --profile` to install a pre-built hook bundle and starter `WORKFLOW.md`: + +```bash +python3 ~/.copilot/tools/setup-project.py --profile python # Python: TDD, test-reminder, commit-gate +python3 ~/.copilot/tools/setup-project.py --profile typescript # TypeScript: coding-standards, test-reminder +python3 ~/.copilot/tools/setup-project.py --profile mobile # Mobile: architecture-guard, QA phase +python3 ~/.copilot/tools/setup-project.py --profile fullstack # Full-stack: architecture-guard, session-banner + +# Hooks only (no full project setup) +python3 ~/.copilot/tools/install-project-hooks.py --profile python --workflow # hooks + WORKFLOW.md +python3 ~/.copilot/tools/install-project-hooks.py --list-profiles # show available profiles + +# Build a custom profile and deploy it +python3 ~/.copilot/tools/profile-builder.py --name myteam \ + --hooks dangerous-blocker.py commit-gate.py \ + --phases CLARIFY BUILD TEST COMMIT # creates presets/myteam.json +python3 ~/.copilot/tools/profile-export.py --profile myteam --output myteam.json # export to share +python3 ~/.copilot/tools/profile-import.py --file myteam.json # import on another machine +python3 ~/.copilot/tools/setup-project.py --profile myteam # deploy +``` + +Profile bundles are defined in `presets/` (`default`, `python`, `typescript`, `mobile`, `fullstack`). +Use `--workflow` with `install-project-hooks.py` to also generate a starter `WORKFLOW.md`. + +## Anti-Patterns + +| Anti-Pattern | Why It Fails | +|-------------|-------------| +| Too many phases (>9) | Overhead kills velocity | +| No skip conditions | Trivial changes take forever | +| Soft gates ("should") | AI rationalizes skipping | +| No evidence requirements | "Done" without proof | +| Phase overlap allowed | Defeats gate purpose | +| Proceeding past CLARIFY with confidence `< 1.0` | Implementation starts from guesses; run RESEARCH first | + + +**Project:** React dashboard (TypeScript + Jest + Playwright) + +**Selected phases:** CLARIFY → DESIGN → VERIFY → BUILD → TEST → REVIEW → QA → COMMIT + +**Customizations:** +- Skipped DESIGN for bug-fix tasks (`skip_if: bug_fix: true`) +- TEST requires both Jest (unit) and Playwright (e2e) passing +- QA captures Playwright screenshots as visual evidence +- COMMIT blocked by `commit-gate.py` until QA artifact exists + +**Gate table (excerpt):** +| Phase | Evidence | Command | +|-------|---------|---------| +| BUILD | TypeScript compiles | `npx tsc --noEmit` | +| TEST | All tests green | `yarn test --ci` | +| HARNESS | All criteria green | `sk harness check` | +| QA | Screenshot in `qa-evidence/` | Playwright visual run | + +**Output:** `.github/WORKFLOW.md` (referenced from `AGENTS.md`) + diff --git a/.github/skills/workflow-creator/references/workflow-template.md b/.github/skills/workflow-creator/references/workflow-template.md new file mode 100644 index 00000000..b513c4bb --- /dev/null +++ b/.github/skills/workflow-creator/references/workflow-template.md @@ -0,0 +1,93 @@ +# WORKFLOW.md Template + +Copy and customize for your project. + +```markdown +# Development Workflow + +> This document defines the phased development lifecycle with quality gates. +> Every phase has a BLOCKING gate — cannot proceed until the previous phase's artifact exists. + +## Phase Overview + +| Phase | Name | Owner | Gate Artifact | Skip When | +|-------|------|-------|---------------|-----------| +| 0 | CLARIFY | spec-clarifier | Spec Health Report (verdict=CLEAN) | Trivial bugfix (<3 files, clear repro) | +| 1 | DESIGN | designer | Design files (HTML/PNG/Figma) | Non-UI changes | +| 2 | VERIFY | 3 reviewers (parallel) | All 3 verdicts = PASS | Phase 1 skipped | +| 3 | BUILD | builder | Compiling code + passing unit tests | — | +| 4 | TEST | test runners (parallel) | All test suites pass | — | +| 5 | REVIEW | code-reviewer | Review approval | — | +| 6 | QA | qa-verifier | Screenshots + OCR evidence | Non-UI changes | +| 7 | COMMIT | conductor | Clean git commit | — | + +## ⛔ BLOCKING WAIT Rule + +Start Phase N+1 ONLY after Phase N artifacts exist. +Parallelism is allowed WITHIN a single phase (e.g., 3 test suites in parallel). + +Starting the next phase early means working blind — code written without verification +results almost always needs rewriting. + +## Phase Gate Evidence + +Each gate requires specific artifacts before the phase can be marked "done": + +| Phase | Required Evidence | Verification | +|-------|-------------------|-------------| +| 0: CLARIFY | Spec Health Report with verdict | Check verdict = ✅ CLEAN | +| 1: DESIGN | Design files exist | File existence check | +| 2: VERIFY | All reviewer verdicts | Count PASS verdicts = total reviewers | +| 3: BUILD | Build output + test output | Exit code = 0 | +| 4: TEST | Test results per platform | 0 failures across all suites | +| 5: REVIEW | Review comments | No blocking issues | +| 6: QA | Screenshots + text verification | OCR confirms expected elements | +| 7: COMMIT | Git commit hash | Commit exists in log | + +## Self-Check Protocol + +At every phase transition, run this check: + +``` +□ Previous phase artifact exists (not just "I think it passed") +□ Artifact meets quality criteria (not empty, not trivially correct) +□ No blocking issues from previous phase remain unresolved +□ Current phase has clear input from previous phase's output +``` + +## Phase Details + +### Phase 0: CLARIFY + + +### Phase 1: DESIGN + + +### Phase 2: VERIFY + + +### Phase 3: BUILD + + +### Phase 4: TEST + + +### Phase 5: REVIEW + + +### Phase 6: QA + + +### Phase 7: COMMIT + + +## Anti-Patterns + +| Anti-Pattern | Why It Fails | +|-------------|-------------| +| Skipping phases for "simple changes" | Simple changes still break things | +| Starting Phase N+1 while N runs | Working blind = rework guaranteed | +| Marking phases done without artifacts | "Trust me" doesn't catch bugs | +| Using same agent for build and review | Self-review misses builder's blind spots | +| Sequential test suites when parallel is possible | Wastes time proportional to suite count | +``` From 32f5a306ffb59126705b899ff77bb32d20760ed6 Mon Sep 17 00:00:00 2001 From: Linh Ngo Date: Sat, 30 May 2026 23:14:28 +0700 Subject: [PATCH 2/3] feat(agents/instructions): add harness integration to all agents and copilot-instructions (#736) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - All 10 .github/agents/*.agent.md: append Harness Integration section with sk harness check, SK_HARNESS=1, sk harness init, quality-over-speed - .github/copilot-instructions.md: add ## Harness Engineering section after Quality Checklist — 7-principle table with executable commands - AGENTS.md: add Command column to harness principles table with sk harness init, SK_HARNESS=1, sk harness check, sk tentacle etc. Closes #736 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/browse-ui-host-state.agent.md | 20 +++++++++++++++++++ .../agents/browser-security-reviewer.agent.md | 20 +++++++++++++++++++ .github/agents/dev-leader.agent.md | 20 +++++++++++++++++++ .../agents/hosted-shell-bootstrap.agent.md | 20 +++++++++++++++++++ .github/agents/python-browse-backend.agent.md | 20 +++++++++++++++++++ .github/agents/qa-leader.agent.md | 20 +++++++++++++++++++ .github/agents/research-planner.agent.md | 20 +++++++++++++++++++ .github/agents/test-leader.agent.md | 20 +++++++++++++++++++ .github/agents/verification-gate.agent.md | 20 +++++++++++++++++++ .../agents/whole-app-impact-auditor.agent.md | 20 +++++++++++++++++++ .github/copilot-instructions.md | 16 +++++++++++++++ AGENTS.md | 18 ++++++++--------- 12 files changed, 225 insertions(+), 9 deletions(-) diff --git a/.github/agents/browse-ui-host-state.agent.md b/.github/agents/browse-ui-host-state.agent.md index 76ca9133..4a6a1726 100644 --- a/.github/agents/browse-ui-host-state.agent.md +++ b/.github/agents/browse-ui-host-state.agent.md @@ -95,3 +95,23 @@ Use Playwright or browser smoke only when runtime behavior cannot be proven by u ## Output Summarize the state machine, affected UI surfaces, and validation evidence. Call out any browser-dependent behavior explicitly. + +## Harness Integration + +Quality over speed — always run harness gates before marking work done: + +- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green +- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront +- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk ` +- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524) + +```bash +# Check harness (when harness.yaml present) +sk harness check + +# Enable middleware for agent task +SK_HARNESS=1 sk briefing "my task" + +# Set up harness on new project +sk harness init --yes +``` diff --git a/.github/agents/browser-security-reviewer.agent.md b/.github/agents/browser-security-reviewer.agent.md index a51aeeea..045f66c6 100644 --- a/.github/agents/browser-security-reviewer.agent.md +++ b/.github/agents/browser-security-reviewer.agent.md @@ -70,3 +70,23 @@ Return: - **Safe-to-merge conditions:** exact checks that must pass If there are no genuine issues, say so directly and do not invent style feedback. + +## Harness Integration + +Quality over speed — always run harness gates before marking work done: + +- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green +- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront +- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk ` +- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524) + +```bash +# Check harness (when harness.yaml present) +sk harness check + +# Enable middleware for agent task +SK_HARNESS=1 sk briefing "my task" + +# Set up harness on new project +sk harness init --yes +``` diff --git a/.github/agents/dev-leader.agent.md b/.github/agents/dev-leader.agent.md index 11eeca48..c9ab09e7 100644 --- a/.github/agents/dev-leader.agent.md +++ b/.github/agents/dev-leader.agent.md @@ -96,3 +96,23 @@ When you encounter a problem that spans domains: Primary: `*.py` (root tools), `hooks/**/*`, `browse/**/*.py`, `migrate.py`, `sk.py`, `install.py` Out of scope: `browse-ui/src/**/*` (belongs to browse-leader), `crates/**/*` (flag to orchestrator) + +## Harness Integration + +Quality over speed — always run harness gates before marking work done: + +- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green +- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront +- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk ` +- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524) + +```bash +# Check harness (when harness.yaml present) +sk harness check + +# Enable middleware for agent task +SK_HARNESS=1 sk briefing "my task" + +# Set up harness on new project +sk harness init --yes +``` diff --git a/.github/agents/hosted-shell-bootstrap.agent.md b/.github/agents/hosted-shell-bootstrap.agent.md index e031e970..5dc74bc6 100644 --- a/.github/agents/hosted-shell-bootstrap.agent.md +++ b/.github/agents/hosted-shell-bootstrap.agent.md @@ -124,3 +124,23 @@ Open a PR with: - Test output - Browser smoke evidence if possible - Any follow-up issue needed for HTTPS companion or richer pairing + +## Harness Integration + +Quality over speed — always run harness gates before marking work done: + +- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green +- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront +- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk ` +- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524) + +```bash +# Check harness (when harness.yaml present) +sk harness check + +# Enable middleware for agent task +SK_HARNESS=1 sk briefing "my task" + +# Set up harness on new project +sk harness init --yes +``` diff --git a/.github/agents/python-browse-backend.agent.md b/.github/agents/python-browse-backend.agent.md index c6de77a5..6cbf606a 100644 --- a/.github/agents/python-browse-backend.agent.md +++ b/.github/agents/python-browse-backend.agent.md @@ -103,3 +103,23 @@ If `watch-sessions.py` changes, also run or confirm coverage for `tests/test_wat ## Output Report the changed routes, security decisions, synchronization decisions, and test evidence. If any behavior is deferred, open or reference a follow-up issue instead of leaving silent TODOs. + +## Harness Integration + +Quality over speed — always run harness gates before marking work done: + +- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green +- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront +- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk ` +- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524) + +```bash +# Check harness (when harness.yaml present) +sk harness check + +# Enable middleware for agent task +SK_HARNESS=1 sk briefing "my task" + +# Set up harness on new project +sk harness init --yes +``` diff --git a/.github/agents/qa-leader.agent.md b/.github/agents/qa-leader.agent.md index 01d53066..f4862e37 100644 --- a/.github/agents/qa-leader.agent.md +++ b/.github/agents/qa-leader.agent.md @@ -139,3 +139,23 @@ Security: no SQL interpolation, no pickle, no secrets Primary: All changed surfaces (read-only audit + gate execution) Out of scope: Implementing fixes (that belongs to dev-leader or browse-leader) + +## Harness Integration + +Quality over speed — always run harness gates before marking work done: + +- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green +- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront +- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk ` +- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524) + +```bash +# Check harness (when harness.yaml present) +sk harness check + +# Enable middleware for agent task +SK_HARNESS=1 sk briefing "my task" + +# Set up harness on new project +sk harness init --yes +``` diff --git a/.github/agents/research-planner.agent.md b/.github/agents/research-planner.agent.md index 88eff5a0..730f7cd5 100644 --- a/.github/agents/research-planner.agent.md +++ b/.github/agents/research-planner.agent.md @@ -77,3 +77,23 @@ When creating an implementation issue, include: ## Output Produce concise research that is implementation-ready. If asked to create GitHub issues, make each issue specific enough that a cloud agent can implement it without hidden context. + +## Harness Integration + +Quality over speed — always run harness gates before marking work done: + +- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green +- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront +- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk ` +- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524) + +```bash +# Check harness (when harness.yaml present) +sk harness check + +# Enable middleware for agent task +SK_HARNESS=1 sk briefing "my task" + +# Set up harness on new project +sk harness init --yes +``` diff --git a/.github/agents/test-leader.agent.md b/.github/agents/test-leader.agent.md index 0cc775ab..2ad976ef 100644 --- a/.github/agents/test-leader.agent.md +++ b/.github/agents/test-leader.agent.md @@ -113,3 +113,23 @@ Handoff must include: test suite output (pass/fail counts), coverage gaps identi Primary: `test_*.py`, `run_all_tests.py`, `tests/**/*`, `browse-ui/src/**/*.test.*`, `browse-ui/e2e/**/*` Review scope (read-only audit): all files changed by dev-leader or browse-leader + +## Harness Integration + +Quality over speed — always run harness gates before marking work done: + +- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green +- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront +- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk ` +- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524) + +```bash +# Check harness (when harness.yaml present) +sk harness check + +# Enable middleware for agent task +SK_HARNESS=1 sk briefing "my task" + +# Set up harness on new project +sk harness init --yes +``` diff --git a/.github/agents/verification-gate.agent.md b/.github/agents/verification-gate.agent.md index cc9bd062..3184b9fc 100644 --- a/.github/agents/verification-gate.agent.md +++ b/.github/agents/verification-gate.agent.md @@ -90,3 +90,23 @@ Return: - **Sync coverage:** watcher, auto-update/install, docs, hooks, skills/agents, conventions, CI/deploy status - **Failures:** root cause and affected file/test when known - **Next action:** exact command or issue/PR comment to run next + +## Harness Integration + +Quality over speed — always run harness gates before marking work done: + +- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green +- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront +- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk ` +- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524) + +```bash +# Check harness (when harness.yaml present) +sk harness check + +# Enable middleware for agent task +SK_HARNESS=1 sk briefing "my task" + +# Set up harness on new project +sk harness init --yes +``` diff --git a/.github/agents/whole-app-impact-auditor.agent.md b/.github/agents/whole-app-impact-auditor.agent.md index da8d985b..a3a8a157 100644 --- a/.github/agents/whole-app-impact-auditor.agent.md +++ b/.github/agents/whole-app-impact-auditor.agent.md @@ -91,3 +91,23 @@ Return: - **Verification matrix:** command or check per impacted surface - **Risks if skipped:** concrete breakage scenario - **Follow-up issues:** only when deferring work is safe and explicit + +## Harness Integration + +Quality over speed — always run harness gates before marking work done: + +- **If `harness.yaml` exists**: run `sk harness check` before writing handoff; verify all success criteria are green +- **For new project setup tasks**: include `sk harness init` as part of setup to define success criteria upfront +- **For complex agent dispatch tasks**: enable telemetry with `SK_HARNESS=1 sk ` +- **Never skip gates to save time**: a 36% performance gap separates harnessed from un-harnessed work (CORE benchmark, arXiv 2412.04524) + +```bash +# Check harness (when harness.yaml present) +sk harness check + +# Enable middleware for agent task +SK_HARNESS=1 sk briefing "my task" + +# Set up harness on new project +sk harness init --yes +``` diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 24e04773..c1e3df0e 100755 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -206,6 +206,22 @@ On Windows (PowerShell), apply these rules to reduce token consumption: **Closeout:** attach command output (not just assertions) · `sk learn` before `task_complete` · subagents handoff with `--status DONE --changed-file --learn`. +## Harness Engineering + +The 🛡️ 7 harness principles from `AGENTS.md` apply at runtime. Executable commands: + +| Principle | When | Command | +|-----------|------|---------| +| No-Ship-Bugs | Before every commit | `python3 test_security.py && python3 test_fixes.py` | +| Follow-Workflow | New project setup | `sk harness init` | +| Quality-Over-Speed | After implementation | `sk harness check` | +| Tentacle-Orchestration | ≥3 files changed | `sk tentacle create --briefing` | +| No-Abandon | confidence < 1.0 | `sk briefing ""` → research loop | +| Rules-First | Before every task | `sk briefing --auto --compact` | +| Knowledge-Recording | After bug fix/pattern | `sk learn --mistake "Title" "Details"` | + +Enable harness dispatch middleware: `SK_HARNESS=1 sk ` + ## Testing ```bash diff --git a/AGENTS.md b/AGENTS.md index 0b68920e..344616d8 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -78,15 +78,15 @@ For `browse-ui/` changes: `cd browse-ui && pnpm typecheck && pnpm lint && pnpm f -| # | Principle | Rule | | -|---|-----------|------|-------------------------------| -| 1 | **No-Ship-Bugs** | CODE→COMPILE→TEST→VERIFY→COMMIT. Never commit without passing tests. | | -| 2 | **Follow-Workflow** | Clarify→Plan→Execute→Verify→Close. No skipping phases. | | -| 3 | **Quality-Over-Speed** | Multi-platform = no shortcuts. Verify on all surfaces. | | -| 4 | **Tentacle-Orchestration** | ≥3 files or ≥2 modules → tentacle required. | | -| 5 | **No-Abandon** | confidence < 1.0 = research loop, never BLOCKED. Fix or delegate. | | -| 6 | **Rules-First** | Read AGENTS.md before every task. | | -| 7 | **Knowledge-Recording** | `sk learn` after every bug fix or new pattern. | | +| # | Principle | Rule | Command | | +|---|-----------|------|---------|-------------------------------| +| 1 | **No-Ship-Bugs** | CODE→COMPILE→TEST→VERIFY→COMMIT. Never commit without passing tests. | `python3 test_security.py && test_fixes.py` | | +| 2 | **Follow-Workflow** | Clarify→Plan→Execute→Verify→Close. No skipping phases. | `sk harness init` (new) · `sk harness check` (verify) | | +| 3 | **Quality-Over-Speed** | Multi-platform = no shortcuts. Verify on all surfaces. | `SK_HARNESS=1 sk ` | | +| 4 | **Tentacle-Orchestration** | ≥3 files or ≥2 modules → tentacle required. | `sk tentacle create --briefing` | | +| 5 | **No-Abandon** | confidence < 1.0 = research loop, never BLOCKED. Fix or delegate. | `sk briefing ""` | | +| 6 | **Rules-First** | Read AGENTS.md before every task. | `sk briefing --auto --compact` | | +| 7 | **Knowledge-Recording** | `sk learn` after every bug fix or new pattern. | `sk learn --mistake/--pattern/--feature` | | > Canonical source: `templates/copilot-instructions.md § 🛡️ Harness Engineering — 7 Nguyên tắc` > Full rule details: [docs/AGENT-RULES.md](docs/AGENT-RULES.md) From f70c19dd8419be39193ff11099ce86bcd30418f0 Mon Sep 17 00:00:00 2001 From: Linh Ngo Date: Sat, 30 May 2026 23:37:05 +0700 Subject: [PATCH 3/3] fix: add python3 prefix to test_fixes.py command in AGENTS.md harness table Address reviewer comment: 'test_fixes.py' without python3 would try to execute as a shell command and fail. Use explicit 'python3 test_fixes.py'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- AGENTS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/AGENTS.md b/AGENTS.md index 344616d8..95c4bb65 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -80,7 +80,7 @@ For `browse-ui/` changes: `cd browse-ui && pnpm typecheck && pnpm lint && pnpm f | # | Principle | Rule | Command | | |---|-----------|------|---------|-------------------------------| -| 1 | **No-Ship-Bugs** | CODE→COMPILE→TEST→VERIFY→COMMIT. Never commit without passing tests. | `python3 test_security.py && test_fixes.py` | | +| 1 | **No-Ship-Bugs** | CODE→COMPILE→TEST→VERIFY→COMMIT. Never commit without passing tests. | `python3 test_security.py && python3 test_fixes.py` | | | 2 | **Follow-Workflow** | Clarify→Plan→Execute→Verify→Close. No skipping phases. | `sk harness init` (new) · `sk harness check` (verify) | | | 3 | **Quality-Over-Speed** | Multi-platform = no shortcuts. Verify on all surfaces. | `SK_HARNESS=1 sk ` | | | 4 | **Tentacle-Orchestration** | ≥3 files or ≥2 modules → tentacle required. | `sk tentacle create --briefing` | |