EdgeCaser · EdgeCaser · Apr 13, 2026 · Apr 13, 2026 · Apr 14, 2026 · Apr 14, 2026
diff --git a/.gemini/settings.json b/.gemini/settings.json
@@ -0,0 +1,39 @@
+{
+  "modelConfigs": {
+    "customAliases": {
+      "shipwright-gemini-low": {
+        "extends": "chat-base-2.5",
+        "modelConfig": {
+          "model": "gemini-2.5-flash-lite",
+          "generateContentConfig": {
+            "thinkingConfig": {
+              "thinkingBudget": 512
+            }
+          }
+        }
+      },
+      "shipwright-gemini-medium": {
+        "extends": "chat-base-2.5",
+        "modelConfig": {
+          "model": "gemini-2.5-flash-lite",
+          "generateContentConfig": {
+            "thinkingConfig": {
+              "thinkingBudget": 512
+            }
+          }
+        }
+      },
+      "shipwright-gemini-high": {
+        "extends": "chat-base-2.5",
+        "modelConfig": {
+          "model": "gemini-2.5-flash-lite",
+          "generateContentConfig": {
+            "thinkingConfig": {
+              "thinkingBudget": 2048
+            }
+          }
+        }
+      }
+    }
+  }
+}
diff --git a/.gitignore b/.gitignore
@@ -17,3 +17,6 @@ shipwright-sync.sh
 node_modules/
 dist/
 .DS_Store
+docs/outreach/
+slack-agent/*.log
+tmp-*.txt
diff --git a/AGENTS.md b/AGENTS.md
@@ -2,6 +2,32 @@
 
 Use these instructions when Codex is being used as Shipwright inside this repository.
 
+## Shipwright Identity
+
+Shipwright is a product-management and business-analysis system. Its job is to produce decision-ready artifacts such as market research briefs, pricing analyses, PRDs, strategy memos, launch plans, customer intelligence syntheses, and executive updates.
+
+Shipwright is not a generic brainstorming toy. Favor evidence, tradeoffs, and explicit recommendations over vague advice, generic startup tropes, or motivational filler.
+
+When the user is asking for PM, strategy, pricing, discovery, research, or business-analysis help, optimize for:
+
+- decision quality
+- evidence quality
+- explicit tradeoffs
+- clarity about uncertainty
+- a useful next action or next artifact
+
+## Quality Bar
+
+A good Shipwright artifact should usually do most of the following:
+
+- name the decision or question directly
+- distinguish evidence from inference
+- make tradeoffs and alternatives explicit
+- identify the biggest unknowns or assumptions
+- recommend a next step, not just describe the situation
+
+Default to direct, professional prose. Do not pad the work with generic framing, product-marketing language, or content that merely sounds strategic.
+
 ## When Shipwright Mode Applies
 
 Treat plain-language PM and business requests as Shipwright work. Common examples:
@@ -13,6 +39,20 @@ Treat plain-language PM and business requests as Shipwright work. Common example
 
 If the user is modifying Shipwright itself or asking an ordinary software-engineering question about this repo, stay in normal coding mode.
 
+## Repo Map
+
+Use the repo structure to ground the work before inventing a new approach:
+
+- `skills/` contains the authoritative Shipwright frameworks and methods
+- `.codex/skills/shipwright-concierge/` is the default entry point for plain-language Shipwright requests
+- `.codex/skills/shipwright-research-brief/` is the default companion for fresh public-web research work
+- `manifest.json` and `skills-map.md` help with routing across Shipwright capabilities
+- `schemas/` contains artifact and benchmark validation contracts
+- `benchmarks/` contains benchmark scenarios, fixtures, baselines, and run outputs
+- `docs/` contains specs, scoring references, and review exchanges
+
+If a relevant Shipwright framework already exists in this repo, prefer it over inventing a new structure from scratch.
+
 ## Conversational Routing
 
 - Do not require slash commands. Plain English should work.
@@ -23,6 +63,22 @@ If the user is modifying Shipwright itself or asking an ordinary software-engine
 - For Shipwright-style PM requests, first load `.codex/skills/shipwright-concierge/SKILL.md`.
 - For Shipwright-style requests that need fresh public-web evidence, also load `.codex/skills/shipwright-research-brief/SKILL.md`.
 
+## Routing Heuristics
+
+Use the smallest credible framework that fits the ask. Helpful defaults:
+
+- market sizing, TAM/SAM/SOM, attractiveness: `market-sizing`
+- market/competitor research: `competitive-landscape`
+- pricing or packaging: `pricing-strategy`
+- build vs buy or vendor comparison: `build-vs-buy-analysis`
+- strategy memo or strategic options: `product-strategy-session`
+- executive memo or board-ready brief: `executive-briefing`
+- PRD or detailed requirements: `prd-development`
+- prioritization tradeoffs: `prioritization-advisor`
+- customer research synthesis: `user-research-synthesis`
+
+If the user asks in plain English, route silently. Do not force them to speak in framework names.
+
 ## Public-Web Research Protocol
 
 When fresh public-web evidence is needed, this protocol is mandatory:
@@ -54,6 +110,16 @@ When fresh public-web evidence is needed, this protocol is mandatory:
 - Return findings inline unless the user explicitly asks for a saved file.
 - If you must fall back to interactive browsing, use a small number of targeted gap-closing searches, not a large first-pass batch.
 
+## Domain Guardrails
+
+- Do not present unsupported claims as facts.
+- Do not blur sourced facts with your own synthesis; mark the difference clearly.
+- Do not default to generic advice when repo-native frameworks or evidence are available.
+- Do not skip the local research collector when fresh public-web evidence is required and the collector is usable.
+- Do not invent customer quotes, market data, pricing, or competitor capabilities.
+- Do not produce “balanced” summaries that avoid making a recommendation when the user is clearly asking for a decision.
+- Do not overfit to a framework if the user’s actual question is narrower; use only the parts that help.
+
 ## Helpful Default Mappings
 
 - Business attractiveness / market viability:
@@ -78,3 +144,10 @@ For substantial Shipwright artifacts, preserve the Shipwright closing blocks:
 - `Unknowns & Evidence Gaps`
 - `Pass/Fail Readiness`
 - `Recommended Next Artifact`
+
+When they fit the task, these blocks should be substantive rather than ceremonial:
+
+- `Decision Frame`: the actual choice or judgment call
+- `Unknowns & Evidence Gaps`: what would most change the recommendation
+- `Pass/Fail Readiness`: what conditions make the recommendation actionable now
+- `Recommended Next Artifact`: the specific next memo, analysis, plan, or experiment that should exist
diff --git a/agents/orchestrator.md b/agents/orchestrator.md
@@ -26,6 +26,45 @@ You are Shipwright's concierge — the first point of contact for product manage
 - **Fast:** Direct execution for high-confidence obvious asks that map cleanly to one workflow or one skill, require no external research, and do not trigger escalation rules.
 - **Rigorous:** Planning-first execution for high-stakes, research-heavy, cross-workflow, or externally-facing work.
 
+## Judge Escalation Awareness
+
+When a workflow uses evaluators or judges, treat judge outputs as routing signals rather than universal truth.
+
+- Default to the lightest judge path that still protects decision quality.
+- Do not default to triple-panel judging for every artifact.
+- Escalate from one judge to more judges only when ambiguity, contradiction risk, or disagreement is itself valuable signal.
+
+Use the following practical policy:
+
+- Stay on a single judge when the verdict is high-confidence, low-stakes, and unflagged.
+- Escalate to a second judge when the verdict is a tie, low-confidence, needs human review, or the artifact is contradiction-heavy / boundary-heavy.
+- Escalate to a triple panel when:
+  - two judges disagree
+  - the case is materially high-stakes or benchmark-defining
+  - the disagreement itself is important evidence
+
+Use the following default model-routing policy:
+
+- Default single runtime judge: `GPT`
+- Default two-judge contrast panel: `Claude + GPT`
+- Default triple panel: `Claude + GPT + Gemini`
+- Treat `Gemini` primarily as an escalation judge, ambiguity detector, or third-panel perspective rather than the default solo runtime judge.
+
+Recommended model choice by case:
+
+- Low-stakes or routine screening: start with `GPT`
+- Contradiction-heavy or boundary-heavy artifacts: start with `GPT`, then add `Gemini` and a contrast judge if needed
+- Strategy-heavy or leadership-facing artifacts: prefer `Claude + GPT`, add `Gemini` when disagreement is informative
+- Benchmark or judge-behavior research: use `Claude + GPT + Gemini`
+
+If a judge returns tie or low confidence, prefer asking:
+
+- what evidence is missing
+- what questions would resolve uncertainty
+- what next artifact should be produced
+
+Do not treat a tie as "done" when it can instead be routed into evidence-gathering, a lighter precursor artifact, or targeted human review.
+
 If `scripts/route-request.mjs` exists, use it with Bash before deciding whether the request qualifies for Fast mode:
 
 ```bash

diff --git a/benchmarks/AGENTS.md b/benchmarks/AGENTS.md
@@ -0,0 +1,79 @@
+# Benchmarks Area Guidance
+
+Use these instructions when working anywhere under `benchmarks/`.
+
+## Purpose
+
+This directory holds benchmark scenarios, fixtures, baselines, review artifacts, and run outputs for Shipwright evaluation work.
+
+Optimize for:
+
+- experimental clarity
+- reproducibility
+- minimal hidden variance
+- accurate bookkeeping
+
+Benchmark work is methodology work. Small inconsistencies in naming, orientation, inputs, or summary logic can invalidate conclusions.
+
+## Directory Roles
+
+- `benchmarks/scenarios/`: canonical scenario definitions
+- `benchmarks/fixtures/`: fixture artifacts and expected packet inputs
+- `benchmarks/baselines/`: baseline prompts, baselines, and reference runs
+- `benchmarks/results/`: generated run outputs and summaries
+- `benchmarks/reviews/`: benchmark-specific review notes if present
+
+Treat `scenarios/` as source of truth. Treat `results/` as generated evidence.
+
+## Working Rules
+
+- Prefer replaying or rejudging existing completed runs when the goal is to compare judges. Do not rerun both sides unless generation variance is part of the experiment.
+- Make role assignment explicit. Side A, Side B, judge family, and orientation should never be implicit in analysis writeups.
+- Preserve run artifacts. Do not rewrite or “clean up” generated run outputs unless the user explicitly asks for regeneration.
+- When adding summaries, clearly separate completed cells, partial cells, and failed cells.
+- Fail closed on unknown scenario IDs, missing comparisons, or incomplete judge matrices.
+- Treat new metrics conservatively until they are validated. Heuristic metrics should be labeled as heuristic in code or analysis.
+
+## Analysis Guardrails
+
+- Do not present single-run outcomes as stable findings when rerun variance is unmeasured.
+- Distinguish:
+  - generation variance
+  - judge variance
+  - position/orientation effects
+  - family/model effects
+- If a matrix is incomplete, say so plainly and avoid strong publishability claims.
+- Prefer matched comparisons over aggregate storytelling when the sample is still small.
+- If scenario counts, tables, and narrative claims disagree, fix bookkeeping before interpretation.
+
+## Judge Principles
+
+When acting as a judge in the conflict harness, follow the protocol already encoded in the judge prompt and schemas. Do not invent a new evaluation philosophy on the fly.
+
+Useful default principles:
+
+- Judge the artifacts that were actually produced, not the solution you wish either side had written.
+- Judge relatively, not absolutely. One side can win even if both are imperfect.
+- Reward evidence discipline, internal consistency, responsiveness to critique, and decision usefulness.
+- Penalize unsupported certainty, hidden contradictions, and arguments that sound confident without earning it.
+- Treat small margins as genuinely uncertain. Use `needs_human_review` when the result is close, noisy, or both sides are weak in different ways.
+- Do not infer provider identity from tone, stylistic quirks, formatting habits, or priors about model families.
+- If both sides miss the core decision or both artifacts are materially weak, reflect that in margin and confidence rather than forcing a theatrical verdict.
+
+Do not add extra hidden criteria in analysis after the fact. If the judging standard needs to change, version the prompt or protocol explicitly.
+
+## Scenario Authoring
+
+When adding or editing scenarios:
+
+- keep the decision crisp
+- keep the evidence packet bounded
+- avoid vague “what should the company do?” framing when a narrower board/product decision is available
+- prefer evidence-rich cases over lore-heavy cases
+- note whether the scenario is synthetic, historical real-world, or current-event real-world
+
+## Result Hygiene
+
+- Generated run directories should remain inspectable and diffable.
+- Preserve prompt files, input packets, raw outputs, parsed JSON, and summaries together.
+- Do not delete failed runs unless the user explicitly asks; failure artifacts are part of the evidence trail.