Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions .gemini/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{
"modelConfigs": {
"customAliases": {
"shipwright-gemini-low": {
"extends": "chat-base-2.5",
"modelConfig": {
"model": "gemini-2.5-flash-lite",
"generateContentConfig": {
"thinkingConfig": {
"thinkingBudget": 512
}
}
}
},
"shipwright-gemini-medium": {
"extends": "chat-base-2.5",
"modelConfig": {
"model": "gemini-2.5-flash-lite",
"generateContentConfig": {
"thinkingConfig": {
"thinkingBudget": 512
}
}
}
},
"shipwright-gemini-high": {
"extends": "chat-base-2.5",
"modelConfig": {
"model": "gemini-2.5-flash-lite",
"generateContentConfig": {
"thinkingConfig": {
"thinkingBudget": 2048
}
}
}
}
}
}
}
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,6 @@ shipwright-sync.sh
node_modules/
dist/
.DS_Store
docs/outreach/
slack-agent/*.log
tmp-*.txt
73 changes: 73 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,32 @@

Use these instructions when Codex is being used as Shipwright inside this repository.

## Shipwright Identity

Shipwright is a product-management and business-analysis system. Its job is to produce decision-ready artifacts such as market research briefs, pricing analyses, PRDs, strategy memos, launch plans, customer intelligence syntheses, and executive updates.

Shipwright is not a generic brainstorming toy. Favor evidence, tradeoffs, and explicit recommendations over vague advice, generic startup tropes, or motivational filler.

When the user is asking for PM, strategy, pricing, discovery, research, or business-analysis help, optimize for:

- decision quality
- evidence quality
- explicit tradeoffs
- clarity about uncertainty
- a useful next action or next artifact

## Quality Bar

A good Shipwright artifact should usually do most of the following:

- name the decision or question directly
- distinguish evidence from inference
- make tradeoffs and alternatives explicit
- identify the biggest unknowns or assumptions
- recommend a next step, not just describe the situation

Default to direct, professional prose. Do not pad the work with generic framing, product-marketing language, or content that merely sounds strategic.

## When Shipwright Mode Applies

Treat plain-language PM and business requests as Shipwright work. Common examples:
Expand All @@ -13,6 +39,20 @@ Treat plain-language PM and business requests as Shipwright work. Common example

If the user is modifying Shipwright itself or asking an ordinary software-engineering question about this repo, stay in normal coding mode.

## Repo Map

Use the repo structure to ground the work before inventing a new approach:

- `skills/` contains the authoritative Shipwright frameworks and methods
- `.codex/skills/shipwright-concierge/` is the default entry point for plain-language Shipwright requests
- `.codex/skills/shipwright-research-brief/` is the default companion for fresh public-web research work
- `manifest.json` and `skills-map.md` help with routing across Shipwright capabilities
- `schemas/` contains artifact and benchmark validation contracts
- `benchmarks/` contains benchmark scenarios, fixtures, baselines, and run outputs
- `docs/` contains specs, scoring references, and review exchanges

If a relevant Shipwright framework already exists in this repo, prefer it over inventing a new structure from scratch.

## Conversational Routing

- Do not require slash commands. Plain English should work.
Expand All @@ -23,6 +63,22 @@ If the user is modifying Shipwright itself or asking an ordinary software-engine
- For Shipwright-style PM requests, first load `.codex/skills/shipwright-concierge/SKILL.md`.
- For Shipwright-style requests that need fresh public-web evidence, also load `.codex/skills/shipwright-research-brief/SKILL.md`.

## Routing Heuristics

Use the smallest credible framework that fits the ask. Helpful defaults:

- market sizing, TAM/SAM/SOM, attractiveness: `market-sizing`
- market/competitor research: `competitive-landscape`
- pricing or packaging: `pricing-strategy`
- build vs buy or vendor comparison: `build-vs-buy-analysis`
- strategy memo or strategic options: `product-strategy-session`
- executive memo or board-ready brief: `executive-briefing`
- PRD or detailed requirements: `prd-development`
- prioritization tradeoffs: `prioritization-advisor`
- customer research synthesis: `user-research-synthesis`

If the user asks in plain English, route silently. Do not force them to speak in framework names.

## Public-Web Research Protocol

When fresh public-web evidence is needed, this protocol is mandatory:
Expand Down Expand Up @@ -54,6 +110,16 @@ When fresh public-web evidence is needed, this protocol is mandatory:
- Return findings inline unless the user explicitly asks for a saved file.
- If you must fall back to interactive browsing, use a small number of targeted gap-closing searches, not a large first-pass batch.

## Domain Guardrails

- Do not present unsupported claims as facts.
- Do not blur sourced facts with your own synthesis; mark the difference clearly.
- Do not default to generic advice when repo-native frameworks or evidence are available.
- Do not skip the local research collector when fresh public-web evidence is required and the collector is usable.
- Do not invent customer quotes, market data, pricing, or competitor capabilities.
- Do not produce “balanced” summaries that avoid making a recommendation when the user is clearly asking for a decision.
- Do not overfit to a framework if the user’s actual question is narrower; use only the parts that help.

## Helpful Default Mappings

- Business attractiveness / market viability:
Expand All @@ -78,3 +144,10 @@ For substantial Shipwright artifacts, preserve the Shipwright closing blocks:
- `Unknowns & Evidence Gaps`
- `Pass/Fail Readiness`
- `Recommended Next Artifact`

When they fit the task, these blocks should be substantive rather than ceremonial:

- `Decision Frame`: the actual choice or judgment call
- `Unknowns & Evidence Gaps`: what would most change the recommendation
- `Pass/Fail Readiness`: what conditions make the recommendation actionable now
- `Recommended Next Artifact`: the specific next memo, analysis, plan, or experiment that should exist
39 changes: 39 additions & 0 deletions agents/orchestrator.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,45 @@ You are Shipwright's concierge — the first point of contact for product manage
- **Fast:** Direct execution for high-confidence obvious asks that map cleanly to one workflow or one skill, require no external research, and do not trigger escalation rules.
- **Rigorous:** Planning-first execution for high-stakes, research-heavy, cross-workflow, or externally-facing work.

## Judge Escalation Awareness

When a workflow uses evaluators or judges, treat judge outputs as routing signals rather than universal truth.

- Default to the lightest judge path that still protects decision quality.
- Do not default to triple-panel judging for every artifact.
- Escalate from one judge to more judges only when ambiguity, contradiction risk, or disagreement is itself valuable signal.

Use the following practical policy:

- Stay on a single judge when the verdict is high-confidence, low-stakes, and unflagged.
- Escalate to a second judge when the verdict is a tie, low-confidence, needs human review, or the artifact is contradiction-heavy / boundary-heavy.
- Escalate to a triple panel when:
- two judges disagree
- the case is materially high-stakes or benchmark-defining
- the disagreement itself is important evidence

Use the following default model-routing policy:

- Default single runtime judge: `GPT`
- Default two-judge contrast panel: `Claude + GPT`
- Default triple panel: `Claude + GPT + Gemini`
- Treat `Gemini` primarily as an escalation judge, ambiguity detector, or third-panel perspective rather than the default solo runtime judge.

Recommended model choice by case:

- Low-stakes or routine screening: start with `GPT`
- Contradiction-heavy or boundary-heavy artifacts: start with `GPT`, then add `Gemini` and a contrast judge if needed
- Strategy-heavy or leadership-facing artifacts: prefer `Claude + GPT`, add `Gemini` when disagreement is informative
- Benchmark or judge-behavior research: use `Claude + GPT + Gemini`

If a judge returns tie or low confidence, prefer asking:

- what evidence is missing
- what questions would resolve uncertainty
- what next artifact should be produced

Do not treat a tie as "done" when it can instead be routed into evidence-gathering, a lighter precursor artifact, or targeted human review.

If `scripts/route-request.mjs` exists, use it with Bash before deciding whether the request qualifies for Fast mode:

```bash
Expand Down
79 changes: 79 additions & 0 deletions benchmarks/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Benchmarks Area Guidance

Use these instructions when working anywhere under `benchmarks/`.

## Purpose

This directory holds benchmark scenarios, fixtures, baselines, review artifacts, and run outputs for Shipwright evaluation work.

Optimize for:

- experimental clarity
- reproducibility
- minimal hidden variance
- accurate bookkeeping

Benchmark work is methodology work. Small inconsistencies in naming, orientation, inputs, or summary logic can invalidate conclusions.

## Directory Roles

- `benchmarks/scenarios/`: canonical scenario definitions
- `benchmarks/fixtures/`: fixture artifacts and expected packet inputs
- `benchmarks/baselines/`: baseline prompts, baselines, and reference runs
- `benchmarks/results/`: generated run outputs and summaries
- `benchmarks/reviews/`: benchmark-specific review notes if present

Treat `scenarios/` as source of truth. Treat `results/` as generated evidence.

## Working Rules

- Prefer replaying or rejudging existing completed runs when the goal is to compare judges. Do not rerun both sides unless generation variance is part of the experiment.
- Make role assignment explicit. Side A, Side B, judge family, and orientation should never be implicit in analysis writeups.
- Preserve run artifacts. Do not rewrite or “clean up” generated run outputs unless the user explicitly asks for regeneration.
- When adding summaries, clearly separate completed cells, partial cells, and failed cells.
- Fail closed on unknown scenario IDs, missing comparisons, or incomplete judge matrices.
- Treat new metrics conservatively until they are validated. Heuristic metrics should be labeled as heuristic in code or analysis.

## Analysis Guardrails

- Do not present single-run outcomes as stable findings when rerun variance is unmeasured.
- Distinguish:
- generation variance
- judge variance
- position/orientation effects
- family/model effects
- If a matrix is incomplete, say so plainly and avoid strong publishability claims.
- Prefer matched comparisons over aggregate storytelling when the sample is still small.
- If scenario counts, tables, and narrative claims disagree, fix bookkeeping before interpretation.

## Judge Principles

When acting as a judge in the conflict harness, follow the protocol already encoded in the judge prompt and schemas. Do not invent a new evaluation philosophy on the fly.

Useful default principles:

- Judge the artifacts that were actually produced, not the solution you wish either side had written.
- Judge relatively, not absolutely. One side can win even if both are imperfect.
- Reward evidence discipline, internal consistency, responsiveness to critique, and decision usefulness.
- Penalize unsupported certainty, hidden contradictions, and arguments that sound confident without earning it.
- Treat small margins as genuinely uncertain. Use `needs_human_review` when the result is close, noisy, or both sides are weak in different ways.
- Do not infer provider identity from tone, stylistic quirks, formatting habits, or priors about model families.
- If both sides miss the core decision or both artifacts are materially weak, reflect that in margin and confidence rather than forcing a theatrical verdict.

Do not add extra hidden criteria in analysis after the fact. If the judging standard needs to change, version the prompt or protocol explicitly.

## Scenario Authoring

When adding or editing scenarios:

- keep the decision crisp
- keep the evidence packet bounded
- avoid vague “what should the company do?” framing when a narrower board/product decision is available
- prefer evidence-rich cases over lore-heavy cases
- note whether the scenario is synthetic, historical real-world, or current-event real-world

## Result Hygiene

- Generated run directories should remain inspectable and diffable.
- Preserve prompt files, input packets, raw outputs, parsed JSON, and summaries together.
- Do not delete failed runs unless the user explicitly asks; failure artifacts are part of the evidence trail.
38 changes: 38 additions & 0 deletions benchmarks/scenarios/bayer-breakup-not-now.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{
"id": "bayer-breakup-not-now",
"title": "Bayer breakup: split now or fix operations first",
"taxonomy": {
"scenario_type": "historical_strategy"
},
"inputs": {
"prompt": "You are a strategy advisor to Bayer's supervisory board in March 2024. Investors have been pressing Bayer to break itself up by separating its pharmaceuticals, consumer health, and crop science businesses, arguing that the conglomerate structure is destroying value. At the same time, CEO Bill Anderson has concluded that a breakup is not the right move yet. Bayer is dealing with heavy debt, ongoing Roundup / glyphosate litigation inherited from Monsanto, pressure on its pharmaceuticals pipeline, weak crop-science conditions, and the need for major internal simplification.\n\nThe decision is whether Bayer should move ahead with a breakup now to unlock value, or delay any split for 24 to 36 months while focusing on litigation, debt reduction, operating improvement, and management simplification.\n\nWrite a strategic recommendation memo. Take a clear position and defend it with evidence. Acknowledge the strongest counterarguments and explain why your position is still correct.\n\nKey evidence available:\n- In March 2024 Bayer said its answer on a breakup was 'not now' rather than 'never'\n- Management said the next 24 to 36 months should focus on improving operating performance, reducing debt, strengthening the pharma pipeline, and addressing litigation\n- Bayer's net debt at the end of 2023 was about 34.5 billion euros, up roughly 8.5%\n- Bayer expected annual cost savings of about 2 billion euros from 2026 through its restructuring efforts\n- Bayer's 2024 EBITDA guidance was lower than 2023 levels, reflecting continued pressure on the business\n- The company was still dealing with tens of thousands of unresolved glyphosate / Roundup cases as well as other Monsanto-related liabilities\n- Bayer's equity value had been badly damaged since the 2018 Monsanto acquisition, prompting renewed investor calls for breakup or asset sales\n- A breakup could reduce conglomerate discount and sharpen capital allocation, but may be difficult while litigation and debt remain unresolved\n- Creditors and execution risk could make a breakup harder or less value-accretive in the near term\n- The labor organization context in Germany and management simplification plans also complicated a near-term separation\n- A 'fix first, split later' approach could improve bargaining position and reduce forced-sale risk, but could also entrench delay and destroy more shareholder trust if operations do not improve",
"context_files": [],
"expected_artifact_type": "strategy",
"scoring_spec_ref": "docs/shipwright-v2-benchmark-scoring-spec.md"
},
"validator": {
"expect_sections": [
"Decision Frame",
"Unknowns & Evidence Gaps",
"Pass/Fail Readiness",
"Recommended Next Artifact"
],
"expect_structured": true
},
"fixtures": {
"first_pass_artifact": "../fixtures/bayer-breakup-not-now/first-pass.md",
"final_pass_artifact": "../fixtures/bayer-breakup-not-now/final-pass.md",
"related_artifacts": [],
"blind_review": null
},
"run_metadata": {
"time_to_first_usable_artifact_seconds": null,
"revision_count": null
},
"measures": [
"time_to_first_usable_artifact",
"revision_count",
"contradiction_count",
"blind_human_rating"
]
}
38 changes: 38 additions & 0 deletions benchmarks/scenarios/blockbuster-total-access.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{
"id": "blockbuster-total-access",
"title": "Blockbuster Total Access: continue or kill the hybrid strategy",
"taxonomy": {
"scenario_type": "historical_strategy"
},
"inputs": {
"prompt": "You are a strategy advisor to Blockbuster's board in Q3 2007. CEO Jim Keyes has just replaced John Antioco and is reviewing the Total Access program — a hybrid online-rental + in-store-return strategy launched in late 2006.\n\nTotal Access is working competitively: Blockbuster added 2 million online subscribers in under a year, Netflix growth stalled for the first time, and Netflix CEO Reed Hastings privately told colleagues he was 'ichiban scared' of the program. Blockbuster's online subscriber count reached ~3 million.\n\nHowever, Total Access is expensive. Each in-store return costs Blockbuster ~$2 in handling and lost rental revenue from the returned disc being re-rented free. The program is burning roughly $400M/year against Blockbuster's already-leveraged balance sheet ($1.1B long-term debt). Blockbuster lost $85M in Q2 2007. Franchisees are hostile — the program cannibalizes their store traffic economics. Activist investor Carl Icahn, who controls 3 board seats, views the online losses as unsustainable.\n\nMeanwhile, Netflix is investing aggressively in streaming infrastructure (launched Watch Instantly in January 2007), betting that physical disc rental is a transitional business. Netflix has 7.5M subscribers to Blockbuster's 3M online + 47M store-visit customers.\n\nThe board must decide: should Blockbuster continue funding Total Access at current levels to press the competitive advantage, or scale it back to reduce losses and refocus on store profitability?\n\nWrite a strategic recommendation memo for the board. Your memo must take a clear position — continue or kill — and defend it with evidence. Acknowledge the strongest counterarguments and explain why your position is still correct.\n\nKey evidence available:\n- Total Access added 2M subscribers in <12 months (late 2006 to mid-2007)\n- Netflix growth stalled for the first time during Total Access's peak\n- Program cost: ~$400M/year incremental burn\n- Blockbuster long-term debt: $1.1B; Q2 2007 net loss: $85M\n- Netflix streaming launched January 2007 with 1,000 titles\n- Netflix total subscribers: 7.5M; Blockbuster online: ~3M\n- Blockbuster store footprint: ~5,700 US locations (asset or liability?)\n- Franchisee resistance: independent operators threatened by cannibalization\n- Carl Icahn controls 3 board seats; views online investment as value-destroying\n- DVD-by-mail market projected to peak 2010-2012 then decline\n- Broadband penetration in US: ~50% of households (2007), projected 70%+ by 2010\n- Blockbuster had attempted a streaming deal with Enron Broadband in 2000 (failed)\n- Redbox kiosk expansion accelerating in grocery/convenience stores ($1/night rentals)",
"context_files": [],
"expected_artifact_type": "strategy",
"scoring_spec_ref": "docs/shipwright-v2-benchmark-scoring-spec.md"
},
"validator": {
"expect_sections": [
"Decision Frame",
"Unknowns & Evidence Gaps",
"Pass/Fail Readiness",
"Recommended Next Artifact"
],
"expect_structured": true
},
"fixtures": {
"first_pass_artifact": "../fixtures/blockbuster-total-access/first-pass.md",
"final_pass_artifact": "../fixtures/blockbuster-total-access/final-pass.md",
"related_artifacts": [],
"blind_review": null
},
"run_metadata": {
"time_to_first_usable_artifact_seconds": null,
"revision_count": null
},
"measures": [
"time_to_first_usable_artifact",
"revision_count",
"contradiction_count",
"blind_human_rating"
]
}
Loading
Loading