feat(notebook): add /tinker skill and api-exploration template for no…#154
feat(notebook): add /tinker skill and api-exploration template for no…#154glassBead-tc wants to merge 1 commit intomainfrom
Conversation
…tebook-first workflows Addresses the "coding blind" problem where agents implement against APIs they've never actually called. Adds three interconnected features: 1. **api-exploration template**: Structured notebook template with phases for assumption inventory, shape validation, edge case testing, and evidence export. 2. **/tinker skill**: Guides agents through notebook-based API exploration before implementation. Records validated mental models as belief_snapshot thoughts. 3. **Server-initiated exploration via sampling**: SamplingHandler.requestExploration() uses MCP sampling/createMessage to ask the agent's LLM to explore untested APIs in a notebook. The gateway appends non-blocking nudges when it detects implementation thoughts about unvalidated APIs. 4. **Workflow integration**: /workflow pipeline now includes a conditional Stage 3.5 (Tinker) between Planning and Implementation that gates on notebook evidence. Inspired by @lateinteraction's observation that coding agents should tinker at small scale in notebooks before writing complex scripts. https://claude.ai/code/session_012WEeqRaye2Exoc6cURRyex
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Greptile SummaryThis PR adds a notebook-first API exploration workflow to Thoughtbox, comprising a new The skill files, template, and workflow integration are well-designed. However, the core runtime implementation in
Confidence Score: 2/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Agent submits thought via thoughtbox_gateway] --> B{result.isError?}
B -- yes --> Z[Return result as-is]
B -- no --> C[checkExplorationEvidence]
C --> D{thoughtType is belief_snapshot\nor assumption_update?}
D -- yes --> H[Skip nudge]
D -- no --> E{Thought matches\nEXPLORATION_SIGNALS?}
E -- yes --> H
E -- no --> F{Thought matches\nIMPLEMENTATION_SIGNALS?}
F -- no --> H
F -- yes --> G{extractApiTarget returns\nnon-null AND target\nalready in exploredTargets?}
G -- yes --> H
G -- no --> I[Append text nudge to result.content]
H --> J{thoughtType === belief_snapshot?}
I --> J
J -- yes --> K[extractApiTarget from thought\nif non-null → markExplored]
J -- no --> L[Profile priming check]
K --> L
L --> M[Return result]
subgraph DEAD_CODE ["⚠️ Never Called"]
N[SamplingHandler.requestExploration]
end
I -.->|PR description implies\nthis triggers sampling| N
Last reviewed commit: b993e12 |
| /** Track APIs/libraries that have been explored in notebooks (prevents repeated nudges) */ | ||
| private exploredTargets = new Set<string>(); |
There was a problem hiding this comment.
exploredTargets is instance-scoped, not session-scoped
exploredTargets is a single Set<string> shared across all MCP sessions on the same GatewayHandler instance. In a multi-agent environment, session A's exploration of, say, "stripe" will permanently suppress nudges for session B — even if session B has never explored Stripe at all.
Every other piece of per-session state in this class uses a Map keyed by session ID: sessionsPrimed, sessionStages. This Set should follow the same pattern to avoid cross-session contamination:
/** Track APIs/libraries that have been explored per session (prevents repeated nudges) */
private exploredTargets = new Map<string, Set<string>>();
Then in checkExplorationEvidence and markExplored, gate on the current mcpSessionId (which is already threaded through handleThought).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/gateway-handler.ts
Line: 211-212
Comment:
**`exploredTargets` is instance-scoped, not session-scoped**
`exploredTargets` is a single `Set<string>` shared across all MCP sessions on the same `GatewayHandler` instance. In a multi-agent environment, session A's exploration of, say, `"stripe"` will permanently suppress nudges for session B — even if session B has never explored Stripe at all.
Every other piece of per-session state in this class uses a `Map` keyed by session ID: `sessionsPrimed`, `sessionStages`. This Set should follow the same pattern to avoid cross-session contamination:
```
/** Track APIs/libraries that have been explored per session (prevents repeated nudges) */
private exploredTargets = new Map<string, Set<string>>();
```
Then in `checkExplorationEvidence` and `markExplored`, gate on the current `mcpSessionId` (which is already threaded through `handleThought`).
How can I resolve this? If you propose a fix, please make it concise.| async requestExploration( | ||
| target: string, | ||
| assumptions: string[], | ||
| context: ThoughtData[] | ||
| ): Promise<string> { | ||
| const assumptionList = assumptions.length > 0 | ||
| ? `\n\nAssumptions detected (validate ALL of these):\n${assumptions.map((a, i) => `${i + 1}. ${a}`).join("\n")}` | ||
| : ""; | ||
|
|
||
| const contextText = context.slice(-3) | ||
| .map((t) => `Thought ${t.thoughtNumber}: ${t.thought}`) | ||
| .join("\n\n"); | ||
|
|
||
| const message = `${contextText ? contextText + "\n\n" : ""}You are about to implement code that depends on: ${target} | ||
|
|
||
| No notebook evidence was found validating your assumptions about this dependency.${assumptionList} | ||
|
|
||
| Create a Thoughtbox notebook using the api-exploration template and validate your mental model before proceeding. Start with: thoughtbox_gateway { operation: "notebook", args: { operation: "create", title: "${target} Exploration", language: "typescript", template: "api-exploration" } }`; | ||
|
|
||
| try { | ||
| const result = (await this.protocol.request( | ||
| { | ||
| method: "sampling/createMessage", | ||
| params: { | ||
| messages: [{ | ||
| role: "user" as const, | ||
| content: { | ||
| type: "text" as const, | ||
| text: message, | ||
| }, | ||
| }], | ||
| systemPrompt: EXPLORATION_SYSTEM_PROMPT, | ||
| modelPreferences: { | ||
| hints: [{ name: "claude-sonnet-4-5-20250929" }], | ||
| intelligencePriority: 0.8, | ||
| costPriority: 0.4, | ||
| }, | ||
| includeContext: "thisServer", | ||
| maxTokens: 2000, | ||
| } as SamplingParams, | ||
| }, | ||
| null as any | ||
| )) as SamplingResult; | ||
|
|
||
| return result.content.text; | ||
| } catch (error: any) { | ||
| if (error.code === -32601) { | ||
| // Client doesn't support sampling — degrade gracefully | ||
| throw error; | ||
| } | ||
| throw new Error(`Exploration sampling request failed: ${error.message}`); | ||
| } | ||
| } | ||
|
|
||
| // NOTE: Sampling capability detection happens automatically during MCP initialization | ||
| // handshake via client capabilities. No need for runtime probing. | ||
| } |
There was a problem hiding this comment.
requestExploration is dead code — never called
SamplingHandler.requestExploration() is implemented here and the PR description says "The gateway appends non-blocking nudges when it detects implementation thoughts about unvalidated APIs" via this method. However, GatewayHandler never imports SamplingHandler, never instantiates it, and never calls requestExploration. The gateway's nudge is entirely the static text block in checkExplorationEvidence.
This means the "server-initiated exploration via sampling" described in the PR is not actually wired up. The sampling-based flow (system prompt, model preferences, includeContext: "thisServer") never fires. Either:
- The integration was deferred and the method should be removed or marked
// TODO, or - There should be a call to
requestExplorationfromGatewayHandler.handleThoughtwhen a nudge is warranted (in addition to or instead of the text nudge).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/sampling/handler.ts
Line: 186-242
Comment:
**`requestExploration` is dead code — never called**
`SamplingHandler.requestExploration()` is implemented here and the PR description says "The gateway appends non-blocking nudges when it detects implementation thoughts about unvalidated APIs" via this method. However, `GatewayHandler` never imports `SamplingHandler`, never instantiates it, and never calls `requestExploration`. The gateway's nudge is entirely the static text block in `checkExplorationEvidence`.
This means the "server-initiated exploration via sampling" described in the PR is not actually wired up. The sampling-based flow (system prompt, model preferences, `includeContext: "thisServer"`) never fires. Either:
1. The integration was deferred and the method should be removed or marked `// TODO`, or
2. There should be a call to `requestExploration` from `GatewayHandler.handleThought` when a nudge is warranted (in addition to or instead of the text nudge).
How can I resolve this? If you propose a fix, please make it concise.| // Check if the target has already been explored in this session | ||
| // (Simple heuristic: extract likely API name from thought) | ||
| const target = this.extractApiTarget(thought); | ||
| if (target && this.exploredTargets.has(target.toLowerCase())) { | ||
| return null; | ||
| } | ||
|
|
There was a problem hiding this comment.
Unbounded nudging when extractApiTarget returns null
When extractApiTarget returns null (which is common — many generic thoughts about integrations won't match the backtick/quote/capitalized-noun heuristics), the guard if (target && this.exploredTargets.has(...)) short-circuits to false. This means:
- Every implementation-signal thought with an unextractable target fires a nudge — potentially on every thought in a long planning session.
- The nudge degrades to
\/tinker [target]`` (the literal placeholder), which is unhelpful. - There is no way for the agent to suppress these nudges by recording a
belief_snapshot, becausehandleThoughtalso callsextractApiTargetto find what to mark, and if it returnsnullthere too, nothing is ever marked.
Consider adding a session-level flag (e.g. "generic implementation nudge already sent this session") so at minimum the null-target case only fires once per session.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/gateway-handler.ts
Line: 1218-1224
Comment:
**Unbounded nudging when `extractApiTarget` returns `null`**
When `extractApiTarget` returns `null` (which is common — many generic thoughts about integrations won't match the backtick/quote/capitalized-noun heuristics), the guard `if (target && this.exploredTargets.has(...))` short-circuits to `false`. This means:
1. Every implementation-signal thought with an unextractable target fires a nudge — potentially on every thought in a long planning session.
2. The nudge degrades to `\`/tinker [target]\`` (the literal placeholder), which is unhelpful.
3. There is no way for the agent to suppress these nudges by recording a `belief_snapshot`, because `handleThought` also calls `extractApiTarget` to find what to mark, and if it returns `null` there too, nothing is ever marked.
Consider adding a session-level flag (e.g. "generic implementation nudge already sent this session") so at minimum the `null`-target case only fires once per session.
How can I resolve this? If you propose a fix, please make it concise.| markExplored(target: string): void { | ||
| this.exploredTargets.add(target.toLowerCase()); | ||
| } |
There was a problem hiding this comment.
markExplored should be private
This method is called only from within handleThought (line 631) and has no reason to be part of the public API. Exposing it publicly allows external callers to pre-suppress nudges for targets that have never been explored, which defeats the purpose of the exploration gate. Mark it private:
| markExplored(target: string): void { | |
| this.exploredTargets.add(target.toLowerCase()); | |
| } | |
| private markExplored(target: string): void { | |
| this.exploredTargets.add(target.toLowerCase()); | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/gateway-handler.ts
Line: 1253-1255
Comment:
**`markExplored` should be `private`**
This method is called only from within `handleThought` (line 631) and has no reason to be part of the public API. Exposing it publicly allows external callers to pre-suppress nudges for targets that have never been explored, which defeats the purpose of the exploration gate. Mark it `private`:
```suggestion
private markExplored(target: string): void {
this.exploredTargets.add(target.toLowerCase());
}
```
How can I resolve this? If you propose a fix, please make it concise.
🤖 Augment PR SummarySummary: This PR introduces a notebook-first “tinker before you implement” workflow to reduce mistakes from coding against unvalidated APIs. Changes:
Technical Notes: The nudge system uses regex-based signals + a session-local explored-targets set, and the sampling exploration prompt instructs creation of a notebook using the new 🤖 Was this summary useful? React with 👍 or 👎 |
| return null; | ||
| } | ||
|
|
||
| return `\n---\n**Exploration nudge**: This thought mentions implementing against an external API/library, but no notebook evidence was found validating the assumed behavior. Consider running \`/tinker ${target || '[target]'}\` to validate your mental model in a notebook before writing production code. This catches shape mismatches, auth surprises, and undocumented edge cases early.`; |
There was a problem hiding this comment.
checkExplorationEvidence()'s nudge says “no notebook evidence was found”, but the current logic only uses regex heuristics + an in-memory exploredTargets set (no check of exported .tinker/* files or persisted session thoughts). This could produce misleading nudges even when evidence exists but isn’t detected by these heuristics.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| * @returns The agent's exploration plan/acknowledgment | ||
| * @throws Error if sampling fails or is not supported | ||
| */ | ||
| async requestExploration( |
There was a problem hiding this comment.
I can’t find any call sites for requestExploration() in the current codebase, so the “server-initiated exploration via sampling” behavior described in the PR may not actually trigger yet. If this is intended to be live in this PR, it likely needs wiring from the gateway/thought flow.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
|
|
||
| No notebook evidence was found validating your assumptions about this dependency.${assumptionList} | ||
|
|
||
| Create a Thoughtbox notebook using the api-exploration template and validate your mental model before proceeding. Start with: thoughtbox_gateway { operation: "notebook", args: { operation: "create", title: "${target} Exploration", language: "typescript", template: "api-exploration" } }`; |
There was a problem hiding this comment.
target is interpolated into a JSON-like tool invocation inside the prompt (e.g., title: "${target} Exploration"); if target contains quotes/newlines, the suggested command can become invalid/confusing for the model to follow. Consider sanitizing/escaping target (or avoiding embedding it in a pseudo-JSON snippet) to keep the instruction robust.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
…tebook-first workflows
Addresses the "coding blind" problem where agents implement against APIs they've never actually called. Adds three interconnected features:
api-exploration template: Structured notebook template with phases for assumption inventory, shape validation, edge case testing, and evidence export.
/tinker skill: Guides agents through notebook-based API exploration before implementation. Records validated mental models as belief_snapshot thoughts.
Server-initiated exploration via sampling: SamplingHandler.requestExploration() uses MCP sampling/createMessage to ask the agent's LLM to explore untested APIs in a notebook. The gateway appends non-blocking nudges when it detects implementation thoughts about unvalidated APIs.
Workflow integration: /workflow pipeline now includes a conditional Stage 3.5 (Tinker) between Planning and Implementation that gates on notebook evidence.
Inspired by @lateinteraction's observation that coding agents should tinker at small scale in notebooks before writing complex scripts.
https://claude.ai/code/session_012WEeqRaye2Exoc6cURRyex