From 225fc57e748ea4dd5cc26886a7b930af1ab8e93c Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Mon, 2 Feb 2026 19:46:26 +0000
Subject: [PATCH 1/6] docs: add comprehensive research on tool choice parameter

- Analyzes AI SDK source code for toolChoice implementation
- Documents provider-specific translations (OpenAI, Anthropic, Google, DeepSeek, xAI)
- Covers form-filling patterns with mandatory tool use
- Includes troubleshooting for common issues (endless loops, hallucination, reliability)
- Provides best practices for ensuring agents use tools before generating output

https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB
---
 ...search-2026-02-02-tool-choice-parameter.md | 581 ++++++++++++++++++
 1 file changed, 581 insertions(+)
 create mode 100644 docs/project/research/research-2026-02-02-tool-choice-parameter.md
diff --git a/docs/project/research/research-2026-02-02-tool-choice-parameter.md b/docs/project/research/research-2026-02-02-tool-choice-parameter.md
new file mode 100644
index 00000000..285131f0
--- /dev/null
+++ b/docs/project/research/research-2026-02-02-tool-choice-parameter.md
@@ -0,0 +1,581 @@
+# Research: Tool Choice Parameter in AI SDK and Major LLM Providers
+
+**Date:** 2026-02-02 (last updated 2026-02-02)
+
+**Author:** AI Research
+
+**Status:** Complete
+
+## Overview
+
+This research document provides a comprehensive technical overview of the `toolChoice` parameter
+implementation in the Vercel AI SDK and across major LLM providers. The focus is on understanding
+how to ensure agents reliably use tools (especially for form-filling use cases where web search
+or other research tools should be invoked before populating form fields).
+
+## Questions to Answer
+
+1. How does the AI SDK implement `toolChoice` and translate it to each provider's native format?
+2. What are the exact behaviors and options for each major provider (OpenAI, Anthropic, Google,
+   Deepseek, xAI/Grok)?
+3. What are the best practices for ensuring agents use tools (especially web search) before
+   filling in forms or generating structured output?
+4. What are the common issues and pitfalls when using `toolChoice`?
+5. What patterns exist for combining tool calling with structured output?
+
+## Scope
+
+- **Included**: AI SDK implementation details (source code analysis), provider-specific behaviors,
+  community best practices, form-filling patterns, troubleshooting guidance
+- **Excluded**: Implementation of specific form-filling applications, UI/UX considerations
+
+---
+
+## Findings
+
+### 1. AI SDK Core Implementation
+
+#### 1.1 Type Definition
+
+The AI SDK defines `ToolChoice` in `packages/ai/src/types/language-model.ts:100-104`:
+
+```typescript
+export type ToolChoice<TOOLS extends Record<string, unknown>> =
+  | 'auto'
+  | 'none'
+  | 'required'
+  | { type: 'tool'; toolName: Extract<keyof TOOLS, string> };
+```
+
+**Options:**
+- `'auto'` (default): The model can choose whether and which tools to call
+- `'none'`: The model must not call tools
+- `'required'`: The model must call a tool (can choose which one)
+- `{ type: 'tool', toolName: string }`: The model must call the specified tool
+
+#### 1.2 Core Translation Logic
+
+In `packages/ai/src/prompt/prepare-tools-and-tool-choice.ts:79-85`, the SDK translates the
+user-facing `toolChoice` to the internal provider format:
+
+```typescript
+toolChoice:
+  toolChoice == null
+    ? { type: 'auto' }
+    : typeof toolChoice === 'string'
+      ? { type: toolChoice }
+      : { type: 'tool' as const, toolName: toolChoice.toolName as string },
+```
+
+**Key insight**: When `toolChoice` is `undefined`/`null`, it defaults to `{ type: 'auto' }`.
+
+#### 1.3 Provider-Level Type
+
+The internal type used by providers is `LanguageModelV3ToolChoice` in
+`packages/provider/src/language-model/v3/language-model-v3-tool-choice.ts`:
+
+```typescript
+export type LanguageModelV3ToolChoice =
+  | { type: 'auto' }    // tool selection is automatic (can be no tool)
+  | { type: 'none' }    // no tool must be selected
+  | { type: 'required' } // one of the available tools must be selected
+  | { type: 'tool'; toolName: string }; // a specific tool must be selected
+```
+
+---
+
+### 2. Provider-Specific Implementations
+
+#### 2.1 OpenAI
+
+**Source**: `packages/openai/src/chat/openai-chat-prepare-tools.ts:59-76`
+
+**Translation:**
+
+| AI SDK Value | OpenAI Native Value |
+|--------------|---------------------|
+| `auto` | `'auto'` |
+| `none` | `'none'` |
+| `required` | `'required'` |
+| `{ type: 'tool', toolName }` | `{ type: 'function', function: { name: toolName } }` |
+
+**Native API Documentation:**
+- `tool_choice: "auto"` - Model decides whether to call functions (default)
+- `tool_choice: "none"` - Model will not call any tool, generates message only
+- `tool_choice: "required"` - Model must call one or more tools
+- `tool_choice: { type: "function", function: { name: "..." } }` - Force specific function
+
+**Notable:** OpenAI supports parallel function calling by default.
+
+**Sources:**
+- [OpenAI Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)
+- [OpenAI Tools Guide](https://platform.openai.com/docs/guides/tools)
+
+#### 2.2 Anthropic (Claude)
+
+**Source**: `packages/anthropic/src/anthropic-prepare-tools.ts:310-353`
+
+**Translation (with important differences):**
+
+| AI SDK Value | Anthropic Native Value |
+|--------------|------------------------|
+| `auto` | `{ type: 'auto' }` |
+| `none` | *removes tools entirely* (Anthropic doesn't support 'none') |
+| `required` | `{ type: 'any' }` (**Note: 'any', not 'required'**) |
+| `{ type: 'tool', toolName }` | `{ type: 'tool', name: toolName }` |
+
+**Critical Insight from Source Code (lines 333-335):**
+```typescript
+case 'none':
+  // Anthropic does not support 'none' tool choice, so we remove the tools:
+  return { tools: undefined, toolChoice: undefined, toolWarnings, betas };
+```
+
+**Anthropic-Specific Features:**
+- `disable_parallel_tool_use: boolean` - Can be combined with any toolChoice type
+- Setting `disable_parallel_tool_use=true` with `type: 'any'` or `type: 'tool'` ensures
+  exactly one tool is called
+
+**Extended Thinking Limitation:**
+When using extended thinking, only `tool_choice: {"type": "auto"}` and
+`tool_choice: {"type": "none"}` are compatible. Using `any` or `tool` types will error.
+
+**Sources:**
+- [Anthropic Tool Use Documentation](https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use)
+- [Anthropic Advanced Tool Use](https://www.anthropic.com/engineering/advanced-tool-use)
+
+#### 2.3 Google (Gemini)
+
+**Source**: `packages/google/src/google-prepare-tools.ts:225-256`
+
+**Translation:**
+
+| AI SDK Value | Gemini Native Value |
+|--------------|---------------------|
+| `auto` | `{ functionCallingConfig: { mode: 'AUTO' } }` |
+| `none` | `{ functionCallingConfig: { mode: 'NONE' } }` |
+| `required` | `{ functionCallingConfig: { mode: 'ANY' } }` |
+| `{ type: 'tool', toolName }` | `{ functionCallingConfig: { mode: 'ANY', allowedFunctionNames: [toolName] } }` |
+
+**Native API Options:**
+- `AUTO` (default): Model decides whether to call functions
+- `NONE`: Model cannot make function calls
+- `ANY`: Forces model to predict a function call
+- `VALIDATED` (Preview): Like ANY but allows text responses too
+
+**Best Practices from Google:**
+- Keep active tools to **10-20 maximum** to reduce selection errors
+- Use **low temperature** (e.g., 0) for deterministic function calls
+- Apply **strong typing** (enums for fixed value sets)
+
+**Sources:**
+- [Google AI Function Calling](https://ai.google.dev/gemini-api/docs/function-calling)
+- [Vertex AI Function Calling](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/model-reference/function-calling)
+
+#### 2.4 DeepSeek
+
+**Source**: `packages/deepseek/src/chat/deepseek-prepare-tools.ts:54-68`
+
+**Translation (follows OpenAI format):**
+
+| AI SDK Value | DeepSeek Native Value |
+|--------------|----------------------|
+| `auto` | `'auto'` |
+| `none` | `'none'` |
+| `required` | `'required'` |
+| `{ type: 'tool', toolName }` | `{ type: 'function', function: { name: toolName } }` |
+
+**Notable Limitations:**
+- DeepSeek's official documentation does not explicitly document the `tool_choice` parameter
+- Uses OpenAI-compatible API format
+- The model may hallucinate parameters not in your schema - validate arguments before calling
+- Not great at multi-turn function calling; performs best with single user message triggering calls
+
+**Sources:**
+- [DeepSeek Function Calling](https://api-docs.deepseek.com/guides/function_calling)
+- [DeepSeek Tool Calls](https://api-docs.deepseek.com/guides/tool_calls)
+
+#### 2.5 xAI (Grok)
+
+**Source**: `packages/xai/src/xai-prepare-tools.ts:71-86` and
+`packages/xai/src/responses/xai-responses-prepare-tools.ts:156-186`
+
+**Translation:**
+
+| AI SDK Value | xAI Native Value |
+|--------------|------------------|
+| `auto` | `'auto'` |
+| `none` | `'none'` |
+| `required` | `'required'` |
+| `{ type: 'tool', toolName }` | `{ type: 'function', name: toolName }` |
+
+**Notable from Source Code (lines 173-180):**
+```typescript
+if (selectedTool.type === 'provider') {
+  // xAI API does not support forcing specific server-side tools via toolChoice
+  // Only function tools can be forced
+  toolWarnings.push({
+    type: 'unsupported',
+    feature: `toolChoice for server-side tool "${selectedTool.name}"`,
+  });
+```
+
+**Recommended Model:** xAI recommends `grok-4-1-fast` for agentic tool calling.
+
+**Sources:**
+- [xAI Function Calling](https://docs.x.ai/docs/guides/function-calling)
+- [xAI Tools Overview](https://docs.x.ai/docs/guides/tools/overview)
+
+---
+
+### 3. Summary: Provider Translation Table
+
+| AI SDK `toolChoice` | OpenAI | Anthropic | Google | DeepSeek | xAI |
+|---------------------|--------|-----------|--------|----------|-----|
+| `'auto'` | `'auto'` | `{ type: 'auto' }` | `mode: 'AUTO'` | `'auto'` | `'auto'` |
+| `'none'` | `'none'` | *removes tools* | `mode: 'NONE'` | `'none'` | `'none'` |
+| `'required'` | `'required'` | `{ type: 'any' }` | `mode: 'ANY'` | `'required'` | `'required'` |
+| `{ type: 'tool', toolName: 'x' }` | `{ type: 'function', function: { name: 'x' } }` | `{ type: 'tool', name: 'x' }` | `mode: 'ANY', allowedFunctionNames: ['x']` | `{ type: 'function', function: { name: 'x' } }` | `{ type: 'function', name: 'x' }` |
+
+---
+
+### 4. Form-Filling Use Cases and Patterns
+
+#### 4.1 The Challenge
+
+When building form-filling agents, a common issue is that the model may:
+1. **Hallucinate data** instead of using tools to research
+2. **Skip tool calls** and go directly to filling the form
+3. **Analyze/plan** what it would do instead of actually calling tools
+4. **Call tools unreliably** after ~5 messages in a conversation
+
+#### 4.2 Pattern: Answer Tool with `toolChoice: 'required'`
+
+**Recommended Approach (AI SDK 6+):**
+
+Use an "answer" tool without an `execute` function and `toolChoice: 'required'` to force
+structured output:
+
+```typescript
+import { generateText, tool } from 'ai';
+import { z } from 'zod';
+
+const result = await generateText({
+  model: yourModel,
+  tools: {
+    webSearch: tool({
+      description: 'Search the web for information',
+      parameters: z.object({ query: z.string() }),
+      execute: async ({ query }) => { /* search implementation */ }
+    }),
+    submitForm: tool({
+      description: 'Submit the completed form with researched data',
+      parameters: z.object({
+        field1: z.string().describe('Value for field1 (must be researched)'),
+        field2: z.string().describe('Value for field2 (must be researched)'),
+      }),
+      // No execute function - acts as termination signal
+    }),
+  },
+  toolChoice: 'required', // Must use a tool at every step
+  stopWhen: hasToolCall('submitForm'), // Stop when form is submitted
+  system: `You are a research assistant. Before filling ANY form field:
+1. Use webSearch to find accurate, current information
+2. NEVER guess or hallucinate data
+3. Only call submitForm when you have researched ALL fields`,
+  prompt: userQuery,
+});
+
+// Get the form data from staticToolCalls (tools without execute)
+const formData = result.staticToolCalls.find(
+  call => call.toolName === 'submitForm'
+)?.args;
+```
+
+#### 4.3 Pattern: AI SDK 6 Unified Output
+
+**New in AI SDK 6:** Combine tool calling with structured output in one flow:
+
+```typescript
+import { generateText, Output } from 'ai';
+
+const result = await generateText({
+  model: yourModel,
+  tools: { webSearch, fetchUrl },
+  output: Output.object({
+    schema: z.object({
+      companyName: z.string(),
+      foundedYear: z.number(),
+      headquarters: z.string(),
+    }),
+  }),
+  system: `Research the company thoroughly using web search before
+           providing structured output. Do not hallucinate.`,
+  prompt: 'Get information about Anthropic',
+});
+
+// result.object contains the structured data
+```
+
+**Important:** Structured output generation counts as an additional step. Adjust `stopWhen`
+accordingly.
+
+#### 4.4 Pattern: Explicit Tool Guidance in Prompts
+
+**System Prompt Best Practices:**
+
+```
+CRITICAL INSTRUCTIONS FOR TOOL USE:
+1. For ANY information that could be time-sensitive, ALWAYS use webSearch first
+2. For ANY factual claims (dates, numbers, names), ALWAYS verify with webSearch
+3. NEVER fill in form fields with guessed or assumed data
+4. If webSearch returns no results, explicitly state "Unknown" rather than guessing
+5. Call tools BEFORE reasoning about the answer, not after
+```
+
+#### 4.5 Pattern: Multi-Step Verification Loop
+
+For critical data accuracy, use a verification pattern:
+
+```typescript
+const agent = createAgent({
+  tools: {
+    webSearch,
+    verifyFact: tool({
+      description: 'Double-check a fact by searching again',
+      parameters: z.object({
+        fact: z.string(),
+        originalSource: z.string(),
+      }),
+      execute: async ({ fact }) => { /* second search */ },
+    }),
+    submitVerifiedForm: tool({
+      description: 'Submit only after all facts are verified',
+      parameters: formSchema,
+    }),
+  },
+  stopWhen: hasToolCall('submitVerifiedForm'),
+  prepareStep: ({ lastToolResults }) => {
+    // Force verification if not all fields verified
+    if (needsVerification(lastToolResults)) {
+      return { toolChoice: { type: 'tool', toolName: 'verifyFact' } };
+    }
+    return {};
+  },
+});
+```
+
+---
+
+### 5. Common Issues and Troubleshooting
+
+#### 5.1 Tool Execution Becomes Unreliable After ~5 Messages
+
+**Issue:** Models increasingly fail to execute tools after approximately 5 messages, instead
+analyzing or describing what they would do.
+
+**Solutions:**
+- Add explicit tool-use reminders in subsequent messages: "Remember to USE the webSearch
+  tool, not describe using it"
+- Reset context periodically with `context.compact()` in AI SDK 6
+- Use `toolChoice: 'required'` to force tool usage
+
+#### 5.2 Endless Loop with `toolChoice: 'required'`
+
+**Issue:** Setting `toolChoice: 'required'` can cause infinite loops when using `streamText`.
+
+**Solutions:**
+- Use `stopWhen: hasToolCall('finalTool')` with a termination tool
+- Use `stopWhen: stepCountIs(n)` as a safety limit
+- Use `prepareStep` to dynamically change `toolChoice` on final step:
+
+```typescript
+prepareStep: ({ stepNumber }) => {
+  if (stepNumber >= 5) {
+    return { toolChoice: 'auto' }; // Allow text response
+  }
+  return { toolChoice: 'required' };
+},
+```
+
+#### 5.3 Model Hallucinating Tool Calls
+
+**Issue:** Model says it's calling a tool but actually hallucinating results.
+
+**Solutions:**
+- Check for actual tool_use blocks in response, not just text mentioning tools
+- Use `toolChoice: 'required'` to force structured tool calls
+- Implement validation on tool results before accepting
+
+#### 5.4 Anthropic: 'none' Doesn't Work as Expected
+
+**Issue:** `toolChoice: 'none'` with Anthropic doesn't just prevent tool use, it removes
+all tool definitions.
+
+**Solution:** This is intentional per the source code. If you need tools available but not
+used in a specific call, use prompting instead: "Do not use any tools for this response."
+
+#### 5.5 Parallel Tool Calls Not Working
+
+**Causes:**
+1. Incorrect tool result formatting (separate messages instead of combined)
+2. Weak prompting
+3. Model-specific limitations (Sonnet 3.7 less likely than Claude 4)
+
+**Solution:**
+```typescript
+// Wrong: Separate messages
+[
+  { role: 'assistant', content: [tool_use_1, tool_use_2] },
+  { role: 'user', content: [tool_result_1] },
+  { role: 'user', content: [tool_result_2] },  // Separate
+]
+
+// Correct: Single message with all results
+[
+  { role: 'assistant', content: [tool_use_1, tool_use_2] },
+  { role: 'user', content: [tool_result_1, tool_result_2] },  // Combined
+]
+```
+
+---
+
+### 6. Best Practices Summary
+
+#### 6.1 For Reliable Tool Use
+
+1. **Use `toolChoice: 'required'`** when tools MUST be used
+2. **Provide detailed tool descriptions** - this is the most important factor
+3. **Limit tools to 5-7** for optimal selection accuracy (10-20 max)
+4. **Use explicit prompting** about when to use which tool
+5. **Implement answer/termination tools** for structured output flows
+
+#### 6.2 For Form-Filling Specifically
+
+1. **Always research before filling** - use `toolChoice: 'required'` initially
+2. **Use an answer tool without execute** - terminates loop with structured data
+3. **Validate all tool results** - don't trust raw model outputs
+4. **Use the AI SDK 6 `output` option** for cleaner structured output flows
+5. **Add verification steps** for critical data
+
+#### 6.3 Provider-Specific Recommendations
+
+| Provider | Recommendation |
+|----------|----------------|
+| OpenAI | Use `'required'` directly; supports parallel calls |
+| Anthropic | Remember `required` → `any` translation; use `disable_parallel_tool_use` for single calls |
+| Google | Use `ANY` mode with `allowedFunctionNames` for specific tools |
+| DeepSeek | Validate tool arguments; avoid multi-turn tool flows |
+| xAI | Use `grok-4-1-fast` for best tool calling; can't force provider tools |
+
+---
+
+## Options Considered
+
+### Option A: Use `toolChoice: 'required'` Everywhere
+
+**Description:** Force tool use on every step until completion.
+
+**Pros:**
+- Guarantees tools are called
+- Prevents hallucination of tool results
+
+**Cons:**
+- Can cause infinite loops without proper termination
+- May force unnecessary tool calls
+- Not compatible with Anthropic extended thinking
+
+### Option B: Use `toolChoice: 'auto'` with Strong Prompting
+
+**Description:** Rely on system prompts to guide tool use.
+
+**Pros:**
+- More flexible
+- Works with all features (extended thinking, etc.)
+- Natural conversation flow
+
+**Cons:**
+- Model may ignore prompts and skip tools
+- Reliability degrades over long conversations
+- Harder to guarantee tool usage
+
+### Option C: Hybrid Approach with `prepareStep`
+
+**Description:** Use `toolChoice: 'required'` initially, switch to `'auto'` for final
+response.
+
+**Pros:**
+- Best of both worlds
+- Guarantees initial research
+- Allows natural completion
+
+**Cons:**
+- More complex implementation
+- Requires careful step management
+
+---
+
+## Recommendations
+
+1. **For form-filling with mandatory research:** Use Option C (Hybrid) with:
+   - `toolChoice: 'required'` for first N steps
+   - An answer tool without execute function
+   - `stopWhen: hasToolCall('submitForm')`
+
+2. **For simpler tool integration:** Use Option A with proper termination:
+   - Define a clear termination tool
+   - Use `stopWhen` to prevent infinite loops
+
+3. **For conversation-like interfaces:** Use Option B with:
+   - Strong system prompts
+   - Explicit tool-use instructions in user messages
+   - Periodic context compaction
+
+---
+
+## Next Steps
+
+- [ ] Implement the recommended hybrid pattern in markform
+- [ ] Add tool input validation for form fields
+- [ ] Create a verification step for critical data
+- [ ] Test across multiple providers for consistency
+
+---
+
+## References
+
+### AI SDK Documentation
+- [AI SDK Tool Calling](https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling)
+- [AI SDK Agents: Loop Control](https://ai-sdk.dev/docs/agents/loop-control)
+- [AI SDK Generating Structured Data](https://ai-sdk.dev/docs/ai-sdk-core/generating-structured-data)
+- [AI SDK Troubleshooting: Tool Calling with Structured Outputs](https://ai-sdk.dev/docs/troubleshooting/tool-calling-with-structured-outputs)
+- [AI SDK 6 Announcement](https://vercel.com/blog/ai-sdk-6)
+
+### Provider Documentation
+- [OpenAI Function Calling](https://platform.openai.com/docs/guides/function-calling)
+- [Anthropic Tool Use](https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview)
+- [Anthropic Implement Tool Use](https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use)
+- [Google Gemini Function Calling](https://ai.google.dev/gemini-api/docs/function-calling)
+- [DeepSeek Function Calling](https://api-docs.deepseek.com/guides/function_calling)
+- [xAI Function Calling](https://docs.x.ai/docs/guides/function-calling)
+
+### AI SDK Source Code References
+- `packages/ai/src/types/language-model.ts:100-104` - ToolChoice type definition
+- `packages/ai/src/prompt/prepare-tools-and-tool-choice.ts` - Core translation logic
+- `packages/provider/src/language-model/v3/language-model-v3-tool-choice.ts` - Provider-level type
+- `packages/openai/src/chat/openai-chat-prepare-tools.ts:59-76` - OpenAI translation
+- `packages/anthropic/src/anthropic-prepare-tools.ts:310-353` - Anthropic translation
+- `packages/google/src/google-prepare-tools.ts:225-256` - Google translation
+- `packages/deepseek/src/chat/deepseek-prepare-tools.ts:54-68` - DeepSeek translation
+- `packages/xai/src/xai-prepare-tools.ts:71-86` - xAI translation
+
+### Community Resources
+- [GitHub Issue: Tool Execution Unreliable After ~5 Messages](https://github.com/vercel/ai/issues/10269)
+- [GitHub Issue: toolChoice 'required' Endless Loop](https://github.com/vercel/ai/issues/3944)
+- [Vercel Blog: We Removed 80% of Our Agent's Tools](https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools)
+- [GitHub Discussion: Tool Calling Loop Understanding](https://github.com/vercel/ai/discussions/8514)
+
+### Hallucination Prevention
+- [Zep: Reducing LLM Hallucinations](https://www.getzep.com/ai-agents/reducing-llm-hallucinations/)
+- [Cleanlab: Prevent Hallucinated Responses](https://cleanlab.ai/blog/prevent-hallucinated-responses/)
+- [AWS: Reducing Hallucinations with Verified Semantic Cache](https://aws.amazon.com/blogs/machine-learning/reducing-hallucinations-in-llm-agents-with-a-verified-semantic-cache-using-amazon-bedrock-knowledge-bases/)

From bbb981f72f8c87ce9ceaf1ce245bbeca68da93d4 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Mon, 2 Feb 2026 19:59:46 +0000
Subject: [PATCH 2/6] docs: add plan spec for tool choice policies

Defines a toolPolicy system to ensure agents reliably use tools:
- auto: Current behavior (model decides)
- require-tools: Force tool use every turn
- research-first: Force web search on first turn
- research-always: Force search every turn
- two-phase: Separate research and fill phases

Includes provider-specific considerations and testing plan for
DeepSeek, Anthropic, Google, OpenAI, and xAI.

Related: research-2026-02-02-tool-choice-parameter.md

https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB
---
 .../plan-2026-02-02-tool-choice-policies.md   | 396 ++++++++++++++++++
 1 file changed, 396 insertions(+)
 create mode 100644 docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md

diff --git a/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md
new file mode 100644
index 00000000..3a1d9e3b
--- /dev/null
+++ b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md
@@ -0,0 +1,396 @@
+# Plan Spec: Tool Choice Policies for Reliable Form Filling
+
+**Date:** 2026-02-02 (last updated 2026-02-02)
+
+**Author:** AI Research
+
+**Status:** Draft
+
+## Overview
+
+This spec defines a **tool choice policy system** for Markform that gives form authors and
+consumers fine-grained control over how agents use tools (especially web search) during form
+filling. The goal is to ensure agents reliably research information before filling fields,
+reducing hallucination and improving data accuracy.
+
+**Related Docs:**
+- `docs/project/research/research-2026-02-02-tool-choice-parameter.md` - Research on AI SDK
+  toolChoice and provider behavior
+- `docs/project/specs/active/plan-2026-01-27-parallel-form-filling.md` - Parallel execution spec
+
+## Goals
+
+1. **Reduce hallucination**: Ensure agents use web search and other research tools before
+   filling fields that require external data
+2. **Configurable policies**: Provide multiple tool use policies that balance research
+   thoroughness against latency/cost
+3. **Cross-model compatibility**: Work reliably across providers (OpenAI, Anthropic, Google,
+   DeepSeek, xAI)
+4. **Form-level and field-level control**: Allow policies at form level with optional per-field
+   overrides
+5. **Integration with parallel execution**: Policies should compose with the existing `parallel`
+   and `order` attributes
+
+## Non-Goals
+
+- Custom tool definitions per form (tools are provided by the harness)
+- Model-specific prompt tuning (policies should work across models)
+- UI for policy configuration (CLI/API only for now)
+- Automatic policy selection based on field content
+
+## Background
+
+### The Problem
+
+Current Markform behavior uses `toolChoice: 'auto'` by default, meaning the model decides
+whether to use tools. In practice, models often:
+
+1. **Skip web search entirely** and fill fields with training data (which may be outdated)
+2. **Hallucinate tool calls** by describing what they would search without actually searching
+3. **Become unreliable after ~5 turns** as conversation length increases
+4. **Fill forms without research** when under time pressure or with simpler prompts
+
+### Research Findings
+
+From the research doc (`research-2026-02-02-tool-choice-parameter.md`):
+
+- `toolChoice: 'required'` forces tool use but can cause infinite loops without termination
+- `toolChoice: { type: 'tool', toolName: 'webSearch' }` forces a specific tool
+- Different providers translate these values differently (Anthropic: `required` → `any`)
+- AI SDK 6's `prepareStep` allows dynamic toolChoice per turn
+- Google recommends limiting to 10-20 tools for reliable selection
+- DeepSeek has unreliable multi-turn tool calling; best for single-turn
+
+### Current Implementation
+
+The harness provides these tools to the agent:
+- `fill_form` - Apply patches to the form
+- `web_search` (optional, via `enableWebSearch`) - Search the web for information
+
+Currently, `toolChoice` is not explicitly set, defaulting to `'auto'`.
+
+## Design
+
+### Tool Choice Policy Enum
+
+A new `toolPolicy` option controls how the harness manages tool selection:
+
+```typescript
+type ToolPolicy =
+  | 'auto'              // Current behavior: model chooses freely
+  | 'require-tools'     // toolChoice: 'required' on every turn
+  | 'research-first'    // Force webSearch on first turn, then auto
+  | 'research-always'   // Force webSearch every turn until form complete
+  | 'two-phase'         // Phase 1: research only, Phase 2: fill only
+```
+
+### Policy Behaviors
+
+#### `auto` (Default - Current Behavior)
+
+```
+toolChoice: 'auto' on every turn
+Model decides when to search and when to fill
+No enforcement of research before filling
+```
+
+**When to use:** Simple forms, fields that don't need external research, testing
+
+#### `require-tools`
+
+```
+toolChoice: 'required' on every turn
+Model must call SOME tool every turn (fill_form or web_search)
+Prevents "analysis paralysis" where model talks without acting
+Uses termination detection to allow final response
+```
+
+**When to use:** General production use, ensures progress every turn
+
+#### `research-first`
+
+```
+Turn 1: toolChoice: { type: 'tool', toolName: 'web_search' }
+Turn 2+: toolChoice: 'required'
+```
+
+**When to use:** Forms with factual fields that need current data, moderate latency tolerance
+
+#### `research-always`
+
+```
+Every turn: First call must be web_search (via prepareStep logic)
+After web_search returns, toolChoice: 'required' for rest of turn
+```
+
+**When to use:** High-accuracy requirements, fields with rapidly changing data
+
+**Implementation:**
+```typescript
+prepareStep: ({ lastToolResults }) => {
+  const hasSearchedThisTurn = lastToolResults?.some(
+    r => r.toolName === 'web_search'
+  );
+  if (!hasSearchedThisTurn) {
+    return { toolChoice: { type: 'tool', toolName: 'web_search' } };
+  }
+  return { toolChoice: 'required' };
+},
+```
+
+#### `two-phase`
+
+```
+Phase 1 (Research): Only web_search available, toolChoice: 'required'
+                    Runs until configurable turn count or all fields researched
+Phase 2 (Fill):     Only fill_form available, toolChoice: 'required'
+                    Uses research context from Phase 1
+```
+
+**When to use:** Maximum accuracy, complex research forms, acceptable latency
+
+**Implementation:** Two separate agent invocations with different tool sets
+
+### API Changes
+
+#### FillOptions Extension
+
+```typescript
+interface FillOptions {
+  // ... existing options
+
+  /**
+   * Tool choice policy for agent tool selection.
+   * Controls how strictly the harness enforces tool usage.
+   *
+   * @default 'auto'
+   */
+  toolPolicy?: ToolPolicy;
+
+  /**
+   * For 'two-phase' policy: max turns in research phase.
+   * After this many turns, switches to fill phase.
+   *
+   * @default 5
+   */
+  researchPhaseTurns?: number;
+
+  /**
+   * For 'research-always': max searches per turn.
+   * Prevents excessive API calls on forms with many fields.
+   *
+   * @default 3
+   */
+  maxSearchesPerTurn?: number;
+}
+```
+
+#### Frontmatter Configuration
+
+```yaml
+---
+markform:
+  spec: MF/0.1
+  harness_config:
+    tool_policy: research-first    # New option
+    research_phase_turns: 5        # For two-phase
+    max_searches_per_turn: 3       # For research-always
+---
+```
+
+#### CLI Extension
+
+```bash
+# New --tool-policy flag
+markform fill form.md --tool-policy=research-first
+
+# Override policy at CLI level
+markform fill form.md --tool-policy=two-phase --research-phase-turns=8
+```
+
+### Field-Level Research Hints (Future Enhancement)
+
+For v2, consider per-field annotations:
+
+```markdown
+<!-- field kind="string" id="revenue" label="Revenue" research="required" -->
+<!-- /field -->
+
+<!-- field kind="string" id="notes" label="Notes" research="none" -->
+<!-- /field -->
+```
+
+This is deferred - the form-level policy is sufficient for initial implementation.
+
+### Provider-Specific Considerations
+
+From research, key provider differences to handle:
+
+| Provider | Notes |
+|----------|-------|
+| OpenAI | `'required'` works directly; parallel tool calls supported |
+| Anthropic | `'required'` → `'any'` translation; `disable_parallel_tool_use` available |
+| Google | `'required'` → `mode: 'ANY'`; limit to 10-20 tools |
+| DeepSeek | Unreliable multi-turn; best with `auto` or single-turn `required` |
+| xAI | Can't force provider-defined tools; use grok-4-1-fast |
+
+**Recommendation:** Test `two-phase` and `research-always` across all providers before
+recommending as defaults. `require-tools` should work reliably across all providers.
+
+### Areas of Uncertainty (Requiring Testing)
+
+1. **DeepSeek multi-turn behavior**: Research indicates unreliable tool calling after first turn.
+   Need to test:
+   - Does `toolChoice: 'required'` work reliably on DeepSeek?
+   - What happens with `two-phase` policy?
+   - Should we auto-downgrade to `auto` for DeepSeek?
+
+2. **Anthropic extended thinking**: `toolChoice: 'required'` may conflict with extended thinking.
+   Need to test:
+   - Does `research-first` work with Claude 4 extended thinking?
+   - Should we detect extended thinking and adjust policy?
+
+3. **Termination detection**: With `toolChoice: 'required'`, how do we allow final text response?
+   Options to test:
+   - Use `stopWhen: hasToolCall('fill_form')` with form completion check
+   - Use `prepareStep` to switch to `'auto'` on last turn
+   - Add a no-op `complete` tool
+
+4. **Parallel execution interaction**: When `enableParallel: true`:
+   - Should each parallel agent have its own tool policy?
+   - Should research happen in loose-serial before parallel batches?
+   - Test: parallel agents with `research-first` - do they all search or just one?
+
+5. **Turn limits**: With `research-always`, does forcing search on every turn:
+   - Hit rate limits with providers?
+   - Significantly impact latency?
+   - Improve accuracy enough to justify cost?
+
+## Implementation Plan
+
+### Phase 1: Core Policy Engine
+
+**Goal:** Implement `toolPolicy` option with `auto`, `require-tools`, and `research-first`.
+
+- [ ] Add `ToolPolicy` type to `harnessTypes.ts`
+- [ ] Add `toolPolicy` to `FillOptions` and `HarnessConfig`
+- [ ] Add `tool_policy` to `HarnessConfigYaml` and mapping in `settings.ts`
+- [ ] Implement `getToolChoiceForPolicy()` helper that returns AI SDK toolChoice
+- [ ] Update `liveAgent.ts` to use `prepareStep` for policy enforcement
+- [ ] Update `fillRecord` to track policy and actual tool usage
+- [ ] Add `--tool-policy` flag to `markform fill` command
+- [ ] Write unit tests for policy → toolChoice translation
+- [ ] Write integration tests with mock agents
+
+### Phase 2: Advanced Policies
+
+**Goal:** Implement `research-always` and `two-phase` policies.
+
+- [ ] Implement `research-always` with `prepareStep` logic
+- [ ] Implement `two-phase` with separate agent invocations
+- [ ] Add `researchPhaseTurns` and `maxSearchesPerTurn` options
+- [ ] Update frontmatter parser for new options
+- [ ] Add session transcript support for two-phase (mark phase transitions)
+- [ ] Write integration tests for advanced policies
+
+### Phase 3: Provider Testing & Documentation
+
+**Goal:** Validate policies across providers and document recommendations.
+
+- [ ] Create test matrix: policy × provider × form complexity
+- [ ] Test DeepSeek specifically for multi-turn reliability
+- [ ] Test Anthropic with extended thinking
+- [ ] Document provider-specific recommendations in research doc
+- [ ] Update `docs/markform-apis.md` with policy documentation
+- [ ] Add example forms demonstrating each policy
+- [ ] Create troubleshooting guide for policy issues
+
+### Phase 4: Parallel Execution Integration
+
+**Goal:** Ensure policies work correctly with parallel execution.
+
+- [ ] Define policy behavior for parallel agents
+- [ ] Test `research-first` with parallel batches
+- [ ] Test `two-phase` with parallel batches (research → parallel fill)
+- [ ] Add policy options to `ParallelHarnessConfig`
+- [ ] Document parallel + policy interaction
+
+## Testing Strategy
+
+### Unit Tests
+
+- Policy → toolChoice translation for each policy type
+- Policy parsing from frontmatter
+- CLI flag parsing
+
+### Integration Tests (Mock Agents)
+
+- `auto`: Agent receives no toolChoice override
+- `require-tools`: Agent receives `toolChoice: 'required'`
+- `research-first`: Turn 1 gets forced webSearch, turn 2+ gets required
+- `research-always`: Each turn starts with forced webSearch
+- `two-phase`: Two agent invocations with different tool sets
+
+### End-to-End Tests (Real LLM Calls)
+
+- Test each policy with a factual research form
+- Verify web search is actually called (check fill record)
+- Compare accuracy: auto vs research-first vs two-phase
+- Measure latency impact
+
+### Provider Matrix Tests
+
+Create automated tests that run the same form across providers:
+
+```
+┌─────────────────┬────────┬──────────┬─────────┬────────┐
+│ Policy          │ OpenAI │ Anthropic│ DeepSeek│ Google │
+├─────────────────┼────────┼──────────┼─────────┼────────┤
+│ auto            │   ✓    │    ✓     │    ✓    │   ✓    │
+│ require-tools   │   ✓    │    ✓     │    ?    │   ✓    │
+│ research-first  │   ✓    │    ✓     │    ?    │   ✓    │
+│ research-always │   ✓    │    ?     │    ?    │   ✓    │
+│ two-phase       │   ✓    │    ✓     │    ?    │   ✓    │
+└─────────────────┴────────┴──────────┴─────────┴────────┘
+```
+
+## Rollout Plan
+
+1. **Phase 1 release**: Add `toolPolicy` with `auto`, `require-tools`, `research-first`
+   - Default remains `auto` for backward compatibility
+   - Document as experimental
+
+2. **Phase 2 release**: Add `research-always` and `two-phase`
+   - Include provider testing results
+   - Document recommended policies per use case
+
+3. **Future consideration**: Change default to `require-tools` once validated across providers
+
+## Open Questions
+
+1. **Default policy**: Should we change the default from `auto` to `require-tools`?
+   - Pro: More reliable tool usage out of the box
+   - Con: Breaking change, may cause issues with DeepSeek
+
+2. **Per-field policies**: Is form-level policy sufficient, or do we need field-level control?
+   - Current decision: Form-level first, field-level in v2
+
+3. **Policy inheritance in parallel execution**: Should parallel agents inherit the form policy
+   or have independent policies?
+   - Recommendation: Inherit form policy, with option to override per batch
+
+4. **Cost tracking**: Should we track web search costs separately in fill records?
+   - Recommendation: Yes, add `webSearchCalls` count to fill record
+
+5. **Policy composition with `order`**: For `two-phase`, should research happen only for
+   the current order level, or research all fields upfront?
+   - Recommendation: Research current order level only (progressive disclosure)
+
+## References
+
+- [Research: Tool Choice Parameter](../research/research-2026-02-02-tool-choice-parameter.md)
+- [AI SDK Tool Calling](https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling)
+- [AI SDK Agents: Loop Control](https://ai-sdk.dev/docs/agents/loop-control)
+- [Parallel Form Filling Spec](plan-2026-01-27-parallel-form-filling.md)
+- [GitHub Issue: Tool Execution Unreliable](https://github.com/vercel/ai/issues/10269)
+- [GitHub Issue: toolChoice Endless Loop](https://github.com/vercel/ai/issues/3944)

From 69c16e833ed92ad58720ab847a6801954eb335d1 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 3 Feb 2026 00:57:15 +0000
Subject: [PATCH 3/6] docs: update tool policy names to snake_case, make
 require_tools default
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Rename policies to snake_case for consistency:
  - require-tools → require_tools
  - research-first → web_search_first
  - research-always → web_search_always
  - two-phase → two_phase
- Add 'none' policy for completeness
- Change default from 'auto' to 'require_tools'
- Update all references throughout the spec

https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB
---
 .../plan-2026-02-02-tool-choice-policies.md   | 136 ++++++++++--------
 1 file changed, 78 insertions(+), 58 deletions(-)

diff --git a/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md
index 3a1d9e3b..305d4ff2 100644
--- a/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md
+++ b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md
@@ -1,6 +1,6 @@
 # Plan Spec: Tool Choice Policies for Reliable Form Filling
 
-**Date:** 2026-02-02 (last updated 2026-02-02)
+**Date:** 2026-02-02 (last updated 2026-02-03)
 
 **Author:** AI Research
 
@@ -77,16 +77,30 @@ A new `toolPolicy` option controls how the harness manages tool selection:
 
 ```typescript
 type ToolPolicy =
-  | 'auto'              // Current behavior: model chooses freely
-  | 'require-tools'     // toolChoice: 'required' on every turn
-  | 'research-first'    // Force webSearch on first turn, then auto
-  | 'research-always'   // Force webSearch every turn until form complete
-  | 'two-phase'         // Phase 1: research only, Phase 2: fill only
+  | 'none'                // No tools provided to agent
+  | 'auto'                // Model chooses freely whether to use tools
+  | 'require_tools'       // toolChoice: 'required' on every turn (DEFAULT)
+  | 'web_search_first'    // Force web_search on first turn, then require_tools
+  | 'web_search_always'   // Force web_search every turn until form complete
+  | 'two_phase'           // Phase 1: web search only, Phase 2: fill only
 ```
 
+**Default:** `require_tools` — ensures the agent always makes progress by calling a tool
+(either `fill_form` or `web_search`) on every turn. This prevents "analysis paralysis"
+where models describe what they would do without actually doing it.
+
 ### Policy Behaviors
 
-#### `auto` (Default - Current Behavior)
+#### `none`
+
+```
+No tools provided to agent
+Agent can only generate text responses
+```
+
+**When to use:** Testing, debugging, or when tools are intentionally disabled
+
+#### `auto`
 
 ```
 toolChoice: 'auto' on every turn
@@ -94,9 +108,10 @@ Model decides when to search and when to fill
 No enforcement of research before filling
 ```
 
-**When to use:** Simple forms, fields that don't need external research, testing
+**When to use:** Legacy behavior, simple forms that don't need research, or when you
+want maximum model flexibility
 
-#### `require-tools`
+#### `require_tools` (Default)
 
 ```
 toolChoice: 'required' on every turn
@@ -105,9 +120,10 @@ Prevents "analysis paralysis" where model talks without acting
 Uses termination detection to allow final response
 ```
 
-**When to use:** General production use, ensures progress every turn
+**When to use:** General production use, ensures progress every turn. This is the
+recommended default for most forms.
 
-#### `research-first`
+#### `web_search_first`
 
 ```
 Turn 1: toolChoice: { type: 'tool', toolName: 'web_search' }
@@ -116,7 +132,7 @@ Turn 2+: toolChoice: 'required'
 
 **When to use:** Forms with factual fields that need current data, moderate latency tolerance
 
-#### `research-always`
+#### `web_search_always`
 
 ```
 Every turn: First call must be web_search (via prepareStep logic)
@@ -138,7 +154,7 @@ prepareStep: ({ lastToolResults }) => {
 },
 ```
 
-#### `two-phase`
+#### `two_phase`
 
 ```
 Phase 1 (Research): Only web_search available, toolChoice: 'required'
@@ -163,12 +179,12 @@ interface FillOptions {
    * Tool choice policy for agent tool selection.
    * Controls how strictly the harness enforces tool usage.
    *
-   * @default 'auto'
+   * @default 'require_tools'
    */
   toolPolicy?: ToolPolicy;
 
   /**
-   * For 'two-phase' policy: max turns in research phase.
+   * For 'two_phase' policy: max turns in research phase.
    * After this many turns, switches to fill phase.
    *
    * @default 5
@@ -176,7 +192,7 @@ interface FillOptions {
   researchPhaseTurns?: number;
 
   /**
-   * For 'research-always': max searches per turn.
+   * For 'web_search_always': max searches per turn.
    * Prevents excessive API calls on forms with many fields.
    *
    * @default 3
@@ -192,9 +208,9 @@ interface FillOptions {
 markform:
   spec: MF/0.1
   harness_config:
-    tool_policy: research-first    # New option
-    research_phase_turns: 5        # For two-phase
-    max_searches_per_turn: 3       # For research-always
+    tool_policy: web_search_first    # New option
+    research_phase_turns: 5          # For two_phase
+    max_searches_per_turn: 3         # For web_search_always
 ---
 ```
 
@@ -202,10 +218,10 @@ markform:
 
 ```bash
 # New --tool-policy flag
-markform fill form.md --tool-policy=research-first
+markform fill form.md --tool-policy=web_search_first
 
 # Override policy at CLI level
-markform fill form.md --tool-policy=two-phase --research-phase-turns=8
+markform fill form.md --tool-policy=two_phase --research-phase-turns=8
 ```
 
 ### Field-Level Research Hints (Future Enhancement)
@@ -234,20 +250,20 @@ From research, key provider differences to handle:
 | DeepSeek | Unreliable multi-turn; best with `auto` or single-turn `required` |
 | xAI | Can't force provider-defined tools; use grok-4-1-fast |
 
-**Recommendation:** Test `two-phase` and `research-always` across all providers before
-recommending as defaults. `require-tools` should work reliably across all providers.
+**Recommendation:** Test `two_phase` and `web_search_always` across all providers before
+recommending as defaults. `require_tools` should work reliably across all providers.
 
 ### Areas of Uncertainty (Requiring Testing)
 
 1. **DeepSeek multi-turn behavior**: Research indicates unreliable tool calling after first turn.
    Need to test:
    - Does `toolChoice: 'required'` work reliably on DeepSeek?
-   - What happens with `two-phase` policy?
+   - What happens with `two_phase` policy?
    - Should we auto-downgrade to `auto` for DeepSeek?
 
 2. **Anthropic extended thinking**: `toolChoice: 'required'` may conflict with extended thinking.
    Need to test:
-   - Does `research-first` work with Claude 4 extended thinking?
+   - Does `web_search_first` work with Claude 4 extended thinking?
    - Should we detect extended thinking and adjust policy?
 
 3. **Termination detection**: With `toolChoice: 'required'`, how do we allow final text response?
@@ -259,9 +275,9 @@ recommending as defaults. `require-tools` should work reliably across all provid
 4. **Parallel execution interaction**: When `enableParallel: true`:
    - Should each parallel agent have its own tool policy?
    - Should research happen in loose-serial before parallel batches?
-   - Test: parallel agents with `research-first` - do they all search or just one?
+   - Test: parallel agents with `web_search_first` - do they all search or just one?
 
-5. **Turn limits**: With `research-always`, does forcing search on every turn:
+5. **Turn limits**: With `web_search_always`, does forcing search on every turn:
    - Hit rate limits with providers?
    - Significantly impact latency?
    - Improve accuracy enough to justify cost?
@@ -270,7 +286,7 @@ recommending as defaults. `require-tools` should work reliably across all provid
 
 ### Phase 1: Core Policy Engine
 
-**Goal:** Implement `toolPolicy` option with `auto`, `require-tools`, and `research-first`.
+**Goal:** Implement `toolPolicy` option with `none`, `auto`, `require_tools`, and `web_search_first`.
 
 - [ ] Add `ToolPolicy` type to `harnessTypes.ts`
 - [ ] Add `toolPolicy` to `FillOptions` and `HarnessConfig`
@@ -284,13 +300,13 @@ recommending as defaults. `require-tools` should work reliably across all provid
 
 ### Phase 2: Advanced Policies
 
-**Goal:** Implement `research-always` and `two-phase` policies.
+**Goal:** Implement `web_search_always` and `two_phase` policies.
 
-- [ ] Implement `research-always` with `prepareStep` logic
-- [ ] Implement `two-phase` with separate agent invocations
+- [ ] Implement `web_search_always` with `prepareStep` logic
+- [ ] Implement `two_phase` with separate agent invocations
 - [ ] Add `researchPhaseTurns` and `maxSearchesPerTurn` options
 - [ ] Update frontmatter parser for new options
-- [ ] Add session transcript support for two-phase (mark phase transitions)
+- [ ] Add session transcript support for two_phase (mark phase transitions)
 - [ ] Write integration tests for advanced policies
 
 ### Phase 3: Provider Testing & Documentation
@@ -310,8 +326,8 @@ recommending as defaults. `require-tools` should work reliably across all provid
 **Goal:** Ensure policies work correctly with parallel execution.
 
 - [ ] Define policy behavior for parallel agents
-- [ ] Test `research-first` with parallel batches
-- [ ] Test `two-phase` with parallel batches (research → parallel fill)
+- [ ] Test `web_search_first` with parallel batches
+- [ ] Test `two_phase` with parallel batches (research → parallel fill)
 - [ ] Add policy options to `ParallelHarnessConfig`
 - [ ] Document parallel + policy interaction
 
@@ -325,17 +341,18 @@ recommending as defaults. `require-tools` should work reliably across all provid
 
 ### Integration Tests (Mock Agents)
 
-- `auto`: Agent receives no toolChoice override
-- `require-tools`: Agent receives `toolChoice: 'required'`
-- `research-first`: Turn 1 gets forced webSearch, turn 2+ gets required
-- `research-always`: Each turn starts with forced webSearch
-- `two-phase`: Two agent invocations with different tool sets
+- `none`: Agent receives no tools
+- `auto`: Agent receives `toolChoice: 'auto'`
+- `require_tools`: Agent receives `toolChoice: 'required'`
+- `web_search_first`: Turn 1 gets forced web_search, turn 2+ gets required
+- `web_search_always`: Each turn starts with forced web_search
+- `two_phase`: Two agent invocations with different tool sets
 
 ### End-to-End Tests (Real LLM Calls)
 
 - Test each policy with a factual research form
 - Verify web search is actually called (check fill record)
-- Compare accuracy: auto vs research-first vs two-phase
+- Compare accuracy: auto vs web_search_first vs two_phase
 - Measure latency impact
 
 ### Provider Matrix Tests
@@ -343,34 +360,37 @@ recommending as defaults. `require-tools` should work reliably across all provid
 Create automated tests that run the same form across providers:
 
 ```
-┌─────────────────┬────────┬──────────┬─────────┬────────┐
-│ Policy          │ OpenAI │ Anthropic│ DeepSeek│ Google │
-├─────────────────┼────────┼──────────┼─────────┼────────┤
-│ auto            │   ✓    │    ✓     │    ✓    │   ✓    │
-│ require-tools   │   ✓    │    ✓     │    ?    │   ✓    │
-│ research-first  │   ✓    │    ✓     │    ?    │   ✓    │
-│ research-always │   ✓    │    ?     │    ?    │   ✓    │
-│ two-phase       │   ✓    │    ✓     │    ?    │   ✓    │
-└─────────────────┴────────┴──────────┴─────────┴────────┘
+┌───────────────────┬────────┬──────────┬─────────┬────────┐
+│ Policy            │ OpenAI │ Anthropic│ DeepSeek│ Google │
+├───────────────────┼────────┼──────────┼─────────┼────────┤
+│ none              │   ✓    │    ✓     │    ✓    │   ✓    │
+│ auto              │   ✓    │    ✓     │    ✓    │   ✓    │
+│ require_tools     │   ✓    │    ✓     │    ?    │   ✓    │
+│ web_search_first  │   ✓    │    ✓     │    ?    │   ✓    │
+│ web_search_always │   ✓    │    ?     │    ?    │   ✓    │
+│ two_phase         │   ✓    │    ✓     │    ?    │   ✓    │
+└───────────────────┴────────┴──────────┴─────────┴────────┘
 ```
 
 ## Rollout Plan
 
-1. **Phase 1 release**: Add `toolPolicy` with `auto`, `require-tools`, `research-first`
-   - Default remains `auto` for backward compatibility
-   - Document as experimental
+1. **Phase 1 release**: Add `toolPolicy` with `none`, `auto`, `require_tools`, `web_search_first`
+   - Default is `require_tools` for reliable tool usage out of the box
+   - Document policy options and when to use each
 
-2. **Phase 2 release**: Add `research-always` and `two-phase`
+2. **Phase 2 release**: Add `web_search_always` and `two_phase`
    - Include provider testing results
    - Document recommended policies per use case
 
-3. **Future consideration**: Change default to `require-tools` once validated across providers
+3. **Backward compatibility**: Existing forms without `tool_policy` get `require_tools`
+   - This is a behavior change from implicit `auto`, but improves reliability
+   - Users can explicitly set `tool_policy: auto` if needed
 
 ## Open Questions
 
-1. **Default policy**: Should we change the default from `auto` to `require-tools`?
-   - Pro: More reliable tool usage out of the box
-   - Con: Breaking change, may cause issues with DeepSeek
+1. **DeepSeek compatibility**: Does `require_tools` work reliably on DeepSeek?
+   - If not, should we auto-detect DeepSeek and fall back to `auto`?
+   - Need testing to determine
 
 2. **Per-field policies**: Is form-level policy sufficient, or do we need field-level control?
    - Current decision: Form-level first, field-level in v2
@@ -382,7 +402,7 @@ Create automated tests that run the same form across providers:
 4. **Cost tracking**: Should we track web search costs separately in fill records?
    - Recommendation: Yes, add `webSearchCalls` count to fill record
 
-5. **Policy composition with `order`**: For `two-phase`, should research happen only for
+5. **Policy composition with `order`**: For `two_phase`, should research happen only for
    the current order level, or research all fields upfront?
    - Recommendation: Research current order level only (progressive disclosure)
 

From 541e192f17923f3acda7ea2f02eb8bdbef9a95db Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sun, 15 Feb 2026 20:10:04 +0000
Subject: [PATCH 4/6] docs: revise tool choice policy spec after engineering
 review

Major changes:
- Document steps vs turns architecture (critical context)
- Simplify to 4 policies: none, auto, require_tools, require_web_search
- Drop web_search_always (redundant) and two_phase (broken: context
  lost across stateless turns)
- Add require_web_search using prepareStep for step-level control
- Propose harness-level research injection as future robust solution
- Add post-turn validation as alternative approach
- Include source code references for all architectural claims

https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB
---
 .../plan-2026-02-02-tool-choice-policies.md   | 487 ++++++++++--------
 1 file changed, 270 insertions(+), 217 deletions(-)

diff --git a/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md
index 305d4ff2..58cf8de6 100644
--- a/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md
+++ b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md
@@ -1,10 +1,10 @@
 # Plan Spec: Tool Choice Policies for Reliable Form Filling
 
-**Date:** 2026-02-02 (last updated 2026-02-03)
+**Date:** 2026-02-02 (last updated 2026-02-15)
 
 **Author:** AI Research
 
-**Status:** Draft
+**Status:** Draft (revised after senior engineering review)
 
 ## Overview
 
@@ -26,10 +26,7 @@ reducing hallucination and improving data accuracy.
    thoroughness against latency/cost
 3. **Cross-model compatibility**: Work reliably across providers (OpenAI, Anthropic, Google,
    DeepSeek, xAI)
-4. **Form-level and field-level control**: Allow policies at form level with optional per-field
-   overrides
-5. **Integration with parallel execution**: Policies should compose with the existing `parallel`
-   and `order` attributes
+4. **Form-level control**: Allow policies at form level via frontmatter and CLI
 
 ## Non-Goals
 
@@ -37,135 +34,244 @@ reducing hallucination and improving data accuracy.
 - Model-specific prompt tuning (policies should work across models)
 - UI for policy configuration (CLI/API only for now)
 - Automatic policy selection based on field content
+- Per-field tool policies (may be considered in v2)
 
 ## Background
 
 ### The Problem
 
-Current Markform behavior uses `toolChoice: 'auto'` by default, meaning the model decides
-whether to use tools. In practice, models often:
+Models often skip web search and fill fields with training data (which may be outdated
+or hallucinated). The current `toolChoice: 'required'` default (set at `liveAgent.ts:91`)
+forces the model to call *some* tool, but doesn't guarantee it uses web search over
+fill_form.
 
-1. **Skip web search entirely** and fill fields with training data (which may be outdated)
-2. **Hallucinate tool calls** by describing what they would search without actually searching
-3. **Become unreliable after ~5 turns** as conversation length increases
-4. **Fill forms without research** when under time pressure or with simpler prompts
+### Architecture: Steps vs Turns (Critical Context)
 
-### Research Findings
+The Markform harness has a two-level iteration model that any tool policy must respect:
 
-From the research doc (`research-2026-02-02-tool-choice-parameter.md`):
+```
+┌─────────────────────────────────────────────────────────────┐
+│ Harness Turn (one fillFormTool() call)                      │
+│                                                             │
+│  ┌──────────────────────────────────────────────────────┐   │
+│  │ Single generateText() invocation                     │   │
+│  │                                                      │   │
+│  │  Step 0: web_search("query")  →  results ✓           │   │
+│  │  Step 1: fill_form(patches)   ← sees search results  │   │
+│  │  Step 2: web_search("query2") →  results ✓           │   │
+│  │  Step 3: fill_form(patches)   ← sees all prior       │   │
+│  │  ...up to maxStepsPerTurn (default: 20)               │   │
+│  │                                                      │   │
+│  │  Context PRESERVED within these steps                 │   │
+│  └──────────────────────────────────────────────────────┘   │
+│                                                             │
+│  Form markdown updated with patches → next turn             │
+│                                                             │
+└─────────────────────────────────────────────────────────────┘
+                          │
+                          ▼ Context RESET
+┌─────────────────────────────────────────────────────────────┐
+│ Next Harness Turn (fresh fillFormTool() call)                │
+│                                                             │
+│  Only sees: updated form markdown + remaining issues         │
+│  Previous web search results are LOST                       │
+│  Previous conversation/reasoning is LOST                    │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Key facts from code review:**
+
+1. **Within a turn** (`liveAgent.ts:201-209`): One `generateText()` call allows up to
+   `maxStepsPerTurn` (default 20) AI SDK steps. The model accumulates context across
+   steps—web search results from step 0 ARE visible when calling fill_form in step 1.
+   AI SDK's `prepareStep` callback fires between steps and can change `toolChoice` and
+   `activeTools` per step (`generate-text.ts:518-546`).
 
-- `toolChoice: 'required'` forces tool use but can cause infinite loops without termination
-- `toolChoice: { type: 'tool', toolName: 'webSearch' }` forces a specific tool
-- Different providers translate these values differently (Anthropic: `required` → `any`)
-- AI SDK 6's `prepareStep` allows dynamic toolChoice per turn
-- Google recommends limiting to 10-20 tools for reliable selection
-- DeepSeek has unreliable multi-turn tool calling; best for single-turn
+2. **Across turns** (`liveAgent.ts:123-125`): Each call is stateless. The full form
+   context is provided fresh. Only three things persist across turns:
+   - The form markdown (updated with filled values)
+   - The remaining issues list
+   - Previous patch rejections
 
-### Current Implementation
+3. **Web search results do NOT persist across turns.** Any policy that separates
+   "research" and "fill" into different harness turns will lose the research results.
 
-The harness provides these tools to the agent:
-- `fill_form` - Apply patches to the form
-- `web_search` (optional, via `enableWebSearch`) - Search the web for information
+### Implications for Policy Design
 
-Currently, `toolChoice` is not explicitly set, defaulting to `'auto'`.
+- **Policies must operate at the step level** (within a single `generateText()` call),
+  NOT at the turn level (across separate `fillFormTool()` invocations).
+- **"Two-phase" as separate invocations is broken** — research results from Phase 1 are
+  lost when Phase 2 starts fresh.
+- **`prepareStep`** is the correct mechanism for step-level control. It can dynamically
+  change `toolChoice` and `activeTools` between steps.
 
 ## Design
 
 ### Tool Choice Policy Enum
 
-A new `toolPolicy` option controls how the harness manages tool selection:
-
 ```typescript
 type ToolPolicy =
-  | 'none'                // No tools provided to agent
-  | 'auto'                // Model chooses freely whether to use tools
-  | 'require_tools'       // toolChoice: 'required' on every turn (DEFAULT)
-  | 'web_search_first'    // Force web_search on first turn, then require_tools
-  | 'web_search_always'   // Force web_search every turn until form complete
-  | 'two_phase'           // Phase 1: web search only, Phase 2: fill only
+  | 'none'                  // No tools provided to agent
+  | 'auto'                  // Model chooses freely whether to use tools
+  | 'require_tools'         // toolChoice: 'required' on every step (DEFAULT)
+  | 'require_web_search'    // Step 0 must be web_search, then require_tools
 ```
 
-**Default:** `require_tools` — ensures the agent always makes progress by calling a tool
-(either `fill_form` or `web_search`) on every turn. This prevents "analysis paralysis"
-where models describe what they would do without actually doing it.
+**Default:** `require_tools`
 
 ### Policy Behaviors
 
 #### `none`
 
 ```
-No tools provided to agent
-Agent can only generate text responses
+No tools provided to agent.
+Agent can only generate text responses.
+toolChoice: N/A
 ```
 
-**When to use:** Testing, debugging, or when tools are intentionally disabled
+**When to use:** Testing, debugging, or when tools are intentionally disabled.
 
 #### `auto`
 
 ```
-toolChoice: 'auto' on every turn
-Model decides when to search and when to fill
-No enforcement of research before filling
+toolChoice: 'auto' on every step.
+Model decides when to search and when to fill.
+No enforcement of tool usage.
 ```
 
 **When to use:** Legacy behavior, simple forms that don't need research, or when you
-want maximum model flexibility
+want maximum model flexibility.
 
 #### `require_tools` (Default)
 
 ```
-toolChoice: 'required' on every turn
-Model must call SOME tool every turn (fill_form or web_search)
-Prevents "analysis paralysis" where model talks without acting
-Uses termination detection to allow final response
+toolChoice: 'required' on every step.
+Model must call SOME tool on every step (fill_form or web_search).
+Prevents "analysis paralysis" where model talks without acting.
+Model can interleave web_search and fill_form freely within a turn.
 ```
 
-**When to use:** General production use, ensures progress every turn. This is the
-recommended default for most forms.
-
-#### `web_search_first`
-
-```
-Turn 1: toolChoice: { type: 'tool', toolName: 'web_search' }
-Turn 2+: toolChoice: 'required'
-```
+**When to use:** General production use. Ensures progress on every step.
+Already set as the default at `liveAgent.ts:91`.
 
-**When to use:** Forms with factual fields that need current data, moderate latency tolerance
+**Note:** This does NOT guarantee web search is used. The model may go straight
+to fill_form. For guaranteed research, use `require_web_search`.
 
-#### `web_search_always`
+#### `require_web_search`
 
 ```
-Every turn: First call must be web_search (via prepareStep logic)
-After web_search returns, toolChoice: 'required' for rest of turn
+Step 0: toolChoice: { type: 'tool', toolName: 'web_search' }
+Step 1+: toolChoice: 'required'
 ```
 
-**When to use:** High-accuracy requirements, fields with rapidly changing data
+**When to use:** Forms with factual fields that need current data. Guarantees at
+least one web search before any form filling within each turn.
 
 **Implementation:**
 ```typescript
-prepareStep: ({ lastToolResults }) => {
-  const hasSearchedThisTurn = lastToolResults?.some(
-    r => r.toolName === 'web_search'
+// In liveAgent.ts, pass prepareStep to generateText():
+prepareStep: ({ stepNumber, steps }) => {
+  const hasSearched = steps.some(step =>
+    step.toolCalls.some(tc => isWebSearchTool(tc.toolName))
   );
-  if (!hasSearchedThisTurn) {
+  if (!hasSearched) {
     return { toolChoice: { type: 'tool', toolName: 'web_search' } };
   }
   return { toolChoice: 'required' };
 },
 ```
 
-#### `two_phase`
+**Behavior within a single turn:**
+1. Step 0: Model is forced to call web_search
+2. Step 1: Model sees search results, chooses any tool (usually fill_form)
+3. Step 2+: Model continues with required tools (may search again or fill more)
 
+### Deferred: More Aggressive Research Policies
+
+The original spec proposed `web_search_always` and `two_phase` policies. These are
+**deferred** pending architectural work:
+
+#### Why `two_phase` doesn't work as designed
+
+The original design called for "Phase 1: research only, Phase 2: fill only" as separate
+agent invocations. This is broken because web search results from Phase 1 are completely
+lost when Phase 2 starts (turns are stateless).
+
+**Possible future approaches:**
+1. **Harness-level research injection** (recommended): The harness itself runs web
+   searches before calling the LLM, based on field labels/descriptions. Inject results
+   into the context prompt. The model never has to decide whether to search.
+2. **Research accumulator**: Store web search results in a sidecar that persists across
+   turns. Inject into subsequent prompts. Adds complexity.
+3. **Single-turn two-phase via `activeTools`**: Within one `generateText()` call, use
+   `prepareStep` to only expose web_search for steps 0-N, then only expose fill_form
+   for steps N+1+. Works within a turn but may exceed step limits for complex forms.
+
+#### Why `web_search_always` is questionable
+
+Forcing web search on every step within a turn doesn't make sense — after step 0
+returns search results, the model already has the information. Forcing redundant
+searches wastes API calls. The `require_web_search` policy (search on step 0) achieves
+the same goal more efficiently.
+
+If the concern is that the model needs to search for different fields at different
+points, `require_tools` already allows this — the model can interleave searches and
+fills freely.
+
+### Alternative Approaches Worth Considering
+
+#### Harness-Level Research Injection (Future)
+
+Instead of asking the model to decide when to search, the harness itself runs web
+searches based on field metadata:
+
+```typescript
+// Pseudocode for future harness-level research
+async function researchFields(form: ParsedForm, issues: InspectIssue[]): Promise<string> {
+  const queries = generateSearchQueries(form, issues);
+  const results = await Promise.all(queries.map(q => webSearch(q)));
+  return formatResearchContext(results);
+}
+
+// Inject into context prompt
+const contextPrompt = buildContextPrompt(issues, form, maxPatches, previousRejections);
+const researchContext = await researchFields(form, issues);
+const fullPrompt = contextPrompt + '\n\n# Research Results\n' + researchContext;
 ```
-Phase 1 (Research): Only web_search available, toolChoice: 'required'
-                    Runs until configurable turn count or all fields researched
-Phase 2 (Fill):     Only fill_form available, toolChoice: 'required'
-                    Uses research context from Phase 1
-```
 
-**When to use:** Maximum accuracy, complex research forms, acceptable latency
+**Advantages:**
+- Most reliable — doesn't depend on model behavior
+- Works identically across all providers
+- Research quality can be tuned independently of the LLM
+- Can be cached/reused across turns
+
+**Disadvantages:**
+- Harness must know what to search for (field labels may not be sufficient)
+- Upfront latency for search before LLM call
+- May search for things the model doesn't need
 
-**Implementation:** Two separate agent invocations with different tool sets
+This is the recommended direction for forms that truly need guaranteed research.
+
+#### Post-Turn Validation (Alternative)
+
+Let the model use `require_tools` freely, but validate after each turn:
+
+```typescript
+// After fillFormTool() returns, check tool usage
+const toolCalls = response.stats.toolCalls;
+const usedWebSearch = toolCalls.some(tc => isWebSearchTool(tc.name));
+
+if (!usedWebSearch && policyRequiresSearch) {
+  // Inject reminder into next turn's context
+  previousRejections.push({
+    message: 'You did not use web search. Research field values before filling.',
+    // ... triggers re-try
+  });
+}
+```
+
+**Advantages:** Simple, works across providers, no `prepareStep` complexity.
+**Disadvantages:** Wastes a turn when model doesn't search. Slower.
 
 ### API Changes
 
@@ -182,22 +288,6 @@ interface FillOptions {
    * @default 'require_tools'
    */
   toolPolicy?: ToolPolicy;
-
-  /**
-   * For 'two_phase' policy: max turns in research phase.
-   * After this many turns, switches to fill phase.
-   *
-   * @default 5
-   */
-  researchPhaseTurns?: number;
-
-  /**
-   * For 'web_search_always': max searches per turn.
-   * Prevents excessive API calls on forms with many fields.
-   *
-   * @default 3
-   */
-  maxSearchesPerTurn?: number;
 }
 ```
 
@@ -208,9 +298,7 @@ interface FillOptions {
 markform:
   spec: MF/0.1
   harness_config:
-    tool_policy: web_search_first    # New option
-    research_phase_turns: 5          # For two_phase
-    max_searches_per_turn: 3         # For web_search_always
+    tool_policy: require_web_search    # New option
 ---
 ```
 
@@ -218,193 +306,149 @@ markform:
 
 ```bash
 # New --tool-policy flag
-markform fill form.md --tool-policy=web_search_first
+markform fill form.md --tool-policy=require_web_search
 
-# Override policy at CLI level
-markform fill form.md --tool-policy=two_phase --research-phase-turns=8
+# Override policy at CLI level (CLI overrides frontmatter)
+markform fill form.md --tool-policy=auto
 ```
 
-### Field-Level Research Hints (Future Enhancement)
-
-For v2, consider per-field annotations:
-
-```markdown
-<!-- field kind="string" id="revenue" label="Revenue" research="required" -->
-<!-- /field -->
-
-<!-- field kind="string" id="notes" label="Notes" research="none" -->
-<!-- /field -->
-```
-
-This is deferred - the form-level policy is sufficient for initial implementation.
-
 ### Provider-Specific Considerations
 
-From research, key provider differences to handle:
-
-| Provider | Notes |
-|----------|-------|
-| OpenAI | `'required'` works directly; parallel tool calls supported |
-| Anthropic | `'required'` → `'any'` translation; `disable_parallel_tool_use` available |
-| Google | `'required'` → `mode: 'ANY'`; limit to 10-20 tools |
-| DeepSeek | Unreliable multi-turn; best with `auto` or single-turn `required` |
-| xAI | Can't force provider-defined tools; use grok-4-1-fast |
-
-**Recommendation:** Test `two_phase` and `web_search_always` across all providers before
-recommending as defaults. `require_tools` should work reliably across all providers.
+| Provider | `require_tools` | `require_web_search` | Notes |
+|----------|----------------|---------------------|-------|
+| OpenAI | Works directly | Works directly | Parallel tool calls supported |
+| Anthropic | `required` → `any` | `{ type: 'tool' }` works | Not compatible with extended thinking |
+| Google | `required` → `ANY` | `ANY` + `allowedFunctionNames` | Limit to 10-20 tools |
+| DeepSeek | **Needs testing** | **Needs testing** | Unreliable multi-turn; may need fallback |
+| xAI | Works | Can't force provider-defined tools | Use grok-4-1-fast |
 
 ### Areas of Uncertainty (Requiring Testing)
 
-1. **DeepSeek multi-turn behavior**: Research indicates unreliable tool calling after first turn.
-   Need to test:
-   - Does `toolChoice: 'required'` work reliably on DeepSeek?
-   - What happens with `two_phase` policy?
+1. **DeepSeek `require_tools` behavior**: Research indicates unreliable tool calling.
+   - Does `toolChoice: 'required'` work reliably?
    - Should we auto-downgrade to `auto` for DeepSeek?
+   - Test with both single-step and multi-step turns.
 
-2. **Anthropic extended thinking**: `toolChoice: 'required'` may conflict with extended thinking.
-   Need to test:
-   - Does `web_search_first` work with Claude 4 extended thinking?
-   - Should we detect extended thinking and adjust policy?
+2. **Anthropic extended thinking + `require_tools`**: Anthropic docs say only
+   `tool_choice: 'auto'` is compatible with extended thinking.
+   - Does this affect our default? Should we detect extended thinking?
+   - Workaround: Use `auto` + strong prompting when extended thinking is enabled.
 
-3. **Termination detection**: With `toolChoice: 'required'`, how do we allow final text response?
-   Options to test:
-   - Use `stopWhen: hasToolCall('fill_form')` with form completion check
-   - Use `prepareStep` to switch to `'auto'` on last turn
-   - Add a no-op `complete` tool
+3. **`require_web_search` with providers using different search tool names**:
+   - OpenAI: `web_search`, Anthropic: `web_search`, Google: `google_search`
+   - Need to resolve correct tool name dynamically.
 
-4. **Parallel execution interaction**: When `enableParallel: true`:
-   - Should each parallel agent have its own tool policy?
-   - Should research happen in loose-serial before parallel batches?
-   - Test: parallel agents with `web_search_first` - do they all search or just one?
+4. **Step limits**: With `require_web_search`, do we burn a step on search?
+   - Current default: 20 steps per turn (plenty of room).
+   - Monitor if complex forms hit the step limit.
 
-5. **Turn limits**: With `web_search_always`, does forcing search on every turn:
-   - Hit rate limits with providers?
-   - Significantly impact latency?
-   - Improve accuracy enough to justify cost?
+5. **Provider-specific `toolChoice: { type: 'tool' }` support**:
+   - xAI can't force server-side tools — does this affect web search?
+   - Need to test forcing specific tool names across all providers.
 
 ## Implementation Plan
 
-### Phase 1: Core Policy Engine
+### Phase 1: Core Policies
 
-**Goal:** Implement `toolPolicy` option with `none`, `auto`, `require_tools`, and `web_search_first`.
+**Goal:** Implement `toolPolicy` with `none`, `auto`, `require_tools`, `require_web_search`.
 
 - [ ] Add `ToolPolicy` type to `harnessTypes.ts`
-- [ ] Add `toolPolicy` to `FillOptions` and `HarnessConfig`
+- [ ] Add `toolPolicy` to `FillOptions` and `LiveAgentConfig`
 - [ ] Add `tool_policy` to `HarnessConfigYaml` and mapping in `settings.ts`
-- [ ] Implement `getToolChoiceForPolicy()` helper that returns AI SDK toolChoice
-- [ ] Update `liveAgent.ts` to use `prepareStep` for policy enforcement
-- [ ] Update `fillRecord` to track policy and actual tool usage
+- [ ] Implement `prepareStep` callback in `liveAgent.ts` for `require_web_search`
+- [ ] Resolve web search tool name dynamically (provider-aware)
 - [ ] Add `--tool-policy` flag to `markform fill` command
+- [ ] Update `fillRecord` to track policy in metadata
 - [ ] Write unit tests for policy → toolChoice translation
 - [ ] Write integration tests with mock agents
 
-### Phase 2: Advanced Policies
-
-**Goal:** Implement `web_search_always` and `two_phase` policies.
+### Phase 2: Provider Validation
 
-- [ ] Implement `web_search_always` with `prepareStep` logic
-- [ ] Implement `two_phase` with separate agent invocations
-- [ ] Add `researchPhaseTurns` and `maxSearchesPerTurn` options
-- [ ] Update frontmatter parser for new options
-- [ ] Add session transcript support for two_phase (mark phase transitions)
-- [ ] Write integration tests for advanced policies
+**Goal:** Validate all policies across providers.
 
-### Phase 3: Provider Testing & Documentation
-
-**Goal:** Validate policies across providers and document recommendations.
-
-- [ ] Create test matrix: policy × provider × form complexity
-- [ ] Test DeepSeek specifically for multi-turn reliability
+- [ ] Create test matrix: policy × provider
+- [ ] Test DeepSeek specifically: `require_tools` and `require_web_search`
 - [ ] Test Anthropic with extended thinking
-- [ ] Document provider-specific recommendations in research doc
-- [ ] Update `docs/markform-apis.md` with policy documentation
-- [ ] Add example forms demonstrating each policy
-- [ ] Create troubleshooting guide for policy issues
+- [ ] Test xAI with forced tool names
+- [ ] Document provider-specific recommendations
+- [ ] Add fallback behavior for unsupported provider/policy combos
 
-### Phase 4: Parallel Execution Integration
+### Phase 3: Harness-Level Research (Future)
 
-**Goal:** Ensure policies work correctly with parallel execution.
+**Goal:** Enable guaranteed research without relying on model behavior.
 
-- [ ] Define policy behavior for parallel agents
-- [ ] Test `web_search_first` with parallel batches
-- [ ] Test `two_phase` with parallel batches (research → parallel fill)
-- [ ] Add policy options to `ParallelHarnessConfig`
-- [ ] Document parallel + policy interaction
+- [ ] Design search query generation from field metadata
+- [ ] Implement harness-level web search execution
+- [ ] Inject research results into context prompt
+- [ ] Add `research_mode: auto | manual` config option
+- [ ] Cache research results across turns for efficiency
+- [ ] Write integration tests
 
 ## Testing Strategy
 
 ### Unit Tests
 
 - Policy → toolChoice translation for each policy type
-- Policy parsing from frontmatter
-- CLI flag parsing
+- Policy parsing from frontmatter and CLI
+- Web search tool name resolution per provider
 
 ### Integration Tests (Mock Agents)
 
 - `none`: Agent receives no tools
 - `auto`: Agent receives `toolChoice: 'auto'`
 - `require_tools`: Agent receives `toolChoice: 'required'`
-- `web_search_first`: Turn 1 gets forced web_search, turn 2+ gets required
-- `web_search_always`: Each turn starts with forced web_search
-- `two_phase`: Two agent invocations with different tool sets
+- `require_web_search`: Step 0 forced to web_search, step 1+ required
 
 ### End-to-End Tests (Real LLM Calls)
 
 - Test each policy with a factual research form
-- Verify web search is actually called (check fill record)
-- Compare accuracy: auto vs web_search_first vs two_phase
+- Verify web search is actually called (check fill record / wire format)
+- Compare accuracy: auto vs require_tools vs require_web_search
 - Measure latency impact
 
-### Provider Matrix Tests
-
-Create automated tests that run the same form across providers:
+### Provider Matrix
 
 ```
-┌───────────────────┬────────┬──────────┬─────────┬────────┐
-│ Policy            │ OpenAI │ Anthropic│ DeepSeek│ Google │
-├───────────────────┼────────┼──────────┼─────────┼────────┤
-│ none              │   ✓    │    ✓     │    ✓    │   ✓    │
-│ auto              │   ✓    │    ✓     │    ✓    │   ✓    │
-│ require_tools     │   ✓    │    ✓     │    ?    │   ✓    │
-│ web_search_first  │   ✓    │    ✓     │    ?    │   ✓    │
-│ web_search_always │   ✓    │    ?     │    ?    │   ✓    │
-│ two_phase         │   ✓    │    ✓     │    ?    │   ✓    │
-└───────────────────┴────────┴──────────┴─────────┴────────┘
+┌─────────────────────┬────────┬──────────┬─────────┬────────┬─────┐
+│ Policy              │ OpenAI │ Anthropic│ DeepSeek│ Google │ xAI │
+├─────────────────────┼────────┼──────────┼─────────┼────────┼─────┤
+│ none                │   ✓    │    ✓     │    ✓    │   ✓    │  ✓  │
+│ auto                │   ✓    │    ✓     │    ✓    │   ✓    │  ✓  │
+│ require_tools       │   ✓    │    ✓     │    ?    │   ✓    │  ✓  │
+│ require_web_search  │   ✓    │    ?     │    ?    │   ?    │  ?  │
+└─────────────────────┴────────┴──────────┴─────────┴────────┴─────┘
 ```
 
 ## Rollout Plan
 
-1. **Phase 1 release**: Add `toolPolicy` with `none`, `auto`, `require_tools`, `web_search_first`
-   - Default is `require_tools` for reliable tool usage out of the box
-   - Document policy options and when to use each
+1. **Phase 1**: Ship `none`, `auto`, `require_tools`, `require_web_search`
+   - Default is `require_tools` (already the current behavior)
+   - `require_web_search` documented as beta until provider testing complete
 
-2. **Phase 2 release**: Add `web_search_always` and `two_phase`
-   - Include provider testing results
-   - Document recommended policies per use case
+2. **Phase 2**: Validate across providers, promote `require_web_search` to stable
 
-3. **Backward compatibility**: Existing forms without `tool_policy` get `require_tools`
-   - This is a behavior change from implicit `auto`, but improves reliability
-   - Users can explicitly set `tool_policy: auto` if needed
+3. **Phase 3**: Harness-level research injection as the robust solution for
+   forms that truly need guaranteed research
 
 ## Open Questions
 
-1. **DeepSeek compatibility**: Does `require_tools` work reliably on DeepSeek?
-   - If not, should we auto-detect DeepSeek and fall back to `auto`?
-   - Need testing to determine
+1. **Web search tool naming**: Should `require_web_search` resolve tool names
+   dynamically (e.g., `google_search` for Google provider), or should we normalize
+   all search tools to `web_search`?
+   - Current code: Google uses `google_search`, others use `web_search`
+   - Recommendation: Resolve dynamically using existing `isWebSearchTool()` helper
 
-2. **Per-field policies**: Is form-level policy sufficient, or do we need field-level control?
-   - Current decision: Form-level first, field-level in v2
+2. **DeepSeek fallback**: If `require_tools` doesn't work on DeepSeek, should we
+   auto-detect and fall back, or let it fail visibly?
+   - Recommendation: Fail visibly with a warning, let user set `auto` explicitly
 
-3. **Policy inheritance in parallel execution**: Should parallel agents inherit the form policy
-   or have independent policies?
-   - Recommendation: Inherit form policy, with option to override per batch
+3. **Extended thinking**: Should we auto-detect extended thinking on Anthropic and
+   downgrade to `auto`?
+   - Recommendation: Yes, with a logged warning
 
-4. **Cost tracking**: Should we track web search costs separately in fill records?
-   - Recommendation: Yes, add `webSearchCalls` count to fill record
-
-5. **Policy composition with `order`**: For `two_phase`, should research happen only for
-   the current order level, or research all fields upfront?
-   - Recommendation: Research current order level only (progressive disclosure)
+4. **Harness-level research scope**: When implemented, should it search for all
+   fields or only unfilled fields with `research: required` annotation?
+   - Deferred to Phase 3 design
 
 ## References
 
@@ -414,3 +458,12 @@ Create automated tests that run the same form across providers:
 - [Parallel Form Filling Spec](plan-2026-01-27-parallel-form-filling.md)
 - [GitHub Issue: Tool Execution Unreliable](https://github.com/vercel/ai/issues/10269)
 - [GitHub Issue: toolChoice Endless Loop](https://github.com/vercel/ai/issues/3944)
+
+### Source Code References
+
+- `liveAgent.ts:91` — current `toolChoice` default (`'required'`)
+- `liveAgent.ts:123-125` — stateless turn documentation
+- `liveAgent.ts:201-209` — `generateText()` invocation with `stepCountIs`
+- `liveAgent.ts:231` — tool call counting (for post-turn validation)
+- `liveAgent.ts:746-748` — `isWebSearchTool()` helper
+- `generate-text.ts:518-546` (AI SDK) — `prepareStep` callback invocation

From 84a5ccdc0564a5ff5874231227ddbec31c214475 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 17 Feb 2026 01:39:09 +0000
Subject: [PATCH 5/6] docs: update research doc recommendations to match
 revised design

- Replace Options A/B/C with concrete recommendations
- Add prepareStep pattern for forced web search
- Document critical architecture note (stateless turns)
- Add provider-specific compatibility table
- Remove stale "Next Steps" referencing old hybrid approach

https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB
---
 ...search-2026-02-02-tool-choice-parameter.md | 102 ++++++++----------
 1 file changed, 43 insertions(+), 59 deletions(-)

diff --git a/docs/project/research/research-2026-02-02-tool-choice-parameter.md b/docs/project/research/research-2026-02-02-tool-choice-parameter.md
index 285131f0..eedbc94a 100644
--- a/docs/project/research/research-2026-02-02-tool-choice-parameter.md
+++ b/docs/project/research/research-2026-02-02-tool-choice-parameter.md
@@ -1,6 +1,6 @@
 # Research: Tool Choice Parameter in AI SDK and Major LLM Providers
 
-**Date:** 2026-02-02 (last updated 2026-02-02)
+**Date:** 2026-02-02 (last updated 2026-02-17)
 
 **Author:** AI Research
 
@@ -470,75 +470,59 @@ used in a specific call, use prompting instead: "Do not use any tools for this r
 
 ---
 
-## Options Considered
+## Recommendations for Markform
 
-### Option A: Use `toolChoice: 'required'` Everywhere
+Based on this research and confirmed by reviewing the Markform harness architecture
+(stateless turns with multi-step tool loops within each turn):
 
-**Description:** Force tool use on every step until completion.
+### Recommended: `toolChoice: 'required'` as Default
 
-**Pros:**
-- Guarantees tools are called
-- Prevents hallucination of tool results
+Use `toolChoice: 'required'` (mapped to AI SDK `require_tools` policy) as the default
+for all form filling. This:
+- Prevents "analysis paralysis" where models describe what they'd do without acting
+- Works across OpenAI, Anthropic (→ `any`), Google (→ `ANY`), and xAI
+- Already set as default in `liveAgent.ts:91`
+- Uses `stopWhen: stepCountIs(maxStepsPerTurn)` to prevent infinite loops
 
-**Cons:**
-- Can cause infinite loops without proper termination
-- May force unnecessary tool calls
-- Not compatible with Anthropic extended thinking
+### For Guaranteed Web Search: `prepareStep` with Forced First Step
 
-### Option B: Use `toolChoice: 'auto'` with Strong Prompting
+When forms require factual research, use AI SDK's `prepareStep` callback to force
+`web_search` on step 0, then `'required'` for subsequent steps. This works because
+within a single `generateText()` call, the model accumulates context across steps—
+search results from step 0 are visible when calling `fill_form` in step 1.
 
-**Description:** Rely on system prompts to guide tool use.
-
-**Pros:**
-- More flexible
-- Works with all features (extended thinking, etc.)
-- Natural conversation flow
-
-**Cons:**
-- Model may ignore prompts and skip tools
-- Reliability degrades over long conversations
-- Harder to guarantee tool usage
-
-### Option C: Hybrid Approach with `prepareStep`
-
-**Description:** Use `toolChoice: 'required'` initially, switch to `'auto'` for final
-response.
-
-**Pros:**
-- Best of both worlds
-- Guarantees initial research
-- Allows natural completion
-
-**Cons:**
-- More complex implementation
-- Requires careful step management
-
----
-
-## Recommendations
-
-1. **For form-filling with mandatory research:** Use Option C (Hybrid) with:
-   - `toolChoice: 'required'` for first N steps
-   - An answer tool without execute function
-   - `stopWhen: hasToolCall('submitForm')`
+```typescript
+prepareStep: ({ steps }) => {
+  const hasSearched = steps.some(step =>
+    step.toolCalls.some(tc => isWebSearchTool(tc.toolName))
+  );
+  if (!hasSearched) {
+    return { toolChoice: { type: 'tool', toolName: 'web_search' } };
+  }
+  return { toolChoice: 'required' };
+},
+```
 
-2. **For simpler tool integration:** Use Option A with proper termination:
-   - Define a clear termination tool
-   - Use `stopWhen` to prevent infinite loops
+**Critical architecture note:** Markform turns are stateless—web search results do NOT
+persist across turns. All tool policies must operate at the **step** level (within a
+single `generateText()` call), not at the turn level.
 
-3. **For conversation-like interfaces:** Use Option B with:
-   - Strong system prompts
-   - Explicit tool-use instructions in user messages
-   - Periodic context compaction
+### Future: Harness-Level Research Injection
 
----
+For maximum reliability (not dependent on model behavior), the harness could run web
+searches itself before calling the LLM, injecting results into the context prompt.
+This is the most provider-agnostic approach and decouples research quality from
+model tool-calling behavior. See the plan spec for details.
 
-## Next Steps
+### Provider-Specific Notes
 
-- [ ] Implement the recommended hybrid pattern in markform
-- [ ] Add tool input validation for form fields
-- [ ] Create a verification step for critical data
-- [ ] Test across multiple providers for consistency
+| Provider | `'required'` | Forced specific tool | Caveat |
+|----------|-------------|---------------------|--------|
+| OpenAI | Works directly | Works directly | — |
+| Anthropic | → `any` | → `{ type: 'tool' }` | Not compatible with extended thinking |
+| Google | → `ANY` | → `allowedFunctionNames` | Limit to 10-20 tools |
+| DeepSeek | Needs testing | Needs testing | Unreliable multi-turn calling |
+| xAI | Works | Can't force server-side tools | Use grok-4-1-fast |
 
 ---
 

From 1a8f09e842ba747ac0733f7c8b3f119ae0b378b7 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 17 Feb 2026 01:42:39 +0000
Subject: [PATCH 6/6] chore: update tbd config for v0.1.13 shortcut/guideline
 renames

https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB
---
 .tbd/config.yml | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/.tbd/config.yml b/.tbd/config.yml
index 027671ba..2f67fcc2 100644
--- a/.tbd/config.yml
+++ b/.tbd/config.yml
@@ -3,7 +3,7 @@ display:
 # Documentation cache configuration.
 # files: Maps destination paths (relative to .tbd/docs/) to source locations.
 #   Sources can be:
-#   - internal: prefix for bundled docs (e.g., "internal:shortcuts/standard/code-review-and-commit.md")
+#   - internal: prefix for bundled docs (e.g., "internal:shortcuts/standard/commit-code.md")
 #   - Full URL for external docs (e.g., "https://raw.githubusercontent.com/org/repo/main/file.md")
 # lookup_path: Search paths for doc lookup (like shell $PATH). Earlier paths take precedence.
 #
@@ -15,12 +15,10 @@ display:
 docs_cache:
   files:
     guidelines/backward-compatibility-rules.md: internal:guidelines/backward-compatibility-rules.md
-    guidelines/bun-monorepo-patterns.md: internal:guidelines/bun-monorepo-patterns.md
     guidelines/cli-agent-skill-patterns.md: internal:guidelines/cli-agent-skill-patterns.md
     guidelines/commit-conventions.md: internal:guidelines/commit-conventions.md
     guidelines/convex-limits-best-practices.md: internal:guidelines/convex-limits-best-practices.md
     guidelines/convex-rules.md: internal:guidelines/convex-rules.md
-    guidelines/electron-app-development-patterns.md: internal:guidelines/electron-app-development-patterns.md
     guidelines/error-handling-rules.md: internal:guidelines/error-handling-rules.md
     guidelines/general-coding-rules.md: internal:guidelines/general-coding-rules.md
     guidelines/general-comment-rules.md: internal:guidelines/general-comment-rules.md
@@ -29,25 +27,19 @@ docs_cache:
     guidelines/general-tdd-guidelines.md: internal:guidelines/general-tdd-guidelines.md
     guidelines/general-testing-rules.md: internal:guidelines/general-testing-rules.md
     guidelines/golden-testing-guidelines.md: internal:guidelines/golden-testing-guidelines.md
-    guidelines/pnpm-monorepo-patterns.md: internal:guidelines/pnpm-monorepo-patterns.md
     guidelines/python-cli-patterns.md: internal:guidelines/python-cli-patterns.md
     guidelines/python-modern-guidelines.md: internal:guidelines/python-modern-guidelines.md
     guidelines/python-rules.md: internal:guidelines/python-rules.md
-    guidelines/release-notes-guidelines.md: internal:guidelines/release-notes-guidelines.md
-    guidelines/tbd-sync-troubleshooting.md: internal:guidelines/tbd-sync-troubleshooting.md
+    guidelines/sync-troubleshooting.md: internal:guidelines/sync-troubleshooting.md
     guidelines/typescript-cli-tool-rules.md: internal:guidelines/typescript-cli-tool-rules.md
     guidelines/typescript-code-coverage.md: internal:guidelines/typescript-code-coverage.md
+    guidelines/typescript-monorepo-patterns.md: internal:guidelines/typescript-monorepo-patterns.md
     guidelines/typescript-rules.md: internal:guidelines/typescript-rules.md
-    guidelines/typescript-sorting-patterns.md: internal:guidelines/typescript-sorting-patterns.md
-    guidelines/typescript-yaml-handling-rules.md: internal:guidelines/typescript-yaml-handling-rules.md
-    guidelines/writing-style-guidelines.md: internal:guidelines/writing-style-guidelines.md
     shortcuts/standard/agent-handoff.md: internal:shortcuts/standard/agent-handoff.md
-    shortcuts/standard/checkout-third-party-repo.md: internal:shortcuts/standard/checkout-third-party-repo.md
-    shortcuts/standard/code-cleanup-all.md: internal:shortcuts/standard/code-cleanup-all.md
-    shortcuts/standard/code-cleanup-docstrings.md: internal:shortcuts/standard/code-cleanup-docstrings.md
-    shortcuts/standard/code-cleanup-tests.md: internal:shortcuts/standard/code-cleanup-tests.md
-    shortcuts/standard/code-review-and-commit.md: internal:shortcuts/standard/code-review-and-commit.md
-    shortcuts/standard/coding-spike.md: internal:shortcuts/standard/coding-spike.md
+    shortcuts/standard/cleanup-all.md: internal:shortcuts/standard/cleanup-all.md
+    shortcuts/standard/cleanup-remove-trivial-tests.md: internal:shortcuts/standard/cleanup-remove-trivial-tests.md
+    shortcuts/standard/cleanup-update-docstrings.md: internal:shortcuts/standard/cleanup-update-docstrings.md
+    shortcuts/standard/commit-code.md: internal:shortcuts/standard/commit-code.md
     shortcuts/standard/create-or-update-pr-simple.md: internal:shortcuts/standard/create-or-update-pr-simple.md
     shortcuts/standard/create-or-update-pr-with-validation-plan.md: internal:shortcuts/standard/create-or-update-pr-with-validation-plan.md
     shortcuts/standard/implement-beads.md: internal:shortcuts/standard/implement-beads.md
@@ -55,7 +47,6 @@ docs_cache:
     shortcuts/standard/new-architecture-doc.md: internal:shortcuts/standard/new-architecture-doc.md
     shortcuts/standard/new-guideline.md: internal:shortcuts/standard/new-guideline.md
     shortcuts/standard/new-plan-spec.md: internal:shortcuts/standard/new-plan-spec.md
-    shortcuts/standard/new-qa-playbook.md: internal:shortcuts/standard/new-qa-playbook.md
     shortcuts/standard/new-research-brief.md: internal:shortcuts/standard/new-research-brief.md
     shortcuts/standard/new-shortcut.md: internal:shortcuts/standard/new-shortcut.md
     shortcuts/standard/new-validation-plan.md: internal:shortcuts/standard/new-validation-plan.md
@@ -72,12 +63,10 @@ docs_cache:
     shortcuts/standard/update-specs-status.md: internal:shortcuts/standard/update-specs-status.md
     shortcuts/standard/welcome-user.md: internal:shortcuts/standard/welcome-user.md
     shortcuts/system/shortcut-explanation.md: internal:shortcuts/system/shortcut-explanation.md
-    shortcuts/system/skill-baseline.md: internal:shortcuts/system/skill-baseline.md
     shortcuts/system/skill-brief.md: internal:shortcuts/system/skill-brief.md
-    shortcuts/system/skill-minimal.md: internal:shortcuts/system/skill-minimal.md
+    shortcuts/system/skill.md: internal:shortcuts/system/skill.md
     templates/architecture-doc.md: internal:templates/architecture-doc.md
     templates/plan-spec.md: internal:templates/plan-spec.md
-    templates/qa-playbook.md: internal:templates/qa-playbook.md
     templates/research-brief.md: internal:templates/research-brief.md
   lookup_path:
     - .tbd/docs/shortcuts/system