From 225fc57e748ea4dd5cc26886a7b930af1ab8e93c Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 2 Feb 2026 19:46:26 +0000 Subject: [PATCH 1/6] docs: add comprehensive research on tool choice parameter - Analyzes AI SDK source code for toolChoice implementation - Documents provider-specific translations (OpenAI, Anthropic, Google, DeepSeek, xAI) - Covers form-filling patterns with mandatory tool use - Includes troubleshooting for common issues (endless loops, hallucination, reliability) - Provides best practices for ensuring agents use tools before generating output https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB --- ...search-2026-02-02-tool-choice-parameter.md | 581 ++++++++++++++++++ 1 file changed, 581 insertions(+) create mode 100644 docs/project/research/research-2026-02-02-tool-choice-parameter.md diff --git a/docs/project/research/research-2026-02-02-tool-choice-parameter.md b/docs/project/research/research-2026-02-02-tool-choice-parameter.md new file mode 100644 index 00000000..285131f0 --- /dev/null +++ b/docs/project/research/research-2026-02-02-tool-choice-parameter.md @@ -0,0 +1,581 @@ +# Research: Tool Choice Parameter in AI SDK and Major LLM Providers + +**Date:** 2026-02-02 (last updated 2026-02-02) + +**Author:** AI Research + +**Status:** Complete + +## Overview + +This research document provides a comprehensive technical overview of the `toolChoice` parameter +implementation in the Vercel AI SDK and across major LLM providers. The focus is on understanding +how to ensure agents reliably use tools (especially for form-filling use cases where web search +or other research tools should be invoked before populating form fields). + +## Questions to Answer + +1. How does the AI SDK implement `toolChoice` and translate it to each provider's native format? +2. What are the exact behaviors and options for each major provider (OpenAI, Anthropic, Google, + Deepseek, xAI/Grok)? +3. What are the best practices for ensuring agents use tools (especially web search) before + filling in forms or generating structured output? +4. What are the common issues and pitfalls when using `toolChoice`? +5. What patterns exist for combining tool calling with structured output? + +## Scope + +- **Included**: AI SDK implementation details (source code analysis), provider-specific behaviors, + community best practices, form-filling patterns, troubleshooting guidance +- **Excluded**: Implementation of specific form-filling applications, UI/UX considerations + +--- + +## Findings + +### 1. AI SDK Core Implementation + +#### 1.1 Type Definition + +The AI SDK defines `ToolChoice` in `packages/ai/src/types/language-model.ts:100-104`: + +```typescript +export type ToolChoice> = + | 'auto' + | 'none' + | 'required' + | { type: 'tool'; toolName: Extract }; +``` + +**Options:** +- `'auto'` (default): The model can choose whether and which tools to call +- `'none'`: The model must not call tools +- `'required'`: The model must call a tool (can choose which one) +- `{ type: 'tool', toolName: string }`: The model must call the specified tool + +#### 1.2 Core Translation Logic + +In `packages/ai/src/prompt/prepare-tools-and-tool-choice.ts:79-85`, the SDK translates the +user-facing `toolChoice` to the internal provider format: + +```typescript +toolChoice: + toolChoice == null + ? { type: 'auto' } + : typeof toolChoice === 'string' + ? { type: toolChoice } + : { type: 'tool' as const, toolName: toolChoice.toolName as string }, +``` + +**Key insight**: When `toolChoice` is `undefined`/`null`, it defaults to `{ type: 'auto' }`. + +#### 1.3 Provider-Level Type + +The internal type used by providers is `LanguageModelV3ToolChoice` in +`packages/provider/src/language-model/v3/language-model-v3-tool-choice.ts`: + +```typescript +export type LanguageModelV3ToolChoice = + | { type: 'auto' } // tool selection is automatic (can be no tool) + | { type: 'none' } // no tool must be selected + | { type: 'required' } // one of the available tools must be selected + | { type: 'tool'; toolName: string }; // a specific tool must be selected +``` + +--- + +### 2. Provider-Specific Implementations + +#### 2.1 OpenAI + +**Source**: `packages/openai/src/chat/openai-chat-prepare-tools.ts:59-76` + +**Translation:** + +| AI SDK Value | OpenAI Native Value | +|--------------|---------------------| +| `auto` | `'auto'` | +| `none` | `'none'` | +| `required` | `'required'` | +| `{ type: 'tool', toolName }` | `{ type: 'function', function: { name: toolName } }` | + +**Native API Documentation:** +- `tool_choice: "auto"` - Model decides whether to call functions (default) +- `tool_choice: "none"` - Model will not call any tool, generates message only +- `tool_choice: "required"` - Model must call one or more tools +- `tool_choice: { type: "function", function: { name: "..." } }` - Force specific function + +**Notable:** OpenAI supports parallel function calling by default. + +**Sources:** +- [OpenAI Function Calling Guide](https://platform.openai.com/docs/guides/function-calling) +- [OpenAI Tools Guide](https://platform.openai.com/docs/guides/tools) + +#### 2.2 Anthropic (Claude) + +**Source**: `packages/anthropic/src/anthropic-prepare-tools.ts:310-353` + +**Translation (with important differences):** + +| AI SDK Value | Anthropic Native Value | +|--------------|------------------------| +| `auto` | `{ type: 'auto' }` | +| `none` | *removes tools entirely* (Anthropic doesn't support 'none') | +| `required` | `{ type: 'any' }` (**Note: 'any', not 'required'**) | +| `{ type: 'tool', toolName }` | `{ type: 'tool', name: toolName }` | + +**Critical Insight from Source Code (lines 333-335):** +```typescript +case 'none': + // Anthropic does not support 'none' tool choice, so we remove the tools: + return { tools: undefined, toolChoice: undefined, toolWarnings, betas }; +``` + +**Anthropic-Specific Features:** +- `disable_parallel_tool_use: boolean` - Can be combined with any toolChoice type +- Setting `disable_parallel_tool_use=true` with `type: 'any'` or `type: 'tool'` ensures + exactly one tool is called + +**Extended Thinking Limitation:** +When using extended thinking, only `tool_choice: {"type": "auto"}` and +`tool_choice: {"type": "none"}` are compatible. Using `any` or `tool` types will error. + +**Sources:** +- [Anthropic Tool Use Documentation](https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use) +- [Anthropic Advanced Tool Use](https://www.anthropic.com/engineering/advanced-tool-use) + +#### 2.3 Google (Gemini) + +**Source**: `packages/google/src/google-prepare-tools.ts:225-256` + +**Translation:** + +| AI SDK Value | Gemini Native Value | +|--------------|---------------------| +| `auto` | `{ functionCallingConfig: { mode: 'AUTO' } }` | +| `none` | `{ functionCallingConfig: { mode: 'NONE' } }` | +| `required` | `{ functionCallingConfig: { mode: 'ANY' } }` | +| `{ type: 'tool', toolName }` | `{ functionCallingConfig: { mode: 'ANY', allowedFunctionNames: [toolName] } }` | + +**Native API Options:** +- `AUTO` (default): Model decides whether to call functions +- `NONE`: Model cannot make function calls +- `ANY`: Forces model to predict a function call +- `VALIDATED` (Preview): Like ANY but allows text responses too + +**Best Practices from Google:** +- Keep active tools to **10-20 maximum** to reduce selection errors +- Use **low temperature** (e.g., 0) for deterministic function calls +- Apply **strong typing** (enums for fixed value sets) + +**Sources:** +- [Google AI Function Calling](https://ai.google.dev/gemini-api/docs/function-calling) +- [Vertex AI Function Calling](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/model-reference/function-calling) + +#### 2.4 DeepSeek + +**Source**: `packages/deepseek/src/chat/deepseek-prepare-tools.ts:54-68` + +**Translation (follows OpenAI format):** + +| AI SDK Value | DeepSeek Native Value | +|--------------|----------------------| +| `auto` | `'auto'` | +| `none` | `'none'` | +| `required` | `'required'` | +| `{ type: 'tool', toolName }` | `{ type: 'function', function: { name: toolName } }` | + +**Notable Limitations:** +- DeepSeek's official documentation does not explicitly document the `tool_choice` parameter +- Uses OpenAI-compatible API format +- The model may hallucinate parameters not in your schema - validate arguments before calling +- Not great at multi-turn function calling; performs best with single user message triggering calls + +**Sources:** +- [DeepSeek Function Calling](https://api-docs.deepseek.com/guides/function_calling) +- [DeepSeek Tool Calls](https://api-docs.deepseek.com/guides/tool_calls) + +#### 2.5 xAI (Grok) + +**Source**: `packages/xai/src/xai-prepare-tools.ts:71-86` and +`packages/xai/src/responses/xai-responses-prepare-tools.ts:156-186` + +**Translation:** + +| AI SDK Value | xAI Native Value | +|--------------|------------------| +| `auto` | `'auto'` | +| `none` | `'none'` | +| `required` | `'required'` | +| `{ type: 'tool', toolName }` | `{ type: 'function', name: toolName }` | + +**Notable from Source Code (lines 173-180):** +```typescript +if (selectedTool.type === 'provider') { + // xAI API does not support forcing specific server-side tools via toolChoice + // Only function tools can be forced + toolWarnings.push({ + type: 'unsupported', + feature: `toolChoice for server-side tool "${selectedTool.name}"`, + }); +``` + +**Recommended Model:** xAI recommends `grok-4-1-fast` for agentic tool calling. + +**Sources:** +- [xAI Function Calling](https://docs.x.ai/docs/guides/function-calling) +- [xAI Tools Overview](https://docs.x.ai/docs/guides/tools/overview) + +--- + +### 3. Summary: Provider Translation Table + +| AI SDK `toolChoice` | OpenAI | Anthropic | Google | DeepSeek | xAI | +|---------------------|--------|-----------|--------|----------|-----| +| `'auto'` | `'auto'` | `{ type: 'auto' }` | `mode: 'AUTO'` | `'auto'` | `'auto'` | +| `'none'` | `'none'` | *removes tools* | `mode: 'NONE'` | `'none'` | `'none'` | +| `'required'` | `'required'` | `{ type: 'any' }` | `mode: 'ANY'` | `'required'` | `'required'` | +| `{ type: 'tool', toolName: 'x' }` | `{ type: 'function', function: { name: 'x' } }` | `{ type: 'tool', name: 'x' }` | `mode: 'ANY', allowedFunctionNames: ['x']` | `{ type: 'function', function: { name: 'x' } }` | `{ type: 'function', name: 'x' }` | + +--- + +### 4. Form-Filling Use Cases and Patterns + +#### 4.1 The Challenge + +When building form-filling agents, a common issue is that the model may: +1. **Hallucinate data** instead of using tools to research +2. **Skip tool calls** and go directly to filling the form +3. **Analyze/plan** what it would do instead of actually calling tools +4. **Call tools unreliably** after ~5 messages in a conversation + +#### 4.2 Pattern: Answer Tool with `toolChoice: 'required'` + +**Recommended Approach (AI SDK 6+):** + +Use an "answer" tool without an `execute` function and `toolChoice: 'required'` to force +structured output: + +```typescript +import { generateText, tool } from 'ai'; +import { z } from 'zod'; + +const result = await generateText({ + model: yourModel, + tools: { + webSearch: tool({ + description: 'Search the web for information', + parameters: z.object({ query: z.string() }), + execute: async ({ query }) => { /* search implementation */ } + }), + submitForm: tool({ + description: 'Submit the completed form with researched data', + parameters: z.object({ + field1: z.string().describe('Value for field1 (must be researched)'), + field2: z.string().describe('Value for field2 (must be researched)'), + }), + // No execute function - acts as termination signal + }), + }, + toolChoice: 'required', // Must use a tool at every step + stopWhen: hasToolCall('submitForm'), // Stop when form is submitted + system: `You are a research assistant. Before filling ANY form field: +1. Use webSearch to find accurate, current information +2. NEVER guess or hallucinate data +3. Only call submitForm when you have researched ALL fields`, + prompt: userQuery, +}); + +// Get the form data from staticToolCalls (tools without execute) +const formData = result.staticToolCalls.find( + call => call.toolName === 'submitForm' +)?.args; +``` + +#### 4.3 Pattern: AI SDK 6 Unified Output + +**New in AI SDK 6:** Combine tool calling with structured output in one flow: + +```typescript +import { generateText, Output } from 'ai'; + +const result = await generateText({ + model: yourModel, + tools: { webSearch, fetchUrl }, + output: Output.object({ + schema: z.object({ + companyName: z.string(), + foundedYear: z.number(), + headquarters: z.string(), + }), + }), + system: `Research the company thoroughly using web search before + providing structured output. Do not hallucinate.`, + prompt: 'Get information about Anthropic', +}); + +// result.object contains the structured data +``` + +**Important:** Structured output generation counts as an additional step. Adjust `stopWhen` +accordingly. + +#### 4.4 Pattern: Explicit Tool Guidance in Prompts + +**System Prompt Best Practices:** + +``` +CRITICAL INSTRUCTIONS FOR TOOL USE: +1. For ANY information that could be time-sensitive, ALWAYS use webSearch first +2. For ANY factual claims (dates, numbers, names), ALWAYS verify with webSearch +3. NEVER fill in form fields with guessed or assumed data +4. If webSearch returns no results, explicitly state "Unknown" rather than guessing +5. Call tools BEFORE reasoning about the answer, not after +``` + +#### 4.5 Pattern: Multi-Step Verification Loop + +For critical data accuracy, use a verification pattern: + +```typescript +const agent = createAgent({ + tools: { + webSearch, + verifyFact: tool({ + description: 'Double-check a fact by searching again', + parameters: z.object({ + fact: z.string(), + originalSource: z.string(), + }), + execute: async ({ fact }) => { /* second search */ }, + }), + submitVerifiedForm: tool({ + description: 'Submit only after all facts are verified', + parameters: formSchema, + }), + }, + stopWhen: hasToolCall('submitVerifiedForm'), + prepareStep: ({ lastToolResults }) => { + // Force verification if not all fields verified + if (needsVerification(lastToolResults)) { + return { toolChoice: { type: 'tool', toolName: 'verifyFact' } }; + } + return {}; + }, +}); +``` + +--- + +### 5. Common Issues and Troubleshooting + +#### 5.1 Tool Execution Becomes Unreliable After ~5 Messages + +**Issue:** Models increasingly fail to execute tools after approximately 5 messages, instead +analyzing or describing what they would do. + +**Solutions:** +- Add explicit tool-use reminders in subsequent messages: "Remember to USE the webSearch + tool, not describe using it" +- Reset context periodically with `context.compact()` in AI SDK 6 +- Use `toolChoice: 'required'` to force tool usage + +#### 5.2 Endless Loop with `toolChoice: 'required'` + +**Issue:** Setting `toolChoice: 'required'` can cause infinite loops when using `streamText`. + +**Solutions:** +- Use `stopWhen: hasToolCall('finalTool')` with a termination tool +- Use `stopWhen: stepCountIs(n)` as a safety limit +- Use `prepareStep` to dynamically change `toolChoice` on final step: + +```typescript +prepareStep: ({ stepNumber }) => { + if (stepNumber >= 5) { + return { toolChoice: 'auto' }; // Allow text response + } + return { toolChoice: 'required' }; +}, +``` + +#### 5.3 Model Hallucinating Tool Calls + +**Issue:** Model says it's calling a tool but actually hallucinating results. + +**Solutions:** +- Check for actual tool_use blocks in response, not just text mentioning tools +- Use `toolChoice: 'required'` to force structured tool calls +- Implement validation on tool results before accepting + +#### 5.4 Anthropic: 'none' Doesn't Work as Expected + +**Issue:** `toolChoice: 'none'` with Anthropic doesn't just prevent tool use, it removes +all tool definitions. + +**Solution:** This is intentional per the source code. If you need tools available but not +used in a specific call, use prompting instead: "Do not use any tools for this response." + +#### 5.5 Parallel Tool Calls Not Working + +**Causes:** +1. Incorrect tool result formatting (separate messages instead of combined) +2. Weak prompting +3. Model-specific limitations (Sonnet 3.7 less likely than Claude 4) + +**Solution:** +```typescript +// Wrong: Separate messages +[ + { role: 'assistant', content: [tool_use_1, tool_use_2] }, + { role: 'user', content: [tool_result_1] }, + { role: 'user', content: [tool_result_2] }, // Separate +] + +// Correct: Single message with all results +[ + { role: 'assistant', content: [tool_use_1, tool_use_2] }, + { role: 'user', content: [tool_result_1, tool_result_2] }, // Combined +] +``` + +--- + +### 6. Best Practices Summary + +#### 6.1 For Reliable Tool Use + +1. **Use `toolChoice: 'required'`** when tools MUST be used +2. **Provide detailed tool descriptions** - this is the most important factor +3. **Limit tools to 5-7** for optimal selection accuracy (10-20 max) +4. **Use explicit prompting** about when to use which tool +5. **Implement answer/termination tools** for structured output flows + +#### 6.2 For Form-Filling Specifically + +1. **Always research before filling** - use `toolChoice: 'required'` initially +2. **Use an answer tool without execute** - terminates loop with structured data +3. **Validate all tool results** - don't trust raw model outputs +4. **Use the AI SDK 6 `output` option** for cleaner structured output flows +5. **Add verification steps** for critical data + +#### 6.3 Provider-Specific Recommendations + +| Provider | Recommendation | +|----------|----------------| +| OpenAI | Use `'required'` directly; supports parallel calls | +| Anthropic | Remember `required` → `any` translation; use `disable_parallel_tool_use` for single calls | +| Google | Use `ANY` mode with `allowedFunctionNames` for specific tools | +| DeepSeek | Validate tool arguments; avoid multi-turn tool flows | +| xAI | Use `grok-4-1-fast` for best tool calling; can't force provider tools | + +--- + +## Options Considered + +### Option A: Use `toolChoice: 'required'` Everywhere + +**Description:** Force tool use on every step until completion. + +**Pros:** +- Guarantees tools are called +- Prevents hallucination of tool results + +**Cons:** +- Can cause infinite loops without proper termination +- May force unnecessary tool calls +- Not compatible with Anthropic extended thinking + +### Option B: Use `toolChoice: 'auto'` with Strong Prompting + +**Description:** Rely on system prompts to guide tool use. + +**Pros:** +- More flexible +- Works with all features (extended thinking, etc.) +- Natural conversation flow + +**Cons:** +- Model may ignore prompts and skip tools +- Reliability degrades over long conversations +- Harder to guarantee tool usage + +### Option C: Hybrid Approach with `prepareStep` + +**Description:** Use `toolChoice: 'required'` initially, switch to `'auto'` for final +response. + +**Pros:** +- Best of both worlds +- Guarantees initial research +- Allows natural completion + +**Cons:** +- More complex implementation +- Requires careful step management + +--- + +## Recommendations + +1. **For form-filling with mandatory research:** Use Option C (Hybrid) with: + - `toolChoice: 'required'` for first N steps + - An answer tool without execute function + - `stopWhen: hasToolCall('submitForm')` + +2. **For simpler tool integration:** Use Option A with proper termination: + - Define a clear termination tool + - Use `stopWhen` to prevent infinite loops + +3. **For conversation-like interfaces:** Use Option B with: + - Strong system prompts + - Explicit tool-use instructions in user messages + - Periodic context compaction + +--- + +## Next Steps + +- [ ] Implement the recommended hybrid pattern in markform +- [ ] Add tool input validation for form fields +- [ ] Create a verification step for critical data +- [ ] Test across multiple providers for consistency + +--- + +## References + +### AI SDK Documentation +- [AI SDK Tool Calling](https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling) +- [AI SDK Agents: Loop Control](https://ai-sdk.dev/docs/agents/loop-control) +- [AI SDK Generating Structured Data](https://ai-sdk.dev/docs/ai-sdk-core/generating-structured-data) +- [AI SDK Troubleshooting: Tool Calling with Structured Outputs](https://ai-sdk.dev/docs/troubleshooting/tool-calling-with-structured-outputs) +- [AI SDK 6 Announcement](https://vercel.com/blog/ai-sdk-6) + +### Provider Documentation +- [OpenAI Function Calling](https://platform.openai.com/docs/guides/function-calling) +- [Anthropic Tool Use](https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview) +- [Anthropic Implement Tool Use](https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use) +- [Google Gemini Function Calling](https://ai.google.dev/gemini-api/docs/function-calling) +- [DeepSeek Function Calling](https://api-docs.deepseek.com/guides/function_calling) +- [xAI Function Calling](https://docs.x.ai/docs/guides/function-calling) + +### AI SDK Source Code References +- `packages/ai/src/types/language-model.ts:100-104` - ToolChoice type definition +- `packages/ai/src/prompt/prepare-tools-and-tool-choice.ts` - Core translation logic +- `packages/provider/src/language-model/v3/language-model-v3-tool-choice.ts` - Provider-level type +- `packages/openai/src/chat/openai-chat-prepare-tools.ts:59-76` - OpenAI translation +- `packages/anthropic/src/anthropic-prepare-tools.ts:310-353` - Anthropic translation +- `packages/google/src/google-prepare-tools.ts:225-256` - Google translation +- `packages/deepseek/src/chat/deepseek-prepare-tools.ts:54-68` - DeepSeek translation +- `packages/xai/src/xai-prepare-tools.ts:71-86` - xAI translation + +### Community Resources +- [GitHub Issue: Tool Execution Unreliable After ~5 Messages](https://github.com/vercel/ai/issues/10269) +- [GitHub Issue: toolChoice 'required' Endless Loop](https://github.com/vercel/ai/issues/3944) +- [Vercel Blog: We Removed 80% of Our Agent's Tools](https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools) +- [GitHub Discussion: Tool Calling Loop Understanding](https://github.com/vercel/ai/discussions/8514) + +### Hallucination Prevention +- [Zep: Reducing LLM Hallucinations](https://www.getzep.com/ai-agents/reducing-llm-hallucinations/) +- [Cleanlab: Prevent Hallucinated Responses](https://cleanlab.ai/blog/prevent-hallucinated-responses/) +- [AWS: Reducing Hallucinations with Verified Semantic Cache](https://aws.amazon.com/blogs/machine-learning/reducing-hallucinations-in-llm-agents-with-a-verified-semantic-cache-using-amazon-bedrock-knowledge-bases/) From bbb981f72f8c87ce9ceaf1ce245bbeca68da93d4 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 2 Feb 2026 19:59:46 +0000 Subject: [PATCH 2/6] docs: add plan spec for tool choice policies Defines a toolPolicy system to ensure agents reliably use tools: - auto: Current behavior (model decides) - require-tools: Force tool use every turn - research-first: Force web search on first turn - research-always: Force search every turn - two-phase: Separate research and fill phases Includes provider-specific considerations and testing plan for DeepSeek, Anthropic, Google, OpenAI, and xAI. Related: research-2026-02-02-tool-choice-parameter.md https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB --- .../plan-2026-02-02-tool-choice-policies.md | 396 ++++++++++++++++++ 1 file changed, 396 insertions(+) create mode 100644 docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md diff --git a/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md new file mode 100644 index 00000000..3a1d9e3b --- /dev/null +++ b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md @@ -0,0 +1,396 @@ +# Plan Spec: Tool Choice Policies for Reliable Form Filling + +**Date:** 2026-02-02 (last updated 2026-02-02) + +**Author:** AI Research + +**Status:** Draft + +## Overview + +This spec defines a **tool choice policy system** for Markform that gives form authors and +consumers fine-grained control over how agents use tools (especially web search) during form +filling. The goal is to ensure agents reliably research information before filling fields, +reducing hallucination and improving data accuracy. + +**Related Docs:** +- `docs/project/research/research-2026-02-02-tool-choice-parameter.md` - Research on AI SDK + toolChoice and provider behavior +- `docs/project/specs/active/plan-2026-01-27-parallel-form-filling.md` - Parallel execution spec + +## Goals + +1. **Reduce hallucination**: Ensure agents use web search and other research tools before + filling fields that require external data +2. **Configurable policies**: Provide multiple tool use policies that balance research + thoroughness against latency/cost +3. **Cross-model compatibility**: Work reliably across providers (OpenAI, Anthropic, Google, + DeepSeek, xAI) +4. **Form-level and field-level control**: Allow policies at form level with optional per-field + overrides +5. **Integration with parallel execution**: Policies should compose with the existing `parallel` + and `order` attributes + +## Non-Goals + +- Custom tool definitions per form (tools are provided by the harness) +- Model-specific prompt tuning (policies should work across models) +- UI for policy configuration (CLI/API only for now) +- Automatic policy selection based on field content + +## Background + +### The Problem + +Current Markform behavior uses `toolChoice: 'auto'` by default, meaning the model decides +whether to use tools. In practice, models often: + +1. **Skip web search entirely** and fill fields with training data (which may be outdated) +2. **Hallucinate tool calls** by describing what they would search without actually searching +3. **Become unreliable after ~5 turns** as conversation length increases +4. **Fill forms without research** when under time pressure or with simpler prompts + +### Research Findings + +From the research doc (`research-2026-02-02-tool-choice-parameter.md`): + +- `toolChoice: 'required'` forces tool use but can cause infinite loops without termination +- `toolChoice: { type: 'tool', toolName: 'webSearch' }` forces a specific tool +- Different providers translate these values differently (Anthropic: `required` → `any`) +- AI SDK 6's `prepareStep` allows dynamic toolChoice per turn +- Google recommends limiting to 10-20 tools for reliable selection +- DeepSeek has unreliable multi-turn tool calling; best for single-turn + +### Current Implementation + +The harness provides these tools to the agent: +- `fill_form` - Apply patches to the form +- `web_search` (optional, via `enableWebSearch`) - Search the web for information + +Currently, `toolChoice` is not explicitly set, defaulting to `'auto'`. + +## Design + +### Tool Choice Policy Enum + +A new `toolPolicy` option controls how the harness manages tool selection: + +```typescript +type ToolPolicy = + | 'auto' // Current behavior: model chooses freely + | 'require-tools' // toolChoice: 'required' on every turn + | 'research-first' // Force webSearch on first turn, then auto + | 'research-always' // Force webSearch every turn until form complete + | 'two-phase' // Phase 1: research only, Phase 2: fill only +``` + +### Policy Behaviors + +#### `auto` (Default - Current Behavior) + +``` +toolChoice: 'auto' on every turn +Model decides when to search and when to fill +No enforcement of research before filling +``` + +**When to use:** Simple forms, fields that don't need external research, testing + +#### `require-tools` + +``` +toolChoice: 'required' on every turn +Model must call SOME tool every turn (fill_form or web_search) +Prevents "analysis paralysis" where model talks without acting +Uses termination detection to allow final response +``` + +**When to use:** General production use, ensures progress every turn + +#### `research-first` + +``` +Turn 1: toolChoice: { type: 'tool', toolName: 'web_search' } +Turn 2+: toolChoice: 'required' +``` + +**When to use:** Forms with factual fields that need current data, moderate latency tolerance + +#### `research-always` + +``` +Every turn: First call must be web_search (via prepareStep logic) +After web_search returns, toolChoice: 'required' for rest of turn +``` + +**When to use:** High-accuracy requirements, fields with rapidly changing data + +**Implementation:** +```typescript +prepareStep: ({ lastToolResults }) => { + const hasSearchedThisTurn = lastToolResults?.some( + r => r.toolName === 'web_search' + ); + if (!hasSearchedThisTurn) { + return { toolChoice: { type: 'tool', toolName: 'web_search' } }; + } + return { toolChoice: 'required' }; +}, +``` + +#### `two-phase` + +``` +Phase 1 (Research): Only web_search available, toolChoice: 'required' + Runs until configurable turn count or all fields researched +Phase 2 (Fill): Only fill_form available, toolChoice: 'required' + Uses research context from Phase 1 +``` + +**When to use:** Maximum accuracy, complex research forms, acceptable latency + +**Implementation:** Two separate agent invocations with different tool sets + +### API Changes + +#### FillOptions Extension + +```typescript +interface FillOptions { + // ... existing options + + /** + * Tool choice policy for agent tool selection. + * Controls how strictly the harness enforces tool usage. + * + * @default 'auto' + */ + toolPolicy?: ToolPolicy; + + /** + * For 'two-phase' policy: max turns in research phase. + * After this many turns, switches to fill phase. + * + * @default 5 + */ + researchPhaseTurns?: number; + + /** + * For 'research-always': max searches per turn. + * Prevents excessive API calls on forms with many fields. + * + * @default 3 + */ + maxSearchesPerTurn?: number; +} +``` + +#### Frontmatter Configuration + +```yaml +--- +markform: + spec: MF/0.1 + harness_config: + tool_policy: research-first # New option + research_phase_turns: 5 # For two-phase + max_searches_per_turn: 3 # For research-always +--- +``` + +#### CLI Extension + +```bash +# New --tool-policy flag +markform fill form.md --tool-policy=research-first + +# Override policy at CLI level +markform fill form.md --tool-policy=two-phase --research-phase-turns=8 +``` + +### Field-Level Research Hints (Future Enhancement) + +For v2, consider per-field annotations: + +```markdown + + + + + +``` + +This is deferred - the form-level policy is sufficient for initial implementation. + +### Provider-Specific Considerations + +From research, key provider differences to handle: + +| Provider | Notes | +|----------|-------| +| OpenAI | `'required'` works directly; parallel tool calls supported | +| Anthropic | `'required'` → `'any'` translation; `disable_parallel_tool_use` available | +| Google | `'required'` → `mode: 'ANY'`; limit to 10-20 tools | +| DeepSeek | Unreliable multi-turn; best with `auto` or single-turn `required` | +| xAI | Can't force provider-defined tools; use grok-4-1-fast | + +**Recommendation:** Test `two-phase` and `research-always` across all providers before +recommending as defaults. `require-tools` should work reliably across all providers. + +### Areas of Uncertainty (Requiring Testing) + +1. **DeepSeek multi-turn behavior**: Research indicates unreliable tool calling after first turn. + Need to test: + - Does `toolChoice: 'required'` work reliably on DeepSeek? + - What happens with `two-phase` policy? + - Should we auto-downgrade to `auto` for DeepSeek? + +2. **Anthropic extended thinking**: `toolChoice: 'required'` may conflict with extended thinking. + Need to test: + - Does `research-first` work with Claude 4 extended thinking? + - Should we detect extended thinking and adjust policy? + +3. **Termination detection**: With `toolChoice: 'required'`, how do we allow final text response? + Options to test: + - Use `stopWhen: hasToolCall('fill_form')` with form completion check + - Use `prepareStep` to switch to `'auto'` on last turn + - Add a no-op `complete` tool + +4. **Parallel execution interaction**: When `enableParallel: true`: + - Should each parallel agent have its own tool policy? + - Should research happen in loose-serial before parallel batches? + - Test: parallel agents with `research-first` - do they all search or just one? + +5. **Turn limits**: With `research-always`, does forcing search on every turn: + - Hit rate limits with providers? + - Significantly impact latency? + - Improve accuracy enough to justify cost? + +## Implementation Plan + +### Phase 1: Core Policy Engine + +**Goal:** Implement `toolPolicy` option with `auto`, `require-tools`, and `research-first`. + +- [ ] Add `ToolPolicy` type to `harnessTypes.ts` +- [ ] Add `toolPolicy` to `FillOptions` and `HarnessConfig` +- [ ] Add `tool_policy` to `HarnessConfigYaml` and mapping in `settings.ts` +- [ ] Implement `getToolChoiceForPolicy()` helper that returns AI SDK toolChoice +- [ ] Update `liveAgent.ts` to use `prepareStep` for policy enforcement +- [ ] Update `fillRecord` to track policy and actual tool usage +- [ ] Add `--tool-policy` flag to `markform fill` command +- [ ] Write unit tests for policy → toolChoice translation +- [ ] Write integration tests with mock agents + +### Phase 2: Advanced Policies + +**Goal:** Implement `research-always` and `two-phase` policies. + +- [ ] Implement `research-always` with `prepareStep` logic +- [ ] Implement `two-phase` with separate agent invocations +- [ ] Add `researchPhaseTurns` and `maxSearchesPerTurn` options +- [ ] Update frontmatter parser for new options +- [ ] Add session transcript support for two-phase (mark phase transitions) +- [ ] Write integration tests for advanced policies + +### Phase 3: Provider Testing & Documentation + +**Goal:** Validate policies across providers and document recommendations. + +- [ ] Create test matrix: policy × provider × form complexity +- [ ] Test DeepSeek specifically for multi-turn reliability +- [ ] Test Anthropic with extended thinking +- [ ] Document provider-specific recommendations in research doc +- [ ] Update `docs/markform-apis.md` with policy documentation +- [ ] Add example forms demonstrating each policy +- [ ] Create troubleshooting guide for policy issues + +### Phase 4: Parallel Execution Integration + +**Goal:** Ensure policies work correctly with parallel execution. + +- [ ] Define policy behavior for parallel agents +- [ ] Test `research-first` with parallel batches +- [ ] Test `two-phase` with parallel batches (research → parallel fill) +- [ ] Add policy options to `ParallelHarnessConfig` +- [ ] Document parallel + policy interaction + +## Testing Strategy + +### Unit Tests + +- Policy → toolChoice translation for each policy type +- Policy parsing from frontmatter +- CLI flag parsing + +### Integration Tests (Mock Agents) + +- `auto`: Agent receives no toolChoice override +- `require-tools`: Agent receives `toolChoice: 'required'` +- `research-first`: Turn 1 gets forced webSearch, turn 2+ gets required +- `research-always`: Each turn starts with forced webSearch +- `two-phase`: Two agent invocations with different tool sets + +### End-to-End Tests (Real LLM Calls) + +- Test each policy with a factual research form +- Verify web search is actually called (check fill record) +- Compare accuracy: auto vs research-first vs two-phase +- Measure latency impact + +### Provider Matrix Tests + +Create automated tests that run the same form across providers: + +``` +┌─────────────────┬────────┬──────────┬─────────┬────────┐ +│ Policy │ OpenAI │ Anthropic│ DeepSeek│ Google │ +├─────────────────┼────────┼──────────┼─────────┼────────┤ +│ auto │ ✓ │ ✓ │ ✓ │ ✓ │ +│ require-tools │ ✓ │ ✓ │ ? │ ✓ │ +│ research-first │ ✓ │ ✓ │ ? │ ✓ │ +│ research-always │ ✓ │ ? │ ? │ ✓ │ +│ two-phase │ ✓ │ ✓ │ ? │ ✓ │ +└─────────────────┴────────┴──────────┴─────────┴────────┘ +``` + +## Rollout Plan + +1. **Phase 1 release**: Add `toolPolicy` with `auto`, `require-tools`, `research-first` + - Default remains `auto` for backward compatibility + - Document as experimental + +2. **Phase 2 release**: Add `research-always` and `two-phase` + - Include provider testing results + - Document recommended policies per use case + +3. **Future consideration**: Change default to `require-tools` once validated across providers + +## Open Questions + +1. **Default policy**: Should we change the default from `auto` to `require-tools`? + - Pro: More reliable tool usage out of the box + - Con: Breaking change, may cause issues with DeepSeek + +2. **Per-field policies**: Is form-level policy sufficient, or do we need field-level control? + - Current decision: Form-level first, field-level in v2 + +3. **Policy inheritance in parallel execution**: Should parallel agents inherit the form policy + or have independent policies? + - Recommendation: Inherit form policy, with option to override per batch + +4. **Cost tracking**: Should we track web search costs separately in fill records? + - Recommendation: Yes, add `webSearchCalls` count to fill record + +5. **Policy composition with `order`**: For `two-phase`, should research happen only for + the current order level, or research all fields upfront? + - Recommendation: Research current order level only (progressive disclosure) + +## References + +- [Research: Tool Choice Parameter](../research/research-2026-02-02-tool-choice-parameter.md) +- [AI SDK Tool Calling](https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling) +- [AI SDK Agents: Loop Control](https://ai-sdk.dev/docs/agents/loop-control) +- [Parallel Form Filling Spec](plan-2026-01-27-parallel-form-filling.md) +- [GitHub Issue: Tool Execution Unreliable](https://github.com/vercel/ai/issues/10269) +- [GitHub Issue: toolChoice Endless Loop](https://github.com/vercel/ai/issues/3944) From 69c16e833ed92ad58720ab847a6801954eb335d1 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 3 Feb 2026 00:57:15 +0000 Subject: [PATCH 3/6] docs: update tool policy names to snake_case, make require_tools default MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Rename policies to snake_case for consistency: - require-tools → require_tools - research-first → web_search_first - research-always → web_search_always - two-phase → two_phase - Add 'none' policy for completeness - Change default from 'auto' to 'require_tools' - Update all references throughout the spec https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB --- .../plan-2026-02-02-tool-choice-policies.md | 136 ++++++++++-------- 1 file changed, 78 insertions(+), 58 deletions(-) diff --git a/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md index 3a1d9e3b..305d4ff2 100644 --- a/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md +++ b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md @@ -1,6 +1,6 @@ # Plan Spec: Tool Choice Policies for Reliable Form Filling -**Date:** 2026-02-02 (last updated 2026-02-02) +**Date:** 2026-02-02 (last updated 2026-02-03) **Author:** AI Research @@ -77,16 +77,30 @@ A new `toolPolicy` option controls how the harness manages tool selection: ```typescript type ToolPolicy = - | 'auto' // Current behavior: model chooses freely - | 'require-tools' // toolChoice: 'required' on every turn - | 'research-first' // Force webSearch on first turn, then auto - | 'research-always' // Force webSearch every turn until form complete - | 'two-phase' // Phase 1: research only, Phase 2: fill only + | 'none' // No tools provided to agent + | 'auto' // Model chooses freely whether to use tools + | 'require_tools' // toolChoice: 'required' on every turn (DEFAULT) + | 'web_search_first' // Force web_search on first turn, then require_tools + | 'web_search_always' // Force web_search every turn until form complete + | 'two_phase' // Phase 1: web search only, Phase 2: fill only ``` +**Default:** `require_tools` — ensures the agent always makes progress by calling a tool +(either `fill_form` or `web_search`) on every turn. This prevents "analysis paralysis" +where models describe what they would do without actually doing it. + ### Policy Behaviors -#### `auto` (Default - Current Behavior) +#### `none` + +``` +No tools provided to agent +Agent can only generate text responses +``` + +**When to use:** Testing, debugging, or when tools are intentionally disabled + +#### `auto` ``` toolChoice: 'auto' on every turn @@ -94,9 +108,10 @@ Model decides when to search and when to fill No enforcement of research before filling ``` -**When to use:** Simple forms, fields that don't need external research, testing +**When to use:** Legacy behavior, simple forms that don't need research, or when you +want maximum model flexibility -#### `require-tools` +#### `require_tools` (Default) ``` toolChoice: 'required' on every turn @@ -105,9 +120,10 @@ Prevents "analysis paralysis" where model talks without acting Uses termination detection to allow final response ``` -**When to use:** General production use, ensures progress every turn +**When to use:** General production use, ensures progress every turn. This is the +recommended default for most forms. -#### `research-first` +#### `web_search_first` ``` Turn 1: toolChoice: { type: 'tool', toolName: 'web_search' } @@ -116,7 +132,7 @@ Turn 2+: toolChoice: 'required' **When to use:** Forms with factual fields that need current data, moderate latency tolerance -#### `research-always` +#### `web_search_always` ``` Every turn: First call must be web_search (via prepareStep logic) @@ -138,7 +154,7 @@ prepareStep: ({ lastToolResults }) => { }, ``` -#### `two-phase` +#### `two_phase` ``` Phase 1 (Research): Only web_search available, toolChoice: 'required' @@ -163,12 +179,12 @@ interface FillOptions { * Tool choice policy for agent tool selection. * Controls how strictly the harness enforces tool usage. * - * @default 'auto' + * @default 'require_tools' */ toolPolicy?: ToolPolicy; /** - * For 'two-phase' policy: max turns in research phase. + * For 'two_phase' policy: max turns in research phase. * After this many turns, switches to fill phase. * * @default 5 @@ -176,7 +192,7 @@ interface FillOptions { researchPhaseTurns?: number; /** - * For 'research-always': max searches per turn. + * For 'web_search_always': max searches per turn. * Prevents excessive API calls on forms with many fields. * * @default 3 @@ -192,9 +208,9 @@ interface FillOptions { markform: spec: MF/0.1 harness_config: - tool_policy: research-first # New option - research_phase_turns: 5 # For two-phase - max_searches_per_turn: 3 # For research-always + tool_policy: web_search_first # New option + research_phase_turns: 5 # For two_phase + max_searches_per_turn: 3 # For web_search_always --- ``` @@ -202,10 +218,10 @@ markform: ```bash # New --tool-policy flag -markform fill form.md --tool-policy=research-first +markform fill form.md --tool-policy=web_search_first # Override policy at CLI level -markform fill form.md --tool-policy=two-phase --research-phase-turns=8 +markform fill form.md --tool-policy=two_phase --research-phase-turns=8 ``` ### Field-Level Research Hints (Future Enhancement) @@ -234,20 +250,20 @@ From research, key provider differences to handle: | DeepSeek | Unreliable multi-turn; best with `auto` or single-turn `required` | | xAI | Can't force provider-defined tools; use grok-4-1-fast | -**Recommendation:** Test `two-phase` and `research-always` across all providers before -recommending as defaults. `require-tools` should work reliably across all providers. +**Recommendation:** Test `two_phase` and `web_search_always` across all providers before +recommending as defaults. `require_tools` should work reliably across all providers. ### Areas of Uncertainty (Requiring Testing) 1. **DeepSeek multi-turn behavior**: Research indicates unreliable tool calling after first turn. Need to test: - Does `toolChoice: 'required'` work reliably on DeepSeek? - - What happens with `two-phase` policy? + - What happens with `two_phase` policy? - Should we auto-downgrade to `auto` for DeepSeek? 2. **Anthropic extended thinking**: `toolChoice: 'required'` may conflict with extended thinking. Need to test: - - Does `research-first` work with Claude 4 extended thinking? + - Does `web_search_first` work with Claude 4 extended thinking? - Should we detect extended thinking and adjust policy? 3. **Termination detection**: With `toolChoice: 'required'`, how do we allow final text response? @@ -259,9 +275,9 @@ recommending as defaults. `require-tools` should work reliably across all provid 4. **Parallel execution interaction**: When `enableParallel: true`: - Should each parallel agent have its own tool policy? - Should research happen in loose-serial before parallel batches? - - Test: parallel agents with `research-first` - do they all search or just one? + - Test: parallel agents with `web_search_first` - do they all search or just one? -5. **Turn limits**: With `research-always`, does forcing search on every turn: +5. **Turn limits**: With `web_search_always`, does forcing search on every turn: - Hit rate limits with providers? - Significantly impact latency? - Improve accuracy enough to justify cost? @@ -270,7 +286,7 @@ recommending as defaults. `require-tools` should work reliably across all provid ### Phase 1: Core Policy Engine -**Goal:** Implement `toolPolicy` option with `auto`, `require-tools`, and `research-first`. +**Goal:** Implement `toolPolicy` option with `none`, `auto`, `require_tools`, and `web_search_first`. - [ ] Add `ToolPolicy` type to `harnessTypes.ts` - [ ] Add `toolPolicy` to `FillOptions` and `HarnessConfig` @@ -284,13 +300,13 @@ recommending as defaults. `require-tools` should work reliably across all provid ### Phase 2: Advanced Policies -**Goal:** Implement `research-always` and `two-phase` policies. +**Goal:** Implement `web_search_always` and `two_phase` policies. -- [ ] Implement `research-always` with `prepareStep` logic -- [ ] Implement `two-phase` with separate agent invocations +- [ ] Implement `web_search_always` with `prepareStep` logic +- [ ] Implement `two_phase` with separate agent invocations - [ ] Add `researchPhaseTurns` and `maxSearchesPerTurn` options - [ ] Update frontmatter parser for new options -- [ ] Add session transcript support for two-phase (mark phase transitions) +- [ ] Add session transcript support for two_phase (mark phase transitions) - [ ] Write integration tests for advanced policies ### Phase 3: Provider Testing & Documentation @@ -310,8 +326,8 @@ recommending as defaults. `require-tools` should work reliably across all provid **Goal:** Ensure policies work correctly with parallel execution. - [ ] Define policy behavior for parallel agents -- [ ] Test `research-first` with parallel batches -- [ ] Test `two-phase` with parallel batches (research → parallel fill) +- [ ] Test `web_search_first` with parallel batches +- [ ] Test `two_phase` with parallel batches (research → parallel fill) - [ ] Add policy options to `ParallelHarnessConfig` - [ ] Document parallel + policy interaction @@ -325,17 +341,18 @@ recommending as defaults. `require-tools` should work reliably across all provid ### Integration Tests (Mock Agents) -- `auto`: Agent receives no toolChoice override -- `require-tools`: Agent receives `toolChoice: 'required'` -- `research-first`: Turn 1 gets forced webSearch, turn 2+ gets required -- `research-always`: Each turn starts with forced webSearch -- `two-phase`: Two agent invocations with different tool sets +- `none`: Agent receives no tools +- `auto`: Agent receives `toolChoice: 'auto'` +- `require_tools`: Agent receives `toolChoice: 'required'` +- `web_search_first`: Turn 1 gets forced web_search, turn 2+ gets required +- `web_search_always`: Each turn starts with forced web_search +- `two_phase`: Two agent invocations with different tool sets ### End-to-End Tests (Real LLM Calls) - Test each policy with a factual research form - Verify web search is actually called (check fill record) -- Compare accuracy: auto vs research-first vs two-phase +- Compare accuracy: auto vs web_search_first vs two_phase - Measure latency impact ### Provider Matrix Tests @@ -343,34 +360,37 @@ recommending as defaults. `require-tools` should work reliably across all provid Create automated tests that run the same form across providers: ``` -┌─────────────────┬────────┬──────────┬─────────┬────────┐ -│ Policy │ OpenAI │ Anthropic│ DeepSeek│ Google │ -├─────────────────┼────────┼──────────┼─────────┼────────┤ -│ auto │ ✓ │ ✓ │ ✓ │ ✓ │ -│ require-tools │ ✓ │ ✓ │ ? │ ✓ │ -│ research-first │ ✓ │ ✓ │ ? │ ✓ │ -│ research-always │ ✓ │ ? │ ? │ ✓ │ -│ two-phase │ ✓ │ ✓ │ ? │ ✓ │ -└─────────────────┴────────┴──────────┴─────────┴────────┘ +┌───────────────────┬────────┬──────────┬─────────┬────────┐ +│ Policy │ OpenAI │ Anthropic│ DeepSeek│ Google │ +├───────────────────┼────────┼──────────┼─────────┼────────┤ +│ none │ ✓ │ ✓ │ ✓ │ ✓ │ +│ auto │ ✓ │ ✓ │ ✓ │ ✓ │ +│ require_tools │ ✓ │ ✓ │ ? │ ✓ │ +│ web_search_first │ ✓ │ ✓ │ ? │ ✓ │ +│ web_search_always │ ✓ │ ? │ ? │ ✓ │ +│ two_phase │ ✓ │ ✓ │ ? │ ✓ │ +└───────────────────┴────────┴──────────┴─────────┴────────┘ ``` ## Rollout Plan -1. **Phase 1 release**: Add `toolPolicy` with `auto`, `require-tools`, `research-first` - - Default remains `auto` for backward compatibility - - Document as experimental +1. **Phase 1 release**: Add `toolPolicy` with `none`, `auto`, `require_tools`, `web_search_first` + - Default is `require_tools` for reliable tool usage out of the box + - Document policy options and when to use each -2. **Phase 2 release**: Add `research-always` and `two-phase` +2. **Phase 2 release**: Add `web_search_always` and `two_phase` - Include provider testing results - Document recommended policies per use case -3. **Future consideration**: Change default to `require-tools` once validated across providers +3. **Backward compatibility**: Existing forms without `tool_policy` get `require_tools` + - This is a behavior change from implicit `auto`, but improves reliability + - Users can explicitly set `tool_policy: auto` if needed ## Open Questions -1. **Default policy**: Should we change the default from `auto` to `require-tools`? - - Pro: More reliable tool usage out of the box - - Con: Breaking change, may cause issues with DeepSeek +1. **DeepSeek compatibility**: Does `require_tools` work reliably on DeepSeek? + - If not, should we auto-detect DeepSeek and fall back to `auto`? + - Need testing to determine 2. **Per-field policies**: Is form-level policy sufficient, or do we need field-level control? - Current decision: Form-level first, field-level in v2 @@ -382,7 +402,7 @@ Create automated tests that run the same form across providers: 4. **Cost tracking**: Should we track web search costs separately in fill records? - Recommendation: Yes, add `webSearchCalls` count to fill record -5. **Policy composition with `order`**: For `two-phase`, should research happen only for +5. **Policy composition with `order`**: For `two_phase`, should research happen only for the current order level, or research all fields upfront? - Recommendation: Research current order level only (progressive disclosure) From 541e192f17923f3acda7ea2f02eb8bdbef9a95db Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 15 Feb 2026 20:10:04 +0000 Subject: [PATCH 4/6] docs: revise tool choice policy spec after engineering review Major changes: - Document steps vs turns architecture (critical context) - Simplify to 4 policies: none, auto, require_tools, require_web_search - Drop web_search_always (redundant) and two_phase (broken: context lost across stateless turns) - Add require_web_search using prepareStep for step-level control - Propose harness-level research injection as future robust solution - Add post-turn validation as alternative approach - Include source code references for all architectural claims https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB --- .../plan-2026-02-02-tool-choice-policies.md | 487 ++++++++++-------- 1 file changed, 270 insertions(+), 217 deletions(-) diff --git a/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md index 305d4ff2..58cf8de6 100644 --- a/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md +++ b/docs/project/specs/active/plan-2026-02-02-tool-choice-policies.md @@ -1,10 +1,10 @@ # Plan Spec: Tool Choice Policies for Reliable Form Filling -**Date:** 2026-02-02 (last updated 2026-02-03) +**Date:** 2026-02-02 (last updated 2026-02-15) **Author:** AI Research -**Status:** Draft +**Status:** Draft (revised after senior engineering review) ## Overview @@ -26,10 +26,7 @@ reducing hallucination and improving data accuracy. thoroughness against latency/cost 3. **Cross-model compatibility**: Work reliably across providers (OpenAI, Anthropic, Google, DeepSeek, xAI) -4. **Form-level and field-level control**: Allow policies at form level with optional per-field - overrides -5. **Integration with parallel execution**: Policies should compose with the existing `parallel` - and `order` attributes +4. **Form-level control**: Allow policies at form level via frontmatter and CLI ## Non-Goals @@ -37,135 +34,244 @@ reducing hallucination and improving data accuracy. - Model-specific prompt tuning (policies should work across models) - UI for policy configuration (CLI/API only for now) - Automatic policy selection based on field content +- Per-field tool policies (may be considered in v2) ## Background ### The Problem -Current Markform behavior uses `toolChoice: 'auto'` by default, meaning the model decides -whether to use tools. In practice, models often: +Models often skip web search and fill fields with training data (which may be outdated +or hallucinated). The current `toolChoice: 'required'` default (set at `liveAgent.ts:91`) +forces the model to call *some* tool, but doesn't guarantee it uses web search over +fill_form. -1. **Skip web search entirely** and fill fields with training data (which may be outdated) -2. **Hallucinate tool calls** by describing what they would search without actually searching -3. **Become unreliable after ~5 turns** as conversation length increases -4. **Fill forms without research** when under time pressure or with simpler prompts +### Architecture: Steps vs Turns (Critical Context) -### Research Findings +The Markform harness has a two-level iteration model that any tool policy must respect: -From the research doc (`research-2026-02-02-tool-choice-parameter.md`): +``` +┌─────────────────────────────────────────────────────────────┐ +│ Harness Turn (one fillFormTool() call) │ +│ │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ Single generateText() invocation │ │ +│ │ │ │ +│ │ Step 0: web_search("query") → results ✓ │ │ +│ │ Step 1: fill_form(patches) ← sees search results │ │ +│ │ Step 2: web_search("query2") → results ✓ │ │ +│ │ Step 3: fill_form(patches) ← sees all prior │ │ +│ │ ...up to maxStepsPerTurn (default: 20) │ │ +│ │ │ │ +│ │ Context PRESERVED within these steps │ │ +│ └──────────────────────────────────────────────────────┘ │ +│ │ +│ Form markdown updated with patches → next turn │ +│ │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ Context RESET +┌─────────────────────────────────────────────────────────────┐ +│ Next Harness Turn (fresh fillFormTool() call) │ +│ │ +│ Only sees: updated form markdown + remaining issues │ +│ Previous web search results are LOST │ +│ Previous conversation/reasoning is LOST │ +└─────────────────────────────────────────────────────────────┘ +``` + +**Key facts from code review:** + +1. **Within a turn** (`liveAgent.ts:201-209`): One `generateText()` call allows up to + `maxStepsPerTurn` (default 20) AI SDK steps. The model accumulates context across + steps—web search results from step 0 ARE visible when calling fill_form in step 1. + AI SDK's `prepareStep` callback fires between steps and can change `toolChoice` and + `activeTools` per step (`generate-text.ts:518-546`). -- `toolChoice: 'required'` forces tool use but can cause infinite loops without termination -- `toolChoice: { type: 'tool', toolName: 'webSearch' }` forces a specific tool -- Different providers translate these values differently (Anthropic: `required` → `any`) -- AI SDK 6's `prepareStep` allows dynamic toolChoice per turn -- Google recommends limiting to 10-20 tools for reliable selection -- DeepSeek has unreliable multi-turn tool calling; best for single-turn +2. **Across turns** (`liveAgent.ts:123-125`): Each call is stateless. The full form + context is provided fresh. Only three things persist across turns: + - The form markdown (updated with filled values) + - The remaining issues list + - Previous patch rejections -### Current Implementation +3. **Web search results do NOT persist across turns.** Any policy that separates + "research" and "fill" into different harness turns will lose the research results. -The harness provides these tools to the agent: -- `fill_form` - Apply patches to the form -- `web_search` (optional, via `enableWebSearch`) - Search the web for information +### Implications for Policy Design -Currently, `toolChoice` is not explicitly set, defaulting to `'auto'`. +- **Policies must operate at the step level** (within a single `generateText()` call), + NOT at the turn level (across separate `fillFormTool()` invocations). +- **"Two-phase" as separate invocations is broken** — research results from Phase 1 are + lost when Phase 2 starts fresh. +- **`prepareStep`** is the correct mechanism for step-level control. It can dynamically + change `toolChoice` and `activeTools` between steps. ## Design ### Tool Choice Policy Enum -A new `toolPolicy` option controls how the harness manages tool selection: - ```typescript type ToolPolicy = - | 'none' // No tools provided to agent - | 'auto' // Model chooses freely whether to use tools - | 'require_tools' // toolChoice: 'required' on every turn (DEFAULT) - | 'web_search_first' // Force web_search on first turn, then require_tools - | 'web_search_always' // Force web_search every turn until form complete - | 'two_phase' // Phase 1: web search only, Phase 2: fill only + | 'none' // No tools provided to agent + | 'auto' // Model chooses freely whether to use tools + | 'require_tools' // toolChoice: 'required' on every step (DEFAULT) + | 'require_web_search' // Step 0 must be web_search, then require_tools ``` -**Default:** `require_tools` — ensures the agent always makes progress by calling a tool -(either `fill_form` or `web_search`) on every turn. This prevents "analysis paralysis" -where models describe what they would do without actually doing it. +**Default:** `require_tools` ### Policy Behaviors #### `none` ``` -No tools provided to agent -Agent can only generate text responses +No tools provided to agent. +Agent can only generate text responses. +toolChoice: N/A ``` -**When to use:** Testing, debugging, or when tools are intentionally disabled +**When to use:** Testing, debugging, or when tools are intentionally disabled. #### `auto` ``` -toolChoice: 'auto' on every turn -Model decides when to search and when to fill -No enforcement of research before filling +toolChoice: 'auto' on every step. +Model decides when to search and when to fill. +No enforcement of tool usage. ``` **When to use:** Legacy behavior, simple forms that don't need research, or when you -want maximum model flexibility +want maximum model flexibility. #### `require_tools` (Default) ``` -toolChoice: 'required' on every turn -Model must call SOME tool every turn (fill_form or web_search) -Prevents "analysis paralysis" where model talks without acting -Uses termination detection to allow final response +toolChoice: 'required' on every step. +Model must call SOME tool on every step (fill_form or web_search). +Prevents "analysis paralysis" where model talks without acting. +Model can interleave web_search and fill_form freely within a turn. ``` -**When to use:** General production use, ensures progress every turn. This is the -recommended default for most forms. - -#### `web_search_first` - -``` -Turn 1: toolChoice: { type: 'tool', toolName: 'web_search' } -Turn 2+: toolChoice: 'required' -``` +**When to use:** General production use. Ensures progress on every step. +Already set as the default at `liveAgent.ts:91`. -**When to use:** Forms with factual fields that need current data, moderate latency tolerance +**Note:** This does NOT guarantee web search is used. The model may go straight +to fill_form. For guaranteed research, use `require_web_search`. -#### `web_search_always` +#### `require_web_search` ``` -Every turn: First call must be web_search (via prepareStep logic) -After web_search returns, toolChoice: 'required' for rest of turn +Step 0: toolChoice: { type: 'tool', toolName: 'web_search' } +Step 1+: toolChoice: 'required' ``` -**When to use:** High-accuracy requirements, fields with rapidly changing data +**When to use:** Forms with factual fields that need current data. Guarantees at +least one web search before any form filling within each turn. **Implementation:** ```typescript -prepareStep: ({ lastToolResults }) => { - const hasSearchedThisTurn = lastToolResults?.some( - r => r.toolName === 'web_search' +// In liveAgent.ts, pass prepareStep to generateText(): +prepareStep: ({ stepNumber, steps }) => { + const hasSearched = steps.some(step => + step.toolCalls.some(tc => isWebSearchTool(tc.toolName)) ); - if (!hasSearchedThisTurn) { + if (!hasSearched) { return { toolChoice: { type: 'tool', toolName: 'web_search' } }; } return { toolChoice: 'required' }; }, ``` -#### `two_phase` +**Behavior within a single turn:** +1. Step 0: Model is forced to call web_search +2. Step 1: Model sees search results, chooses any tool (usually fill_form) +3. Step 2+: Model continues with required tools (may search again or fill more) +### Deferred: More Aggressive Research Policies + +The original spec proposed `web_search_always` and `two_phase` policies. These are +**deferred** pending architectural work: + +#### Why `two_phase` doesn't work as designed + +The original design called for "Phase 1: research only, Phase 2: fill only" as separate +agent invocations. This is broken because web search results from Phase 1 are completely +lost when Phase 2 starts (turns are stateless). + +**Possible future approaches:** +1. **Harness-level research injection** (recommended): The harness itself runs web + searches before calling the LLM, based on field labels/descriptions. Inject results + into the context prompt. The model never has to decide whether to search. +2. **Research accumulator**: Store web search results in a sidecar that persists across + turns. Inject into subsequent prompts. Adds complexity. +3. **Single-turn two-phase via `activeTools`**: Within one `generateText()` call, use + `prepareStep` to only expose web_search for steps 0-N, then only expose fill_form + for steps N+1+. Works within a turn but may exceed step limits for complex forms. + +#### Why `web_search_always` is questionable + +Forcing web search on every step within a turn doesn't make sense — after step 0 +returns search results, the model already has the information. Forcing redundant +searches wastes API calls. The `require_web_search` policy (search on step 0) achieves +the same goal more efficiently. + +If the concern is that the model needs to search for different fields at different +points, `require_tools` already allows this — the model can interleave searches and +fills freely. + +### Alternative Approaches Worth Considering + +#### Harness-Level Research Injection (Future) + +Instead of asking the model to decide when to search, the harness itself runs web +searches based on field metadata: + +```typescript +// Pseudocode for future harness-level research +async function researchFields(form: ParsedForm, issues: InspectIssue[]): Promise { + const queries = generateSearchQueries(form, issues); + const results = await Promise.all(queries.map(q => webSearch(q))); + return formatResearchContext(results); +} + +// Inject into context prompt +const contextPrompt = buildContextPrompt(issues, form, maxPatches, previousRejections); +const researchContext = await researchFields(form, issues); +const fullPrompt = contextPrompt + '\n\n# Research Results\n' + researchContext; ``` -Phase 1 (Research): Only web_search available, toolChoice: 'required' - Runs until configurable turn count or all fields researched -Phase 2 (Fill): Only fill_form available, toolChoice: 'required' - Uses research context from Phase 1 -``` -**When to use:** Maximum accuracy, complex research forms, acceptable latency +**Advantages:** +- Most reliable — doesn't depend on model behavior +- Works identically across all providers +- Research quality can be tuned independently of the LLM +- Can be cached/reused across turns + +**Disadvantages:** +- Harness must know what to search for (field labels may not be sufficient) +- Upfront latency for search before LLM call +- May search for things the model doesn't need -**Implementation:** Two separate agent invocations with different tool sets +This is the recommended direction for forms that truly need guaranteed research. + +#### Post-Turn Validation (Alternative) + +Let the model use `require_tools` freely, but validate after each turn: + +```typescript +// After fillFormTool() returns, check tool usage +const toolCalls = response.stats.toolCalls; +const usedWebSearch = toolCalls.some(tc => isWebSearchTool(tc.name)); + +if (!usedWebSearch && policyRequiresSearch) { + // Inject reminder into next turn's context + previousRejections.push({ + message: 'You did not use web search. Research field values before filling.', + // ... triggers re-try + }); +} +``` + +**Advantages:** Simple, works across providers, no `prepareStep` complexity. +**Disadvantages:** Wastes a turn when model doesn't search. Slower. ### API Changes @@ -182,22 +288,6 @@ interface FillOptions { * @default 'require_tools' */ toolPolicy?: ToolPolicy; - - /** - * For 'two_phase' policy: max turns in research phase. - * After this many turns, switches to fill phase. - * - * @default 5 - */ - researchPhaseTurns?: number; - - /** - * For 'web_search_always': max searches per turn. - * Prevents excessive API calls on forms with many fields. - * - * @default 3 - */ - maxSearchesPerTurn?: number; } ``` @@ -208,9 +298,7 @@ interface FillOptions { markform: spec: MF/0.1 harness_config: - tool_policy: web_search_first # New option - research_phase_turns: 5 # For two_phase - max_searches_per_turn: 3 # For web_search_always + tool_policy: require_web_search # New option --- ``` @@ -218,193 +306,149 @@ markform: ```bash # New --tool-policy flag -markform fill form.md --tool-policy=web_search_first +markform fill form.md --tool-policy=require_web_search -# Override policy at CLI level -markform fill form.md --tool-policy=two_phase --research-phase-turns=8 +# Override policy at CLI level (CLI overrides frontmatter) +markform fill form.md --tool-policy=auto ``` -### Field-Level Research Hints (Future Enhancement) - -For v2, consider per-field annotations: - -```markdown - - - - - -``` - -This is deferred - the form-level policy is sufficient for initial implementation. - ### Provider-Specific Considerations -From research, key provider differences to handle: - -| Provider | Notes | -|----------|-------| -| OpenAI | `'required'` works directly; parallel tool calls supported | -| Anthropic | `'required'` → `'any'` translation; `disable_parallel_tool_use` available | -| Google | `'required'` → `mode: 'ANY'`; limit to 10-20 tools | -| DeepSeek | Unreliable multi-turn; best with `auto` or single-turn `required` | -| xAI | Can't force provider-defined tools; use grok-4-1-fast | - -**Recommendation:** Test `two_phase` and `web_search_always` across all providers before -recommending as defaults. `require_tools` should work reliably across all providers. +| Provider | `require_tools` | `require_web_search` | Notes | +|----------|----------------|---------------------|-------| +| OpenAI | Works directly | Works directly | Parallel tool calls supported | +| Anthropic | `required` → `any` | `{ type: 'tool' }` works | Not compatible with extended thinking | +| Google | `required` → `ANY` | `ANY` + `allowedFunctionNames` | Limit to 10-20 tools | +| DeepSeek | **Needs testing** | **Needs testing** | Unreliable multi-turn; may need fallback | +| xAI | Works | Can't force provider-defined tools | Use grok-4-1-fast | ### Areas of Uncertainty (Requiring Testing) -1. **DeepSeek multi-turn behavior**: Research indicates unreliable tool calling after first turn. - Need to test: - - Does `toolChoice: 'required'` work reliably on DeepSeek? - - What happens with `two_phase` policy? +1. **DeepSeek `require_tools` behavior**: Research indicates unreliable tool calling. + - Does `toolChoice: 'required'` work reliably? - Should we auto-downgrade to `auto` for DeepSeek? + - Test with both single-step and multi-step turns. -2. **Anthropic extended thinking**: `toolChoice: 'required'` may conflict with extended thinking. - Need to test: - - Does `web_search_first` work with Claude 4 extended thinking? - - Should we detect extended thinking and adjust policy? +2. **Anthropic extended thinking + `require_tools`**: Anthropic docs say only + `tool_choice: 'auto'` is compatible with extended thinking. + - Does this affect our default? Should we detect extended thinking? + - Workaround: Use `auto` + strong prompting when extended thinking is enabled. -3. **Termination detection**: With `toolChoice: 'required'`, how do we allow final text response? - Options to test: - - Use `stopWhen: hasToolCall('fill_form')` with form completion check - - Use `prepareStep` to switch to `'auto'` on last turn - - Add a no-op `complete` tool +3. **`require_web_search` with providers using different search tool names**: + - OpenAI: `web_search`, Anthropic: `web_search`, Google: `google_search` + - Need to resolve correct tool name dynamically. -4. **Parallel execution interaction**: When `enableParallel: true`: - - Should each parallel agent have its own tool policy? - - Should research happen in loose-serial before parallel batches? - - Test: parallel agents with `web_search_first` - do they all search or just one? +4. **Step limits**: With `require_web_search`, do we burn a step on search? + - Current default: 20 steps per turn (plenty of room). + - Monitor if complex forms hit the step limit. -5. **Turn limits**: With `web_search_always`, does forcing search on every turn: - - Hit rate limits with providers? - - Significantly impact latency? - - Improve accuracy enough to justify cost? +5. **Provider-specific `toolChoice: { type: 'tool' }` support**: + - xAI can't force server-side tools — does this affect web search? + - Need to test forcing specific tool names across all providers. ## Implementation Plan -### Phase 1: Core Policy Engine +### Phase 1: Core Policies -**Goal:** Implement `toolPolicy` option with `none`, `auto`, `require_tools`, and `web_search_first`. +**Goal:** Implement `toolPolicy` with `none`, `auto`, `require_tools`, `require_web_search`. - [ ] Add `ToolPolicy` type to `harnessTypes.ts` -- [ ] Add `toolPolicy` to `FillOptions` and `HarnessConfig` +- [ ] Add `toolPolicy` to `FillOptions` and `LiveAgentConfig` - [ ] Add `tool_policy` to `HarnessConfigYaml` and mapping in `settings.ts` -- [ ] Implement `getToolChoiceForPolicy()` helper that returns AI SDK toolChoice -- [ ] Update `liveAgent.ts` to use `prepareStep` for policy enforcement -- [ ] Update `fillRecord` to track policy and actual tool usage +- [ ] Implement `prepareStep` callback in `liveAgent.ts` for `require_web_search` +- [ ] Resolve web search tool name dynamically (provider-aware) - [ ] Add `--tool-policy` flag to `markform fill` command +- [ ] Update `fillRecord` to track policy in metadata - [ ] Write unit tests for policy → toolChoice translation - [ ] Write integration tests with mock agents -### Phase 2: Advanced Policies - -**Goal:** Implement `web_search_always` and `two_phase` policies. +### Phase 2: Provider Validation -- [ ] Implement `web_search_always` with `prepareStep` logic -- [ ] Implement `two_phase` with separate agent invocations -- [ ] Add `researchPhaseTurns` and `maxSearchesPerTurn` options -- [ ] Update frontmatter parser for new options -- [ ] Add session transcript support for two_phase (mark phase transitions) -- [ ] Write integration tests for advanced policies +**Goal:** Validate all policies across providers. -### Phase 3: Provider Testing & Documentation - -**Goal:** Validate policies across providers and document recommendations. - -- [ ] Create test matrix: policy × provider × form complexity -- [ ] Test DeepSeek specifically for multi-turn reliability +- [ ] Create test matrix: policy × provider +- [ ] Test DeepSeek specifically: `require_tools` and `require_web_search` - [ ] Test Anthropic with extended thinking -- [ ] Document provider-specific recommendations in research doc -- [ ] Update `docs/markform-apis.md` with policy documentation -- [ ] Add example forms demonstrating each policy -- [ ] Create troubleshooting guide for policy issues +- [ ] Test xAI with forced tool names +- [ ] Document provider-specific recommendations +- [ ] Add fallback behavior for unsupported provider/policy combos -### Phase 4: Parallel Execution Integration +### Phase 3: Harness-Level Research (Future) -**Goal:** Ensure policies work correctly with parallel execution. +**Goal:** Enable guaranteed research without relying on model behavior. -- [ ] Define policy behavior for parallel agents -- [ ] Test `web_search_first` with parallel batches -- [ ] Test `two_phase` with parallel batches (research → parallel fill) -- [ ] Add policy options to `ParallelHarnessConfig` -- [ ] Document parallel + policy interaction +- [ ] Design search query generation from field metadata +- [ ] Implement harness-level web search execution +- [ ] Inject research results into context prompt +- [ ] Add `research_mode: auto | manual` config option +- [ ] Cache research results across turns for efficiency +- [ ] Write integration tests ## Testing Strategy ### Unit Tests - Policy → toolChoice translation for each policy type -- Policy parsing from frontmatter -- CLI flag parsing +- Policy parsing from frontmatter and CLI +- Web search tool name resolution per provider ### Integration Tests (Mock Agents) - `none`: Agent receives no tools - `auto`: Agent receives `toolChoice: 'auto'` - `require_tools`: Agent receives `toolChoice: 'required'` -- `web_search_first`: Turn 1 gets forced web_search, turn 2+ gets required -- `web_search_always`: Each turn starts with forced web_search -- `two_phase`: Two agent invocations with different tool sets +- `require_web_search`: Step 0 forced to web_search, step 1+ required ### End-to-End Tests (Real LLM Calls) - Test each policy with a factual research form -- Verify web search is actually called (check fill record) -- Compare accuracy: auto vs web_search_first vs two_phase +- Verify web search is actually called (check fill record / wire format) +- Compare accuracy: auto vs require_tools vs require_web_search - Measure latency impact -### Provider Matrix Tests - -Create automated tests that run the same form across providers: +### Provider Matrix ``` -┌───────────────────┬────────┬──────────┬─────────┬────────┐ -│ Policy │ OpenAI │ Anthropic│ DeepSeek│ Google │ -├───────────────────┼────────┼──────────┼─────────┼────────┤ -│ none │ ✓ │ ✓ │ ✓ │ ✓ │ -│ auto │ ✓ │ ✓ │ ✓ │ ✓ │ -│ require_tools │ ✓ │ ✓ │ ? │ ✓ │ -│ web_search_first │ ✓ │ ✓ │ ? │ ✓ │ -│ web_search_always │ ✓ │ ? │ ? │ ✓ │ -│ two_phase │ ✓ │ ✓ │ ? │ ✓ │ -└───────────────────┴────────┴──────────┴─────────┴────────┘ +┌─────────────────────┬────────┬──────────┬─────────┬────────┬─────┐ +│ Policy │ OpenAI │ Anthropic│ DeepSeek│ Google │ xAI │ +├─────────────────────┼────────┼──────────┼─────────┼────────┼─────┤ +│ none │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ +│ auto │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ +│ require_tools │ ✓ │ ✓ │ ? │ ✓ │ ✓ │ +│ require_web_search │ ✓ │ ? │ ? │ ? │ ? │ +└─────────────────────┴────────┴──────────┴─────────┴────────┴─────┘ ``` ## Rollout Plan -1. **Phase 1 release**: Add `toolPolicy` with `none`, `auto`, `require_tools`, `web_search_first` - - Default is `require_tools` for reliable tool usage out of the box - - Document policy options and when to use each +1. **Phase 1**: Ship `none`, `auto`, `require_tools`, `require_web_search` + - Default is `require_tools` (already the current behavior) + - `require_web_search` documented as beta until provider testing complete -2. **Phase 2 release**: Add `web_search_always` and `two_phase` - - Include provider testing results - - Document recommended policies per use case +2. **Phase 2**: Validate across providers, promote `require_web_search` to stable -3. **Backward compatibility**: Existing forms without `tool_policy` get `require_tools` - - This is a behavior change from implicit `auto`, but improves reliability - - Users can explicitly set `tool_policy: auto` if needed +3. **Phase 3**: Harness-level research injection as the robust solution for + forms that truly need guaranteed research ## Open Questions -1. **DeepSeek compatibility**: Does `require_tools` work reliably on DeepSeek? - - If not, should we auto-detect DeepSeek and fall back to `auto`? - - Need testing to determine +1. **Web search tool naming**: Should `require_web_search` resolve tool names + dynamically (e.g., `google_search` for Google provider), or should we normalize + all search tools to `web_search`? + - Current code: Google uses `google_search`, others use `web_search` + - Recommendation: Resolve dynamically using existing `isWebSearchTool()` helper -2. **Per-field policies**: Is form-level policy sufficient, or do we need field-level control? - - Current decision: Form-level first, field-level in v2 +2. **DeepSeek fallback**: If `require_tools` doesn't work on DeepSeek, should we + auto-detect and fall back, or let it fail visibly? + - Recommendation: Fail visibly with a warning, let user set `auto` explicitly -3. **Policy inheritance in parallel execution**: Should parallel agents inherit the form policy - or have independent policies? - - Recommendation: Inherit form policy, with option to override per batch +3. **Extended thinking**: Should we auto-detect extended thinking on Anthropic and + downgrade to `auto`? + - Recommendation: Yes, with a logged warning -4. **Cost tracking**: Should we track web search costs separately in fill records? - - Recommendation: Yes, add `webSearchCalls` count to fill record - -5. **Policy composition with `order`**: For `two_phase`, should research happen only for - the current order level, or research all fields upfront? - - Recommendation: Research current order level only (progressive disclosure) +4. **Harness-level research scope**: When implemented, should it search for all + fields or only unfilled fields with `research: required` annotation? + - Deferred to Phase 3 design ## References @@ -414,3 +458,12 @@ Create automated tests that run the same form across providers: - [Parallel Form Filling Spec](plan-2026-01-27-parallel-form-filling.md) - [GitHub Issue: Tool Execution Unreliable](https://github.com/vercel/ai/issues/10269) - [GitHub Issue: toolChoice Endless Loop](https://github.com/vercel/ai/issues/3944) + +### Source Code References + +- `liveAgent.ts:91` — current `toolChoice` default (`'required'`) +- `liveAgent.ts:123-125` — stateless turn documentation +- `liveAgent.ts:201-209` — `generateText()` invocation with `stepCountIs` +- `liveAgent.ts:231` — tool call counting (for post-turn validation) +- `liveAgent.ts:746-748` — `isWebSearchTool()` helper +- `generate-text.ts:518-546` (AI SDK) — `prepareStep` callback invocation From 84a5ccdc0564a5ff5874231227ddbec31c214475 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 17 Feb 2026 01:39:09 +0000 Subject: [PATCH 5/6] docs: update research doc recommendations to match revised design - Replace Options A/B/C with concrete recommendations - Add prepareStep pattern for forced web search - Document critical architecture note (stateless turns) - Add provider-specific compatibility table - Remove stale "Next Steps" referencing old hybrid approach https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB --- ...search-2026-02-02-tool-choice-parameter.md | 102 ++++++++---------- 1 file changed, 43 insertions(+), 59 deletions(-) diff --git a/docs/project/research/research-2026-02-02-tool-choice-parameter.md b/docs/project/research/research-2026-02-02-tool-choice-parameter.md index 285131f0..eedbc94a 100644 --- a/docs/project/research/research-2026-02-02-tool-choice-parameter.md +++ b/docs/project/research/research-2026-02-02-tool-choice-parameter.md @@ -1,6 +1,6 @@ # Research: Tool Choice Parameter in AI SDK and Major LLM Providers -**Date:** 2026-02-02 (last updated 2026-02-02) +**Date:** 2026-02-02 (last updated 2026-02-17) **Author:** AI Research @@ -470,75 +470,59 @@ used in a specific call, use prompting instead: "Do not use any tools for this r --- -## Options Considered +## Recommendations for Markform -### Option A: Use `toolChoice: 'required'` Everywhere +Based on this research and confirmed by reviewing the Markform harness architecture +(stateless turns with multi-step tool loops within each turn): -**Description:** Force tool use on every step until completion. +### Recommended: `toolChoice: 'required'` as Default -**Pros:** -- Guarantees tools are called -- Prevents hallucination of tool results +Use `toolChoice: 'required'` (mapped to AI SDK `require_tools` policy) as the default +for all form filling. This: +- Prevents "analysis paralysis" where models describe what they'd do without acting +- Works across OpenAI, Anthropic (→ `any`), Google (→ `ANY`), and xAI +- Already set as default in `liveAgent.ts:91` +- Uses `stopWhen: stepCountIs(maxStepsPerTurn)` to prevent infinite loops -**Cons:** -- Can cause infinite loops without proper termination -- May force unnecessary tool calls -- Not compatible with Anthropic extended thinking +### For Guaranteed Web Search: `prepareStep` with Forced First Step -### Option B: Use `toolChoice: 'auto'` with Strong Prompting +When forms require factual research, use AI SDK's `prepareStep` callback to force +`web_search` on step 0, then `'required'` for subsequent steps. This works because +within a single `generateText()` call, the model accumulates context across steps— +search results from step 0 are visible when calling `fill_form` in step 1. -**Description:** Rely on system prompts to guide tool use. - -**Pros:** -- More flexible -- Works with all features (extended thinking, etc.) -- Natural conversation flow - -**Cons:** -- Model may ignore prompts and skip tools -- Reliability degrades over long conversations -- Harder to guarantee tool usage - -### Option C: Hybrid Approach with `prepareStep` - -**Description:** Use `toolChoice: 'required'` initially, switch to `'auto'` for final -response. - -**Pros:** -- Best of both worlds -- Guarantees initial research -- Allows natural completion - -**Cons:** -- More complex implementation -- Requires careful step management - ---- - -## Recommendations - -1. **For form-filling with mandatory research:** Use Option C (Hybrid) with: - - `toolChoice: 'required'` for first N steps - - An answer tool without execute function - - `stopWhen: hasToolCall('submitForm')` +```typescript +prepareStep: ({ steps }) => { + const hasSearched = steps.some(step => + step.toolCalls.some(tc => isWebSearchTool(tc.toolName)) + ); + if (!hasSearched) { + return { toolChoice: { type: 'tool', toolName: 'web_search' } }; + } + return { toolChoice: 'required' }; +}, +``` -2. **For simpler tool integration:** Use Option A with proper termination: - - Define a clear termination tool - - Use `stopWhen` to prevent infinite loops +**Critical architecture note:** Markform turns are stateless—web search results do NOT +persist across turns. All tool policies must operate at the **step** level (within a +single `generateText()` call), not at the turn level. -3. **For conversation-like interfaces:** Use Option B with: - - Strong system prompts - - Explicit tool-use instructions in user messages - - Periodic context compaction +### Future: Harness-Level Research Injection ---- +For maximum reliability (not dependent on model behavior), the harness could run web +searches itself before calling the LLM, injecting results into the context prompt. +This is the most provider-agnostic approach and decouples research quality from +model tool-calling behavior. See the plan spec for details. -## Next Steps +### Provider-Specific Notes -- [ ] Implement the recommended hybrid pattern in markform -- [ ] Add tool input validation for form fields -- [ ] Create a verification step for critical data -- [ ] Test across multiple providers for consistency +| Provider | `'required'` | Forced specific tool | Caveat | +|----------|-------------|---------------------|--------| +| OpenAI | Works directly | Works directly | — | +| Anthropic | → `any` | → `{ type: 'tool' }` | Not compatible with extended thinking | +| Google | → `ANY` | → `allowedFunctionNames` | Limit to 10-20 tools | +| DeepSeek | Needs testing | Needs testing | Unreliable multi-turn calling | +| xAI | Works | Can't force server-side tools | Use grok-4-1-fast | --- From 1a8f09e842ba747ac0733f7c8b3f119ae0b378b7 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 17 Feb 2026 01:42:39 +0000 Subject: [PATCH 6/6] chore: update tbd config for v0.1.13 shortcut/guideline renames https://claude.ai/code/session_011kQikoqnHhscL1RuuDJGHB --- .tbd/config.yml | 27 ++++++++------------------- 1 file changed, 8 insertions(+), 19 deletions(-) diff --git a/.tbd/config.yml b/.tbd/config.yml index 027671ba..2f67fcc2 100644 --- a/.tbd/config.yml +++ b/.tbd/config.yml @@ -3,7 +3,7 @@ display: # Documentation cache configuration. # files: Maps destination paths (relative to .tbd/docs/) to source locations. # Sources can be: -# - internal: prefix for bundled docs (e.g., "internal:shortcuts/standard/code-review-and-commit.md") +# - internal: prefix for bundled docs (e.g., "internal:shortcuts/standard/commit-code.md") # - Full URL for external docs (e.g., "https://raw.githubusercontent.com/org/repo/main/file.md") # lookup_path: Search paths for doc lookup (like shell $PATH). Earlier paths take precedence. # @@ -15,12 +15,10 @@ display: docs_cache: files: guidelines/backward-compatibility-rules.md: internal:guidelines/backward-compatibility-rules.md - guidelines/bun-monorepo-patterns.md: internal:guidelines/bun-monorepo-patterns.md guidelines/cli-agent-skill-patterns.md: internal:guidelines/cli-agent-skill-patterns.md guidelines/commit-conventions.md: internal:guidelines/commit-conventions.md guidelines/convex-limits-best-practices.md: internal:guidelines/convex-limits-best-practices.md guidelines/convex-rules.md: internal:guidelines/convex-rules.md - guidelines/electron-app-development-patterns.md: internal:guidelines/electron-app-development-patterns.md guidelines/error-handling-rules.md: internal:guidelines/error-handling-rules.md guidelines/general-coding-rules.md: internal:guidelines/general-coding-rules.md guidelines/general-comment-rules.md: internal:guidelines/general-comment-rules.md @@ -29,25 +27,19 @@ docs_cache: guidelines/general-tdd-guidelines.md: internal:guidelines/general-tdd-guidelines.md guidelines/general-testing-rules.md: internal:guidelines/general-testing-rules.md guidelines/golden-testing-guidelines.md: internal:guidelines/golden-testing-guidelines.md - guidelines/pnpm-monorepo-patterns.md: internal:guidelines/pnpm-monorepo-patterns.md guidelines/python-cli-patterns.md: internal:guidelines/python-cli-patterns.md guidelines/python-modern-guidelines.md: internal:guidelines/python-modern-guidelines.md guidelines/python-rules.md: internal:guidelines/python-rules.md - guidelines/release-notes-guidelines.md: internal:guidelines/release-notes-guidelines.md - guidelines/tbd-sync-troubleshooting.md: internal:guidelines/tbd-sync-troubleshooting.md + guidelines/sync-troubleshooting.md: internal:guidelines/sync-troubleshooting.md guidelines/typescript-cli-tool-rules.md: internal:guidelines/typescript-cli-tool-rules.md guidelines/typescript-code-coverage.md: internal:guidelines/typescript-code-coverage.md + guidelines/typescript-monorepo-patterns.md: internal:guidelines/typescript-monorepo-patterns.md guidelines/typescript-rules.md: internal:guidelines/typescript-rules.md - guidelines/typescript-sorting-patterns.md: internal:guidelines/typescript-sorting-patterns.md - guidelines/typescript-yaml-handling-rules.md: internal:guidelines/typescript-yaml-handling-rules.md - guidelines/writing-style-guidelines.md: internal:guidelines/writing-style-guidelines.md shortcuts/standard/agent-handoff.md: internal:shortcuts/standard/agent-handoff.md - shortcuts/standard/checkout-third-party-repo.md: internal:shortcuts/standard/checkout-third-party-repo.md - shortcuts/standard/code-cleanup-all.md: internal:shortcuts/standard/code-cleanup-all.md - shortcuts/standard/code-cleanup-docstrings.md: internal:shortcuts/standard/code-cleanup-docstrings.md - shortcuts/standard/code-cleanup-tests.md: internal:shortcuts/standard/code-cleanup-tests.md - shortcuts/standard/code-review-and-commit.md: internal:shortcuts/standard/code-review-and-commit.md - shortcuts/standard/coding-spike.md: internal:shortcuts/standard/coding-spike.md + shortcuts/standard/cleanup-all.md: internal:shortcuts/standard/cleanup-all.md + shortcuts/standard/cleanup-remove-trivial-tests.md: internal:shortcuts/standard/cleanup-remove-trivial-tests.md + shortcuts/standard/cleanup-update-docstrings.md: internal:shortcuts/standard/cleanup-update-docstrings.md + shortcuts/standard/commit-code.md: internal:shortcuts/standard/commit-code.md shortcuts/standard/create-or-update-pr-simple.md: internal:shortcuts/standard/create-or-update-pr-simple.md shortcuts/standard/create-or-update-pr-with-validation-plan.md: internal:shortcuts/standard/create-or-update-pr-with-validation-plan.md shortcuts/standard/implement-beads.md: internal:shortcuts/standard/implement-beads.md @@ -55,7 +47,6 @@ docs_cache: shortcuts/standard/new-architecture-doc.md: internal:shortcuts/standard/new-architecture-doc.md shortcuts/standard/new-guideline.md: internal:shortcuts/standard/new-guideline.md shortcuts/standard/new-plan-spec.md: internal:shortcuts/standard/new-plan-spec.md - shortcuts/standard/new-qa-playbook.md: internal:shortcuts/standard/new-qa-playbook.md shortcuts/standard/new-research-brief.md: internal:shortcuts/standard/new-research-brief.md shortcuts/standard/new-shortcut.md: internal:shortcuts/standard/new-shortcut.md shortcuts/standard/new-validation-plan.md: internal:shortcuts/standard/new-validation-plan.md @@ -72,12 +63,10 @@ docs_cache: shortcuts/standard/update-specs-status.md: internal:shortcuts/standard/update-specs-status.md shortcuts/standard/welcome-user.md: internal:shortcuts/standard/welcome-user.md shortcuts/system/shortcut-explanation.md: internal:shortcuts/system/shortcut-explanation.md - shortcuts/system/skill-baseline.md: internal:shortcuts/system/skill-baseline.md shortcuts/system/skill-brief.md: internal:shortcuts/system/skill-brief.md - shortcuts/system/skill-minimal.md: internal:shortcuts/system/skill-minimal.md + shortcuts/system/skill.md: internal:shortcuts/system/skill.md templates/architecture-doc.md: internal:templates/architecture-doc.md templates/plan-spec.md: internal:templates/plan-spec.md - templates/qa-playbook.md: internal:templates/qa-playbook.md templates/research-brief.md: internal:templates/research-brief.md lookup_path: - .tbd/docs/shortcuts/system