-
Notifications
You must be signed in to change notification settings - Fork 293
fix(kosong): parse DeepSeek inline tool calls when a backend leaves them unstructured #798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
b7c35de
60eab4d
42d0a7a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| --- | ||
| "@moonshot-ai/kosong": patch | ||
| --- | ||
|
|
||
| Parse DeepSeek-format inline tool calls when an OpenAI-compatible backend leaves them unstructured. | ||
|
|
||
| DeepSeek-architecture models (deepseek-v3/r1 and derivatives such as cogito) emit tool calls as special tokens rather than OpenAI `tool_calls`. DeepSeek's own API structures these server-side, but many compatible deployments — self-hosted vLLM/SGLang/llama.cpp, ollama, and some proxies — leak the raw `<|tool▁calls▁begin|>…` tokens into the assistant content, leaving the agent with nothing to dispatch and the turn dead-ending. The OpenAI chat-completions provider now detects that case, parses the tokens into structured tool calls, and strips them from the visible text — but only when the backend returned no structured call, so it stays a no-op for providers that already do the right thing. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,158 @@ | ||
| import type { ToolCall } from '#/message'; | ||
|
|
||
| /** | ||
| * Defensive fallback for DeepSeek-V3 style **inline** tool calls. | ||
| * | ||
| * DeepSeek-architecture models (deepseek-v3/r1 and derivatives such as | ||
| * cogito-2.1) emit tool calls in a special-token format rather than as OpenAI | ||
| * `tool_calls`. The official DeepSeek API parses this server-side and returns | ||
| * structured `tool_calls`, but many OpenAI-compatible deployments — self-hosted | ||
| * vLLM / SGLang / llama.cpp, ollama, and some proxies — do NOT, and instead leak | ||
| * the raw tokens into the assistant `content`: | ||
| * | ||
| * <|tool▁calls▁begin|> | ||
| * <|tool▁call▁begin|>function<|tool▁sep|>NAME | ||
| * ```json | ||
| * { ...arguments... } | ||
| * ```<|tool▁call▁end|> (repeated for parallel calls) | ||
| * <|tool▁calls▁end|> | ||
| * | ||
| * The bar is either ASCII `|` (U+007C, as ollama-cloud emits) or the tokenizer's | ||
| * full-width `|` (U+FF5C, as raw vLLM/SGLang/llama.cpp leaks); separators are | ||
| * `▁` (U+2581). The outer `…calls▁begin…` wrapper is sometimes omitted and the | ||
| * block starts straight at a per-call `…call▁begin…` (see vllm-project/vllm#21727), | ||
| * so detection anchors on either boundary. | ||
| * | ||
| * When this happens the agent sees no tool call and the turn dead-ends. This | ||
| * module parses those tokens client-side so the call can still be dispatched. It | ||
| * is applied ONLY when the provider returned no structured tool call AND a block | ||
| * boundary is present, so it is a no-op for every well-behaved provider/model. | ||
| */ | ||
|
|
||
| const SEP = '▁'; // U+2581 | ||
| const BAR = '[||]'; // ASCII U+007C or full-width U+FF5C | ||
|
|
||
| /** First block boundary: a calls-begin OR a call-begin token, either bar. */ | ||
| const BLOCK_START = new RegExp(`<${BAR}tool${SEP}calls?${SEP}begin${BAR}>`); | ||
|
|
||
| // One call: <…call▁begin…>[function]<…sep…>NAME ```[json] {ARGS} ``` | ||
| const CALL_RE = new RegExp( | ||
| `<${BAR}tool${SEP}call${SEP}begin${BAR}>\\s*(?:function)?\\s*<${BAR}tool${SEP}sep${BAR}>\\s*([A-Za-z0-9_.-]+)\\s*` + | ||
| '```(?:json)?\\s*([\\s\\S]*?)```', | ||
| 'g', | ||
| ); | ||
|
|
||
| /** Canonical ASCII calls-begin sentinel (documentation / convenience). */ | ||
| export const DEEPSEEK_TOOL_CALLS_BEGIN = `<|tool${SEP}calls${SEP}begin|>`; | ||
|
|
||
| // Longest sentinel we might hold back while detecting a marker split across deltas. | ||
| const MAX_MARKER_LEN = DEEPSEEK_TOOL_CALLS_BEGIN.length; | ||
|
|
||
| /** Index of the first DeepSeek tool-call block boundary in `content`, or -1. */ | ||
| export function firstBlockStart(content: string): number { | ||
| const match = BLOCK_START.exec(content); | ||
| return match ? match.index : -1; | ||
| } | ||
|
|
||
| /** | ||
| * Parse DeepSeek inline tool-call tokens from assistant content into structured | ||
| * {@link ToolCall}s. Calls whose argument block is not valid JSON are skipped, | ||
| * so a partially corrupted emission yields the calls it can rather than throwing. | ||
| */ | ||
| export function parseDeepSeekInlineToolCalls(content: string): ToolCall[] { | ||
| if (typeof content !== 'string' || firstBlockStart(content) < 0) { | ||
| return []; | ||
| } | ||
| const calls: ToolCall[] = []; | ||
| CALL_RE.lastIndex = 0; | ||
| let match: RegExpExecArray | null; | ||
| while ((match = CALL_RE.exec(content)) !== null) { | ||
| const name = (match[1] ?? '').trim(); | ||
| const args = (match[2] ?? '').trim(); | ||
| if (name.length === 0) continue; | ||
| try { | ||
| JSON.parse(args); | ||
| } catch { | ||
| continue; | ||
| } | ||
| calls.push({ type: 'function', id: crypto.randomUUID(), name, arguments: args }); | ||
| } | ||
| return calls; | ||
| } | ||
|
|
||
| /** | ||
| * Streaming-safe filter that lets text deltas through live until a DeepSeek | ||
| * tool-call block begins, then suppresses the raw tokens (so they never reach the | ||
| * UI) while accumulating the full content for {@link parseDeepSeekInlineToolCalls}. | ||
| * | ||
| * A small trailing holdback covers a block-start marker that straddles two deltas. | ||
| * Once the provider emits a structured tool call ({@link releaseHoldback}) no | ||
| * inline leak is possible, so the filter releases any held text and passes the | ||
| * rest through verbatim — preserving ordering for well-behaved providers. | ||
| */ | ||
| export class DeepSeekInlineToolCallFilter { | ||
| private buffer = ''; | ||
| private full = ''; | ||
| private suppressing = false; | ||
| private passthrough = false; | ||
|
|
||
| /** Feed a content delta; returns the text safe to yield now (possibly empty). */ | ||
| push(delta: string): string { | ||
| this.full += delta; | ||
| if (this.passthrough) return delta; | ||
| if (this.suppressing) return ''; | ||
| this.buffer += delta; | ||
| const idx = firstBlockStart(this.buffer); | ||
| if (idx >= 0) { | ||
| const out = this.buffer.slice(0, idx); | ||
| this.suppressing = true; | ||
| this.buffer = ''; | ||
| return out; | ||
| } | ||
| const holdback = MAX_MARKER_LEN - 1; | ||
| if (this.buffer.length > holdback) { | ||
| const out = this.buffer.slice(0, this.buffer.length - holdback); | ||
| this.buffer = this.buffer.slice(this.buffer.length - holdback); | ||
| return out; | ||
| } | ||
| return ''; | ||
| } | ||
|
|
||
| /** | ||
| * The provider emitted a structured tool call, so no inline leak is possible. | ||
| * Release any held-back text (in order) and stop buffering subsequent deltas — | ||
| * this keeps a short preamble from being reordered after the tool-call parts. | ||
| * | ||
| * No-op once suppression has begun: if an inline block was already detected we | ||
| * must keep stripping it rather than flip to passthrough (which would leak the | ||
| * remainder of the block as visible text). | ||
| */ | ||
| releaseHoldback(): string { | ||
| if (this.suppressing) return ''; | ||
| this.passthrough = true; | ||
| const out = this.buffer; | ||
| this.buffer = ''; | ||
| return out; | ||
| } | ||
|
|
||
| /** | ||
| * Remaining buffered text once the stream ends (empty if a block was | ||
| * suppressed). Idempotent: the buffer is cleared, so a second call returns ''. | ||
| */ | ||
| flush(): string { | ||
| if (this.suppressing) return ''; | ||
| const out = this.buffer; | ||
| this.buffer = ''; | ||
| return out; | ||
| } | ||
|
Comment on lines
+142
to
+147
|
||
|
|
||
| /** Whether a block-start marker was seen (and the rest of content suppressed). */ | ||
| get sawToolBlock(): boolean { | ||
| return this.suppressing; | ||
| } | ||
|
|
||
| /** Full accumulated content (for parsing). */ | ||
| get content(): string { | ||
| return this.full; | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -30,6 +30,11 @@ import { | |
| convertChatCompletionStreamToolCall, | ||
| type BufferedChatCompletionToolCall, | ||
| } from './chat-completions-stream'; | ||
| import { | ||
| DeepSeekInlineToolCallFilter, | ||
| firstBlockStart, | ||
| parseDeepSeekInlineToolCalls, | ||
| } from './deepseek-inline-tool-calls'; | ||
| import { | ||
| mergeRequestHeaders, | ||
| requireProviderApiKey, | ||
|
|
@@ -361,20 +366,36 @@ export class OpenAILegacyStreamedMessage implements StreamedMessage { | |
| yield { type: 'think', think: reasoning } satisfies StreamedMessagePart; | ||
| } | ||
|
|
||
| if (message.content) { | ||
| yield { type: 'text', text: message.content } satisfies StreamedMessagePart; | ||
| const structuredToolCalls = (message.tool_calls ?? []).filter(isFunctionToolCall); | ||
| const content = typeof message.content === 'string' ? message.content : ''; | ||
|
|
||
| // Fallback: a backend served a DeepSeek-format model but left its inline | ||
| // tool-call tokens in `content` instead of structuring them as `tool_calls`. | ||
| // Strip the block from visible text whenever a block boundary is present (and | ||
| // the provider returned no structured call) — even if some blocks fail to | ||
| // parse — so the raw tokens never render. Parse what we can into dispatchable | ||
| // tool calls. No-op when absent or already structured. | ||
| const blockStart = structuredToolCalls.length === 0 ? firstBlockStart(content) : -1; | ||
| const inlineToolCalls = blockStart >= 0 ? parseDeepSeekInlineToolCalls(content) : []; | ||
|
|
||
| if (content.length > 0) { | ||
| const text = blockStart >= 0 ? content.slice(0, blockStart) : content; | ||
| if (text.length > 0) { | ||
| yield { type: 'text', text } satisfies StreamedMessagePart; | ||
| } | ||
| } | ||
|
|
||
| if (message.tool_calls) { | ||
| for (const toolCall of message.tool_calls) { | ||
| if (!isFunctionToolCall(toolCall)) continue; | ||
| yield { | ||
| type: 'function', | ||
| id: toolCall.id || crypto.randomUUID(), | ||
| name: toolCall.function.name, | ||
| arguments: toolCall.function.arguments, | ||
| } satisfies ToolCall; | ||
| } | ||
| for (const toolCall of structuredToolCalls) { | ||
| yield { | ||
| type: 'function', | ||
| id: toolCall.id || crypto.randomUUID(), | ||
| name: toolCall.function.name, | ||
| arguments: toolCall.function.arguments, | ||
| } satisfies ToolCall; | ||
| } | ||
|
|
||
| for (const toolCall of inlineToolCalls) { | ||
| yield toolCall; | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -383,6 +404,8 @@ export class OpenAILegacyStreamedMessage implements StreamedMessage { | |
| reasoningKey: string | undefined, | ||
| ): AsyncGenerator<StreamedMessagePart> { | ||
| const bufferedToolCalls = new Map<number | string, BufferedChatCompletionToolCall>(); | ||
| const inlineFilter = new DeepSeekInlineToolCallFilter(); | ||
| let sawStructuredToolCall = false; | ||
|
|
||
| try { | ||
| for await (const chunk of response) { | ||
|
|
@@ -416,19 +439,46 @@ export class OpenAILegacyStreamedMessage implements StreamedMessage { | |
| yield { type: 'think', think: reasoning } satisfies StreamedMessagePart; | ||
| } | ||
|
|
||
| // text content | ||
| // text content — funnel through the inline filter so a leaked DeepSeek | ||
| // tool-call block is stripped from visible text (and captured for parsing | ||
| // once the stream ends) instead of being shown to the user. | ||
| if (delta.content) { | ||
| yield { type: 'text', text: delta.content } satisfies StreamedMessagePart; | ||
| const visible = inlineFilter.push(delta.content); | ||
| if (visible.length > 0) { | ||
| yield { type: 'text', text: visible } satisfies StreamedMessagePart; | ||
|
Comment on lines
+446
to
+448
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
For normal streamed OpenAI responses that already use Useful? React with 👍 / 👎. |
||
| } | ||
| } | ||
|
|
||
| // tool calls — preserve `index` on every yielded part so the generate | ||
| // loop can route interleaved argument deltas from parallel tool calls. | ||
| if (delta.tool_calls && delta.tool_calls.length > 0 && !sawStructuredToolCall) { | ||
| sawStructuredToolCall = true; | ||
| // A structured tool call means no inline leak is possible: release any | ||
| // held-back preamble text now so it isn't reordered after the call parts. | ||
| const released = inlineFilter.releaseHoldback(); | ||
| if (released.length > 0) { | ||
| yield { type: 'text', text: released } satisfies StreamedMessagePart; | ||
| } | ||
| } | ||
| for (const toolCall of delta.tool_calls ?? []) { | ||
| for (const part of convertChatCompletionStreamToolCall(toolCall, bufferedToolCalls)) { | ||
| yield part; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Flush any text held back for partial begin-marker detection. | ||
| const tail = inlineFilter.flush(); | ||
| if (tail.length > 0) { | ||
| yield { type: 'text', text: tail } satisfies StreamedMessagePart; | ||
| } | ||
| // Fallback: the backend served a DeepSeek-format model but left its inline | ||
| // tool-call tokens in `content` instead of structuring them. Parse them. | ||
| if (!sawStructuredToolCall && inlineFilter.sawToolBlock) { | ||
| for (const toolCall of parseDeepSeekInlineToolCalls(inlineFilter.content)) { | ||
| yield toolCall; | ||
| } | ||
| } | ||
| } catch (error: unknown) { | ||
| throw convertOpenAIError(error); | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ordinary OpenAI-compatible streams with no DeepSeek marker, this still appends every text delta to
full, butcontentis only read at EOF whensawToolBlockis true. That makes the fallback not a no-op for long normal completions: the provider now keeps a second copy of the entire assistant text and repeatedly concatenates it for no benefit. Consider starting the parse accumulator only after a block boundary is found, seeding it with the held buffer at that point.Useful? React with 👍 / 👎.