Skip to content

fix(kosong): parse DeepSeek inline tool calls when a backend leaves them unstructured#798

Open
mikkihugo wants to merge 3 commits into
MoonshotAI:mainfrom
mikkihugo:fix/deepseek-inline-tool-parse
Open

fix(kosong): parse DeepSeek inline tool calls when a backend leaves them unstructured#798
mikkihugo wants to merge 3 commits into
MoonshotAI:mainfrom
mikkihugo:fix/deepseek-inline-tool-parse

Conversation

@mikkihugo

Copy link
Copy Markdown

Problem

DeepSeek-architecture models (deepseek-v3/r1 and derivatives such as cogito) emit tool calls in a special-token format rather than as OpenAI tool_calls:

<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>NAME
```json
{ ...args... }
```<|tool▁call▁end|><|tool▁calls▁end|>

DeepSeek's own API parses this server-side and returns structured tool_calls. A lot of OpenAI-compatible deployments don't — self-hosted vLLM/SGLang/llama.cpp, ollama, and some proxy layers pass the raw tokens straight through into the assistant content. kosong then sees an ordinary text turn with no tool call, the agent has nothing to dispatch, and the turn dead-ends: the model effectively narrates the call instead of making it.

Where I ran into it

ollama-cloud serving cogito-2.1:671b (DeepSeek-MoE based). At temperature: 0 it leaks the tokens on essentially every tool turn — ollama's own parser only catches them some of the time. Any DeepSeek-format model behind a backend that doesn't structure tool calls hits the same wall.

Change

OpenAILegacyChatProvider now detects a leaked DeepSeek tool-call block and parses it into structured tool_calls — but only when the response carried no structured tool call and the begin-token is present. For any backend that already does the right thing (DeepSeek's API included) it's a no-op.

  • Both streaming and non-streaming paths.
  • Streaming is marker-aware: text is forwarded live up to the begin-token, then the block is suppressed so the raw tokens never reach the UI, with a one-token holdback so a begin-token split across two deltas is still caught.
  • Argument blocks that aren't valid JSON are skipped rather than emitted as a broken call.

The parser and the stream filter are a small standalone module (deepseek-inline-tool-calls.ts) so they're unit-testable on their own.

Tests

test/providers/deepseek-inline-tool-calls.test.ts covers single and parallel calls, the no-token no-op, invalid-JSON skipping, and the streaming filter (passthrough, suppression, split-marker detection). Typecheck and the kosong suite pass locally.

End to end against ollama-cloud/cogito, agentic tool loops that previously never completed now run reliably.

…enAI-compatible backends

DeepSeek-architecture models (deepseek-v3/r1 and derivatives like cogito) emit
tool calls as special tokens rather than OpenAI tool_calls. DeepSeek's own API
structures these server-side, but many compatible deployments — self-hosted
vLLM/SGLang/llama.cpp, ollama, some proxies — leak the raw <|tool_calls_begin|>
tokens into the assistant content, so the agent sees no tool call and the turn
dead-ends.

The OpenAI chat-completions provider now parses those tokens into structured
tool calls and strips them from visible text, but only when the backend returned
no structured call — a no-op for providers that already do the right thing.
Covers both streaming (marker-aware, so a split begin-token is still caught) and
non-streaming responses.
Copilot AI review requested due to automatic review settings June 15, 2026 20:55
@changeset-bot

changeset-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 42d0a7a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@moonshot-ai/kosong Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a defensive fallback to detect and parse DeepSeek-style inline tool-call tokens when an OpenAI-compatible backend fails to return structured tool_calls, ensuring tool calls are still dispatched and raw tokens aren’t shown in user-visible text.

Changes:

  • Introduces a DeepSeek inline tool-call parser and streaming-safe content filter.
  • Updates the OpenAI legacy provider to strip leaked inline tool-call blocks from visible text and emit parsed tool calls when appropriate.
  • Adds Vitest coverage for the parser and filter behaviors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
packages/kosong/src/providers/deepseek-inline-tool-calls.ts Adds parser for DeepSeek inline tool-call token blocks plus a streaming filter to suppress tokens from UI text.
packages/kosong/src/providers/openai-legacy.ts Uses the parser/filter as a fallback when tool_calls are missing, stripping inline tokens from text and yielding tool calls.
packages/kosong/test/providers/deepseek-inline-tool-calls.test.ts Adds tests for parsing and filtering inline tool calls, including split-marker streaming behavior.
.changeset/deepseek-inline-tool-calls.md Documents the patch release and rationale for the fallback behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 380 to 388
if (typeof message.content === 'string' && message.content.length > 0) {
const text =
inlineToolCalls.length > 0
? message.content.slice(0, message.content.indexOf(DEEPSEEK_TOOL_CALLS_BEGIN))
: message.content;
if (text.length > 0) {
yield { type: 'text', text } satisfies StreamedMessagePart;
}
}
Comment on lines +98 to +100
flush(): string {
return this.suppressing ? '' : this.buffer;
}

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7c35de0b0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +448 to +450
const visible = inlineFilter.push(delta.content);
if (visible.length > 0) {
yield { type: 'text', text: visible } satisfies StreamedMessagePart;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Flush held-back text before yielding structured tool calls

For normal streamed OpenAI responses that already use delta.tool_calls, this filter still holds back up to DEEPSEEK_TOOL_CALLS_BEGIN.length - 1 characters. If the content preamble is shorter than that (for example the existing "Checking weather. " prefix is 18 chars while the holdback is 19), visible is empty here, the following tool_calls delta is yielded first, and the text is only emitted by the end-of-stream flush. That reorders onMessagePart/UI output for providers that are supposed to be a no-op for this fallback; flush or release the buffered text before yielding structured tool-call parts.

Useful? React with 👍 / 👎.

…potent

- Non-stream path: strip the DeepSeek token block from visible text whenever the
  begin token is present, not only when a call parses — so a malformed emission
  never renders raw tokens to the user.
- DeepSeekInlineToolCallFilter.flush() clears its buffer, so it is idempotent and
  a reused instance can't re-emit.
- Tests for both, plus non-stream provider coverage (valid and malformed blocks).
@mikkihugo

Copy link
Copy Markdown
Author

Thanks for the review — both addressed in 60eab4d:

  • Non-stream path now strips the token block based on the begin token's presence rather than a successful parse, so a malformed emission no longer renders raw tokens (it yields the clean prefix with no dispatched call).
  • DeepSeekInlineToolCallFilter.flush() clears its buffer, so it's idempotent and a reused instance can't re-emit.

Added tests for both, plus non-stream provider coverage for the valid and malformed cases.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60eab4d849

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

*/

const SEP = '▁'; // ▁
export const DEEPSEEK_TOOL_CALLS_BEGIN = `<|tool${SEP}calls${SEP}begin|>`;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Support the actual DeepSeek sentinels

For the DeepSeek/vLLM leak this fallback is meant to recover, raw content uses full-width sentinels like <|tool▁call▁begin|> and the reported vLLM repro can omit the outer calls-begin token entirely, starting at tool_call_begin (see vllm-project/vllm#21727). Because this constant is the ASCII <|...|> form and parseDeepSeekInlineToolCalls returns early unless it is present, both streaming and non-streaming paths still render those leaked calls as plain text and dispatch no tool call, so the targeted self-hosted DeepSeek scenario continues to dead-end.

Useful? React with 👍 / 👎.

…ering

Bot reviewers (Codex + advreview panel) flagged:
- Real DeepSeek/vLLM leaks use full-width sentinels (U+FF5C) and can omit the
  outer calls-begin token. Detection is now bar-agnostic and anchors on either
  the calls-begin or a per-call call-begin boundary, so the self-hosted
  vLLM/SGLang/llama.cpp case is actually recovered.
- A short text preamble before a structured tool_calls delta could be reordered
  after the call: a structured call now releases held text first, in order.
- releaseHoldback() is a no-op once suppression has begun, so a structured call
  arriving mid-block can't flip the filter to passthrough and leak the rest of
  the raw tokens.

Covered by tests (full-width, omitted wrapper, reorder, mid-block guard).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42d0a7ab87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


/** Feed a content delta; returns the text safe to yield now (possibly empty). */
push(delta: string): string {
this.full += delta;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid retaining every normal streamed response

For ordinary OpenAI-compatible streams with no DeepSeek marker, this still appends every text delta to full, but content is only read at EOF when sawToolBlock is true. That makes the fallback not a no-op for long normal completions: the provider now keeps a second copy of the entire assistant text and repeatedly concatenates it for no benefit. Consider starting the parse accumulator only after a block boundary is found, seeding it with the held buffer at that point.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants