Skip to content

fix: avoid content leak when generation ends inside thinking block#2001

Open
mlpy0 wants to merge 1 commit into
exo-explore:mainfrom
mlpy0:fix/thinking-finish-reason-leak
Open

fix: avoid content leak when generation ends inside thinking block#2001
mlpy0 wants to merge 1 commit into
exo-explore:mainfrom
mlpy0:fix/thinking-finish-reason-leak

Conversation

@mlpy0
Copy link
Copy Markdown
Contributor

@mlpy0 mlpy0 commented Apr 28, 2026

When generation ends while the parser is still inside <think>...</think>, the chunk that carries finish_reason was unconditionally stamped is_thinking=False in parse_thinking_models. Its text then routed to the content channel, leaking 1-8 characters of the last thinking token into content while the rest of the thinking output stayed in reasoning_content.

This change preserves is_thinking on the final token's text via a separate chunk (with finish_reason=None) and emits an empty-text content chunk to carry finish_reason. Consumers that read only content and finish_reason still see the terminating delta; consumers that read reasoning_content get the full thinking output.

Reproduction (Qwen3.6-27B-8bit, max_tokens=10, default thinking):

  • before: usage.completion_tokens=10, content=' '
  • after: usage.completion_tokens=10, content=None, reasoning_content populated

Tests: strengthens TestThinkingModelsFinishReason.test_finish_reason_during_thinking to pin the leak, and adds test_finish_reason_during_thinking_no_content_leak covering the starts_in_thinking=True path that hit the bug in production.

@Evanev7
Copy link
Copy Markdown
Member

Evanev7 commented May 1, 2026

sounds like an issue in a downstream client or in our adapters. can you tell us what tools you were using that caused this output, and which endpoint (ollama, responses, chat-completions or claude)?

@mlpy0
Copy link
Copy Markdown
Contributor Author

mlpy0 commented May 1, 2026

Endpoint: chat-completions (POST /v1/chat/completions, stream=false).

Client: small Python script using urllib.request from the stdlib. No SDK, no wrapper, no adapter on top. Just reads choices[0].message.content out of the JSON.

Model: mlx-community/Qwen3.6-27B-8bit (also reproduced intermittently on Qwen3.6-35B-A3B-8bit).

Sample request that reproduced it:

{
  "model": "mlx-community/Qwen3.6-27B-8bit",
  "messages": [{"role": "user", "content": "Count from 1 to 100, comma separated."}],
  "max_tokens": 10,
  "temperature": 0.0,
  "stream": false
}

Response had completion_tokens=10, finish_reason="stop", content=" ". Sweeping max_tokens up to 1000 keeps the same shape: completion_tokens tracks max_tokens, content stays 1 to 8 characters of whatever the last mid-thinking token decoded to.

Since the script does nothing beyond reading the content field out of the JSON, no downstream parser is in play on this side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants